Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Parallel language and pipeline constructs for concurrent computation
(USC Thesis Other)
Parallel language and pipeline constructs for concurrent computation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PARA LLEL L A N G U A G E A N D P IP E L IN E C O N ST R U C T S F O R C O N C U R R E N T C O M P U T A T IO N by Zhiwei Xu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Engineering) August 1987 Copyright 1987 Zhiwei Xu U M I Number: DP22767 All rights reserved INFORMATION TO ALL USERS T he quality of this reproduction is d ep en d en t upon the quality of the copy subm itted. In the unlikely event that the author did not sen d a com plete m anuscript and there are m issing pages, th e se will be noted. Also, if material had to be rem oved, a note will indicate th e deletion. Dissertation Publishing UMI D P22767 Published by P roQ uest LLC (2014). Copyright in th e Dissertation held by the Author. Microform Edition © P roQ uest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United S tates C ode P roQ uest LLC. 789 E ast Eisenhow er Parkway P.O . Box 1346 Ann Arbor, Ml 4 8 1 0 6 - 1346 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90089 Ph. D , Cf> s ■ > 8 7 X * 2z$oc/s 7 This dissertation, written by ZHlWEX X U under the direction of h.iS. Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillm ent of re quirem ents for the degree of D O C TO R OF PH ILOSOPH Y Dean of Graduate Studies D a te ........ DISSERTATION COMMITTEE .......................... Chairperson ii D E D IC A T IO N To Li Yin and Shao-Meng iii A C K N O W L E D G M E N T S I would like to take this opportunity to express my sincere appreciation to Professor Kai Hwang for his guidance and support in my doctorate study. I have found his ideas and constructive criticisms a constant source of inspiration. His patience and encouragement helped me through my graduate study in Pur due and USC. I would like to thank Professor Dean Jacobs and Professor Wlodek Proskurowski for serving on my dissertation committee. Their valuable sugges tions enhanced the presentation of this thesis. I also thank Professor Michel Dubois, Professor Jean-Luc Gaudiot, Professor B. Keith Jenkins, and Professor Sun-Yuan Kung for m any educational discussions during this research. I thank my parents, my wife, and my daughter for their love, encourage ment, and understanding. I thank all my friends and colleagues in USC for their various help. Finally, I gratefully acknowledge the financial supports from the National Science Foundation (Grant DMC-84-21022), the Air Force Office of Scientific Computing (Grant 86-0008), and the Office of Navy Research (Contract No. N00014-86-K-0559). iv Table of Contents Dedication ..................................................................................................................... ii Acknowledgments ........................................................................................................ iii List of Figures .............................................................................................................. viii List of Tables ............................................................................................................... x A bstract ........................................................................................................................ xi 1. Introduction ............................................................................................................ 1 1.1. Problems of Parallel Com putation ...................................................... 1 1.1.1. Representation ............................................................................ 2 1.1.2. Analysis ........................................................................................ 3 1.1.3. Synthesis ......................................:................................................ 4 1.2. Towards a Theory of Parallel Com putation ....................................... 6 1.3. Organization of the Dissertation ........ ;................................................. 9 2. Molecules for Parallel Programming .................................................................. 12 2.1. Program Structure .................................................................................... 12 2.2. D ata Declarations ..................................................................................... 14 V 2.3. Expressions and Statem ents .................................................................. 16 2.4. Molecule Types ......................................................................................... 18 2.4.1. Syntax Rules ................................................................................ 18 2.4.2. Communication Rules ............................................................... 20 2.4.3. Sequencing Rules ........................................................................ 23 3. Various Parallel Com putation Modes ............................................................... 27 3.1. SIMD Array Com putation ..................................................................... 27 3.2. Pipelined Com putation ........................................................................... 29 3.3. Shared Variable Multiprocessing ......................................................... 29 3.4. Message Passing M ulticomputing ........................................................ 32 3.5. Dataflow Com putation ........................................................................ 33 3.6. A Layered Programming Methodology .............................................. 37 4. Compound Functions and Pipeline Nets ......................................................... 47 4.1. The forpipe Construct ............................................................................ 49 4.2. Some Extensions ....................................................................................... 58 4.3. The Pipeline N e t ....................................................................................... 60 4.4. Vector Instructions ............................................................................ 67 4 5. Parallel Pipelined Com putations ....................................................................... 73 vi 5.1. Com putation Systems ............................................................................ 74 5.2. Properties of Com putation Systems .................................................... 83 5.3. An Operational Semantics of the forpipe Construct ....................... 92 5.4. Modeling Pipeline Nets ........................................................................... 99 6. Synthesis and Performance of Pipeline Nets ................................................... 102 6.1. Networking ................................................................................................. 103 6.2. Partitioning ............................................................................................... I l l 6.3. Performance Analysis ............................................................................. 115 7. Optical Im plementation of Pipeline Nets ......................................................... 126 7.1. The Architecture of Opcom ................................................................... 128 7.2. Simulation of Turing Machine by Opcom ......................................... 135 7.3. Concurrent Operations in Opcom ................ 138 7.4. Implementation Issues of Opcom ......................................................... 146 8. Conclusions ............................................................ 153 8.1. Summary of Research Results .............................................................. 153 8.2. Suggestions for Future Research ........................................................... 156 References ...................................................................................................................... 158 Appendices .................................................................................................................... 165 vii A. PAL Program s for Complex Inner Product Com putation .............. 165 B. Livermore Loops Representable in forpipe Loops ............................... 169 C. Livermore Loops not Representable in forpipe Loops ....................... 171 V lll L ist o f F ig u r e s F igure 1.1. Relationships among representation, analysis, and synthesis F igure 2.1. Three communication schemes for parallel processing F igure 2.2. Four sequencing rules in program execution F igure 3.1. Three layers of parallel program development on an iPSC hyper cube multicom puter F igure 3.2. Three layers of program development for Example 3.1 F igure 3.3. Parallel program development on m ajor classes of parallel comput ers F igure 4.1. The dependence graph for Example 4.1 F igure 4.2. Examples of ill-formed dependence graphs F igure 4.3. The assignment graph for Example 4.1 F igure 4.4. Historical evolution of the concept of pipeline networking F igure 4.5. The logical architecture of a pipeline net F igure 4.6. The configuration register used in pipeline networking F igure 4.7. An example of multipipeline networking F igure 5.1. An Interpreted SP-net derived for Example 4.1 F igu re 5.2. Examples of a system firing F igure 5.3. Operational snapshots of an example system F igure 5.4. Transforming an MIMO system into an SISO one F igure 5.5. A system w ith a O-capacity cycle F igure 5.6. Transforming a forpipe loop in Example 5.3 into a system IX F igure 5.7. A system modeling the pipeline net in Fig. 4.7 F igu re 6.1. Various methods for coverting forpipe loops into systolic arrays or pipeline nets F igure 6.2. An example transform ation of a system into a pipeline net F igure 6.3. Partitioning a large system into several subsystems using Algo rithm 6.3 F igu re 6.4. The throughput of pipeline nets as a function of vector length F igure 6.5. A system for Example 6.3 F igure 6.6. The speedup performance of pipeline nets of various sizes and different cyclic graph structures F igu re 6.7. The speedup performance of using pipeline nets in implementing Livermore loops F igu re 7 .1 . The architecture of an optical computer (Opcom) F igu re 7 .2 . The operation of a functional cell Figure 7.3. The control unit used in Opcom F igu re 7.4. The logical circuit of a bit-slice integer adder F igu re 7 .5 . Throughput performance of using Opcom in vector additions F igu re 7.6. The characteristics of an optical bistable device F igu re 7.7. Implementing a functional cell by an optical bistable device T ab le 3.1. T able 4.1. T ab le 4.2. x L ist o f T a b le s Parallel programming using extended C and PAL on an iPSC hypercube multicom puter Examples of compound functions suitable for multi-pipeline net working Comparisons of systolic arrays, switch lattices, and pipeline nets xi A b s tr a c t W ith increasing research and applications of parallel computation, there is a strong need for a coherent theory. Such a theory should provide a unified framework such th at many problems regarding the design of parallel algorithms, parallel languages, and parallel computers can be rigorously formulated, studied, and solved. In this dissertation, a methodology towards such a theory is developed. Im portant problems of parallel com putation are identified and classified into three categories: representation, analysis, and synthesis. It is suggested th at representation of parallel computation should be made available using a high- level language, a mathem atical model, and a computer architecture. A parallel language construct, called molecule, is proposed and studied. It is demonstrated th a t this construct, armed with the molecule type concept, can be efficiently used to specify parallel computations of various modes. A three-layered m ethod based on molecules is also described for developing programs on various parallel com puters. A theory is established for an im portant class of parallel computations: evaluations of compound functions. Such computations are specified by a forpipe loop language construct, and implemented by a pipeline net parallel architecture. A m athem atical model, called computation system, is defined. Im portant prob lems of term ination, determinacy, equivalence, performance, and optim al syn thesis are form ulated and solved in this model. The pipeline net architecture is xii also related to an emerging technology: digital optical computing. Design and implementation issues of an optical pipeline net are investigated. The molecule construct and the layered methodology can be embedded in a software environment, for efficient development of parallel programs on pipe lined computers (Cyber 205, FPS 164/Max), SIMD array processors (M PP, Con nection Machine), multiprocessors (Cray X-MP, Alliant FX /8), and multicom puters (iPSC, NCUBE). Such an environment aids the programmer in solving various parallel programming problems, such as partitioning, communication, synchronization, and allocation. The pipeline net architecture offers an attrac tive alternative for future generation of advanced vector multiprocessors. The results presented are useful for designing systolic arrays, wavefront arrays, and other parallel pipelined computers, using electronic or optical technologies. 1 C h a p te r 1 In tr o d u c tio n W ith the decreasing cost of hardware and the increasing demands for com putational power, parallel computation is becoming a m ajor paradigm of super computing. A lot of work has been done towards the understanding and utiliza tion of various forms of parallel computation. This ranges from pure theoretical studies to commercial implementations. The tim e has come to integrate these experiences and knowledge into a coherent theory, whereby various problems can be rigorously formulated, studied, and solved, within a unified framework. Such a theory would be very useful for the design and implementation of paral lel architectures, parallel programming languages, and parallel algorithms. 1.1. P rob lem s o f P arallel C om p u tation In the past two decades, researchers in computer science accumulated a rich set of schemes for various sequential and parallel computations[4l]. These schemes are called computation modes in this research. The following is a partial list of com putation modes proposed: (1) Serial mode: Operations are performed one by one, the order of execution is determined through a program counter in a sequential computer. A t anj time, at most one operation can be performed. (2) Pipelining mode: M ultiple operations can be performed at the same time in an overlapped fashion[52]. 2 (3) SIMD mode: M ultiple operations of the same type can be performed at the same tim e on multiple processors in a lock-step fashion[lO]. (4) MIMD mode: Multiple operations of different types can be performed at the same time on multiple processors asynchronously[53, 57]. (5) D ataflow mode: An operation can be executed as soon as all its operand d ata become available[8,21, 26]. (6) Reduction mode: An operation is performed if its result is requested[77]. (7) Mixed modes: An example is a macro dataflow scheme in which a compu tation combines high-level dataflow and low-level serial[9, 26, 40]. To utilize these modes for high-performance com putation, one m ust have a profound understanding of their properties. The following questions need to be answered: W hat are exactly these com putation modes? W hat properties do they have? How should they be represented in a high-level language? How can they be implemented in an computer architecture. In this dissertation, I classify these problems into three categories: representation, analysis, and synthesis. 1.1.1. R ep resen tation Representation refers to the syntactic and the semantic specifications of com putation. Just like knowledge representation is essential to developing expert systems, parallelism representation provides a foundation on which a theory of parallel com putation is built. Four levels of parallelism representation schemes are identified below: 3 (1) Parallel Algorithms: A t this level one has some informal definition of vari ous computation modes (pipelining, dataflow, SIMD, MIMD, etc.). A com putation of a certain mode is represented in English or some pseudo pro gramming language and explained with some figures or examples[6, 67]. (2) Parallel languages: High-level language constructs[7,47] provide another representation scheme. An example is the task construct in Ada, which pro vides a notation for MIMD computations. (3) M athem atic models: The most attractive scheme is to represent parallel com putations with some m athem atical m odel[l6,63, 65]. This way, seman tics of a com putation is defined precisely, problems can be form ulated and solved mathem atically. (4) Parallel architectures: A com putation mode can also be represented by a particular computer architecture[4l]. Here architecture means the hardware organization together w ith the instruction set and probably some operating system primitives. For example, the SIMD com putation mode can be under stood clearly w ith parallel com putation on the M PP array processor[10]. 1.1.2. A n alysis Once a proper representation scheme is chosen for parallel computations, we are in a position to study their properties. Analysis refers to the determina tion of various properties of a given computation. Im portant analysis problems include the following[16, 50, 65]: (1) D eterm inacy: Does a com putation always produces the same result for the 4 same operand ? (2) Termination: Does the com putation always stop in finite am ount of time? (3) Equivalency: Is this com putation equivalent to another one in a m athem ati cally precise fashion? (4) P erf ormance: Given hardware constraints, how much time is needed for a computation? 1.1.3. S yn th esis Synthesis refers to the transform ation of a com putation at a certain level of representation to another com putation at the same or lower level. Here the order of levels is as shown in Fig. 1.1 w ith the informal representation at the highest level and the architecture representation at the lowest level. An example of syn thesis is to design a systolic array from some informal specification. The syn thesis problem can be further divided into two subproblems. Translation is the process of converting computations from one level to a lower level (an example is compilation, which translates high-level language programs into particular machine codes). Optimization is often performed at the same level, which intends to minimize or maximize some param eters w ith certain constraints. Fig ure 1.1 illustrates the relationships among representation, analysis, and syn thesis. SYNTHESIS Mathematical Models Parallel Algorithms High-Level Language Programs Architectures Parallel ANALYSIS Interpretation Modeling 'igure 1.1. Relationships among representation, analysis, and synthesis 6 1.2. T ow ards a T h eory o f P arallel C om p u tation This dissertation investigates various issues th at need to be resolved in developing a theory of parallel computation. Such a theory consists of the fol lowing components. (1)A com putation domain th a t contains a class of desired computations of a certain mode (or modes). (2) A high-level language th a t specifies these computations. (3) A computer architecture th at implements these computations. (4) A m athem atical model th at defines semantics for these computations. (5) Analysis and synthesis problems arise in these computations. (6) Solutions to (or methodologies for) these problems. For a given com putation domain, many theories can be derived by selecting the above components differently. W hat we need is a "good" theory. W hether a theory is good or not largely depends on the representation schemes used. This is analogous to an axiomatical system for some application domain, where the basic axioms and inference rules determine the power of the system. For this reason, we m ust set up some criteria for the design and selection of the language, the architecture, and the m athem atical model. A representation scheme is considered good if it is simple, flexible, and efficient, as explained below. Simplicity. A representation scheme is considered simple if it is based on a small number of easily understandable concepts. The high-level language 7 developed or chosen should have a small number of basic constructs. No new notation should be introduced unless it contributes significantly to expressing a main concept. The m athem atical model should be based on only a few primitive definitions and rules. The architecture should have a simple hardware organiza tion th at consists of only a few types of basic components. It m ust also have a small instruction set w ith as few new instructions as possible. Flexibility. A representation scheme is considered flexible, if it can be used to express all computations in the given domain. For instance, Cray X-MP[68] is a flexible architecture for those parallel computations th at utilize both synchro nous pipelining and asynchronous multiprocessing. A high-level language for Cray X-MP should enable a user to write various programs th a t combine these two modes. A m athem atical model is flexible if it can be used (1) to define an operational semantics for any program w ritten in the chosen language, (2) to model the operations of the chosen architecture, and (3) to formulate all analysis and synthesis problems. Efficiency. The efficiency of a high-level language has two different mean ings. It should help a user in developing parallel programs efficiently, and it can be efficiently implemented on the target architecture. Some dataflow language, such as Id[8], is efficient in the first sense, while F ortran is efficient for pipelined computers in the second sense. The efficiency of a parallel architecture can often be measured in term s of speedup. If the hardware parallelism is n , an efficient architecture should exhibit a speedup close to n for most computations in the given domain. Finally, a mathem atical model should be able to solve most of 8 the given analysis and synthesis problems efficiently. In the past, theoretical work on parallel com putation have been centered around the definition of a m athem atical model and its application to solving a num ber of analysis and synthesis problems. These models can be divided into three groups: Abstract Machine Models. Such models are usually m athem atical abstrac tions of parallel extensions of sequential computers. These include Chandra, Kozen, and Stockmeyer’s alternating machines[13], Meyer auf der Heide’s DRAMs[32], Goldschlager’s SIMDAG[28], Fortune and Wyllie’s PRAM[25], and Schwartz’s U ltracom puters[73]. These models are mainly applied to the study of complexity of synchronous parallel computation[l6]. Algebraic M odels. W ith such a model, the concept of process and a num ber of basic operators on processes are defined by some axioms. Processes and operators together form some process algebra. Various properties of processes can then be derived algebraically. Examples of this type of model include M ilner’s calculus of communicating systems (CCS)[62, 63], Hoare’s com municating sequential processes (CSP)[34], and Bergstra and Klop’s process alge- b ra[ll]. This kind of model is usually used to represent asynchronous MIMD com putations based on message passing. Many analysis problems can be con cisely form ulated and solved within this model. However, it is dubious whether this type of model can be used to represent synchronous computation efficiently, and there are comparatively very few results regarding the performance analysis and the synthesis issues. 9 Graph Models. Roughly speaking, such a model usually associates each com putation step w ith a node of a directed graph, and specifies the transmission paths of data and control information by edges. Two such models are Petri nets[65] and Karp and Miller’s com putation graphs[50], which have been used to study various analysis problems of asynchronous com putations. Dataflow graphs[2l] provide another example of this type of model, which are used to define operational semantics of dataflow computations[8]. A theory will be developed in this dissertation for a class of parallel pipe lined computation: evaluations of compound functions (CFs). A CF is a collec tion of linked scalar operations to be executed repeatedly m any times in a loop ing structure. CFs are frequently encountered in scientific and engineering com putations. None of the existing models is flexible enough to provide a unified framework for form ulating and solving a number of im portant analysis and syn thesis problems in this domain. For this reason, I will define a m athematical model, called computation system, which makes it possible to solve these prob lems in a systematical way. 1.3. O rgan ization o f th e D issertation A main difference of parallel computations from the sequential one is th at many com putation modes can be adopted. M otivated by this additional dimen sion, I introduce a new language construct, called molecule, in C hapter 2. This construct is defined w ith a procedure language PAL. To satisfy the simplicity criterion, PAL adopts a Pascal-like syntax. There is only a single new concept, molecule type (or program type), which is used to specify various computation 10 modes. In C hapter 3, the flexibility of PAL is dem onstrated by defining six molecule types th at characterize SIMD array processing, pipelining, shared vari able multiprocessing, message passing m ulticom puting, and dataflow computa tion modes. These cover most com putation modes existing in today’s parallel computers. A layered methodology is then presented for efficient development of parallel programs for commercial computers. Beginning from Chapter 4, the dissertation focuses on developing a theory for a class of frequently encountered parallel computations: evaluations of com pound functions (CFs). A forpipe construct is proposed to specify CFs in PAL notations. An parallel architecture, called pipeline net, is designed for imple menting CFs. Chapter 5 is devoted to the definition and study of a m athem atical model for CF evaluation. This m odel,. called computation system, is shown powerful enough for modeling forpipe loops and pipeline nets. Furtherm ore, I define and solve the analysis problems of term ination, determinacy, and equivalency. C hapter 6 deals with the problem of optim al synthesis. Algorithms are given th a t transform a CF represented by a forpipe loop into an equivalent, optim al pipeline net specification. Performance of pipeline nets is then analyzed. The throughput and the speedup performances of a pipeline net processor are defined and derived in evaluating various CFs. The general results are applied to a group of benchmarks, Livermore loops. 11 C hapter 7 relates the proposed pipeline net architecture to an emerging technology: digital optical computing. An optical pipeline net architecture, called Opcom, is defined and studied. Special attention is paid to how Opcom takes advantage of optical technology and its implementation in optical devices. Finally in Chapter 8, I summarize the results obtained in this research and discuss directions for further study. 12 C h a p te r 2 M o le c u le s fo r P a r a lle l P r o g r a m m in g This section defines a language construct, called molecule, for parallel pro gramming. A molecule is a procedure associated w ith a molecule type. Each molecule type specifies a particular com putation mode (pipelining, SIMD, dataflow, etc.) w ith some syntax and semantic rules. In the following, I described the basic concept of molecule with a simple language, called PAL. Pascal-like notations are implicitly used. Terms such as variable, expression, data types, etc. are assumed to have the same meaning as in Pascal [78], unless otherwise specified. Henceforth, I adopt the following conventions: (l) square brackets are used to denote an optional sequence of items, (2) angle brackets surround nonterminals, (3) braces enclose a repeated item which may appear zero or m any times, and (4) boldfaced items indicate reserved words. 2.1. P rogram Stru ctu re The BNF syntax definition of a PAL program is given below: < p ro g ra m > ::= { < molecule type definition>} [< global declaration> ] < molecule declaration > { < molecule declaration> } PAL adopts a C-like program structure[51]. A program consists of a global declaration followed by a num ber of molecule declarations. One of the molecules m ust be named as a "main" program. The difference from C is th a t a number of molecule type definitions may appear prior to the global declaration. The user defines a molecule type with a typ ro construct to be introduced shortly. 13 E xam ple 2.1. This example illustrates the structure of a PAL program which adopts a macro dataflow mode. The main program specifies a macro dataflow algorithm, which calls a restricted sequential subprogram (fctl) and a dataflow subprogram (fct2). ty p ro restricted detail of restricted type typ ro dataflow detail of dataflow type co n st n = 100; main() oftyp e dataflow detail of the main molecule fctl(x: in; y: out) o fty p e restricted detail of fc tl fct2(x: in) oftyp e dataflow detail of fct2 A molecule declaration in PAL as shown below is quite similar to a func tion declaration in C: < molecule declaration > < molecule n am e> ([< p aram eter list> ]) o fty p e < molecule ty p e > [< param eter declaration > ] begin [< local declaration > ] < molecule body> end [< molecule nam e> ] E xam p le 2.2. This example demonstrates the structure of a molecule declaration. /* molecule type definition */ /* molecule type definition */ /* global declaration */ I* main molecule declaration */ /* molecule fctl declaration *f /* molecule fct2 declaration */ 14 fctl(x: in; y: out) o ftyp e restricted x,y: integer; /* param eter declaration */ begin i: integer; /* local declaration */ i : = x; /* molecule body */ w hile i > 0 do begin y : = x + i; i : = i - 1 end end fctl There are two features which differentiate a molecule from a C function: (l) A molecule type m ust be specified following the reserved word oftyp e. Each molecule type determines a distinct com putation mode (sequential, pipelining, dataflow, etc.). (2) The parameter class concept is adopted from A da [20]. A param eter class (in, ou t, or inout) specifies whether a param eter is intended to pass an operand value into or out of a molecule. This concept helps users speci fying the input-output behavior of a molecule. 2.2. D a ta D eclaration s A data object in a PAL program is declared as a global, a parameter, or a local object, depending on whether it is shared by all molecules, local to a single molecule, or it serves as an interface between molecules, respectively. A global/local d ata object is declared by specifying its d ata type and possibly its initial value. A param eter is declared by specifying its class (in the formal param eter list) and its d ata type (in the param eter declaration part). (a). Global Declaration'. Variables, constants, and user defined data types, which do not appear in the param eter or the local declarations of a molecule, are nr called global w ith respect to th a t molecule. If a constant or a d ata type is to be shared by several molecules, they should be defined w ith a global declaration. Global variables should be used w ith caution, since they m ay cause harmful side effects, which constrain parallelism [79]. A special type of global variables, called semaphores, can be used to facilitate shared variable m ultitasking. A semaphore can only be processed by two special built-in procedures: P and V [22]. (b). Parameter Classes: The param eter list specifies a num ber (possibly em pty) of formal param eters. Similar to Ada, each param eter has a class (in, o u t, inout). An in param eter, such as x in Example 2.2, passes a value into the molecule from the caller molecule. It can be used as an operand in the molecule body but can not be updated (e.g., appears on the left-hand side of an assign m ent statem ent). An o u t param eter (like y in Example 2.2) passes a value from the molecule to the caller molecule. An in ou t param eter may pass a value into the molecule from another molecule and output a value. (c). Parameter and Local Declarations: The param eter declaration part defines the d ata types of formal param eters. The local declaration defines local objects like data types, constants, and variables. These objects are invisible from outside the molecule. A difference from Pascal is th at no submolecules can be declared locally. (d). Functions: W hen a molecule is used as a function, a value should be returned from a molecule call. The molecule name serves as a formal param eter w ith an implicit o u t class. Its data type is explicitly specified in the param eter declaration. The following molecule fct2 shows an example case: 16 fct2(x: in) o fty p e dataflow x, fct2: integer; begin fet2 : = sin(x)+ cos(x); end fct2 2.3. E xp ression s and S tatem en ts The expressions in PAL are defined as in Pascal, except an additional type of if-then-else expression is included: if < boolean expression> th en <expression> else < expression> Thus the following are legal PAL expressions: a+ b, a[i] * f(b), 3 * (if a > l th en x+1 else x+5). In designing PAL, I follow the simplicity principle[36]: no new concept or construct should be introduced unless necessary. This is reflected in our design of the statem ent set. Only twelve statem ents are included in PAL, of which seven appeared already in Pascal. These twelve statem ents can be divided into three groups: Group A . Statem ents in this group have the same syntax and semantics as those in Pascal. They include: assignment, goto, if, while, and molecule call. < assignment > ::= < variable> : = < expression > < g o to > ::= g o to < label > < i f > : : = if <expression> th en < sta te m e n t> [else < sta te m e n t> ] < w h ile > ::= w hile <expression> do < co m p o u n d > < c a ll> ::= Cmolecule n a m e > ([< actu al param eter list> ]) Group B. This group contains a for statem ent and a compound statem ent. They have syntax similar to those in Pascal. However, both statem ents may have different operational semantics, when they appear in different types of 17 molecules. < f o r > :: = for < v a ria b le > := <expression> to < expression> [step < expression > ] do < compound > < c o m p o u n d > ::= begin < statem ent> {statem ent} end Group C. This group includes five new statem ents for parallel processing. The send and receive statem ents are used to support message-passing used in a distributed computing environment. < s e n d > ::= send < expression> to < destination> [about < k e y > ] < receive > : : = receive < variable > from [< source > ] [about < k e y > ] The following fork statem ent is used for multiprocessing: < f o r k > ::= fork < process or task name list> < statem ents > join < process or task name list> W hen a fork is executed, all processes or tasks appearing in the corresponding name list will be invoked to execute. These processes or tasks will proceed in parallel with the program segment < statem ents> . W hen a jo in is encountered in a molecule, the execution of the molecule is suspended until all the' processes or tasks in the corresponding name list term inate. In a forall statem ent < f o r a ll> ::= forall < v a ria b le > := < expression> to <expression> [step < expression > ] do < compound > the compound statem ent is called the loop body, the variable is called the loop index. Syntactically, a forall statem ent differs from a for statem ent in the follow ing ways: (l) No goto statem ents can appear in the loop body. (2) The left-hand side of any statem ent m ust be an array element w ith the loop index. 18 A forpipe statem ent has the following syntax: < fo rp ip e > ::= forp ipe < v a ria b le > := < expression> to < expression> do < compound > The details of syntax and semantic definitions are left to C hapter 4. 2.4. M olecule T yp es Each molecule has a specific type which is expressed in the molecule declaration following oftyp e. A molecule type-encapsulates molecules w ith the same com putation mode. A molecule type can be defined with a typ ro con struct, which characterizes a particular com putation mode using seven type rules as follows: typ ro < molecule ty p e > begin class : = < admissible class s e t> ; statem ent :== < admissible statem ent s e t> ; callset : = < admissible submolecule type s e t> ; global :— no; | yes; single : = no; J strong; | weak; comm : = < admissible communication scheme s e t> ; sequence : = <sequencing schem e>; end 2 .4.1. S yn tax R ules The first five statem ents in a typ ro construct are called the syntax rules. These rules restrict the syntax of a molecule. A common feature of these rules is th a t they can be checked (almost) completely at compile time. The first rule determines w hat class the formal param eters can have. The admissible class set can be any subset of {in ,out,inou t,}, "null" and "all" can also be used to denote the null subset and the complete set. Similarly, the second rule specifies 19 admissible statem ents, i.e., statem ents which can appear in the molecule body. The callset rule specifies w hat type of molecule can be invoked in the molecule body. The fourth rule shows whether global variables are allowed. The fifth is the single-assignment rule, which is meaningful only in those program segments th a t consist of for, forpipe, call, and assignment statem ents. D efin ition 2.1. A program segment satisfies the "strong" single-assignment rule if the following three conditions hold: (a) A variable (either scalar or array) appears in the left-hand side of at most one statem ent f. (b) The lexical order of the statem ents can be rearranged such th at any data item appearing in the right-hand side of a statem ent is either an input to the segment or has appeared in the left-hand side of a previous statem ent. (c) The left-hand side of any statem ent in a k-level nested for or forpipe loop m ust be an k-dimensional array variable. D efin ition 2.2. A program segment satisfies the "weak" single-assignment rule, if after expanding all for and forpipe statem ents, the resulting program satisfies the above conditions (a) and (b). E xam ple 2.3. Consider the following program segment: f A data item is in the left-hand side (the right-hand side) of a molecule call statement, if it is the actual of an o u t class (in class) parameter. 20 forp ipe i : = begin 1 to 2 do SO: X i :== y i] * z[i]; SI: a i : = a [i-1] + b[i]; S2: sum : = 2.5 + c[ij; SOI X 1 :== y S ll a 1 : = a S21 sum :== 2 S02 X 2 ;== y S12 a 2 : = a S22 sum : = 2 end A fter expanding the loop, we obtain the following: 1] * z[l]; 0] + b[l]; .5 + c[l]; '2 1 * z[2]; '1 + b[2]; This program satisfies no single-assignment rules, since the variable sum has appeared in both S21 and S22. Deleting S2, we obtain a program which does not satisfy the strong rule, since the variable "a" appears in both the left-hand side and the right-hand side of SI, which violates condition (b). However, the modified program satisfies the weak rule, since conditions (a) and (b) hold in the corresponding expanded program. If we delete both Si and S2, the resulting loop will satisfy the strong single-assignment rule. 2 .4.2. C om m u n ication R ules The sixth statem ent in a typ ro construct denotes a communication rule. This rule defines when and how d ata are passed in and out of a molecule. A molecule can exchange inform ation w ith the outside world (its caller molecule or other molecules) through param eters, global variables, or messages. The admis sible communication scheme set can be any subset of {parameter, global, mes sage}. These three schemes are illustrated in Fig. 2.1. Global Variable Calling .Molecule Start Start Operand Parameter Passing_ _ Operand Parameter Passing_ _ Execution Execution Result Parameter P a ssin g ___ Result Parameter P a ssin g ___ Terminate Terminate Molecule 2 Molecule 1 Figure 2.1 Three communication schemes for parallel processing 22 In the by parameter scheme, a molecule receives operands when actual param eters are passed to formal param eters after the molecule is activated and before the molecule body is executed. Result data are sent out when the result formal param eters are passed to actuals after the molecule body is executed and before the molecule is term inated. There are m any param eter passing schemes, such as by value, by reference, by constant, and by ncrae[36]. The following param eter passing scheme is recommended as I have found it appropriate when applied to all the investigated molecule types. Passing by value is used for those in class param eters w ith the actuals being an expression which is not a variable. All other param eters are passed by reference. Inter-molecule communications can also be realized through the by global scheme. This scheme is applicable only when global variables are allowed to appear in the molecule body. W ith this scheme, communications take place when the molecule body is being executed. The molecule receives an operand value if a global variable is referenced; it passes a result out when a global vari able is assigned a value. Semaphores can be used to protect the sharing of global variables in multiprocessing environment, as shown in section 3.3. W ith the by message scheme, a molecule communicates w ith other molecules through the two special statem ents, send and receive, when its body is in execu tion. Assume th at P I executes send x to P2, and P2 executes receive y from P I, where P i and P2 are molecules and x, y are variables. Three strategies have been proposed: • Nonblocking. In this scheme, P i (P2) initiates the sending (receiving) and 23 continues to execute as soon as the operating system records the request. Note th a t the message denoted by x (y) is not necessarily sent (received), thus variable x (y) should not be reassigned (used) until it is checked th at the message has been sent (received). This checking is performed in iPSC by a "STATUS" routine. • Asynchronous. W ith this strategy, the execution of the send (receive) state ment is not completed until the message has been sent (received). Thus after the statem ent is executed, x (y) can be reassigned (used) immediately. Note, however, th a t completion of the send statem ent does not imply th at the message has been received at the intended destination. • Synchronous. This scheme forces the execution of the send in P I to be delayed until a corresponding receive statem ent is executed in P2. The message is then passed to y and both processes continue to execute. Thus after the completion of the send statem ent, the message has been received at the intended destination. 2.4.3. Sequencing R ules The final statem ent in a typ ro construct specifies the internal execution sequencing of a molecule. Namely, after a molecule is enabled to execute, in w hat order the statem ents in the molecule body should proceed. Four basic sequencing schemes are possible for any complex statem ent, where a complex statem ent is defined as follows: < com plex> ::= < com pound> | < f o r > | < forall> | < forpipe> | < fo rk > 24 These four sequencing rules can be explained by considering the following loop: for i : = 1 to 2 do begin Sl(i): a[i] : = b[i-l] + 2 ; S2(i): b[i] : = 2 * c[i] ; end • Sequential. This sequencing rule refers to the conventional von Neumann scheme (Fig. 2.2a). Statem ents in a complex statem ent are executed one by one following the lexical order (implicit control-flow) unless conditional (if) or unconditional (goto) jum ps are encountered (explicit control-flow). • Parallel. W ith this scheme, all the statem ents in the compound statem ent execute simultaneously (Fig. 2.2b). In other words, the complex statem ent can be viewed as a parbegin construct[22]. • Concurrent. A statem ent in the compound statem ent begins execution when all its operand d ata become available (Fig. 2.2c). This is in fact the data driven rule[4]. • SIMD. W ith this scheme, the above loop is executed in 2 sequential steps (m steps if there are m statem ents in the loop body). A t step j (j = 1,2,...,m ), statem ents Sj(l), Sj(2), ..., Sj(n) are executed simultaneously. All statem ents in a step m ust be finished before starting the next step. This is illustrated in Fig. 2.2d. SI S2 S3 S4 (a) Sequential b[2j:=2*c[2j begin S4 S3 25 (b) Parallel begin S4 S3 (c) Concurrent begin (d) SIMD Figure 2.2 Four sequencing rules for program execution 26 The sequencing of a molecule type can be any combination of the above four schemes. For instance, the SIMD type to be defined in section 3.1 has the following sequencing rule: sequence : = {SIMD in forall, sequential in others} This rule states th a t the statem ents in a molecule of SIMD type are executed following the von Newmann sequential ordering, except forall statem ents, which employ the SIMD scheme. 27 C h a p te r 3 V a r io u s P a r a lle l C o m p u ta tio n M o d es Current parallel computers commercially available can be divided into four classes: SIMD array processors (MPP[IO] and Connection Machine[33] ), pipe lined machines (Cyber-205[4l] ), shared memory multiprocessors (Cray X- MP[68] and Cyberplus[24] ), and message passing m ulti computers (iPSC[l7] and FPS T-Series[3l] ). Equipped w ith the molecule construct, PAL can be used to program most of these parallel computers. In this chapter, I shall define a molecule type for each parallel computer class th a t characterizes the computa tion mode implemented therein. Then a layered program development methodol ogy will be discussed th at allows parallel programming in an architecture tran sparent manner. 3.1. SIM D A rray C om p u tation A Single-Instruction-stream-M ultiple-Data-stream (SIMD) array computer consists of a control processor and m ultiple d ata processors connected by a com m unication network. Such a computer executes two types of instructions. A sequential instruction is executed by the control processor. A parallel instruction is broadcast to all d ata processors for execution. This com putation mode can be characterized by an SIMD molecule type as defined below: 28 typ ro simd begin class : = all; statem ent : = {assignment, goto, if, while, for, forall, call); callset : = {simd}; global : = yes; single : = no; comm : = {parameter, global}; sequence : = {SIMD in forall, sequential in others} end A molecule of SIMD type resembles a C function, w ith the only exception of forall statem ent. A t run time, forall statem ents will be executed by the data processors in parallel, while other statem ents are executed sequentially by the control processor. The semantics of a forall statem ent SO forall i : = 1 to n do begin Sl(i); S2(i); ... Sm(i); end can be simply explained by examining how it is executed in an SIMD machine. First, data arrays are distributed so th at an array element w ith index i is allo cated to data processor i. The loop head (SO) is executed by the control proces sor to identify all possible values of the loop index. The statem ents SI, ..., Sm in the loop body are executed one by one according to the lexical order. How ever, the execution of each statem ent is equivalent to the sim ultaneous execution of all its instances by m ultiple d ata processors. And the i-th instance of a state ment is executed in d ata processor i. Interprocessor communication is expressed w ith assignment statem ents. Different communication patterns may be accommodated through different index 29 expressions. A detailed SIMD program is shown in Appendix A. 3.2. P ip elin ed C om p u tation Various pipelined com putations have been practiced in pipelined uniproces s o r s ^ ] , systolic arrays[54], and pipeline nets[46]. In a pipelined architecture, a control unit executes sequential operations and several functional units execute parallel operations in a synchronous, overlapped fashion. To characterize this com putation mode, a pipe molecule type is defined as follows: typ ro pipe begin class :== all; statem ent : = {assignment, goto, if, while, for, forpipe, call}; callset : = {pipe}; global : = yes; single : = {weak in forpipe, no in others}; comm : = {parameter, global}; sequence : = {concurrent in forpipe, sequential in others}; end A pipe molecule is quite similar to a C function, except the new forpipe con struct, which is discussed in detail in C hapter 4. An example of pipe molecule is shown in Appendix A. 3.3. Shared V ariable M u ltip rocessing A multiprocessor system consists of a shared memory, m ultiple processors and a memory-processor interconnection network. A com putation on such a com puter is organized as a num ber of tasks executing in parallel. Inter-task communication and synchronization are realized through the use of shared vari ables. A task molecule type is defined to characterize this com putation mode. 30 ty p ro task begin class : = {in}; statem ent : = {assignment, goto, if, while, for, call, fork}; callset : = {pipe}; global : = yes; single : = no; comm : = {parameter,global}; sequence : = {parallel in fork, sequential in others}; end A molecule of task type is different from a C function in two aspects. First, an additional fork statem ent is introduced to invoke several tasks for sim ultane ous execution. Second, inter-task communication is realized mainly by global variables. E xam p le 3.1. The following is a program consisting of molecules of task type. More examples of task molecule can be found in Appendix A. a,b,x:array[l..2] of integer; z: real; sem ap h ore a [l]= a [2 ]= l; b [l]= b [2 ]= 0 ; main() o fty p e task begin 51 input x,z 52 w hile z > 0.005 do begin 53 fork T sk(l), Tsk(2); 54 P(b[l]); P(b[2]); 55 z : = x[l] / x[2]; 56 jo in T sk(l), Tsk(2); end output X en d main 31 Tsk(i: in) o fty p e task i: integer; begin 57 P(a[i]); 58 x[i] : = fctn(x[i]); 59 V(b[i]); V(a[i]); end Tsk The only param eter class admissible in a molecule of task type is in. Such a param eter is usually used to denote a task name. Tasks w ith the same compu tation structure can be encapsulated by a molecule of task type. A task is created by calling a molecule of task type. A parameter-free molecule (like "main" in Example 3.1) can produce at most one task, which shares the same name of the molecule. A molecule w ith formal param eters may produce m ultiple tasks, as the formals are substituted by different constants, when the molecule is called. In the above program, we have two task molecules (main and Tsk) and three tasks (main, T sk(l), Tsk(2)). The identifier "main" serves as both a molecule name and a task name, "Tsk" is a molecule name, and "Tsk(l)" is a task name. The execution of a task molecule is similar to th at of a C function, except the fork statem ent. In the main molecule of Example 3.1, SI and S2 are exe cuted sequentially. The fork statem ent S3 invokes two tasks T sk(l) and Tsk(2), which then proceed in parallel w ith the execution of "main". W hen the join statem ent S6 is encountered, the execution of "main" is suspended until both T sk(l) and Tsk(2) term inate. Inter-task communication and synchronization is realized through the use of shared variables and semaphores, respectively. In Example 3.1, the global variable x[l] is assigned in T sk(l) (S8) and used in "main" (S5). The semaphore 32 variable b[l] is used ensure the data integrity. W hen the P(b[l]) operation of S4 is executed, "main" is blocked until the corresponding V(b[l]) operation of S9 in T sk(l) is executed. Thus when S5 is executed, x[l] is guaranteed to contain the new value computed in S8, instead of some incorrect old value. . 3.4. M essage P assin g M u lticom p u tin g A m ulticom puter is an interconnect of m ultiple processors each w ith its own CPU and memory. There is no shared memory. Interprocessor communica tion is accomplished via message passing. The interconnection topology could be ring, mesh, tree, hypercube, etc. Each processor can be viewed as a sequential von Neumann machine. However, the instruction set of the processors is aug mented w ith at least two message passing primitives: send and receive. The com putation mode common to all m ulticom puter systems is message passing m ulticom puting. A com putation of this mode consists of a num ber of concurrent processes th a t communicate and synchronize w ith each other by sending and receiving messages. This mode can be characterized by a process molecule type as defined below: typ ro process begin class : = {in}; statem ent : = {assignment, goto, if, while, for, call, fork, send, receive}; callset : = {pipe}; global : = no; single : = no; comm : = {parameter,message}; sequence : = {parallel in fork, sequential in others}; end A molecule of process type is quite similar to a molecule of task type. The only 33 difference is message passing versus shared variable communications. Thus a process molecule contains no global variable, inter-process communication and synchronization is realized through send and receive statem ents, as described in the next paragraph. An example process molecule is shown in Appendix A. A pair of source and destination process names define a channel. A channel can be named by an identifier. Consider the following send!receive statem ents: send e to d [about < k e y > ] receive v [from s] [about < k e y > ] In these statem ents, d and s can be a process name or a channel name. W hen a process executes a send statem ent, it will evaluate the expression e and send the value to process d if d is a process name, or send the value to the destina tion process of d if d is a channel name. If the optional "about k ey' is included, a key is attached to the message. It can be used to identify the message uniquely. Now suppose a process P2 is executing a receive statem ent. If both the from s and the a b o u t key options are om itted, P2 will be blocked until the arrival of any message. W hen "from s " is included, P2 will be delayed until the arrival of a message from process s if s is a process name, or from the source process of s if s is a channel name. If "about key" is included, P2 will be blocked until the arrival of a message w ith the indicated key. This particular send/ receive paradigm is a slight modification of th a t in PLITS[23]. 3.5. D ataflow C om p u tation The mode of macro dataflow com putation has been advocated in [9, 27,40] for medium to large-grain parallel processing. This is a mixed mode combining 34 the advantages of both dataflow and control flow. This mode can be character ized using two molecule types: dataflow and restricted. A macro dataflow pro gram can be constructed with a dataflow main molecule, which calls molecules of restricted type. A definition of the dataflow molecule type is given below: ty p ro dataflow begin class : = {in, out}; statem ent : = {assignment, if, for, call}; callset : = {dataflow, restricted}; global : = no; single : = weak; comm : = {parameter}; sequence : = concurrent; end The most salient feature of a dataflow molecule is th a t the execution is based on data driven. A statem ent can be executed if all of its operand vari ables have been assigned a value. A molecule has a determ inate behavior if the same operands always produce the same results. A more formal definition can be found in C hapter 5. To guarantee the deterministic behavior of a dataflow molecule, a num ber of restrictions are in order: (a)The admissible param eter classes are in and ou t. The in ou t class parame ters are disallowed because they may result in multiple assignments to a single variable. In fact, any dataflow molecule can be viewed as a function which maps operand values denoted by in class param eters to result values denoted by o u t class param eters. (b)Any statem ent in a dataflow molecule body should be viewed as a pure function mapping some operand values into some result values. T hat is why the admissible statem ent set only includes assignment, if, for, and call 35 statem ents. By the same token, the type of molecule a dataflow molecule can call m ust be dataflow or restricted. (c)Global variables are disallowed. If a global variable were used in a dataflow molecule body as an operand, it would become very difficult to check the d ata availability. W hen a global variable is assigned in the molecule body, it would lead to side effects. (d)The single-assignment rule: no variable can be assigned more than once in the molecule body. Clearly, if we allow a variable to appear on the left sides of m ultiple assignment statem ents, the concurrent execution of these statem ents may lead to race conditions, thus result in a nondeterministic behavior. (e)Because global variables are not allowed, communication is realized by param eters only. The above restrictions can be found in the dataflow languages ID [8] and VAL[3]. However, the dataflow molecule construct being presented has two features different from existing dataflow languages. F irst, we have replaced the "strong" single-assignment rule w ith a "weak" rule. Single-assignment is now applied at the array element level. This makes programming more flexible in the dataflow mode, and it facilitates structure handling, a difficult issue of dataflow com putation. The programmer should follow the (weak) single assignment rule in writing a dataflow molecule. However, it is the compiler’s responsibility to check whether this rule is satisfied. If the compiler can not decide th a t the rule is obeyed, a syntactical error will be reported. For example, the program m er should avoid indirect references such as x[i] : = x[y[j]], as the single-assignment rule can not be ensured at compile time. The for statem ent has been chosen as the only iteration construct in a dataflow molecule. Semantically, for 1 to n do S is considered as an abbre viation of n instances of the loop body S, w ith each instance being properly param eterized w ith the corresponding value of the loop index i. Thus its seman tics is simply the same as the expanded program segment. For example, the for loop for i : = n to 1 step -1 do begin y[i-l] : = y[i] * i; end is equivalent to the following program segment: y n-l : = y n] * n; y n-2 : = y n-l] * (n-l); y[oj : = y[i] * i; In a molecule of restricted type, the admissible param eter classes are in and ou t. Excluding in ou t class eliminates side-effects when a restricted molecule is called. Also, no global variable is allowed to appear in the molecule body. This rule helps run tim e data availability checking. Finally, since no global variable is allowed, the communication rule is modified to use the param eter scheme only. In sum m ary, a restricted molecule has an internal control-flow behavior, but from the caller’s viewpoint, it is just a function mapping operand values to result values. A definition of the restricted molecule type follows: 37 typ ro restricted begin class : = {in, out); statem ent : = {assignment, goto, if, case, while, for, call); callset : = {pipe}; global : = no; single : = no; comm : = {parameter}; sequence : = sequential; end 3.6. A L ayered P rogram m in g M eth od ology The molecule concept inspires a layered methodology for parallel program development. W ith this methodology, three layers of software (machine layer, mode layer, and application layer) are construct for each parallel computer. The purpose is to create a software environment th a t enables parallel programming in a flexible, friendly, and efficient manner[47, 81]. This methodology is illus trated in Fig. 3.1 for the iPSC multicom puter. A t the machine layer is the extended programming language C provided by Intel. This language is specially designed for iPSC. It has m any machine- dependent restrictions. A program in iPSC C can not be executed on an FPS T-Series computer w ithout modification, although both machines have the same hypercube architecture. A t the mode layer, a molecule type is defined for each parallel architecture class, which characterizes the com putation mode common to all parallel comput ers in th a t class. For message passing m ulticom puters, this is the process type as defined in Section 3.4. Because these types are architecture oriented, programs w ritten in these molecules can be efficiently implemented. Application Layer: Mode Layer: PAL programs using dataflow and restricted molecule types 38 Moleculizer for partitioning, sy nchronizat ion, and communication I C PAL programs using parallel and process molecule types Precompiler for communication and allocation Machine Layer: Figure 3.1. Three layers of parallel program development on an iPSC hypercube multicomputer 39 A t the application layer, molecule types are defined to facilitate the efficient coding of application algorithms. Many numerical parallel algorithms can be coded using molecules of dataflow and restricted types. This macro dataflow style is user-friendly in th a t it is independent of the target machine architecture. Program m ers are freed from the tedious tasks of explicit specification of parti tioning, synchronization, communication, and allocation. The end user develops a parallel program m ostly at the application layer. An algorithm is coded as a PAL program th at consists of only dataflow and res tricted molecules. Because no architecture detail or machine idiosynchracy is exposed, program development is much easier. A system software, called molecu lizer, transform s PAL-dataflow program into another PAL program, which comprises only molecules of process types. During this transform ation, the moleculizer takes care of most routine work in partitioning, synchronization, and communication. If a user is willing to exploit particular architecture features for better per formance at the expense of extra software development cost, he can go to the mode layer and fine tune the program generated by a moleculizer. The final PAL program is fed into a source-to-source precompiler to generate an iPSC C program . The precompiler is responsible for process allocation and inter-process communication. E xam p le 3.2. Suppose we use iPSC to compute the values of x and y in the following equations: x=q(p(u),v) y=r(v) 40 where p , q, and r are functions and u, v are operands. The three layers of pro gram development process are illustrated in Fig. 3.2. Using dataflow and res tricted molecule types, one can very easily write a PAL-dataflow program as fol lows: main() o fty p e dataflow u,v,w,x,y: real; begin P(w,u); Q(x,w,v); R(V>v); end P(w: ou t, u: in) o fty p e restricted detail of P Q(x: o u t, w,v: in) o fty p e restricted v,w,x: real; begin a: real; a : = sin(v)*cos(v); x : = w + a; end R(y: ou t, v: in) o ftyp e restricted detail of R A fter a transform ation by a moleculizer, this macro dataflow program is decomposed into four processes, one for each molecule call, and a "host" to take care of I/O activities. The implicit communication by param eter passing is replaced by explicit message passing through send/receive statem ents. The dataflow synchronization is enforced by inserting a receive (send) statem ent for each operand (result) param eter at the beginning (end) of a process. For instance, w is a result of P and an operand of Q, so send w to Q is inserted at the end of process P , and receive w from P is attached to the beginning of pro cess Q. begin P(w,u); Q(x,w,v); R(y,v); end Process P Process Q receive v from H receive w from P a: =sin(v) *cos (v) x:= a + w send x to H Process R Process H send w to Q Process P: Process Q Process H Process R ci=copen(P) size=sizeof(w) node=mynode() send(ci.msgtp.w.size.node.Q) Figure 3.2. Three layers of program development for Example 3.1 42 The program m er m ay w ant to tune up the new PAL program for better performance. Consider the process Q generated by a moleculizer: begin SI: receive v from host; S2: receive w from P; S3: a : = sin(v)*cos(v); S4: x : = w + a; S5: sen d x to host; end S3 can not be executed until process P finishes, because S2 blocks the execution of S3, waiting for w to arrive from P . Since S3 does not depend on w, it can change places w ith S2. Then S3 and process P can proceed in parallel. A precompiler transform s the final PAL program into iPSC C code. A main part of this transform ation is to derive a good process allocation and to supply details of inter-process communication. For instance, the statem ent send w to Q in process P of the PAL program will be converted into the following: size = sizeof(w); ci = copen(P); node = mynode(); sendw(ci, msgtype, x, size, node, Q); Details such as communication channel, message type, message size, node proces sor num ber are derived autom atically by the precompiler. This methodology can be applied to develop program s for m any parallel computers, as shown in Fig. 3.3. Dataflow molecules are used at the application layer for all computers. A t the mode layer, I define a process type for message passing m ulticom puters (Section 3.4), a task type for shared memory m ultipro cessors (Section 3.3), and an SIMD type for array processors (Section 3.1). 43 Dataflow Molecule Type Moleculizers M2 M3 Ml Process Type Compilers C3 C2 C5 C6 Cl C4 iPSC T-Serieq (Cray X- Occ Cyberpl .Fortran, MPP Pascal CM Fortrai Multicomputers Multiprocessors Array Processors Figure 3.3 Parallel program development on major classes of parallel computers 44 A moleculizer needs to be developed for each architecture class, and a precom piler is needed for each parallel computer. A m ain advantage of the layered methodology is the simplification of paral lel program development. In [81], we compared program developments at the three layers w ith a concrete example. The problem is to invert a large triangu lar m atrix by the block-partitioning algorithm proposed by Hwang and Cheng[39]. The target computer is the iPSC hypercube m ulticom puter. A t the machine layer, a program is developed in the iPSC extended C language. A t the mode layer, a PAL program is constructed using process molecules. Finally, at the application layer, a PAL program is developed in term s of dataflow and res tricted molecules. These three programs are compared in Table 3.1. Given a parallel algorithm, it is often necessary to identify a num ber of useful subprogram s. This is done in all three cases by the programmer. In the extended C and the PAL-process approaches, the program m er has to decompose the com putation into a process graph and explicitly specify interprocess com m unication through message passing routines or statem ents. In a PAL-dataflow program , however, all these are implicit. In the extended C, the programmer m ust m anually allocate the process graph to the m ulticom puter. In PAL, this is taken care of by an autom atic allocation algorithm at the precompile time. Because of the above reasons, the PAL-dataflow approach is the most user friendly. In implementing the inversion algorithm for a 4X4 m atrix, it takes us about 3 hours to develop a program in the extended C language, of which about one hour is spent on deriving the process graph. The graph has 21 processes and 45 49 comm unication links. About one hour is needed in developing a PAL-process program and only 10 minutes for a PAL-dataflow program. The PAL-dataflow program is also the shortest. Most im portantly, it is very easy to modify the dataflow program . W hen the m atrix size becomes 8, we only need to change two lines in the dataflow program. For the PAL-process approach, we need to recode two molecules. In the iPSC C, however, we have to repeat the whole program development procedure, even w ith a small change. 46 T ab le 3.1. Comparisons of parallel programming in iPSC Extended C and PAL on an iPSC hypercube m ulticom puter Attribute Extended C (iPSC) PAL (Process) PAL (Dataflow) Identifying Subprograms Programmer Programmer Programmer Decomposition into Processes Programmer Programmer Software Allocation Programmer Software Software Communication Message Passing (Explicit) Message Passing (Explicit) Parameter Passing (Implicit) Development Time > 4 0 mins. (TV— 2) 3 hrs. (TV= 4) 20 mins. (TV = 2 ) 1 hr. (TV = 4) 10 mins. (T V =2) 10 mins. (T V = 4) Number of Lines > 2 5 0 (TV =2) >1000 (TV=4) 44 (T V = 2 ) 80 (T V = 4 ) 26 (T V = 2) 26 (T V = 4) Changing Problem Size Recode the Entire Programs Modify Two Subprograms Change Two Lines 47 C h a p te r 4 C o m p o u n d F u n c tio n s a n d P ip e lin e N e ts In the remaining chapters, I will concentrate on an im portant class of parallel com putation: evaluations of compound function (CF) on pipeline nets [42, 43, 44,45, 80,46]. This class is large enough to cover com putations on m any im portant computing systems, such as pipelined com puters[52], systolic arrays[54], and uniform Boolean circuits [12]. It is expected th at the obtained results have a wide range of application. A CF is a collection of linked scalar operations to be executed repeatedly m any times in a looping structure. Various compound functions for scientific arithm etic com putations have been discussed in an earlier paper[43]. Table 4.1 summarizes some of the compound functions frequently encountered in scientific and engineering computations. A CF can be precisely specified in three ways. In this chapter, I study the representation of a CF using a high-level language construct, called forpipe loop. For architecture level representation, I will propose a pipeline net processor. A CF is represented by a num ber of SET and START instructions on such a pro cessor. In C hapter 5, a m athem atical model, called computation system, will be defined. A CF is modeled by a com putation on a system. Analysis and synthesis problems are form ulated and solved in the framework provided by this model. 48 T ab le 4 .1. Examples of compound functions Complex Arithmetic Complex Addition/Subtraction Complex Multiplication Complex Division Complex Inner Product Interval Arithmetic Interval Addition/Subtraction Interval Multiplication Interval Inversion Interval Division Vector Reduction Operations Vector Summation Search for Max/Min Chain Vector Product Mean and Standard Deviation Matrix Algebra Matrix Addition/Subtraction Matrix-Vector Multiplication Matrix Multiplication Matrix Inversion Matrix Triangularization Polynomials and Elementary Functions Polynomial Addition/Subtraction Polynomial Multiplication Chebyshev Polynomial Approximation Power Series Evaluation Signal Processing FIR, HR Filtering Convolution Correlation Fast Fourier Transform 49 4.1. T h e forp ipe C on stru ct Syntactically, a CF is specified as follows: tem p < id en tifier> { ,< id en tifier> } ; < initial assignments > forp ipe i : = 1 to n do (I) begin S ,(.•); S 2(0; ■ ■ • sm(i); end Here i is the loop index, and 5 1 (t); S 2(i); • ' • are assignment state m ents of the form a [i]:= < expression > . D ata items appearing in the right-hand side expression can be a constant, a scalar variable, or an array element with index expression i-c, where c is either an positive integer or zero. D ata items appear in the left-hand side are either temporary arrays (if their names appear in the tem p construct) or result arrays. Other d ata items are operands. Given a forpipe loop in the form of (I), its dependence graph is obtained as follows: (1)Expand program (I) into a number of assignments: < initial assignm ents> •S ’i(l); • •• Sm(l);S,(2); Sm(2); • • • Sm(n); (2)For each assignment statem ent w ith r d ata items in the right-hand side, draw a node w ith one outgoing arc and r incoming arcs. Label nodes and arcs w ith corresponding expressions and d ata items. (3)Connect arcs w ith the same label. E xam p le 4.1. The following program specifies a CF: 50 tem p w ; x[0] : = 0; x[-l] w[0] : = 2; forp ip e i 1 to n do begin w[i] : = r - x[i-2] ; x[i] : = if x [i-l]= 0 th en w[i-l] else w[i] ; end In this program , w is a tem porary array, x is the result array, and r is an operand. In fact, this program defines a recurrence equation w ith the initial con ditions given by the two initial assignments. Suppose n = 2 , this program can be expanded into the following code: x[0] : = 0; x[-l] : = w [0 ] : = 2; w[l] : = r - x[-l] ; x[l] : = if x[0]=0 th en w[0] else w[l] ; w[2] : = r - x[0] ; x[2] : = if x [l]= 0 th en w[l] else w[2] ; Its dependence graph is shown in Fig. 4.1. The semantics of a CF can be defined as a function 0 —L ( I ) mapping operand / into result O, where I, O are ordered sets of values. If we consider each node as a combinational circuit realizing the indicated function (expres sion), a dependence graph can be viewed as a logic circuit realizing the function defined by the forpipe loop. (In Section 6.2, an operational semantics of forpipe loop will be given in term s of pipeline net.) It is interesting to know when a forpipe loop defines a pure function, th a t always maps an operand into a unique result. If a forpipe loop does not define a pure function, we consider it in error, and expect the compiler to detect it. 51 x[0] x[l] x[2] Figure 4.1. The dependence graph of Example 4.1 52 D efin ition 4.1. A dependence graph is well formed if it is acyclic and all node outputs are distinctly labeled. W ith this definition, the dependence graph in Fig. 4.1 is well formed. Two ill formed dependence graphs are shown in Fig. 4.2. The graph in Fig. 4.2a is ill formed because it is cyclic. It corresponds to an assignment statem ent x[l]:= x[0]+ x[l]*r, and the com putation of x[l] is dependent on itself. The graph in Fig. 4.2b is ill formed, since both outputs of the addition node and the sub traction node have the same label w[l]. In fact, this graph corresponds to the following program: 51 w 52 w z[l] + r ; y[l] - r ; S3 x[l] : = y [0] * w[l] ; Because there are two assignments to w[l] in S i and S2, it is not clear which will be chosen as an operand in S3. L em m a 4.1. A forpipe loop defines a pure function if and only if its depen dence graph is well formed. P roof. If a dependence graph is well formed, it can be viewed as a pure, race-free, combinational circuit. Thus when any operand is applied to its inputs, a unique result will be generated at the output ports. In a cyclic dependence graph, the com putation of some value is dependent on itself. If a graph contains two nodes w ith the same output label, the two outputs will merge in an arbi trary way. In both cases, the result is not unique. Q.E.D. 53 r x MPY ADD x[l] (a) An ill formed graph due to cyclic dependence ADD y[o] SU w[l] MPY x[l] (b) An ill formed graph due to conflict dependence Figure 4.2, Examples of ill formed dependence graphs Figure 4.3. The assignment graph in Example 4.1 54 T h eorem 4.1. A forpipe loop defines a pure function if and only if it satisfies the weak single-assignment rule. P roof. Suppose a forpipe loop satisfies the weak single-assignment rule. A fter expanding, we have (a) the left-hand sides of all assignment statem ent are different, and (b) the data item appearing in the left-hand side of any assign ment statem ent is different from those in the right hand side of the same or pre vious statem ents. Thus its dependence graph m ust be well formed. Now sup pose a forpipe loop does not satisfy the weak single-assignment rule. Then after expanding, (a) or (b) m ust be false. The failure of (a) implies th a t multiple node outputs of the dependence graph have the same label. The failure of (b) leads to a cyclic dependence graph. Q.E.D. Since weak single-assignment is such an im portant property, we need an algorithm to detect it for any forpipe loop. We can always detect weak single assignment by checking Definition 3.3. T hat, however, requires expanding the forpipe loop and would lead to a com putational complexity proportional to the vector length n . An algorithm is given below th a t is independent of the vector length. First, we need some definitions. Suppose a forpipe loop is given in the following form (II): forp ipe k : = 1 to n do begin 51 a i[k] := e 1 ; 52 a 2[k] := e2 ; (II) S m um [k] . em , end 55 D efin ition 4.2. The assignment matrix of a forpipe loop is defined as an m Xm m atrix M = (Mg y), where where a l5 a2, ■ ■ ■ , am are array names, and elf e2, ..., em are expressions. D efin ition 4.3. The assignment graph of a forpipe loop is an edge-weighted directed graph Ga = ( A ,E , W). The set of nodes contains all array names in the forpipe loop, i.e., A —{av a2,...,am}. There is an edge eg ;- from ai to ay, if the corresponding element in the assignment m atrix is not infinity. T hat is E = { e s y | M i}- =^oo}. The weight function is defined as W(eiy) = M iy. E xam p le 4.2. To clarify the above concepts, let us consider the forpipe loop specified in Example 4.1. This loop has only two statem ents in its body. The array names are a j = w and a2=x. Its assignment graph has two nodes as drawn in Fig. 4.3. Its 2X2 assignment m atrix is shown below: D efin ition 4.4. Given an m X p m atrix R = (rty) and a p X n m atrix S = ( s 4 y), their cross product is an m X n m atrix T = R X S = (tij-), where The power m atrix is defined recursively as T q = T q l XT. The minimum m atrix U = Min (R , S ) is defined by = Min ( r ^ , s{j- ). if ag € ej if otherwise oo 0 tii = Min { rik + ski \k=l,...,p} 56 A lgorith m 4.1. Detecting weak single-assignment in a forpipe loop. Input: A forpipe loop in the form of (II). Output: Yes or no, depending on whether the forpipe loop satisfies the weak single-assignment rule. Steps: (1)Compare the left-hand side array names. If « ,■ = aj for some i j , then return "no" and stop. (2)Construct the assignment m atrix M and compute the m atrix C —M in (M ,M 2, . . . , M m ). (3)If cj 8 - > 0 for i — then return "yes", otherwise return "no". Stop. L em m a 4 .2 . Suppose a forpipe loop is given in the form of (II) and its C m atrix is computed in Algorithm 4.1. Then for any i = 1 ,2 ...,m , the computa tion of ag -[& ] does not depend on itself, for k = 1,2,...,n, if and only if c! g > 0 . P roof. It can be proven by induction on p, th a t the if element of M p is the minimum weight of any path from a ,- to O y th a t travels p edges. Thus ci}- is the minimum weight of any path from a,- to af. Supposing c,y = 5 ^ 00, the compu tation of aj[k\ is dependent on a,-[A r— c], where c > c g y. Thus c8 l - = 0 if and only if at- depends on itself. T h eorem 4.2. Algorithm 4.1 correctly detects weak single-assignment for any forpipe loop with time complexity 0 (m 4), where m is the num ber of state m ents in the forpipe loop body. P roof. From Theorem 4.1, a forpipe loop satisfies the weak single assignment rule if and only if its dependence graph is well formed. By Definition 4.2, a dependence graph is well formed if it is (l) acyclic and (2) all node out puts are distinctly labeled. The correctness of Algorithm 4.1 is proven from the facts th a t Step 1 returns a "yes" if and only if condition (2) is true in the corresponding dependence graph, and th a t Step 3 returns a "yes" if and only if the dependence graph is acyclic, as implied by Lemma 4.1. AS for tim e complexity, the dominate com putation is the calculation of to— 1 cross products of toX t o matrices, which has a complexity of 0 (to4). Q.E.D. E xam p le 4.3. Now let us apply Algorithm 4.1 to the forpipe loop in Example 4.1. Step 1 passes successfully since w and x are different. The relevant matrices are computed below: oo 0 2 1 2 0 M = 2 1 , m 2 = 3 2 , and C = 2 1 Because both diagonal elements are greater than zero, this forpipe loop satisfies the weak single-assignment rule. And by Theorem 4.1, it defines a pure function x ~ f (r fac^ > f°r n = 2, this loop is another way of writing the follow ing function: x, = w 0 r —x. if x 0=O if a:0#Q Xc r —x_x if XjMD r — x0 if x 1= fO 58 4.2. S om e E xten sion s The forpipe loop presented in the last section is not very expressive. M any restrictions are made to simplify the definition of its semantics and, as shown later, its transform ation to pipeline net. In this section, I present some exten sions th a t augm ent the expressive power of forpipe loop. These extensions are purely syntactical sugar, any extended forpipe loop can be rewrite as a sem anti cally equivalent basic forpipe loop. First, I now require th a t the index of an array variable in an assignment statem ent be i-c, where c can be either 0, positive or negative. Thus the follow ing program is legal: tem p w ; forp ip e i : = 1 to n do begin w[i] a[i+l] * x[i-3] ; x[i+l] := ' x[i] + w[i+2] ; end By introducing two new array variables y[i]= x[i+ l] and b[i]=a[i+3], this loop can be rew ritten as follows: forp ipe i : = 1 to n do begin y[i] :== y[i-l] + b[i] * y[i-2] ; end Second, I allow the for statem ent to appear in a forpipe loop. For example, a m atrix addition can be specified as follows: 59 forp ip e i : = 1 to n do begin for j : = 1 to n do begin e[i,j] : = a[i,j] + b[i,j] ; end end where the for loop is viewed as an abbreviation of n assignment statem ents Thus the above program can be rew ritten as forp ipe i : = 1 to n do begin c[i,j] : = a[i,j] + b[i,j] ; c[i,n] : = a[i,n] + b[i,n] ; end Third, we can declare an array as a m erge variable to deal w ith vector reduction operations[64]. In a forpipe loop, if a result array is declared as a m erge variable, only the last array element need be stored, all other elements will be discarded. Thus we only need to allocate memory space for a scalar, instead of an array. For instance, an inner product com putation can be specified as follows: m erge c ; c[0] : = 0 ; forp ipe i : = 1 to n do begin e[i] : = c[i-l] + a[i] * b[i] ; end W e are only interested in c[n], all other elements only serve as tem porary values. Finally, the loop head can be extended from 60 fo rp ip e i : = 1 t o n d o to forp ipe 1 : = ej to e 2 ste p e3 do where the values of the three expressions have the following relations: e1< e 2 and e3 > 0 . If these relations do not hold, no com putation is performed. For instance, the following loop forp ipe i : = 2 to m step 2 do begin c[i] : = c[i-2] + a[i-l] * b[i] ; end can be rew ritten as forp ipe j : = 1 to n do begin c[j] : = c p -l] + A[j] * B[j] ; end where C[j]=c[2j], A [j]=a[2j-l], B[j]==b[2j], and n = (m -2 )/2 + 1. 4 .3. T h e P ip elin e N et A pipeline net is constructed from interconnecting m ultiple functional pipe lines (FPs) through a buffered crossbar network, which is itself pipe l in e d ^ , 44,45,46,80]. One can view a pipeline net as a two-level pipelined and dynamically reconfigurable systolic array. By reconfiguring a single physical pipeline net, we can generate m any virtual systolic arrays to execute different CFs at different times. The pipeline net evaluates CFs based on the pipeline networking concept, 61 which is a natural generalization of pipeline chaining in Cray-1 and Cray X- MP[68]. The idea is th a t after the operand data are loaded into the register file, a pipeline net consisting of vector registers, functional pipelines, and crossbar networks is dynamically set up. Operand blocks pass m ultiple pipelines via the networks, before it finally returns to registers. Interm ediate results flow directly from a pipeline to another w ithout passing through registers. The pipeline net best m atches the dataflow pattern of the CF to be evaluated, thus operand fetchings, arithm etic/logic operations, interm ediate result routing, and final result storing are executed concurrently in an overlapped fashion. As illustrated in Fig. 4.4, the pipeline networking concept originates from the notion of internal forwarding used in IBM 360/91 and the dynamic linking of functional units in CDC 7600[4l]. Pipeline chaining has been practiced in Cray-1 and Cray X-MP[68]. However, only a small num ber of vector operations can be linearly linked in these vector machines. In the Cyberplus[24] fifteen functional pipelines are linked by a crossbar network. In FPS-164/m ax and FPS T-Series, dot-product operations are executed by a cascade of a m ultiplier and an adder pipelines[3l]. Our pipeline networking generalizes the chaining practice to a two-level, reconfigurable systolic approach. The pipeline net is different in structure and operation from other parallel computing architectures. It uses a central controller to synchronize the entire network operations, thus it is not an asynchronous MIMD machine or a dataflow wavefront array[56]. It is not an SIMD machine either, because m ulti ple pipelines used in a net can perform different operations at the same time. IBM 360/91 (Internal Forwarding) CDC 7600 (Linking of Multiple Functional Units Cray 1 (Pipeline Chaining) Systolic Array (One-Level) Cray X-MP (Multiple Chaining) Systolic Array (Two-Level) Warp Computer (LINC Chips) Cyberplus (Linked Vector Operations) FPS 164/Max (Vector and Matrix Accelerator) Pipeline Nets (Multipipeline Networking) Figure 4.4. Historical evolution of the concept of pipeline networking 63 Pipeline nets are compared in Table 4.2 with the systolic arrays[54, 55] and pro grammable switch lattices[75]. (1)Pipeline nets can offer more flexible connectivities th an systolic arrays (which have fixed connections) or switch lattices (which have limited switch connections) in evaluating various CFs. (2)A pipeline net is not restricted by local connectivity as required in systolic arrays or in the switch lattices. (3)A pipeline net uses two-level pipelining and could be driven by a faster clock rate th an th a t used in a single-level systolic array. A pipeline net processor, as illustrated in Fig. 4.5, is m ade of a control unit, m ultiple functional pipelines (FPs), two buffered crossbar networks (BCNs), and a set of data registers (Rs). For simplicity, all FP s are assumed identical and m ultifunctional. Different operations can be performed by the same F P at different times. The registers are used to hold operands and results. The buffered crossbar networks are used to provide dynamic connecting paths among FP s and registers. A collection of fetch/store pipes will be used to transfer data between main memory and the registers, similar to those memory access pipes implemented in F ujitsu VP-200 and in Cray X-MP. The buffered crossbar networks are used to provide dynamic interconnec tion paths among the FP s and the registers. I choose the crossbar over multis tage netw orks[14], due to the demand of full connectivity in pipeline nets. The crossbar network supports arbitrary 1-to-l and 1-to-many mappings. Many-to- 1 mappings are not allowed in a pipeline net. 6 4 T ab le 4.2. Comparisons of three parallel processing architectures: systolic arrays, switch lattices, and pipeline nets. F eatures Systolic Arrays (Kung and Leiserson[54]) Switch Lattices (Snyder[75]) Pipeline Nets (Hwang and Xu[44]) Connectivity Local Local Local or Global Topology Fixed Reconfigurable Reconfigurable Pipelining One-Level [54] or Two-Level[55] One-Level Two-level Application Special Purpose Semi General Purpose General Purpose 65 Buffered « Crossbar ii Network with Programmable Delays (BCN1) Buffered i Crossbar Network with Programmable Delays (BCN2) B - Buffer (Programmable Delays) FP - Functional Pipeline M PX- M ultiplexer Figure 4.5. The logical architecture o f a pipeline net 66 A part from full connectivity, crossbar switching is easy to set up, which implies a small interconnection overhead. The m ain problem of crossbar network is its high hardw are complexity, 0 (m 2) for a very large network of size m. How ever, a crossbar switch of size 8X8 has been made w ith today’s VLSI/CMOS technology, such as the LINC chip[37,59,60] w ith buffers at all input ports. The progress in optical interconnection technology promises for the future the construction of larger crossbar networks[72]. The pipeline net supports arbitrary connections among the FP s. Local con nections necessary in a systolic array are no longer a structural constraint in a pipeline net. However, the systolic flow of data through the pipeline net is preserved. For example, when two operand stream s arrive at a certain pipeline unit, they m ay have traversed through different d ata paths w ith unequal delays. These p ath delays m ust be equalized in order to have the correct operand pairs arriving at the right place at the right time. In a pipeline net processor, delay m atching is handled by the program m able crossbar networks. Some program m able buffers are provided at each output port of the crossbar network. Using these buffers, noncompute delays can be inserted at any d ata path to be con nected. In the LINC ehip[37], at most 32 delays can be program m ed at each I/O buffer. A block of operands flowing through the pipeline net at the same time is called a wavef ront [56]. A pipeline net is expected to receive successive wave fronts of operands. Interm ediate results flow directly from an F P to another FP w ithout passing through the registers. The connections set up in a pipeline net 67 can best m atch w ith the dataflow pattern of a CF, thus operand fetchings, arithm etic/logic operations, interm ediate result routing, and storing final result are all executed concurrently in an overlapped fashion. 4.4. V ector In stru ction s In designing the instruction set of the pipeline net processor, I have fol lowed a simplicity principle. The instruction set of the pipeline net processor consists of conventional scalar operations plus vector instructions. Only two vector instructions are used to specify pipeline networking. The SET instruction shown below is needed tq select the function of the F P or the connection pattern (including noncompute delays) of the BCN to be used: SET unit, value The unit could be any F P or any BCN. If it is an F P , the value indicates an arithm etic/logic operation. If it is a BCN, the value indicates a connection pat tern w ith proper noncompute delays on connecting paths. For instance, the fol lowing instructions set up a pipeline net to perform a vector addition: SET FP1, + ; /* Use F P l to ADD SET B C N l, 125; /* Set B C N l to assume connection pattern 125 SET BCN2, 3; /* Set BCN2 to assume connection pattern 3 A fter a pipeline net is set up, a START instruction broadcasts a start sig nal to all relevant units to enable the pipelined operations: START n, a /* Feed n wavefronts of operand to the net, /* one wavefront every a clock periods where, n and a are called the vector length and the wavefront latency, respec 68 tively. The startu p tim e creates an overhead in the pipelined net. A residual con trol concept[5] is used to reduce this overhead. For a processor w ith m FPs, there are m + 4 residual control registers R C lr : . . , R C m+4. The first m regis ters are associated w ith the m FPs. R C m+1 and R C m+2 are associated w ith the two buffered crossbar networks B C N l and BCN2, respectively. R C m+s is the vector length counter (VLR), and R C m+4 is the latency counter (LC). A CF is evaluated in two steps: (1) the configuration of a pipeline net with a sequence of SET instructions; and (2) the actual execution w ith a START instruction enabling the operations at a particular cycle. Before the configuration step begins, these instructions are packed into a very long instruc tion, and loaded into a configuration register, as shown in Fig. 4.6. This register has m -H fields, containing control inform ation to be sent to the m + 4 residual registers. More specificly, field i, i = l,...,m , contains the operation to be per formed by FP,-. Fields m + 1 and m + 2 contain connection patterns for BCN l and BCN2. A nd fields m + 3 and m + 4 contain the vector length n and the latency a, as specified in the START instruction. During the configuration step, the control unit sends appropriate control values to relevant residual registers. During the execution step, the residual registers m aintain the control signals needed for execution, the control unit is freed to execute scalar and fetch/store operations, or to pack the long instruc tion for the next CF evaluation. This residual control scheme reduces the hardw are complexity of the control unit and the pipeline net startup time. 69 m m +1 m + 2 m + 3 m +4 F P 1 F P 2 F P m BCN1 BCN2 LC VLR Figure 4.6. The configuration register 70 E xam p le 4.4. Consider the following forpipe loop which specifies the operations to be performed in a CF. forp ipe I : = 1 to 400 do begin e[i] : = (a[i]*b[i] + b[i]*c[i]) / ((b[i]*c[i])*(e[i]+d[i])); end Suppose th a t the F P s used for Add, Multiply, and Divide have 2, 4, and 6 pipe line stages, respectively. This forpipe loop is m apped into an equivalent pipeline net shown in Fig. 4.7a. Note th at non compute delays are inserted into the data paths connecting F P 3 to FP5 and FP4 to F P 6. This com putation is compiled into the following assembly language program: SET F l, * ; SET F2, * ; SET F3, + ; SET F4, + ; SET F5, * ; SET F 6, / ; SET B C N l, 7 ; SET BCN2, (3 ; START 400, 1 ; These instructions are packed and loaded into the configuration register as shown in Fig. 4.7b. Figure 4.7c shows the im plem entation of the pipeline net obtained after the configuration step. The two num bers a and (3 specify partic ular connection patterns needed in the B C N l and BCN2 networks respectively. The connection pattern /? for BCN2 also includes two noncompute delays of 2 units each. A fter the pipeline net is configured, the CF is evaluated by passing 400 wavefronts through the net from vector registers A, B, C, and D. The final results will be stored in a vector register E. 71 DIV Delay MPY ADD Delay MPY MPY ADD — a[i] b[i] d[i] (a)- The p ip eline net 1 2 3 4 5 6 7 8 9 10 MPYMPY ADD ADD MPY div y p i 400 i ' t \r 1 i \ i ’ i \ i ' Ff*l FP2 FP3 FP4 FP5 FP6 BC N l BCN2 LC VLR (b) The content of the configuration register Figure 4.7. An example of multipipeline networking 72 A BCN1 FP1 BCN2 a[i] MPY F P 2 b[i] MPY F P 3 ADD F P 4 d[i] ADD F P 5 e[i] MPY 7 FP6 DIV (c)- The crossbar network im plem entation Figure 4.7. (Continued). An example of m ultipipeline networking 73 C h a p te r 5 P a r a lle l P ip e lin e d C o m p u ta tio n s In the previous chapter, I defined a class of parallel pipelined computations, called CFs, and gave the language-level and the architecture-level representa tions. A m athem atical model, called computation system, is defined in this chapter to facilitate the form ulation and solution of the following im portant analysis and synthesis problems: (1)Modeling pipeline nets: A CF is represented in a pipeline net processor by a num ber of SET instructions followed by a START instruction. These SET or START instructions should be recast into the m athem atical model. (2)Modeling forpipe loop'. Give an operational semantics for any forpipe loop, th a t is consistent w ith the functional semantics given in C hapter 4. (3)Termination: Give necessary and sufficient conditions whereby a CF com putation stops in finite am ount of time. (4)Determinacy: Derive necessary and sufficient conditions whereby a CF com putation always produces a unique result. (5)Equivalence: Define the equivalence between two systems. Give a number of useful transform ations th a t preserve equivalence. (6)Performance evaluation: Derive the total tim e needed for a CF computa tion. Derive the throughput and the speedup of a pipeline net. (7)Synthesis: Give algorithm s th a t transform a forpipe loop into an equivalent, optim al pipeline net. 74 5.1. C o m p u ta tio n S ystem s Let N denote the set of all natural num bers {0,1,2,...}, Z the set of all integers 2,— 1,0,1,2,...}, Z + the set of all positive integers {l,2,...}. Let D be a set of d ata values. The exact definition of D is not im portant, it could be any set suitable for our discussion. For example D can the set of real num bers plus a special value d, where all reals are called valid values and d is called the invalid or garbage value. A path of length I in a graph G is a sequence of nodes and edges v 1e 1v 2-..vi ei vi+l ■ ■ ■ vt+l such th a t edge eg - connects w g to vg+1. A cycle is a path w ith v 1 = v[+1. A graph is called acyclic if it contains no cycles; otherwise it is called cyclic. D efin ition 5.1. A Petri net is a bipartite graph P N = (P ,V ,E ), where: (1 )P is a finite set of places; (2 )F is a finite set of operators', (3)E C ( P X F ) ^ j( F X F ) is a finite set of edges. D efin ition 5.2. For any operator v G P , its input place set and output place set, denoted by IN(v) and OUT(v), respectively, are defined by IN(v) = {p | (p,v)£E } and OUT(v) = {p | (v,p)£E}. A place is called an incoming (outgoing) place of PN, if it is not an output (input) place of any operator. Places other than incoming or outgoing ones are called intermediate places. D efin ition 5.3. A synchronous Petri net (SP-net for short) G is a four tuple G = { P , V ,E ,c ) , where: 75 (1)(jP ,V ,E ) is a Petri net such th a t each place is an input or an output of at most one operator. (2)The Petri net has exactly one incoming place (denoted by pin) and one out going place (denoted by pout). And every node is on a path from pin to Pout' (3)c is a function c :P — >-iV(j{oo} such th at c(pin)=oo and c(pout)=0. For any p &P, c(p) is called the capacity of place p . D e fin itio n 5.4. Let F be a set of basic functions. An interpretation a of a SP-net G is a m apping < r: V —>F th at assigns a function to each operator. E xam p le 5.1. Consider Fig. 5.1, which is an interpreted SP-net derived from the forpipe loop in Example 4.1. It has two operator nodes (uj and v 2) for subtraction and selection (if-then-else), respectively. There are four interm ediate places (p l5 p 2, p 3, and p 4), one incoming place (p,-n ), and one outgoing place (Pout)• The capacities of these places are c(pin) = oo, c (p x) = 2, c (P2) = c (P4) = an^ c {Pz) — c {Pout) = 0- The input and output place sets of the operator node v 2 are IN(v2) = {p2, p 3, p 4}, and OUT(v2) = {p1} p4, pout). D e fin itio n 5.5. The set of tokens Q = D X ( ^ +(j{ ^} ), where D is the set of all d ata values, Z + the set of positive integers used as tags, and h is a special halting tag. A token is represented as a two-tuple < value,tag> . 76 out m (IF) (SUB) Figure 5.1. An interpreted SP-net derived from Exam ple 4.1 77 D efin ition 5.6. A system S is a 5-tuple S = (G ,cr,S0,n ,a), where G is SP-net and a is an interpretation of G. S 0 is a function from P — {Pin ,pout} to the power set of Q , defining the initial states of the system. For any intermedi ate place p , S 0(p) = {(«1,l), • • • ,(«e(p),c(p))}, where and c{p) is the capacity of place p . The two positive integers, n and a indicate the vector length and the latency of the operand sequence. An system operates synchronously under the control of a central clock. A t the beginning of clock cycle t, the contents of all places define the marking M t of the system at clock cycle t . A t the end of clock cycle t , th a t is also the begin ning of the t 4-1 clock cycle, the system changes its m arking to M t+1. This change is denoted by M t —*-Mt+v Initially (i.e., in clock cycle 0), interm ediate places contain tokens as specified by S 0. The incoming place pin contains n tokens w ith tags 1, aH-1, ..., ( n — 2)a+l, and h, respectively. (When n — 1, the only token in the incoming place has a tag h .) The corresponding values form an operand sequence of length n . The outgoing place pout is empty. This defines the initial marking M 0. A system stops at the end of the t-th clock cycle, if there is a token in M t th a t contains the halting tag. In clock cycle t (£=1,2,...), every operator v fires exactly once by removing a token w ith tag t from each of its input place and adding a token to each of its output place. Suppose IN(v) = {p 1,...,pr} and OUT(v) = {ql,...,qs}. The token <Uj,lj> added to each output place qj, where Uj is a value and lj is a tag, can be computed according to the following four firing rules: 78 (1)If every input place ps - has a token < « ,-,£ > , then < u j,lj> = <a{v)(uv . . . , ur), t+c(q{)>. (2)In the case th a t an input place pi has no token w ith tag t but it has a token <U;,h > , where h is the halting tag, then < Uj ,lj > = < o{v )[uv . . . , ur ), h > . (3)If there is an input place p; which has no token w ith tag t or h, then < u j,lj> = < d ,t + c { q i ) > , where d is the garbage value. (4)After every operator has fired, all O-capacity interm ediate places m ust be empty. Some explanation of rule (4) is in order. Consider the system shown in Fig. 5.2 which has a O-capacity interm ediate place p x and two operator nodes for square operation. Figure 5.2a shows the initial m arking in clock cycle 0. The incoming place contains a token < 2,1 > , other places are em pty. A t the first clock cycle, operator v 1 can only fire by rule (1), but v2 m ight fire by either rule (l) or rule (3). If v2 fires before p x receives the result token from firing v v rule (3) applies and Fig. 5.2b is produced. Otherwise rule (1) applies and Fig. 5.2c results. Rule (4) is used to resolve such am biguity. In fact, only Fig. 5.2c is the legal result of a firing, since it leaves p x empty. out m (MPY) (MPY) (a) The initial marking out m (MPY) (MPY) (b) The marking from firing v l and v2 by rules (1) and (3) 16,1) out m (MPY) (MPY) (c) The marking from firing both v l and v2 by rule (1) Figure 5.2. Examples of a system firing 80 E xam p le 5.2. To clarify the above definitions, I give in Fig. 5.3 the opera tional details of a com putation system. This system S = (G ,a,S0,1,3) is derived from the forpipe loop in Exam ple 4.1, w ith the assum ption th a t the vector length n — 3, the latency a= l, and the operand r = 5. Its interpreted SP-net (G ,a) is described in Example 5.1. The initial state S 0 of the system at clock cycle 0 is shown in Fig. 5.3a. The incoming place contains three operand tokens w ith tags 1, 2, h and a constant value r= 5 . A t clock cycle t = 1 , both operators v lf v2 fire and Fig. 5.3b results. The operator v 1 fires according to rule (l) by absorbing a token < 2,1 > from place Pi and a token < 5 ,1 > from place pin. The result value is 5— 2=3. The output tokens put into p 2 and p 3 have tags £ + c (p 2) = 1 + 1 = 2 and t + c ( p 3) = 14-0 = 1, respectively. By firing rule (4), v2 can fire only after v 1 fires. The tokens < 2,1 > and < 3,1 > are absorbed from p 2 and p 3, and a result value 2 is produced. The tokens output to pi, p4, and pout have tags 3, 2, and 1, respectively, according to firing rule (1). A t clock cycle t — 2, the two operators fire according to rules (l) and (4), in a way sim ilar to clock cycle 1. The result is shown in Fig. 5.3c. A t clock cycle t — 3, however, the two operators fire by rules (2) and (3), since the operand tokens have the halting tag h. The system finishes com putation in three clock cycles. The values of the tokens in pout form a result sequence (2, 5, 3). 5,2] 5,h (2,1) out m (IF) (SUB) The initial marking 5,2) 5,h) (3,2) out m (IF) (SUB) The marking after the first cycle Figure S.3. Operations of an example system 82 (5,3) out m (IF) (SUB) (c) The marking after the second cycle (3,h) 2,1 5,2 out m (IF) (SUB) (d) The marking after the third cycle Figure 5.3 (Continued). Operations of an example system 83 In the system model defined, I assume a system has exactly one incoming place and one outgoing place. This restriction is proposed to simplify the presen tation of the main concepts and results. It does not limit the power of the model. In fact, for any m ulti-input, m ulti-output (MIMO) system, we can always create an equivalent single-input, single-output (SISO) system, by adding a decouple and a couple operators, as shown in Fig. 5.4. Figure 5.4a illustrates an MIMO system. Each incoming place initially con tains an operand token sequence as shown on the top. Each outgoing place will eventually contain a result token sequence as shown at the bottom . By adding two operators couple and decouple, we create an equivalent SISO system shown in Fig. 5.4b. Now we have only one operand sequence and one result sequence, but each data value is a tuple of data values. The decouple node simply decom pose a tuple sequence into m ultiple individual sequences, and the couple node performs the reverse function. 5.2. P ro p erties o f C o m p u ta tio n System s The system model defined in the previous section provides a framework whereby m any analysis problems can be precisely form ulated and solved. In this section, we study three such problems: termination, determinacy, and equivalence. D efin ition 5.7. A system S is terminating, if at each clock cycle, every operator can fire exactly once, according to either rule (1), (2), or (3), so th a t the resulting system marking satisfy rule (4). (A ln,h) (Arn,h) (A ll,l) (A^L1) (Bln,h) (Bsn,h) ( B ll,l) (B sl,l) (a) An MIMO system ((Aln,...,Arn) ,h) ((All,...,Arl),l) Decouple Node Couple Node (b) An equivalent SISO system Figure 5.4. Transforming an MIMO system into an SISO one 85 If a system is term inating, then for any finite operand sequence I of length n , the outgoing place will receive a token w ith tag h after T clock cycles. We say the system terminates at the end of the T -th clock cycle. W hen a system term inates, the outgoing place will contain a sequence of token, of which the last n valid values form a result sequence O. For instance, the system in Example 5.2 term inates in three clock cycles. The operand sequence I = (5,5,5) produces a; result sequence O = (2,5,3). D efin ition 5.8. A system S = (G ,a,S0,n ,a) is systolic, if c ( p ) > 0 for every interm ediate place p. S is semisystolic, if for any cycle C of S , the sum of capa cities of all the places on C is greater than zero. T h eorem 5.1. A system is term inating if and only if it is semisystolic. P roof. Suppose a system is not semisystolic, then it has at least one cycle on which all the places have 0 capacity, as shown in Fig. 5.5. A t each clock cycle, u{ can fire only after Uj does, for any j > i , in order to satisfy the firing rule (4). Since u 1 = ur, this leads to a contradiction: u x can fire only after u 1 does. Thus no firing of u x can satisfy rule (4), the system is not term inating. Now suppose the system is semisystolic. Then all operator nodes can fire as follows so th a t the result m arking satisfy rule (4): First, all operators th a t has no O-capacity input place fire. Second, those remaining operators fire, if all 0- capacity input places of which contain a token. The second step repeats until all operators have fired. Since the system is semisystolic, each execution of the second step will fire at least one operator. 86 r-1 Figure 5.5. A system with a O-capacity cycle 87 Because a system has only a finite num ber of operators, eventually all operator will fire. Thus the system is term inating. Q.E.D. D efin ition 5.9. A three-tuple C = {S ,1 ,0 ) is a computation on S, if O is a finite result sequence obtained by applying a finite operand sequence I to a ter m inating system S. D efin ition 5.10. A term inating system S is determinate, if for any two com putations C = ( 5 ,7 ,0 ) and C ' —(S ,1 ' ,0 ' ), I = / ’ implies 0 = 0 ' . T h eorem 5.2. Any term inating system is also determinacy. P ro o f. I will prove a stronger result: If a system is term inating and is fed w ith an operand sequence I of length n , then there exists a unique marking sequence —► ■ • • —*MT. First, it is obvious th a t M 0 is unique. Now I prove th a t if M t is unique, so is M t+i. Since the system is term inating, every operator fires exactly once at clock cycle t. The output tokens produced by each operator are uniquely determined by the input tokens in its input places. (The tag of an output token also depends on the capacity of an output place, which for a given system is fixed.) Note th a t the input token of a non-zero-capacity input place is part of marking M t . Suppose an operator v has a single O-capacity input place p , which is also the output place of an operator u. The node v may have other non-zero capa city input places. The output tokens of v are uniquely determined by the output of u and M t . This argum ent is recursively applied to u . Since the system is ter 88 m inating, it m ust be semisystolic (Theorem 5.1.) Thus eventually one encounters a non-zero-capacity input place. If an operator has multiple O-capacity input places, the above reasoning can be applied to each branch of input. Thus the output tokens of every operator are uniquely determined by M t . By the induc tion hypothesis, M t+1 is unique. Q.E.D. From Theorems 5.1 and 5.2, the two semantical properties, term inating and determinacy, of a system are completely decided by one of its syntactical properties: whether it is semisystolic. We can determine these two semantical properties by checking the much simpler semisystolic property. This can be done using a technique similar to the one used in C hapter 4 to check weak single assignment. D efin ition 5.11. Given a system w ith a SP-net G = ( F ,V ,E ,c), where | V | = m , its capacity matrix is a m X m m atrix A = (ag -y), where g.„- = tj Min {c (p ) | p £ OUT(v{ )\JIN{vj)} if OUT{v{ ) ( JIN (Vj) ¥= < ! > oo if otherwise T h eorem 5.3. Let C = Min ( A , A 2, . . . , A m ), where the minimum and the power operations are defined in Definition 4.4. Then a system is semisystolic if and only if every diagonal element of its C m atrix is greater th an zero. P roof. By induction on p, it can be shown th a t the ij element of A p is the m inimum capacity of any path from operator vi to operator G y th a t travels p places. Thus c{j- is the minimum capacity of any path from to Vj. Conse quently, cf ! - is the minimum capacity of any cycle containing vg -. There is a 0- 89 capacity cycle in the system if and only if cg y = 0 . Q.E.D. In scientific and engineering computing, people are most often interested in those com putations th a t always stops in finite am ount of time, and always give unique results, within reasonable numerical errors. The above investigations regarding term ination and determinaey give us a precise characterization of those systems th a t satisfy these requirements. In the rest of this section, I will study another property: equivalence. This property is im portant because if two systems are equivalent, we can replace the more expensive (slower) system with a cheaper (faster) one, to minimize cost or to maximize performance. D efin ition 5.12. Two determ inate systems S and S ' are equivalent, if for any tw o com putations C —{S , 1 ,0 ) and C ' = (S 1,1' ,0 '), I = I > implies 0 = 0 '. Although it is of academic interest to investigate the equivalence of any systems, I am most interested in special equivalence-preserving transform ations th a t are useful for solving synthesis problems. Given a determ inate system S = (G ,a,S0,n ,a), where the SP-net G —(P,V,E,c), I w ant to transform it into an equivalent system by modifying the initial state S 0, the latencyo', and the capacity function c only, leaving everything else intact. The following three lemmas provide such transform ations. I will discuss their applications in C hapter 6. L em m a 5.1. Given a determ inate system S =(G ,cr,S0,n ,a). Suppose O ' is obtained from G by m ultiplying the capacity of every place by a positive 90 integer (scaling constant) 8. Then there exists S 0 such th a t S and S ' = (G ' ,cr,S0' ,n,8a) are equivalent. P roof. W ithout losing generality, I assume every interm ediate place of S can have capacity at most 1. A place w ith capacity c > 1 can be broken into a linear cascade of c unit-capacity places, w ith an identity operator between each pair of places. Now suppose S starts w ith initial state £ 0. The initial state S 0r of S ' is determined as follows. For any interm ediate place p such th a t c(p)—0, we set S0(p) — S 0' (p ) = (f> . Otherwise we set S 0' (p) = { < u ,l> ,< d ,2 > ,...,< d ,8 > }, where d is the garbage value and * S ’ o(p) = { < « ,1 > } . I claim th a t when the same operand sequence I is applied to S and S ' , the output value of any operator u in S at tim e t is the same as th a t of v in S ' at tim e 8(t—l )+ l. This claim can be proven by m athem atical induction on t. From the above specification of 5 0 and S 0' , the claim is obvious for t= l. Now suppose the claim is true for t = k, I prove th a t it m ust also hold for t = k + l. If the claim does not hold for some operator v, then v will produce different values in S at k + l and in S ' at 8(k+l—1)+1 = £ & + l. This implies th a t the token removed from some input place p of v contains different values for S and S ' . (If p has zero-capacity, the argum ent is recursively applied to the input operator of p. Because both systems are term inating, they are semisystolic, by Theorem 5.1. 1 can eventually find a non-zero capacity place on each input path of v .) These values are computed in S at clock cycle k and in S ' at clock cycle 8(k—1)+1, respectively. Their being different contradicts the induction hypothesis. 91 Since the claim is true for any operator node, it is true for the operator th a t sends tokens to the outgoing place pout. Thus the *-th token in pout of S has the same value as the {8(i — l)+ l)-th token of pout in S ' . From the definition of result sequence, both systems produce the same result sequence for the same operand sequence. Thus they are equivalent. Q.E.D. L em m a 5.2 (Retiming Lemma, Leiserson and Saxe, 1983). Given a deter m inate system S = (G ,a,S0,n ,a). Let lag be a function th a t maps the merge node to zero and all other operator node to an integer. Suppose th a t for every place p the value c (p) + lag(v) — lag < u ) is nonnegative, where u and v are the input and the output operators of p , respectively. Let G ' be a SP-net obtained by replacing the capacity c (p ) of every interm ediate place p by c(p) + lag(v)—lag<u). Then there exists S 0 such th a t S and S ' = (G ' ,cr,S0r ,n,a) are equivalent. P roof. The proof is given in [58]. L em m a 5.3. Given a determ inate system S —(G ,a,S0,n ,a). Suppose a new SP-net G ' is obtained from G by the following operation: For some opera to r node v th a t has no O-capacity input place, decrement the capacity of every input place by 1 and increment the capacity of every output place by 1. Then there exists S 0 such th a t S and S ' = (G r ,a,S 0' ,n ,a ) are equivalent. P ro o f. This is a consequence of Lemma 5.2. In fact, we can achieve the desired transform ation by define the lag function as follows: lag(v^) = 1 for 92 v{ = v, and lag(vi ) = 0 for u * =£v. Q.E.D. 5.3. A n O p eration al S em an tics o f th e forp ipe C on stru ct In chapter 4, I defined the forpipe loop construct and explained the seman tics by its dependence graph. In this section, I define an operational semantics for forpipe loop via the concept of systems and show th a t these two semantics are consistent. First, I present an algorithm th a t transform s a CF into a system representation. A lg o rith m 5.1 (for Compiling Vector Compound Functions). Input: A CF L as defined in section 4.1. Output: A system S =(G,cr, S 0,n, 1) Steps: (1) C onstruct a syntax tree for each assignment statem ent in the loop body. D ata items on the right-hand side of each statem ent are translated into incoming places. The left-hand side array element is converted into an out going place. Arithm etic/logic operations are transform ed into operator nodes. (2) If an incoming place is an array element w ith index expression i-c (as in y[i-c]) then its capacity is set to be c. (3) If an incoming place p has the same array name as an outgoing place of an operator node v , then add a backward arc from v to p. 93 (4) Delete all outgoing places which holds tem porary arrays. The remaining outgoing places are assigned capacity 0. Insert a 0-capacity place on each arc th a t connect two operators. Now we obtain an interpreted Petri net, w ith each operator weighted by its function and each place weighted by its capacity. (5) A dd an incoming place pin and a decouple operator to the Petri net obtained, if there are more th an one incoming places or if any incoming place has non-zero capacity; draw an arc from the decouple node to every old incoming place. Assign the incoming place a capacity of oo. Add a cou ple operator and an outgoing place if there are more th an one outgoing places in the Petri net; draw an arc from every old outgoing place to the couple operator. (6) The initial state S 0 is determined as follows: If an interm ediate place p has capacity c —c [p )=0, then SQ (p) = < f> . If p has capacity c = c ( p ) > 0 and array name a, then 5 0(p) = { < a [ c — 1],1> , < a [c— 2 ],2 > ,...,< a [0],c > } . E xam p le 5.3. Consider the CF specified by a forpipe loop below: tem p w ; x[0] : = x[-l] : = x[-2] : = x[-3] : = w[0] : = 1 ; forp ipe i : = 1 to n do begin 51 w[i] :== y[i] / x[i-4] ; 52 x[i] : = w[i-l] * w[i-l] + if x[i-3]=0 th en y[i-l] else w[i-l] ; end Applying Algorithm 5.1, one obtains two syntax trees in Step 1, as shown in Fig. 5.6a. The left tree is derived from statem ent S i and the right tree from statem ent S2. The resulting graph has four operators v2,v3,v4,v5, seven 94 incoming places p v ...,p7, and two outgoing places Pi0,pu . In Step 2, capacities are assigned to the seven incoming places: c (p 1) = 0, c(p2) = 4, °{P2) = C{P ^) = c{Pb)~ c{Pb)~^y and c (p 7) = 3. In Step 3, backward arcs are added. For instance, an arc is attached from the division operator to place p3, since its output place p n has the same array name w w ith p 3. Similarly, two arcs are added from the addition operator to p 2 and p 7. The resulting graph is shown in Fig. 5.6b. In Step 4, the outgoing place p n is removed, since it corresponds to a tem porary array w . The remaining outgoing place p 10 is assigned a capacity of 0. Two 0-capacity places p 8 and pg are inserted on the arcs from v3 and v4 to v5. The resulting interpreted Petri net is shown in Fig. 5.6c. In Step 5, an decouple operator v0 and an incoming place pin of capacity oo are added to the net, since there are two incoming places p x and p 6 in Fig. 5.6c. No couple operator is added, since there is only one outgoing place P\o = pont in. Fig. 5.6c. The result ing interpreted SP-net is shown in Fig. 5.6d. In Step 6, the initial state * S 0 is determined. Places p v p g, p g, and p 10 are em pty, since their capacity is 0. Place p 2 contains contains token set {< ar[0],4> ,< a:[— 1 ] ,3 > ,< a;[— 2 ] ,2 > ,< z [ — 3],1 > } , since its capacity is 4 and array name is x. Similarly, place p7 contains the token set { < a:[0],3 > , < a;[— 1],2> , < a:[— 2],1 > }; places p 3, p 4 and p 5 contain a single token < w [0 ] ,l> ; and place p6 a token < ? /[0 ],l> . [i-1] i-4 w MPY J2DL 1 1 (a) Generate syntax trees in Step 1 w[i-!] X w [ i- 1] ^/Vswji-l] MPY ADD (b) Add capacities and backward axes in Steps 2 and 3 Figure 5.6. Transforming the forpipe loop in Example 5.3 into a system 96 MPY)PC (c) Remove a temporary output place and insert O-capacity places in Step 4 MPYjFJ = P out (d) Add the decouple node in Step 5 Figure 5.6. (Continued) Transforming the forpipe loop in Example 5.3 into a system 97 To summarize, the system S = (G ,a,S0,n ,a) obtained from transform ing the forpipe loop is as follows: The SP-net G is as shown in Fig. 5.6d; a is defined by = decouple, a(u2) = devide, 0(^ 3) = multiply, C f(w 4) = if— then— else, and 0( ^ 5) = add. The initial state S 0 is as described in the last paragraph. The vector length n is as given in the forpipe loop. Finally, the latency a; is 1. The operand sequence is specified by the array y, w ith the incom ing token sequence being < y [ l ] ,l > , < y[ 2],2 > , ..., < y [n— l ] ,n — 1> , and <y[n],h > . T h e o re m 5.4. Suppose I is a finite operand sequence, O a finite result sequence, L a CF satisfying the weak single-assignment rule, and S a semisys tolic system obtained by applying Algorithm 5.1 to a forpipe loop L th a t satisfies the weak single-assignment rule. Then 0 = L { I) if and only if (S , I , 0 ) is a com putation. P ro o f. The proof is based on a m athem atical induction on the vector length n. I will not give the detail, which is tedious but straightforw ard. The line of reasoning follows. For n = 1, I can prove the theorem by another induc tion on the num ber of statem ents in the forpipe loop. Suppose the theorem holds for n = k , I can prove it for n =& -|-l in the following way: Expand the obtained system as a fc+l-stage iterative array, sim ilar to the expansion of a sequential circuit into a combinational circuit. The dependence graph is parti tioned into k + 1 subgraphs, each corresponding to a iteration. By the nature of Algorithm 5.1, the fc+l-th stage of the iterative array is the same as the & +l-th subgraph. Now by the induction hypothesis, the output of the &-th subgraph is 98 the same as th a t of the fc-th stage. Because the dependence graph is well formed and the system is semisystolic, both will produce the same unique result at the & + l-th stage (subgraph). Q.E.D. T h e o re m 5.5. The system S obtained in Algorithm 5.1 is semisystolic, if and only if the CF L satisfies the weak single-assignment rule. S is acyclic, if and only if L satisfies the strong single-assignment rule. P ro o f. The first statem ent is a direct consequence of Theorems 4.1, 5.1, 5.2, and 5.4. To prove the second statem ent, note th a t in Algorithm 5.1, the only step where an cycle could be formed is Step 3, where backward arcs are added. If L satisfies the strong single-assignment rule, the com putation of any array does not depend on the same array itself. A fter Step 3, if there is a path from some incoming place p to some outgoing place q, p and q m ust correspond to different array names. Thus no cycle exists in the obtained SP-net G , the system S is acyclic. If the system S is cyclic, then after Step 3 there exist a path from p to q and both places correspond to the same array name x . Thus the com putation of x depends on x itself, the loop does not satisfy the strong single-assignment rule. Q.E.D. As an example, consider the system in Example 5.3. The CF satisfies the weak single assignment rule, since it has the following assignment m atrix and C matrix: 99 oo 0 4 3 4 0 M — 4 3 m 2 = 7 4 C = Min (M ,M 2) = 4 3 and both diagonal elements of C are greater th an 0. The resulting system is semisystolic, as shown in Fig. 5.6d. 5.4. M o d e lin g P ip e lin e N e ts A CF evaluation is performed by a pipeline net processor w ith the execu tion of an assembly language program, th a t contains a num ber of SET instruc tions followed by a START instruction. This section study modeling pipeline nets, or more precisely, assembly programs, by com putation systems. I will give a procedure th a t transform an assembly program into a system . An example is also given to illustrate the procedure. Note th a t the vector length n and the latency a are already specified in the START instruction. A lg o rith m 5.2 (for Modeling Pipeline Nets.) Input: An assembly language program. Output: A system S = {G ,a,S0,n ,a). Steps: (1)F or each SET FPop, where the operation op requires k pipeline stages and has r inputs, construct an operator node vt- w ith o(vi) = op, an output place of 0-capacity, and r input places of capacity k. (2) For each operand (result) vector register R i , construct an incoming (outgo ing) place Pi of 0-capacity. 100 (3) Connect proper nodes w ith arcs according to the two SET instructions th a t determine the interconnection patterns of BCN1 and BCN2. If a non compute buffer of c units of delay is on an arc, insert a place of capacity c on th a t arc. (4) If the resulting system is MIMO, convert it into an SISO system by adding a decouple and a couple operators. (5) If there are two connected places w ith capacities c x and c2, merge them into one place w ith capacity c x + c 2. E x a m p le 5.4. Consider the assembly language program in Example 4.4. It sets up a pipeline net shown in Fig. 4.7a. The vector length n and the latency a are specified in the START instruction as 400 and 1, respectively. The inter preted SP-net is shown in Fig. 5.7. It has seven operator nodes, corresponding to the six FP s and a decouple node. The two input places of each addition opera tor have capacity 2, because an addition requires 2 pipeline stages. One of the input place of the division operator has capacity 8, because division requires 6 stages and there are two units of delay on the d ata p ath from FP4 to F P 6. Note th a t in the above procedure, a pipeline of k stages is modeled by an operator w ith all its input places having capacity k. A system S is called a k- system, if every interm ediate place in S has a capacity larger than or equal to k. A pipeline net with all FPs having k stages can be modeled by a A:-system. DIV ADD MPY ADD M PY DECOUPLE Figure 5.7. A system modeling the pipeline net in Fig.4.7 102 C h a p te r 6 S y n th e sis a n d P e r fo r m a n c e o f P ip e lin e N e ts Three synthesis problems arise for pipeline net com putation, (l) Compile forpipe loops into com putation systems. This problem has been solved in C hapter 5, where I develop a procedure (Algorithm 5.1) to transform a forpipe loop into an equivalent system w ith latency 1. (2) Transform com putation sys tem s into pipeline net specifications. This will be the topic of Section 6.1. (3) partition a large system into smaller ones to suit a small pipeline net. This will be dealt w ith in Section 6.2. Davidson [19] has developed a reservation table approach to sequencing the execution of successive instructions in a pipeline processor. For a given pipeline organization, his algorithm determines the minimum latency cycle by which the pipeline is scheduled to yield a maximum throughput rate. In this paper, I study a different problem: transform a com putation system into a pipeline net w ith a minimum constant latency cycle. Here the minimum is optimized from all equivalent pipeline nets. I restrict my study to the sequencing of pipeline nets w ith constant cycles, in order to reduce the control complexity of pipeline nets. In fact, variable length latency cycles have not been implemented in real machines. All commercial pipelines choose the constant cycle scheme for its sim plicity in control. Leiserson and Saxe [58] have proposed a technique called retiming to minimize the clock period of any semisystolic circuit. This technique allows one to transform com putation systems into systolic arrays. Similar work is also 103 reported in Kung[56], where a systolization procedure is used to transform signal-flow graphs to systolic arrays. These two approaches convert a given sys tem into an one-level systolic array, in which each cell is not pipelined. I w ant to convert parallel algorithms expressed as forpipe loops into assem bly language program s th a t are executable in a two-level pipeline net. In Fig. 6.1, various m ethods for converting scientific loops into systolic arrays or into pipeline nets are compared. The techniques being presented are potentially use ful for designing efficient vector supercomputers th a t can evaluate vector com pound functions directly in hardware. 6 .1 . N e tw o rk in g A fter applying Algorithm 5.1, a CF is represented by a com putation on a system S w ith a latency of 1. We have to transform this com putation into an equivalent pipeline net characterized by a ^-system Sk . The resulting system Sk has the same topology as S, but the capacities and the latency m ay be modified. If the original system does not have enough capacities, we have to m ultiply the capacities by a scaling constant 8 using Lemma 5.1. This implies th a t the latency will be increased by the same am ount, and the throughput of the pipe line net will decrease. Thus we m ust find a minimum scaling factor such th at enough capacities are obtained after applying Lemma 5.1. This factor is deter mined by two metrics of a system. For any cycle C of a system, its length, denoted by 1(C), is the num ber of operators in the cycle; its capacity, denoted by c(C), is the sum of capacities of all the places in the cycle. 104 forpipe Loops C om putation Systems Pipeline Net Systolic Arrays Algorithm 6.2 for Pipeline Networking Algorithm 5.1 for Converting Com pound Function Systolization Procedure (Kung, 1984) A lgortithm 6.3 for System Partitioning Systolic Conversion Theorem (Leiserson Saxe 1983) Figure 6.1 Various methods for converting forpipe ioops into systolic arrays or pipeline nets 105 D e fin itio n 6 .1 . Suppose 5 is a system converted by Algorithm 5.1 from a forpipe loop. The minimum scaling factor 5(5) of 5 is defined as follows: if 5 is acyclic r Max {k y f ( C ) / c (C ) | C is a cycle in 5}1 if otherwise *(s) = - where T ^1 is the smallest integer greater th an or equal to x. L e m m a 6.1. Suppose S is a system transform ed from a forpipe loop by Algorithm 5.1. If Sk is a fc-system transform ed from 5 using Lemmas 5.1 and 5.2, then Sk has a latency larger than or equal to 5(5). P ro o f. F irst, note th a t retiming does not change the capacity of any cycle in 5 . W hen 5 is transform ed into a pipeline net represented by Sk, any cycle C of length l(C ) in Sk m ust have a capacity of at least kXl(C ). However, the same cycle in 5 has only a capacity of c(C). Thus for this particular cycle to have enough capacity, we have to scale 5 up using Lemma 5.1 by m ultiplying F k l(C )/ c (C )l. Consider all cycles in 5 , we obtain the m inimum scaling fac tor 5(5). Q.E.D. There are m any pipeline nets equivalent to 5 . We are interested in finding the one which is transform ed from 5 with the minimum scaling constant 5(5), thus has the minimum latency. Such a pipeline net w ith minimum constant latency is called optimal. To simplify the presentation, all pipelines are assumed to have equal k stages. Such a pipeline net is characterized by a fc-system. The key concept and techniques also apply to the general cases, where the pipelines may have unequal numbers of stages. 106 The following procedure converts a given system S into an optim al pipeline « net characterized by a A;-system Sk . The procedure works as follows: For a given system S, we retime it by defining a proper lag function through some opera tions on its capacity m atrix. We then check if the retimed system is already a fc-system. If so, stop. Otherwise, we modify the capacity and the latency of the system using Lemma 5.1 by multiplying a scaling factor 8, and apply retiming again. Since l < 8 ( S ) < k m f we can find the m inimum scaling factor in 0 (logm +logA:) iterations through binary search. A lg o rith m 6 .1 . (for Pipeline Networking). Input: A semisystolic, m -operator system S w ith latency 1. Output: A fc-system Sk w ith wavefront latency 8(S). Steps : (1) Let L — 1 and U = k m . (2) Let 8= T (U — L)/2+L 1. Apply Lemma 5.1 by m ultiplying all capacities and the latency of S by 8. We obtain a new system S ' w ith latency 8. (3) Let P be the capacity m atrix of S r . Let Q = P —X where X is a m atrix w ith Xjj = k , and R be the last column vector of Q . Com pute the vector H = M i n ( R ,Q X R ,..., Q m~lXR ). (4) Apply the Retiming Lemma to obtain a system Sk from system S ' by defining the lag function to be lag(vm)=0 and lag{vi ) = hi for i = 1,2,...,m . 107 (5) If Sk is a A;-system, then let U —8 -, otherwise let L =8. (6) If U—L > 1, then go to Step 2; otherwise stop, Sk is the desired Ar-system w ith latency <5(5). E xam p le 6.1. Consider the conversion of the com putation system 5 shown in Fig. 5.6d to an equivalent 3-system. Initially, the latency is 1. In Step 1, I set the lower bound L = 1 and the upper bound U = km = 1 5 . A fter three iterations of Steps 2, 3, 4, 5, and 6, I return to Step 2 from Step 6 w ith L = 1 and U — 3. (These three iterations perform operations similar to the last itera tion, thus I have om itted details here.) In Step 2, I compute 8= T {U— L)/2+L 1=2. After applying Lemma 5.1 to 5 w ith 5 = 2, I obtain a system 5 1 as shown in Fig. 6.2a. In Step 3, I find the matrices P , Q, and H as shown below: P = oo 0 oo 1 oo oo - 3 oo - 1 oo oo oo 1 1 oo oo oo - 1 —1 oo oo oo oo oo 0 Q = oo oo oo oo - 3 oo oo oo oo 0 oo oo oo oo - 3 oo 4 oo 3 oo oo 5 oo 3 oo Q R = - 4 - 7 - 4 - 7 — 4 oo - 4 - 3 oo q 2r = - 3 QSR = - 2 Q4R = - 3 oo - 3 - 2 - 3 0 1 0 1 H = Min { - 4 - 7 - 4 - 4 oo - 4 oo ! - 3 * - 2 oo - 3 - 2 0 1 0 - 7 - 3 - 3 - 3 1 } = - 7 - 4 - 3 - 3 0 108 In Step 4, by applying the Retiming Lemma to S in Fig. 6.2a, w ith the defined lag function, I obtained an equivalent system Sk as shown in Fig. 6.2b. Since this is a 3-system, I set U = 8=2 in Step 5. The algorithm stops at Step 6, because U —L = 2 — 1 = 1. The resulting Sk is the desired 3-system with a minimum latency 2, which is redrawn as a pipeline net in Fig. 6.2c. T h e o re m 6.1. The pipeline networking procedure (Algorithm 6.1) produces an equivalent, optim al pipeline net. P ro o f: In all steps of Algorithm 6.1, only equivalent system transform a tions are m ade by applying Lemmas 5.1 and 5.2. Thus Sk is equivalent to S. Furtherm ore, the latency in the final A;-system is obtained by searching all possi- * ble latency values and selecting the minimum one th a t enable S to transform into a ^-system Sk. Thus it m ust be the minimal scaling factor 8(S). Therefore, the resulting pipeline net m ust be optimal. Q.E.D. T h e o re m 6.2. Any forpipe loop, satisfying the weak single-assignment rule, can be optim ally transform ed into an equivalent pipeline net. P ro o f: This is a direct consequence of Theorems 5.1 and 6.1. Q.E.D. A pipeline net w ith unequal numbers of pipeline stages in F P s can not be characterized as a A:-system. However, such a net can be represented as a K- system, S(K ), where K = (k1, k 2, ■ ■ ■ , km ) is a vector and A ; , - is the num ber of pipeline stages in the F P represented by node . In converting a com putation (a) The system S’ obtained after scaling up Fig.5.6d by 2 v5 \ A out (A D D f (b) The k-system obtained by retiming S’ r DIV M PY (c) The pipeline net obtained from (b) figure 6.2. An example of transforming a system into a pipeline net 110 system into such a pipeline net, Algorithm 6.1 needs to be modified slightly by changing the m atrix X to Y in Step 3, where where 1" is a m atrix with , for i ,j —1,2,...,m . The modified algorithm transform s a semisystolic system S into an equivalent A-system S ( K ) for a given vector K . This modification gen eralizes the multipipeline networking procedure to any pipeline net constrained by a given vector K . A lg o rith m 6.2. Input'. A semisystolic, m -operator system S w ith latency 1. Output: A A -system S (K ) w ith latency 8(S), where A = (&!, k2, , km). Steps: (1)Let L —1 and U —km. (2) Let 8— T (U—L)/2-\rL 1. Apply Lemma 5.1 by multiplying all capacities and the latency of S by 8. We obtain a new system 5 1 w ith latency 8. (3) Let P be the capacity m atrix of S ' . Let Q = P — Y where Y is a m atrix w ith = k i , and R be the last column vector of Q . Com pute the vector H = M in ( R ,Q X R ,..., Q m~1XR ). (4) Apply the Retiming Lemma to obtain a system Sk from system S 1 by defining the lag function to be lag{ym)= 0 and lag(Vj ) = hi for i = l , 2,...,m . (5) If Sk is a k-system, then let U =8; otherwise let L =8. (6) If U—L > 1, then go to Step 2; otherwise stop, Sk is the desired fc-system : I l l w ith latency 8(S). 6.2. P a r titio n in g A pipeline net processor has only m pipelines, for a small value of m . W hen we have to process CFs w ith more than m operations in a forpipe loop, we m ust provide a means to partition a large com putation system obtained from a forpipe loop. The partitioning problem can be stated as follows: D efin itio n 6 .2 . A com putation system S is said to be m-partitionable, if it can be decomposed into a num ber of disjoint subsystems S lf S 2,...,Sr, such th a t (1) the num ber of operators in each subsystem does not exceed m ; and (2) no two subsystems are m utually dependentf. A strong connected component (SCC) is a subgraph of G in which every node pair is connected by a path. A maximally strongly connected component (MSC) is an SCC containing no other SCC as a subgraph. The m - partitionability of a system is directly related to the sizes of its M SC’s. T h e o re m 6.3. A com putation system S is m -partitionable, if and only if every MSC of S has at most m operators. P ro o f: Suppose a system S contains an MSC which has more th an m operators. Since all operators inside an MSC are m utually dependent, this MSC can not be partitioned. Thus S is not m -partitionable. If every MSC of S has t Two subsystems Si and S% ol‘ a system S are mutually dependent if there exist edges e j and e2 in S such that e j is from a node in 5 j to a node in S 2, and e2 is from a node in S 2 to a node in S j. 112 at most m operators, The m ethod shown in Algorithm 6.3 will produce a parti tion. Q.E.D. The problem of finding an optim al partition, which minimizes the number of subsystems to be produced, is NP-complete, as it contains m any NP-complete scheduling subproblems. The following algorithm first identifies all MSCs of a system and then adopts a heuristic list schedule scheme to find a near optimal partition. The list scheduling was originally introduced in [30]. A lg o rith m 6.3 (for System Partitioning). Input A com putation system S w ith n > m operators. Output: Subsystems S S 2,—,Sr th a t satisfying (1) and (2). Steps: (1) Identify all MSCs of S. This can be done w ith any known algorithm , say, Algorithm 5.4 in [6]. For each MSC Mir denote the num ber of operators by »,•. If % > m , then the system is not partitionable. (2) Merging each MSC into a single operator. Now S is reduced to an acyclic graph G w ith p nodes M 1 , , Mp . (3) Com pute the level of Mj, denoted by L(M i), as follows: Let N (M i ) be the children set of JW ,- in G. If iV(Mj) = < f> , then L{Mi ) = 1. Otherwise, L ) - 1 + max {L (M j) | Mj e (4) Label the nodes of G from 1 to p as follows: The highest level first; within 113 the same level the leftmost node first. Create a list T of all these nodes in ascending order. (5) List-schedule: i : = 0; w hile T is not em pty do i : = i + 1; let Si be the em pty set; w hile | | < m do search T in ascending order for a node M y such th at M y has no father and | S t - | + ny < m ; if found then remove M y from G and T and add it to S i ; otherwise exit; end end E xam p le 6.2. Given a system S, only the topology of its underlying Petri net is of interest, as far as partitioning is concerned. Consider the system S in Fig. 6.3a. Only its operators and their interconnections are shown. Other factors are irrelevant here. This system is not m -partitionable for m < 4 , since it con tains an MSC (denoted by M5) which has four nodes. It is m -partitionable for any m > 4. Now suppose m = 5. Algorithm 6.3 is used to find a partition as follows. After steps 1 and 2, I identified nine MSCs as shown in the nine cycles of Fig. 6.3a. Figure 6.3b shows th a t an acyclic system G is obtained after Step 3. A fter Steps 4, the nodes of G are grouped into 4 levels as shown in Fig. 6.3c, where the num ber inside each node denotes the label of th a t node. From this labeled system, I obtained a list T = {M X ,M g,M S,M S,M e,M 7,M 8,M^,Mg}. Finally, after three iterations in Step 5, I obtain three subsystems, s i = {M ifM e,M sM 4}, S 2 — {M 5,M 7}, S 3 = {M e,M 8,M g}, each having <5 operators. 114 in M l M5 M7 M6 M8 % u t (a) A system with 9 MSCs M l M3 M6 M8 M9 (b) The MSC graph Level 4 Level 3 Level 2 Level 1 rM9 (c) Levels and labels Figure 6.3. Partitioning a large system into several subsystems using Algorithm 6.3 115 6.3. P erform an ce A n alysis Denote by P ( m ,k ) a pipeline net processor with m FPs, each F P having exactly k stages. In this section, I compare the performance of such a processor, P (m ,k), w ith th a t of a processor w ith a single F P , P (l,fc). A CF is evaluated as a com putation in a pipeline net characterized by a system S — (G ,cr,S0,n ,a), where the SP-net G = ( P ,V ,E , c ), and capacity c{p)'>k for any intermediate place p . Assume th a t the system S has m operator nodes, i.e., | V \ = m , and let I be the length of the critical path (i.e., the num ber of operator nodes on the longest simple path from pin to pout) in G. Then for the purpose of performance evaluation, the desired com putation can be characterized as C = (k ,1 ,m ,n ,a), where a is the latency, and n the num ber of wavefronts entering the pipeline net. The com putation C is implemented w ith a num ber of SET instructions fol lowed by a START instruction. Let 7 be the tim e needed for the executions of all SET instructions. Let (3 be the num ber of pipeline stages in each crossbar network (BCNl and BCN2). The total com putation tim e is derived below. Firstly, a pipeline net is set up in 7 clock periods. A wavefront enters the pipe line net every a clock periods. The first wavefront needs to traverse I FP s and passes the crossbar network BC N l once and BCN2 I times. This implies th a t ( /+ l)(k+/3) clock periods are needed to fill up the net. After th a t a result wave- front leaves the net every a clock periods. Thus the total time required is: T m = l+(l+l)(k+/3)+(x{n—1) (6.1) 116 The total num ber of arithm etic/logie operations performed is mn. Thus the throughput of the pipeline net, Hm = mn / Tm, is equal to: H = _____________ (6 2) m 7 + (/+ l)(A :+ /?)4 -a(n -l) [°'Z) Figure 6.4 plots the throughput Hm against the vector length n for various values of the wavefront latency a, w ith the assum ptions ra= 16, (3=1, and k=c=4. From this figure, we see th a t the throughput is reversely proportional to the latency a. Hockney and Jesshope[35] have proposed two param eters (rO Q ,n 1/ g) to measure the performance of a vector pipeline, where is the maximum throughput obtained when the vector length n approaches infinity; and n ^ B is the m inimum vector length n whereby half of the maximum throughput can be achieved. For the pipeline net processor, I can derive these two measures from Eq. (6.2) r00 = m J a (6.3) n l/2—l+{l+l)(k+P)—a (6.4) The above equations (6.3) and (6.4) have a num ber of interesting implica tions. The hardw are designer should minimize the crossbar network delay (3 and the setup tim e 7 . This is especially crucial for short vector operations. The user should write forpipe loops in such a way th a t the transform ed systems have small latencys and short critical paths. Most im portantly, to achieve the highest throughput, the compiler should keep the latency a as small as possible. T hat is why in transform ing a forpipe loop into a pipeline net specification, we optimize the resulting system by minimizing the latency a. H (T h ro u g h p u t) m 16 Latency = 1 14 12 10 8 Latency = 2 6 4 Latency = 4 2 ooSo 100 1000 N (Vector Length) Figure 6.4. The throughput performance o f p ip elin e nets with respect to vector length 118 Next, I analyze the relative speedup of P ( m ,k ) over a single pipeline, P(l,Ar). I call an operator node of S a cyclic node if it belongs to some cycle of nodes. Otherwise it is called an acyclic node. In a single pipeline, each acyclic node can be processed, in vector mode. But all cyclic nodes m ust be executed in scalar mode, since they m utually depend on each other. However, in a pipeline net, operations represented by cyclic nodes are executed in vector mode w ith a latency a, same as those represented by acyclic nodes. Suppose the system S representing a CF has M l cyclic nodes and M 2 acyclic nodes, where M l -\-M2 = m . Then the tim e needed for P(l,£) to evaluate the CF is equal to: T 1= M 1k n + M ^ / j + k + n — 1) (6-5) where 7 1 is the setup time for P(l,k). From Eqs. (6.1) and (6.5), I derived the following speedup: T \ M ikn + M P H 1 + k + n — 1) Sp - 7 + ((+ i ) ((.+ /3)+ q („ _ i ) (6-6) The asymptotical speedup S O Q = Sp (when n —► oo) is kM 1 + M 2 a (6.7) E xam p le 6.3. Consider the system in Fig. 6.5, which corresponds to the following forpipe loop: forp ipe i : = 1 to n do begin 51 x 52 y 53 z[i end = w[i] * w[i] ; = x[i] + z[i-5] ; = 1 / y[i-3] ; 119 5 (a) A system w M PY ADD DIV (b) An equivalent pipeline net characteried by a 4-system Figure 6.5. An system for Exam ple 6.3 120 There are m — 3 operators in the system, corresponding to the three state ments in the forpipe loop. The operator v x is an acyclic node, while v2 and v3 are cyclic nodes. Thus M x = 2 and M 2 = 1. In a single pipeline processor P(l,k), only Si can be vectorized, since the other two statem ents are m utually depen dent. Assume all pipelines have four stages, i.e., k — 4. The total tim e for P(l,&) to execute this forpipe loop is then T x = M xkn -\-Mg{fij~\~k 4-n — 1) — 7i 4-9 w 4-3. However, using the pipeline networking technique provided in Algorithm 6.1, the system in Fig. 6.5a can be transform ed into an equivalent pipeline net as shown in Fig. 6.5b. All three statem ents can be executed simultaneously in a parallel pipelined fashion w ith the latency a = l . Thus the total tim e for P (m ,k ) to execute the forpipe loop is: Tm = T+(l+l)(k+0)+a{n - 1 ) = i+4(3+n 4-15 and the speedup is T x 714-9 n 4-3 Sp — ------= ------- ► 9 as n —► oo Tm 7+4/?4-n4-15 The speedup equation (6.6) is plotted as a function of the vector length n for a fixed net size m = 8 in Fig. 6.6a, and as a function of m for a fixed vector length n = 6 4 in Fig. 6.6b. Both plots are made under the assum ption th a t a=/3=l, 7=50, and 'y1=l=k=4. In these plots, the ratio, R = M X / m , represents the percentage of nodes which are acyclic in the system. The larger the value of R , the better the speedup performance is observed in these plots. L S (Speedup) 2 4 20 R = 1/2 R = 0 1 2 8 8 16 3 2 6 4 N (a) Speedup versus vector length 2 4 S (Speedup) 20 R = 1 /2 R = 0 m (b) Speedup versus size o f pipeline net Figure 6.6. The speedup perform ance o f p ip eline nets _______________ sizes and different cyclic graph structures 121 various 122 It is interesting to note th a t when the system is cyclic, a net of m pipelines may achieve a speedup larger than m . This is because P (l,k ) has to process all cyclic nodes in scalar mode, while P (m ,k ) may process them in vector mode. This minimum scaling factor 5(5) is directly related to the performance of a forpipe loop L , when it is implemented by a pipeline net. In fact, we can now give some speedup performance estimations for different groups of loops with very long vector length. A system is called systolizable, if it can be transformed into a systolic system, using only Lemma 5.2. T h eorem 6.4. W hen a forpipe loop L is executed by P (m ,k ) and by P (l,k ), the asym ptotical speedup 5 ^ has the following values: (1) S q q = m , if L satisfies the strong single assignment rule. (2) 1 <5,30 <.mk, if L satisfies the weak single assignment rule. (3) m < 5 ^ < m k , if L is transform ed into a systolizable system by Algorithm 5.1. P roof. Suppose L satisfies the strong single-assignment rule. Then the sys tem 5 transform ed from Algorithm 5.1 is acyclic (Theorem 5.5.) By Definition 6.1, 5(5) = 1. Since all operators are acyclic, = 0 and M 2 = m . Thus from Eq. (6.7), S 0 Q = {kM1+ M 2) / 8 = m . Suppose L satisfies the weak single-assignment rule. Then the system 5 transform ed from Algorithm 5.1 is semisystolic (Theorem 5.5). In the worst case, all operators of 5 are in a common cycle which has unit capacity. By 123 Definition 6.1, 8{S) = km. Since all operators are cyclic, M 1 = m and M 2 — 0. Thus from Eq. (6.7), S O 0 = (kM 1+ M 2) / 8 — 1. In the best case, all operators of S are cyclic, but there are enough capacity in every cycle of S for transform ing into a pipeline net. T hat is, no scaling up is needed, and <$(£) = 1. Thus S q q — (kM j + M 2) / 8= k m . Suppose L is transform ed by Algorithm 5.1 into a systolizable system S. In the worst case, all operators of S are in a common cycle which has a capacity of m . By Definition 6.1, 8(S) = k. Since all operators are cyclic, M 1= m and M 2 = 0 . Thus from Eq. (6.7), S 00 = (kM l + M 2) / 8 = m . In the best case, all operators of S are cyclic, but there are enough capacity in every cycle of S for transform ing into a pipeline net. T hat is, no scaling up is needed, and 8{S) = 1. Thus S 00 — (kMi + M 2) / 8 = km. Q.E.D. Livermore Loops[69] are 14 F ortran DO-loops frequently used as bench m arks to evaluate the performance of m any computer system s[69]. For readers’ convenience, I list the 12 Livermore Loops which can be w ritten as forpipe loops in Appendix B. Using Kogge’s cyclic doubling method[52], we can rewrite all these loops to produce pipeline nets having a latency of 1 [46]. Appendix C con tains two loops (No. 13 and No. 14) th a t are not representable by forpipe loops. The difficulty is due to indirect array addressing in Loop No.13 and unknown index calculations in Loop No.14. Incidentally, these two loops can not be vec torized in any pipelined computers [69]. 124 Applying Eq. (6.6) to the Livermore Loops in Appendix B, we obtain some interesting results as plotted in Fig. 6.7. Loops th at correspond to acyclic sys tem s (N o.l, 7, 8, 9, 10, 12) are plotted in Fig. 6.7a. The remaining loops (No.2, 3, 4, 5, 6, 10) corresponding to cyclic systems are plotted in Fig. 6.7b. In both figures, an almost linear speedup observed for each loop, if the size of pipeline net, m , does not exceed the num ber of operations in th a t loop body. For loops Nos. 3, 4, and 11 in Fig. 6.7b, we can even achieve a speedup > m when m < 4 . The theoretical form ula (Theorem 6.4) and the benchmarking results (Fig ure 6.7) verify the efficiency of pipeline nets in evaluation compound functions. Compared w ith a single pipeline processor, almost linear speedup can be achieved if the vector length is not too short (say, > 6 4 ), and the size of the pipeline net is not too large (for Livermore loops, a pipeline net w ith 8 F P s is appropriate.) W hen the operations in a loop exhibit cyclic dependence, even superlinear speedup m ay be expected. Sp (Speedup) No. 8 12 No. 9 10 No. 10 No. 7 8 6 No. 1 4 2 No. 12 m 125 8 10 12 14 (Size of Pipeline Net) 12 10 8 6 4 (a) Speedups for acyclic loops Sp (Speedup) No. 2 Nos. 5 and 6 No. 3 No. 4 No. 11 m 2 4 6 8 10 12 14 (Size of Pipeline Net) (b) Speedups for cyclic loops Figure 6.7. The speedup performance of pipeline nets _________________ for implementing Livermore loops 126 C h a p te r 7 O p tic a l I m p le m e n ta tio n o f P ip e lin e N e ts Over the past 20 years there has been a great deal of research in the area of optical com puting[1, 2] This has comprised work in optical materials, devices, architectures, and algorithms. The vast m ajority of the work on architectures and algorithm s has been centered around special-purpose, analog optical com puting systems. These systems are restricted to low-to-medium accuracy, per form only a limited range of operations, but can have extremely high throughput. More recently, advances in materials and devices are opening up the possibility of digital optical computing, w ith the corresponding promises of accuracy, program m ability, and potentially high throughputs. Recent research has considered physical and to some degree logical architectures for digital opti cal com puting[38, 82,70] While some of these systems have the capability of implementing general purpose machines, very little published work to date specifically addresses this issue. In this Chapter, I investigate the possibility of applying the pipeline net concept to digital optical computing. It may be difficult to implement the pipe line net concept to a large scale by electronic technology because of the intercon nection problem. A pipeline net processor needs a nonblocking network with small remarking time. To implement such a network, the only way is to use a crossbar. It is well known th at crossbar switching networks of large size are cost-prohibitive in electronic environment. On the other hand, recent advances in the research of optical interconnections indicate th at it m ay be cost-effective 127 to implement large optical crossbar networks[15, 29,82, 72]. An optical device can implement fixed interconnections between millions of inputs and outputs via holograms. And an optical crossbar network can be designed to realize reconfigurable interconnections between hundreds of inputs and outputs. F urth ermore, data pass through such devices essentially at light speed, thus the net work delay could be very small. This inspires an investigation on optical pipeline nets ', In this chapter, I will present a pipeline net architecture, called Opcom, th a t could potentially map onto optical hardware. The Opcom takes advantage of m any of the nice features offered by optical technology, such as massive parallelism, gate-level pipelining, flexible and global interconnections. A t the same tim e, the Opcom architecture partially circumvents some problems existing in the current optical technology, like large reconfiguration time. This Opcom architecture has the fol lowing features: • Technology compatibility. The pipeline net architecture could m ap onto opt ical hardware. • Generality. The system is proved to be general purpose in th a t it can simu late any Turing machine. • Uniformity. The system can be implemented from only a few types of dev ices. It has no separate memory and processing units. Instead, a homogene ous optical array of cells serves both purposes. • Concurrency. The operation is basically parallel and pipelined processing at the gate level. 128 The hardw are organization and the instruction set of Opcom will be presented in Section 7.1. A proof is given in Section 7.2 of the generality of Opcom. I show th a t it can sim ulate any Turing machine program. Section 7.3 discusses the concurrency aspect of the computer. A detailed analysis is con ducted for various vector addition schemes. Section 7.4 discusses potential opti cal implem entations of two key subsystems: the array of cells and the intercon nect unit. A ttention is paid to techniques of substructuring the system in order to reduce the interconnection overhead. 7.1. T he A rch itectu re o f O pcom The proposed optical computer, called Opcom, has the logical structure as shown in Fig. 7.1. It has four m ajor subsystems: A host, a control unit, an interconnect unit, and an array of cells. Their organizations and functions are explained below. T h e h ost. The host could be implemented using any electronic micro- or mini-computer. It could also include electronic or optical peripheral devices. It has three basic functions: providing a man-machine interface (like loading pro gram codes and operand d ata into the Opcom com puter and transferring result d ata back); providing some secondary storage space; and performing some preprocessing (like program development and compilation of high-level language program s into Opcom's machine codes). 129 Array of Cells Figure 7.1. The architecture of Opcom new q = "cq + c(x-fy) new q = c^f + c(x+y) Figure 7.2. The operation of a cell Host Interconnect Unit Control Unit 130 T h e array o f cells. The array of cells provides the basic processing and storage facilities of the Opcom computer. It consists of m Xn cells arranged as a rectangular array. Each cell provides one bit storage and processing power. Cell d ,- can be viewed as a D flip flop w ith two d ata inputs a r ,- and yi , and a control input c,-. The operation of a cell is shown in Fig. 7.2. All the cells are synchron ized by a global clock. A t any time, when the control line ci has a logical value "0", the cell is disabled and the two m utually complemented outputs and q} m aintain their original values after a clock cycle. W hen c8= " l" , however, the cell is enabled and the new values and x~7y; are assigned to qt - and after a clock cycle. Because the NOR operation is a logically complete operation, any desirable Boolean function, or equivalently, any combinational circuit, can be realized by connecting a num ber of cells together. Because each cell is a flip flop, it can also be used as a memory cell. Thus the array of cells can implement any combinational or sequential circuit, and therefore, it can construct both process ing elements and memory units. If we num ber the cells from 1 to m X n , we can form the following six sets to describe the array of cells. x = i xi * = l,2,...,ran Y = m i = l , 2,...,mn Q = m . I I I — * l V = m < s > , I I 1 — * F5 1 C — \ ci * = 1,2,...,ran D = k * = 1,2,...,ran X ^ j T forms the operand plane of the array of cells and Q{jQ the result plane of the array. Both planes are connected to the interconnect unit. C is the control plane of the array of cells and is connected to the control unit. Finally, 131 D is ju st the set of all cells. T h e in tercon n ect unit. This unit is implemented as an optical intercon nection network. For the tim e being, we can regard it as a black box with an V input bit plane I, an output bit plane O, and a reconfiguration signal plane R. W ith proper setting of the signals in R, the unit can realize any one-to-one and one-to-m any mappings from I to O. More precisely, these planes can be specified as follows. R is the reconfiguration register in the control unit, its content con sists of a connect command. And the other two bit planes are / = O = X ] j Y \ j O U T \ j F F \ j A IN and OUT are the input and the output flip flops in the I/O device respectively. Here for simplicity, I impose the restriction th a t the Opcom com puter transfers data w ith the outside world via the host one bit at a time. In practical implementations, a large block of data could be transferred simultane ously. FF is a flip flop in the control unit which is used for testing the value of a certain cell. A is an address register in the control unit which is used for indirect addressing. The elements of I and O are numbered as I x, J 2, ..., / 2m»+i and O v 0 2, ..., 0 2 mn+fc+2’ where mn is the size of the array of cells and k is the length of the address register A. The i in /,■ and O ,- is called the index of th a t element. T h e con trol un it. The control unit has a structure as shown in Fig. 7.3. 132 i ii 111 PC Program Memory Control Signal Generator Connect Command Memory Timing Command Memory Decoding Circuitry FF To Array of Cells From To Interconnect unit Interconnect unit A: Address Register FF: A Flip-Flop PC: Program Counter R: Reconfiguration Command Register T: Timing Command Register Figure 7.3. The control unit used in Opcom 133 Its function is best explained by examining how the Opcom, computer executes the basic instructions and programs. The instruction set contains 9 instructions, each of which may be prefixed w ith a positive integer, called its label. Their semantics are discussed below. CON a,b; where a £ 1 and 6 G O . The execution of this instruction will reconfigure the interconnect unit so th a t a is connected to 6. Note th at the connection is unidirectional. Thus th at a is connected to b does not imply th a t b is connected to a. CO N l b; where b G O . W hen this instruction is executed, the control unit generates a connect command which reconfigures the interconnect unit so th a t element I{ whose index i is given by the value of the address register A will be connected to b. CON2 a; where a G I. W hen this instruction is executed, the control unit generates an interconnect command which reconfigures the interconnect unit so th a t c will be connected to element 0 8 - whose index i is given by the value of the address register A. CON3 c,d; where c G {0,l} and d GO • The execution of this instruction will set cell d into a constant value as specified by c. T hat is, after the execu tion, the two output bits q and ~qof d will become c and c " . JUM P b; where b is a positive integer indicating a label. The execution of this instruction simply changes the content of the program counter (PC) to b, so th a t the next instruction to be executed will be the one w ith label b. 134 JZERO b; where b is a positive integer indicating a label. The execution of this instruction will change the content of PC to b, if the value of FF is zero. Otherwise, it does nothing. HALT. This instruction denotes the term ination of a program . After execu tion, the program is stopped and a program completion message will be sent to the host. ENABLE n,fc,d; where n , k are positive integers, and d is a cell. This instruction determines the values of the control bit c of cell d as follows: W hen the next sta rt instruction is executed, c takes a logical value "l" every k clock cycles for n times; and it is "0" all the other clock cycles. A num ber of enable instructions will determine the patterns of the control plane C which are generated by the control signal generator. START m; where m is a positive integer. Upon the execution of this instruction, the control patterns of C are applied to the array of cells for the next m clock cycles. The above instructions can be clustered into four groups: the control-type instructions (JUM P, JZERO, and STOP) determine the control flow of a pro gram; the connect-type instructions (CON, C O N l, CON2, and CON3) establish the necessary interconnections between cells; the enable instructions determine patterns of the control plane; and the START instruction performs actual operations. Note th a t d ata transfer and arithm etic/logic operations are done by first connecting a num ber of cells to form a desired circuit using some connect- type instructions, setting up necessary control patterns using a num ber of enable m instructions, and then completing the actual operation w ith a start instruction. A num ber of connect instructions m ay be combined into a single connect instruction, which will accomplish all the necessary connections. For example, we can use CON (1,2),(3,4),(5,6) to replace CON 1,2; CON 3,4; CON 5,6; The same is true for the enable instructions. A t compile tim e, a high-level language construct (such as the forpipe con struct described in C hapter 4) is transform ed into a connect instruction and an enable instruction followed by a start instruction. The connect instruction is then translated into ah connect command and loaded into the connect command memory. The enable instruction is translated into a tim ing command and loaded into the tim ing command memory. A t run time, when a connect instruc tion is executed, an interconnect command is fetched from the connect memory to the register R, the signals of R will then reconfigure the interconnect unit so th a t all desired connections are realized in parallel. Similarly, when an enable instruction is executed, the corresponding timing command is fetched from the timing command memory to the register T, which is then sent to the control signal generator to generate all desired tim ing sequences in parallel. 7.2. S im u lation o f T u ring M achines b y O pcom In this section, I shall prove th a t the Opcom computer is a general purpose system by showing th a t it can simulate any Turing machine. Since every Turing 136 machine can be sim ulated by a Random Access Machine (RAM)[6], I will prove below th a t the Opcom can sim ulate an RAM, which in turn, can simulate a Turing machine. For this purpose, I can restrict the RAM instruction set as fol lows: An instruction consists of two parts, an operation code and an address. The possible operations are {Load, Store, Add, Read, W rite, Jum p, Jgtz, Jzero, Halt}. The address can be a label or an operand. Three types of operand are possible which correspond to immediate, direct, and indirect addressing modes. Note th a t these instructions are quite sufficient for the RAM to simulate any Turing machine. Before the actual proof of the main result, I first make some assumptions. Note th a t in the Opcom computer, a bit is the basic unit of processing, whereas in a RAM, a word is the basic unit. Suppose the word length is w, we first map part of the array of cells to RAM ’s memory by connecting w consecutive cells into a linear array to serve as a memory word. I assume the uniform cost cri terion for RAM, th a t is, each RAM instruction takes one unit of tim e (step). 1 also assume each Opcom instruction takes one step. Since each Opcom instruc tion essentially performs an one-bit operation, our assum ption is conservative. For example, to transfer a word from a memory location to the accumulator, the RAM machine needs only one step by executing a Load instruction. But u steps are needed for the Opcom. T h e o re m 7.1. Any RAM instruction can be sim ulated by the Opcom com puter in at most 0{w) steps. Thus any RAM program of complexity 0(n) can be sim ulated by the Opcom computer in 0(wn) steps. 137 P ro o f. Firstly, the array of cells is divided into a finite processing part and a potentially infinite memory part. The memory consists of a num ber of registers M x, M 2, where M l is the accumulator and some registers (e.g., M 2 up to Mg) are used as scratch pad registers. First note th at HALT and JUM P b are the same in both RAM and Opcom. Read, W rite, Load, and Store are d ata transfer instructions, and each of which can be realized by 0 ( w ) "CON3" instructions if the immediate addressing mode is used. For direct addressing, I can sim ulate each of them by using 0 ( w ) con nect and enable instructions and a "START m" instruction where m = 0 (w ). For indirect addressing mode, I first transfer the effective address to the address ing register A in the control unit, and then use a "CONl" or "CON2" instruction to establish the desired connections and 0(w) enable instructions and a start instructions to complete the indirect data transfer. To sim ulate Jzero b of RAM by Opcom, I can first connect O(w) cells into a zero-test circuit (which is essentially an OR-tree for tw o’s complement integers). The inputs of the circuit are then connected to the accum ulator and the output to FF. All these can be done by executing 0 ( w ) connect instructions. Because only O(w) cells are involved, I need only O(w) enable instructions to set up the desired timing sequences. Then the execution of START m, where m = 0 (log w ), will set FF to "0" when the accumulator contains a zero. Finally, I execute a JZERO b instruction to complete the sim ulation. The total time steps needed for the sim ulation of this instruction is 0 (w )+ 0 (u ;)+ 0 (lo g u ;)= 0 (w ). Similarly, I can sim ulate Jgtz b in the same 138 am ount of time. Now I tu rn to the Add operation. First I consider the direct addressing case, i.e., A dd L We first construct an add circuit as shown in Fig. 7.4 and then connect the accumulator and Mi to it. Since altogether only 2 w + ll cells are involved, I need only 0 ( w ) enable instructions. Then a "START 2w-|-2" instruc tion will complete the simulation. Suppose I have immediate addressing, th a t is, Add =i. Let the binary representation of i be ■ ■ • iw. Then I can first use w CON3 instructions to transfer i into a memory word, say M 2. Then I sim ulate Add 2. Finally I consider indirect addressing, i.e., Add *i. We can first transfer the content of Mg - (assume it is j) to the addressing register A by O(w) connect and enable instructions and a "START l" instruction. And I use a C O N l, w enable instructions and a "START w" to transfer the content of Mj to M 2. Then I sim ulate A dd 2. Q.E.D. 7.3. C o n c u rre n t O p e ra tio n s in O p c o m The proposition in the previous section m ight be interpreted m istakenly to indicate th a t Opcom would be w times slower th an an RAM machine. However, this is not the case. In fact, the basic processing mode of Opcom is massively parallel and pipelined processing based on the pipeline net concept. In this sec tion, I shall dem onstrate how pipeline net com putation is performed in Opcom. 139 In designing Opcom, I have considered four facts observed from optical computing research: (1) Clock-skew may not be a serious problem in optical sys tems. This implies th a t it is easier to synchronize a large optical system via a global clock. (2) In constructing gates using bistable optical devices, each gate can be naturally associated w ith a latch because of the bistability. (3) Recent advances in optical bistable devices promise optical gates w ith very fast switch tim e in fraction of a nanosecond[74]. However, such gates are still expensive. (4) Optical random-aecess-memory is not yet available. Although it can be implemented by bistable devices, the cost m ay be high. In addition, optical sys tems perm it high bandw idth, highly parallel I/O . In this research, I assume I/O is lim ited as in electronics and do not consider the effects of increased I/O com m unication. Based on facts (1) and (2), I extend the pipeline net concept down to the cell (gate) level in the Opcom design. Thus each cell is a pipeline stage and the pipeline cycle tim e is reduced to the cell (gate) delay. Facts (3) and (4) indicate th a t we need to provide schemes to implement memory and to reduce the am ount of hardw are used for processing and storage units. In response to these needs, I use a homogeneous array of cells in Opcom to serve both storage and processing purposes. Thus memory and processing units can share the same opt ical hardware. This idea of merging storage and processing hardware has been noted in the Connection Machine design[33] To further reduce the demand for processing hardware, and to exploit the advantages of pipelining, I adopt the bit-sequential-word-parallel processing 140 scheme as described in [41]. This scheme has at least two nice implications. A bit-sliced processing unit usually requires far less hardw are th an a bit-parallel unit. For example, the bit sequential adder in Fig. 7.4 needs only 14 cells, independent of the word length. However, the num ber of cells needed for a bit- parallel adder is at least linearly proportional to the word length. Bit-sliced operations also implicitly increase the vector length in vector operations. Sup pose we w ant to add N-dimensional vectors of integers w ith word length w. For bit-parallel hardware, the vector length is ju st N ; but it becomes wN for bit- sequential hardware. And of course long vectors are more suitable for pipelined processing because the reconfiguration overhead and the pipeline fill-up tim e are better masked. Most com putations in Opcom are performed in the following way. Suppose the com putation is form ulated by a high-level loop construct (like a Fortran DO-loop or a PAL forpipe loop). A compiler first translates the construct into a pipeline net specification. This specification usually consists of a num ber of con nect and enable instructions and a single start instruction. A t run time, a required pipeline net, which in Opcom is essentially a synchronous sequential cir cuit, is constructed using connect instructions. Necessary timing sequences of the cells (the control patterns) are then prepared w ith enable instructions. We call all this collection of activities a reconfiguration. The reconfiguration overhead is denoted as 7 . An instruction "START t " is then executed, which will feed the operands into the pipeline net in a bit-sequential fashion, to com plete the com putation. Thus to perform a com putation, we need 7 + t clock I 141 cycles in total. The num ber t , which is the minimal num ber of cycles needed for the reconfigured circuit to perform the desired com putation, is a function of three term s which I now define. For any circuit (i.e., pipeline net), the maximum num ber of cells an operand bit flows through from input to output w ithout repeating is called the circuit delay, denoted as r. For most com putations, the control patterns, which are set up by the enable instructions, are the same for all the involved memory cells. For example, the enable instructions for the adder in Fig. 7.4 are '‘ ENABLE n,a,d", where n = w and k — 2 for every cell d of A, B, and S. I call n the vector length and e x . the latency. It takes r cycles for the first operand bit (more precisely, bit-group) to traverse the circuit to form a result bit. And after th a t, a result bit emits from the circuit every a cycles. Thus t —r+cx(n— 1). The total time needed for a com putation is equal tc ^ 7 + r + a (n — l)), where 7 , r, a, n, r are the reconfiguration overhead, circuit delay, latency, vector length, and clock cycle time, respectively. The cycle time r, which is determined by the cell delay tim e plus the time optical signals pass through the interconnect unit, could be lower than one nanosecond. The reconfiguration tim e 7 , which is dom inated by the tim e tc reconfigure the interconnect unit, is currently much longer. To perform a single reconfiguration of the interconnect unit, two operations are needed. We must first generate the desired reconfiguration command. And we m ust physically apply this command to set up the interconnect unit. The time needed to per form the latter operation depends on the available optical technology; thus a 142 computer architect can do nothing to reduce it. The best observed result for this kind of operation is a few microseconds [61, 72]., On the other hand, one can minimize the time needed to generate the reconfiguration commands through software and architecture means. In Opcom, we precompute all connect commands at compile time. T hat is, when a high- level language program is translated into Opcom's machine code, the connect and the enable instructions are grouped together and compiled into connect commands and tim ing commands, respectively. These commands are then pre stored in the connect command memory and the tim ing command memory dur ing the program load time. A t run time, we only need to fetch the memories to generate the desired command, which can be done in principle in a single clock cycle. This technique can be effective when the high-level language used has a static allocation mechanism. F ortran is such a language. Now let us consider the com putation of a vector addition in Opcom. This example is chosen because it is easy to understand, and it demonstrates most salient features of the Opcom architecture. The following is a forpipe formula tion of the algorithm , which adds iV-dimensional vectors A and B to form the sum m ation vector S. forp ipe I : = 1 to N do begin S(I) = A(I) +B(I); end We need to perform N scalar additions. Suppose the numbers involved are w-bit tw o’s complement integers. Then a scalar addition S = A -\-B, where 143 S = s wsw_ 1...s1, A = awaw_ 1...a 1, and B = bwbw_ 1...b1, can be form ulated as follows: c[0] : = 0; fo r p ip e i : = 1 t o w d o b e g in i] I c[i] s[i] e n d a a g P i] © b [i]; i]/Vb[i]; i]v(p[i]Ac[i-l]); i]®c[i-lj; where 0 denotes the exclusive-or operation. To perform a scalar addition in Opcom, I first construct a bit-serial, tw o’s complement adder by connecting 14 cells as shown in Fig. 7.4. In the above pro gram, A and B are operands, S is the sum, and G, P and C are tem porary data. In compiling the loop into Opcom’s machine code, memory space is reserved only for operand and result data, the tem porary variables will be mapped to the out put of certain cells in the constructed pipeline net. This is illustrated in Fig. 7.4. The resulting pipeline net com putation in Fig. 7.4 has a delay r = 6, spacing a = 2, and vector length (which is just the word length) n— w. Thus the total com putation tim e is (a + 2u> + 2 )r. If we w ant to add two A^dimensional vectors of integers, we can adopt m any processing methods. Here I only discuss two extreme cases. Suppose I have a small array of cells, or equivalently, the vectors are very long. A fter allocating cells for storing operands and results, we may end up w ith very few cells left for processing. Suppose only 15 cells are left. In this case, I can only build one adder and all additions are performed sequentially in a bit-serial, pipelined fashion. 144 The total tim e needed is ('7 + 2m /V +4)r. We observe th a t in this case, the effect of reconfiguration overhead and circuit delay will be insignificant for large N, since 2wN will become the dom inant term . The total num ber of cells needed is 3 w N + 1 7 . The other extreme case is when the array of cells is sufficiently large, so th a t we can perform the N additions in parallel by construct N adders. It is easy to see th a t the time needed becomes (^/ + 2w + 4 )r, and the total num ber of cells is now wN+14N. Note th a t now the reconfiguration overhead can no longer be ignored since the vector length is ju st the word length. It should be noted th a t the Opcom does not have a fixed num ber of pro cessing elements as is the case in most parallel computer architectures. One can dynamically construct as m any processing elements as he needs so long as the array of cells is sufficiently large. Furtherm ore, one can trade some processing cells for memory whenever it is needed. Suppose I have a small Opcom computer w ith 100K cells (such a small cell array is roughly equivalent to an electronic microprocessor chip in term s of gate count). Also assume the cycle tim e r = 1 nsec. and the reconfiguration tim e a = l micro second. Suppose I w ant to add two JV 1 dimensional vectors of word length w = 3 2 . The throughput (number of operations per second) is plotted in Fig. 7.5 as a function of N. 145 Figure 7.4. The logical circuit of a bit-slice integer adder T hroughput (MOPS) 1000 750 500 250 N (Vector Length) 0 1250 500 250 750 1000 Figuer 7.5. The throughput performance of vector addition, where t is a threshold limited by array size (m*n), word length w, and adder complexity. 146 The increasing section of the throughput curve indicates a linear growth as N increases. In this section, I have enough cells for both processing and memory usages. The word-parallel scheme is used to perform all element additions in parallel. The larger the vector length N, the higher the parallelism. A peak throughput of 1000 million operations per second (MOPS) is observed when iV increases to around IK . Here an operation is defined as an integer add. The decreasing section of the figure indicates th at I m ust trade some processing cells for memory, as the vector length N become too large for the full word-parallel scheme. In the worst case, (when iV increases to about 1.333K), most cells will be used to hold data, the remaining cells are sufficient to construct ju st one adder. W e can still perform the vector addition using the word-sequential scheme and reach a throughput of about 1.3 MOPS. It should be noted th a t the increased memory is required because of limited I/O to the array of cells per our assump tions. Taking the highly parallel I/O nature of optical systems into account may eliminate this decrease in throughput. 7.4. Im p lem en tation Issu es o f O pcom Up to now, I have deliberately concentrated on the logical or functional architecture of the Opcom computer. Little has been said about the physica structure or implementation. In this section, I discuss some possible methods of implementing the array of cells and the interconnect unit. T he array o f cells. The array of cells consists of homogeneous cells arranged in a two-dimensional rectangular array. Since all cells are identical, it suffices to discuss how a single cell can be implemented. There are m any ways tc 147 implement the cell as defined in Fig. 7.2. For instance, since each cell can be viewed as a simple three-input-tw o-output sequential circuit, any scheme capa ble of realizing sequential circuits can be used to implement the cell. An exam ple of such a scheme is the "sequential optical processor" as described by Jenkins et al in [49]. In this system, a spatial light m odulator is used as a 2-D array of independently acting NOR gates. These gates are interconnected via an optica system th a t utilizes a computer-generated hologram. Storage capability is pro vided via optical feedback. The most promising technology for implementing these gates is th a t of optical bistable devices. Another m ethod of realizing the cells is to implement each cell directly with an optical bistable device instead of connecting together multiple gates. This will reduce the num ber of active elements (cells or gates) required. From a device point of view, this means the device resolution or overall size requirement are reduced, w ith the tradeoff being increased requirements on each resolution ele m ent. Optical bistable devices are discussed in [18,66,70,74,76]. A typical way to achieve optical bistability is to insert a nonlinear medium inside a Fabry- Perot etalon. The index of refraction in the medium depends on the incident light intensity, and the intensity transm itted (Iout) through the device can be plotted as a function of the incident intensity {Iin) as shown in Fig. 7.6. 148 (Output Intensity) flnpiy, Intensity) I « m I I I I I Figure 7.6. The bistability phenomena in an optical bistable device Shutter x y p c Optical Bistable Device q Figure 7.7. Implementing a cell by an optical bistable device 149 The transm ission of the device remains low until Iin is increased beyond a critical value / 3. The transmission then increases rapidly towards a high value. It remains high thereafter even when Iin decreases until I 2 is reached. It then suddenly jum ps to a low value. Optical bistable devices have been demonstrated for gate operations at subnanosecond speeds and switching energies approaching th at of electronic gates. Additional progress th a t is needed is in further reducing the switching energies and making 2-D arrays of these devices. Research toward these goals is currently under way. A cell of our machine can be made of such devices as shown in Fig. 7.7. The input beam is obtained by superimposing the d ata inputs x, y, the control input c, and the power input p. The data output q is ju st the transm itted beam Iout, and “ q could be obtained using the reflected beam. Note th a t a shutter is placed in the optical paths of x and y, which can be implemented using the same type of optical device. By properly defining the physical intensities correspond ing to logical values, I can implement a logic cell w ith one or two bistable dev ices. Consider Fig. 7.7 again. In our design, p has a constant intensity of 1 ^ The physical intensities of x and y will be assigned as I 4 — I 1 for logical value "l", and 0 for logical "0". The implementation of the control bit c is a little bit complicated. If c has a logical value "0" at a certain clock cycle, the shutter will be closed and the physical intensity of c will be I for the entire cycle. Now sup- pose the logical value of c is "l". Then during the first half cycle, the shutter will be open and the intensity of c will be 0; and in the second half cycle, the shutter is closed and the intensity of c is I. W ith the above logical to physical mapping, it is straightforw ard to verify th at this device (Fig. 7.7) has exactly ------------------------------------------------------------------------------------------------------------------------------------- rar the same (logical) behavior as the cell defined in Fig. 7.2. T h e in tercon n ect u n it. There are m any possible techniques for imple m enting the interconnect unit. Fixed optical interconnections (e.g., utilizing computer generated holograms) could be used in conjunction w ith optical switches (gates or cells) to implement a m ultistage reconfigurable interconnection network[7l]. Reconfiguration could potentially be performed at the speed of the switches, and the network could be circuit switched since control signals could be sent to all switches in parallel optically. This approach assumes th a t arrays of fast switching elements will become available. Alternatively, a single-stage crossbar could be implemented using spatial light m odulators in an optical m atrix-vector or m atrix-m atrix multiplier[72]. In this paper I will consider only this latter spatial light m odulator approach. It should be noted th at some of the resulting lim itations in the com puting system m ay not be fundam ental to optical systems but rather reflect our limiting assum ption of using only this type of reconfigurable interconnection. Examples of these optical crossbars have been described in [15,71,72]. Optical crossbar of moderately large size (128 by 128 or 256 by 256) look feasible in the near future. Such a unit will have a delay tim e around one nanosecond, and a reconfiguration time about 1 microsecond. One such scheme adopts an Acousto- optical deflector to implement an inner-product m atrix-vector multiplier. This scheme may offer high optical efficiency and limited broadcasting capability. It should also be noted th a t cells in the array could be used instead of the deflector to implement crossbars with fast reconfiguration times, at the expense of 151 increasing the delay through the network somewhat. Despite the fact th a t large optical crossbars are much easier to implement than electronic ones, they are not large enough to support the tremendous inter connection requirements of the Opcom architecture when the array of cells is only moderately large. For example, a 103X103 array of cells requires a 106X106 crossbar, which is difficult to implement even when optical technology is used. To alleviate this problem, below we shall present several techniques which reduce the interconnection complexity by decomposing the Opcom system in various ways, providing a tradeoff between flexibility and cost. Suppose we have n cells in the array. We need an nX n crossbar if full connections among all the cells are desired. The first technique I provide is to divide the interconnections into two classes: the long-term interconnections and the short-term interconnections. Note th a t in the Opcom architecture, I m ust connect a num ber of cells into a linear array to construct a memory word. These interconnections w ithin a word need not be changed at run time frequently. Also, if the m ajor application is number-crunching, I may need a lot of arithm etic units (adder, multiplier, divider, elementary function generator, F F T butterfly unit, etc.). These units can be constructed before the computation, and the interconnections within these units can be fixed for a very long time. Thus we see two classes of interconnections: the intra-word and intra-unit ones and the inter-unit and inter-word ones. The former could be fixed for a long tim e (hours, days, even months); while the latter may have to be 152 reconfigured frequently. Com puter generated holograms can support fixed inter connections up to a very large scale (106 cells can be connected in a fixed fashion using the hybrid interconnection scheme introduced in [82]. ) Only the short term interconnections need to be implemented using the much more costly crossbar. For example, if the cells are grouped into (fixed) units of size p (with a constant num ber of inputs and outputs per unit), then the required crossbar size is reduced to 0 ( n 2/ p 2). Another scheme is to decompose a system into several subsystems, and the interconnect unit is also decomposed accordingly. I will explain this scheme by examining an implementation of the pipeline net processor (Fig. 4.5) discussed in C hapter 4. Suppose the processor contains m vector registers w ith vector length n and word length w , m identical functional pipelines which performs only integer addition using the circuitry shown in Fig. 7.4. We need an m X2m optical crossbar network for implementing BCN 1 and an m X3m optical crossbar network for implementing BCN2. Note th a t the sizes of these two networks are independent of the vector length n and the word length w . Each bit-slice integer adder needs 14 cells. The total num ber of cells needed to implement the registers and the FPs is mnw -{-14m. Suppose w = 16, m — 8, and n = 6 4 . The total num ber of cells is 8304. Such a pipeline net can be implemented w ith an 8X16 and an 8X24 optical crossbar networks, and a 100X100 array of cells. However, if full connections among all cells are demanded, we need an optical crossbar network of the size 104X104. 153 C h a p te r 8 C o n c lu sio n s W ith increasing research and applications of parallel com putation, there is a strong need for a coherent theory. Such a theory should provide a unified framework such th a t m any questions regarding the design of parallel algorithms, languages, and computers can be asked and answered. In this dissertation, we present a methodology for the development of such a theory. Three class of im portant problems (representation, analysis, and synthesis), have been identified. The main body of a parallel com putation theory consists of formula tions and solutions of these problems in a com putation domain. 8.1. Sum m ary o f th e R esearch R esu lts The m ain contributions of this research lie in four areas: M olecule. I have presented a parallel language construct to realize various com putation modes for various parallel computer classes. Basic mechanisms of this molecule construct are described w ith a structured language PAL. PAL sup ports many com putation modes via the molecule type concept. A user can characterize a particular mode by defining corresponding molecule types with a typ ro construct. Such a language can be used as a common notation for developing, teaching, and implementing parallel algorithms. The flexibility of PAL is dem onstrated by defining six molecule types for specifying SIMD array com putation, pipelined processing, shared variable multiprocessing, message passing multicom puting, and dataflow computation. 154 Associated w ith PAL is a layered software development methodology. Users develop programs using macro dataflow molecule types, which are architecture- transparent and user-friendly in th at most tedious work in partitioning, com m unication, synchronization, and allocation is left to a software translator. If a user really w ants to take advantage of particular architecture details for better performance, he has the option to use architecture oriented molecules in fine tuning the programs. I have dem onstrated the advantages of the layered approach to parallel software development w ith a concrete example of m atrix inversion. P ip elin e net. A new parallel architecture has been designed to speed up the evaluations of compound functions. The pipeline net architecture and its operational principle are described. BY dynamically reconfiguring its topology, a pipeline net can best m atch the dataflow patterns in user algorithms. Due to this reconfigurability, a single pipeline net processor can be programmed to serve as m any different types of two-level pipelined systolic arrays. This approach is more flexible and cost-effective than systolic arrays for general-purpose scientific computations. Algorithms are presented for transforming forpipe loops into pipeline net implementations. These algorithms are useful for designing the proposed pipe lined processors. Our methods can be easily modified to obtain fixed structures such as systolic arrays and wavefront arrays, if program m ability is not required. The performance of the proposed architecture is analyzed w ith Livermore loop benchmarks. Compared with a single pipeline processor, a pipeline net can 155 achieve significant speedup, especially for long vector operations. W hen a forpipe loop satisfies the strong single assignment rule, or equivalently, if the corresponding com putation system is acyclic, an almost linear speedup is observed. For cyclic systems, even superlinear speedup is possible. C om p u tation system . A m athem atical model for CF evaluation is defined. It is shown th at any pipeline net com putation can be recast into this system model. An operational semantics is also given in term s of com putation systems. Furtherm ore, the properties of term ination, determinacy, and equivalency are defined and studied. It is shown th a t these semantical properties can be checked and/or preserved by checking or m anipulating the syntactical counterparts. O p tical C om p uting. A digital optical architecture is presented as an ini tial step towards answering two im portant questions: W hat is a good, general- purpose functional architecture for optical technology? W hat kind of optical devices are needed to build a powerful digital optical computer? The concepts and techniques presented in this dissertation may inspire further research in developing meaningful optical architecture and system concepts. Our research reveals th a t the pipeline net is an architectural concept suit able for designing digital optical computer systems. I showed th a t the Opcom computer based on this concept is quite general purpose in the sense th a t it can sim ulate any Turing machine. More im portant, however, is the fact th a t such an architecture can take advantage of many of the features offered by optical tech nology (like massive parallelism, gate-level pipelining, and flexible and global 156 interconnection) and at the same time partially circumvent some of the prob lems existing in current optical technology (e.g., the large reconfiguration over head). I showed by examples and analysis th a t because of the massive, global interconnections offered by optical technology, an optical computing system may have a much higher hardware utilization rate than its electronic counterpart. For instance, an optical system w ith about the same hardware as an electronic microcomputer may reach a peak throughput of 1000 MOPS. I have also discussed several potential techniques for optical implementation of the Opcom architecture. One promising technique is to use optical bistable devices to implement the array of cells, and an AO deflector matrix-vector mul tiplier to implement an optical crossbar interinterconnect unit. In response to the inconsistency between the huge interconnection requirement of Opcom and the lim itation of the current optical interconnection technology, we also pro vided several techniques to trade flexibility for interconnection cost. 8.2. S u ggestion s for F u tu re R esearch Parallel com putation is like the Pacific in both broadness and depth. This dissertation is only an initial step towards an all-embracing theory. There are m any topics needed to be studied. The following is only a partial list. Although the molecule construct has been defined and studied in some detail in Chapters 2 and 3, using a hypothetical language PAL, only an practi cal implementation in a commercial parallel computer can fully reveal its merits and potential problems. The best candidate computers include Cray X-MP, Alii- 157 ant F X /8, and iPSC VX. Each of these machines utilizes at least two computa tion modes: multiprocessing and pipelining. The pipeline net approach presented in this paper has several restrictions. I only consider constant-cycle schedules, i.e., wavefronts entering a pipeline net w ith a constant latency. This restriction is used to simplify the control of a pipeline net. The throughput of a pipeline net could be further enhanced, if vari able latency cycles are used to minimize the average latency. However, because of the desired reconfigurability, a pipeline net is set up dynamically after com pile time. This makes it difficult to implement variable-cycle schedules. I also restrict my study to com putations which can be represented as forpipe loops. F urther research is encouraged in developing a precompiler which transform s F ortran loops or other higher-level language specifications into forpipe loops. This dissertation develops a theory for a class of synchronous parallel com putation, based on the model of com putation systems. It is interesting and use ful to extend this model to cover other classes of synchronous com putations (e.g., SIMD array processing, synchronous MIMD computing, etc.) and asynchro nous parallel computations. A suggestion is to modify the firing rules and the tagging scheme. 158 R eferences 1. “ Special Issue on Optical Com puting,” Proc. IEEE, vol. 72, no. 7, July 1984. 2. “ Special Issue on Optical Com puting,” Optical Engineering, vol. 23-25, no. 1, Jan. 1984-86. 3. W. B. Ackerman and J. B. Dennis, “VAL--A Value Oriented Algorithmic Language, Prelim inary Reference M anual,” Tech. Report 218, Laboratory for Com puter Science, MIT, June 1979. 4. W . B. Ackerman, “D ata Flow Languages,” IEEE Computer, vol. 15, no. 2, pp.. 15-25, Feb. 1982. 5. A. K. Agrawala and T. G. Rauscher, Foundations of Microprogramming, pp. 73-75, Academic Press, New York, 1976. 6. A. V. Aho, J. E„ Hopcroft, and J. D. Ullman, The Design and Analysis of Computer Algorithms, Addison-Wesley Publishing Company, 1974. 7. G. R. Andrews and F. B. Schneider, “Concepts and Notations for Con current Program m ing,” A C M Computing Surveys, vol. 15, no. 1, March 1983. 8. Arvind, K. P . Gostelow, and W. E. Plouffe, “An Asynchronous Program ming Language and Computing Machine,” Tech. Report 114a, Dept, of Inform ation and Com puter Science, UC Irvine, Dec. 1978. 9. R. G. Babb, II, “Program m ing the HEP w ith Large-Grain Dataflow Tech niques,” in MIMD Computation: H EP Supercomputer and Its Application, ed. Kowalik, The MIT Press, Cambridge, MA, 1985. 10. K. E. Batcher, “Design of a Massive Parallel Processor,” IEEE Trans, on Computers, vol. C-29, no. 9, pp. 836-840, Sept. 1980. 11. J. A. Bergstra and J. W. Klop, “Process Algebra for Synchronous Com m unication,” Information and Control, vol. 60, pp. 109-137, 1984. 12. A. Borodin, “ On Relating Time and Space to Size and D epth,” SIA M J. Comput., vol. 6, pp. 733-744, 1977. 13. A. K. Chandra, D. C. Kozen, and L. J. Stockmeyer, “A lternation,” J. ACM , vol. 28, no. 1, pp. 114-133, 1981. 159 14. C. Y. Chin and K. Hwang, “Packet Switching Networks for Multiproces sors and Dataflow Com puters,” IE EE Trans, on Computers, vol. C-33, no. 11, pp. 991-1003, Nov. 1984. 15. B. Clymer and S. A. Collins, Jr., “ Optical Com puter Switching Network,” Optical Engineering, vol. 24, no. 1, pp. 074-081, Jan. 1985. 16. S. A. Cook, “Towards a Complexity Theory of Synchronous Parallel Com putation,” Enseign. Math., vol. 27, pp. 99-124, 1981. 17. Intel Corporation, iPSC Concurrent Processor Manual, 1986. 18. M. Dagenais and W .F. Sharfin, “Extremely Low Switching Energy Optical Bistable Devices,” Optical Engineering, vol. 25, no. 2, pp. 219-224, Feb. 1986. 19. E. S. Davidson, “ Scheduling for Pipelined Processors,” Proc. 7th Hawaii Conf. on System Sciences, pp. 58-60, 1974. 20. U. S. D epartm ent of Defense, “Program ming Language Ada: Reference M anual,” in Lecture Notes in Computer Science, vol. 106, Springer-Verlag, New York, NY, 1981. 21. J. B. Dennis, “First Version of a D ata Flow Procedural Language,” Lecture Notes in Computer Science, vol. 19, Springer-Verlag, Berlin, 1974. 22. E. W. D ijkstra, “ Cooperating Sequential Processes,” in Programming Languages, ed. F. Genuys, Academic Press, New York, NY, 1968. 23. J. A. Feldman, “High Level Program m ing for D istributed Com puting,” Commun. ACM , vol. 22, no. 6, pp. 353-368, June 1979. 24. M. W . Ferrante, “Cyberplus and Map V Interprocessor Communications for Parallel and A rray Processor Systems,” Multiprocessors and Array Pro cessors, pp. 45-54, Simulation Councils, Inc., San Diego, CA, Jan. 1987. 25. S. Fortune and J. Wyllie, “Parallelism in Random Access M achines,” Tenth A C M Symp. on Theory of Computing, pp. 114-118, 1978. 26. D. D. Gajski, D. A. Padua, D. J. Kuck, and R. H. Kuhn, “A Second Opin ion on D ata Flow Machines and Languages,” IEEE Computer, pp. 58-69, Feb. 1982. 27. D. D. Gajski, D. J. Kuck, D. Lawrie, and A. Sameh, “ Cedar,” COMPCON, pp. 36-39, Spring 1984. 160 28. L. M. Goldschlager, “A Unified Approach to Models of Synchronous Paral lel Machines,” Tenth A C M Symp. on Theory of Computing, pp. 89-94, 1978. 29. J.W . Goodman, F .J. Leonberger, S-Y. Kung, and R.A. A thale, “ Optical Interconnections for VLSI Systems,” Proc. IEEE, vol. 72, no. 7, pp. 850- 866, July 1984. 30. R. L. Graham, “Bounds on Multiprocessing Timing Anomalies,” SIA M J. Appl. Math., vol. 17, pp. 416-429, 1969. 31. S. Hawkinson, “The FPS T Series: A Parallel Vector Super Com puter,” Multiprocessors and Array Processors, pp. 147-156, Simulation Councils, Inc., San Diego, CA, Jan. 1987. 32. F. Meyer auf der Heide, “Lower Time Bounds for Solving Linear Diophan- tine Equations on Several Parallel Com putational Models,” Information and Control, vol. 67, pp. 195-211, 1985. 33. W . D. Hillis, The Connection Machine, The MIT Press, Cambridge, MA, 1985. 34. C. A. R. Hoare, “ Communicating Sequential Processes,” Commun. ACM, vol. 21, no. 8, pp. 666-677, Aug. 1978. 35. R. W . Hockney and C. R. Jesshope, Parallel Computers, pp. 51-64, Adam Hilger Ltd., 1981. 36. E. Horowitz, Fundamentals of Programming Languages, Com puter Science Press, Rockville, ML, 1983. 37. F. H. Hsu, H. T. Kung, T. Nishizawa, and A. Sussman, LINC: The Link and Interconnection Chip, Dept, of Computer Science, Carnegie-Mellon Univ., 1984. 38. A. Huang, “Architectural Considerations Involved in the Design of an Opti cal Digital Com puter,” Proc. IEEE, vol. 72j no. 7, pp. 780-786, July 1984. 39. K. Hwang and Y. H. Cheng, “Partitioned M atrix Algorithms for VLSI Arithm etic Systems,” IEEE Trans, on Computers, vol. C-31, no. 12, pp. 1215-1224, Dec. 1982. 40. K. Hwang and S. P . Su, “Priority Scheduling in Event-Driven Dataflow Com puters,” Tech. Kept. TR -E E 83-86, School of Electrical Engineering, Purdue University, 1983. 161 41. K. Hwang and F. A. Briggs, Computer Architecture and Parallel Processing, McGraw-Hill, New York, 1984. 42. K. Hwang and Z. Xu, “Dynamic Systolization for Developing Multiproces sor Supercom puters,” TR -E E 84~ 4%j Dept, of Electrical Engineering, P ur due University, Oct. 1984. 43. K. Hwang and Z. Xu, “Multiprocessor for Evaluating Compound A rith metic Functions,” Proc. 7th Symp. on Computer Arithm etic, pp. 266-275, June 1985. 44. K. Hwang and Z. Xu, “Remps: A Reconfigurable Multiprocessor for Scientific Supercom puting,” Proc. of 1985 I n t’ l. Conf. on Parallel Process ing, pp. 102-111, Aug. 1985. 45. K. Hwang and Z. Xu, “Multipipeline Networking for Fast Evaluation of Vector Compound Functions,” Proc. of 1986 In t’ l. Conf. on Parallel Pro cessing, pp. 495-502, Aug. 1986. 46. K. Hwang and Z. Xu, “Pipeline Nets for Compound Vector Supercomput ing,” IE E E Trans, on Computers, accepted to appear Jan. 1988. 47. D. Jacobs, “Para-Logic Programming: A Technique for Developing Parallel Program s,” Tech. Report CRI-87-37, Com puter Research Institute, Univer sity of Southern California, June 1987. 48. B.K. Jenkins, P. Chavel, R. Forchheimer, A.A. Sawchuk, and T.C. Strand, “Architectural Implications of a Digital Optical Processor,” Applied Optics, vol. 23, no. 19, pp. 3465-3474, Oct. 1984. 49. B.K. Jenkins, A.A. Sawchuk, T.C. Strand, R. Forchheimer, and B.H. Soffer, “ Sequential Optical Logic Im plem entation,” Applied Optics, vol. 23, no. 19, pp. 3455-3464, Jan. 1984. 50. R. M. K arp and R. E. Miller, “Properties of a Model for Parallel Com puta tions: Determinacy, Term ination, and Queueing,” SIA M J. Appl. Math., vol. 14, no. 6, pp. 1390-1411, Nov. 1966. 51. B. W. Kernighan and D. M. Ritchie, The C Programming Language, Prentice-Hall, Englewood Cliffs, NJ, 1978. 52. P. M. Kogge, The Architecture of Pipelined Computers, pp. 57-66, McGraw-Hill Book Company, 1981. 53. Kowalik, MIMD Computation: HEP Supercomputer and Its Application, The MIT Press, Cambridge, MA, 1985. 54. H. T. Kung and C. E. Leiserson, “Systolic Arrays (for VLSI),” Sparse M atrix Proc., pp. 32-63, 1978. 55. H. T. Kung and M. S. Lam, “Wafer-Scale Integration and Two-Level Pipe lined Implementations of Systolic A rrays,” /. Parallel and Distributed Pro cessing, vol. 1, no. 1, Aug. 1984. 56. S. Y. Kung, “ On Supercomputing w ith Systolic/W avefront A rray Proces sors,” Proc. of IEEE, vol. 72, no. 7, pp. 867-884, July, 1984. 57. J. Larson, “M ultitasking on the Cray X -M P/2 M ultiprocessor,” IEEE Computer, vol. 17, no. 7, pp. 62-69, July 1984. 58. C. E. Leiserson and J. B. Saxe, “ Optimizing Synchronous Systems,” J. VLSI and Computer Systems, vol. 1, no. 1, pp. 41-68, Spring 1983. 59. W . T. Lin and C. Y. Ho, “A New F F T Mapping Algorithm for Reducing the Traffic in a Processor A rray,” in VLSI Signal Processing II, (Kung, Owen, and Nash, Editors), pp. 328-336, IEEE Press, 1986. 60. W. T. Lin and C. Y. Chin, “A Reconfigurable A rray Using LINC Chip,” in Systolic Arrays, (Moore, McCabe, and U rguhart, Editors), pp. 313-320, Adam Hilger Ltd., Bristal and Boston, 1987. 61. A.D. McAulay, “ Optical Crossbar Interconnected Digital Signal Processor with Basic Algorithm s,” Optical Engineering, vol. 25, no. 1, pp. 082-090, Jan. 1986. 62. R. Milner, “A Calculus of Communicating Systems,” Lecture Notes in Computer Science, no. 92, Springer-Verlag, New York/Berlin, 1980. 63. R. Milner, “ Calculi for Synchrony and Asynchrony,” Theoret. Comput. Sci., vol. 25, pp. 267-310, 1983. 64. L. M. Ni and K. Hwang, “Vector Reduction Techniques for Arithm etic Sys tem s,” IEEE Trans, on Computers, vol. C-34, no. 5, pp. 404-411, May 1985. 65. J. L. Peterson, Petri N et Theory and the Modeling of Systems, Prentice-Hall, Englewood Cliffs, NJ, 1981. 66. N. Peygham barian and H. M. Gibbs, “ Optical Bistability for Optical Signal Processing and Com puting,” Optical Engineering, vol. 24, no. 1, pp. 068- 073, Jan. 1985. 163 67. M. Quinn, Designing Efficient Algorithms for Parallel Computers, McGraw- Hill, New York, NY, 1987. 68. Cray Research, Inc., Cray X -M P Series Mainframe Reference Manual, Tech. Note HR-0232, Nov. 1982. 69. J. P. Riganiti and P. B. Schneck, “ Supercom puting,” IEEE Computer, vol. 17, no. 10, pp. 97-113, Oct. 1984. 70. A.A. Sawchuk and T.C. Strand, “Digital Optical C om puting,” Proc. IEEE, vol. 72, no. 7, pp. 758-779, July 1984. 71. A.A. Sawchuk, B.K. Jenkins, C.S. Raghavendra, and A. Varm a, “ Optical Interconnection Networks,” Proc. 1985 In tl. Conf. on Parallel Processing, pp. 388-392, Aug. 1985. 72. A.A. Sawchuk, B.K. Jenkins, C.S. Raghavendra, and A. V arm a, “ Optical Matrix-Vector Im plementation of Crossbar Interconnection Networks,” IEEE Computer, vol. 20, no. 6, pp. 50-60, June 1987. 73. J. T. Schwartz, “U ltracom puters,” A C M Trans. Program. Lang. Systems, vol. 2, no. 4, pp. 484-521, 1980. 74. P.W . Sm ith and W .J. Tomlinson, “Bistable Optical Devices Promise Subpi cosecond Switching,” IEEE Spectrum, vol. 8, pp. 26-33, June 1981. 75. L. Snyder, “Introduction to the Configurable, Highly Parallel Com puter,” IEEE Computer, vol. 15, no. 1, pp. 47-64, Jan. 1982. 76. A. R. Tanguay, “M aterials Requirements for Optical Processing and Com puting Devices,” Optical Engineering, vol. 24, no. 1, pp. 002-018, Jan. 1985. 77. P . C. Treleaven, D. R. Brownbridge, and R. P. Hopkins, “Data-Driven and Demand-Driven Com puter Architectures,” A C M Computing Surveys, vol. 14, no. 1, pp. 93-143, March 1982. 78. N. W irth, “ The Program ming Language Pascal,” A cta Informatica, vol. 1, no. 1, pp. 35-63, 1971. 79. W . W ulf and M. Shaw, “ Global Variable Considered H arm ful,” SIGPLAN Notices, vol. 8, no. 2, pp. 28-34, 1973. 80. Z. Xu, “Dynamic Systolic Arrays for Supercom puting,” Master Thesis, Dept, of Electrical Engineering, Purdue University, Dec. 1984. 81. Z. Xu and K. Hwang, “Molecule: A Language Construct for Concurrent 164 Program m ing,” Tech. Report CRI-87-12, Com puter Research Institute, University of Southern California, 1987. 82. Z. Xu, K. Hwang, and B. K. Jenkins, “Opcom: An Architecture for Optical Com puting Based on Pipeline Networking,” Proc. of Twentieth Annual Hawaii I n t’ l. Conf. on System Sciences, vol. 1, pp. 147-156, Jan. 1987. I 1 6 5 Appendices A . PAL Program s for Com plex Inner Product C om putation In this appendix, we demonstrate the flexibility of PAL by presenting five PAL programs for the same computation: calculation of the inner product z — z re + i X ^ T O of two complex vectors x = (x 1,x2,'~,xn ) and y = { y \,y 2,—,yn ), where x8 - = ag - + j'X.bi andyi = ci + j'X .di . Five molecule types are used in these programs to specify SIMD array processing (program inprdctl), pipelined processing (program inprdct2), dataflow computation (program inprdct3), message passing multicomputing (program inprdct4), and shared variable multiprocessing (program inprdct5). inprdctl(a, b, c, d: in; zre, zim: out) oftype simd a,b,c,d: array[l..n] o f real; zre, zim: real; begin i,m,k: integer; r,s: array[l..n] o f real; forall i : = 1 to n do begin r[i] := a[i] * c[i] + b[i] * d[i]; s[i] := b[i] * c[i] - a[i] * d[ij; end m := 2; while m < = n do begin k :== m div 2; forall i := m to n step m do begin r[i] := r[i] + r[i-k]; s[i] :— s[i] + s[i-k]; end m := m * 2; end zre := r[n];- zim := s[n]; end inprdctl 166 inprdct2(a, b, c, d: in; zre, zim: out) oftype pipe a,b,c,d: array[l..n] o f real; zre,zim: real; begin i,m,k: integer; r,s: array[l..n] o f real; t,u: array[0..n] o f real; t[0] := 0; u[0] := 0; forpipe i := 1 to n do begin r[i] := a[i] * c[i] + b[i] * d[i]; s[i] := b[i] * c[i] - a[i] * d[ij; t[i] := t[i-l] + r[ij; u[i] := u[i-l] + s[i]; end zre :— t[n]; zim := u[n]; end inprdct2 inprdct3(a, b, c, d: in; zre, zim: out) oftype dataflow a,b,c,d: array[l..n] o f real; zre,zim: real; begin i,m,k: integer; r,s: array [l>..n] o f real; t,u: array[0..n] o f real; t[0] := 0; u[0] := 0; for i := 1 to n do begin r[i] := a[i] * c[i] + b[i] * d[i]; s[i] := b[i] * c[i] - a[i] * d[i]; t[i] := t[i-l] + r[i]; u[i] := u[i-l] + s[i]; end zre :== t[n]; zim := u[n]; end inprdct3 167 inprdct4() oftype process begin i: integer; a,b,c,d: array[l..n] of real; zre, zim: real; input operands a, b, c, d\ fork for i:= l to n do begin prdct(i); end for i:= l to n do begin send a[i] b[i] c[i] d[i] to prdct(i) end receive zre,zim from prdct(n); join for i:= l to n do begin prdct(i); end output results zre and zim; end inprdct4 prdct(i: in) oftype process i: integer; begin a,b,c,d,r,s,t,u: real; j,m,k: integer; receive a,b,c,d from inprdct4; r:=a*c+b*d; s:=b*c-a*d; m := 2; while m < = n do begin k:=m div 2; for i:—m to n step m do begin if i=j-k th en send r,s to prdct(j); else if i= j then begin receive t,u from prdct(j-k); r:= r+ t; s: =s+u; end end m :=m*2; end if i= n th en send r,s to inprdct4; end prdct 168 a,b,c,d,r,s,t: array[l..n] o f real; sem aphore t = 0; inprdct5() oftype task begin i: integer; fork for i:= l to n do begin pdt(i); end join for i:= l to n do begin pdt(i); end end inprdct5 pdt(i: in) oftype task i: integer; begin j: integer; r[i] := a[i] * c[i] + b[i] * d[i]; s[i] :== b[i] * c[i] - a[i] * d[i]; if even(i) then begin j := i - 1; while 0 < j < i/2 do begin P(t[j]); r[i] := r[i] + r[j]; s[i] := s[i] + s[jj; j := 2*j - i; end end V (t[i]); end pdt B. Liverm ore Loops R epresentable in forpipe Loops 1 6 9 No.l DO 1 k = 1,400 1 X(K)=Q+Y(K)*(R*Z(K+10))+T*Z(K+11) No.2 DO 2 K = l,996,5 Q=Q+Z(K)*X(K)+Z(K+1)*X(K+1) +Z(K+2)*X(K+2)+Z(K+3)*X(K+3) 2 +Z(K+4)*X(K+4) No.3 DO 3 k = 1,1000 3 Q=Q+Z(K)*X(K) No. 4 DO 4 J=30,870,5 4 X(L-1)=X(L-1)-X(W)*Y(J) No. 5 DO 5 1=2,998,3 X(I)=Z(I)*Y(I)-X(I-1) X(I+1)=Z (I+ 1)*Y(I+1)-X(I) 5 X(I+2)=Z(I+2)*Y(I+2)-X(I+l) No. 6DO 6 J=3,999,3 I=1000-j+3 X(I)=X(I)-Z(I)*X(I+1) X(I-1)=X (I-1)-Z(I-1)*X(I) 6 X(I-2)+X(I-2)-Z(I-2)*X(I-l) No. 7DO 7 M = l,120 7 X(M)=U(M)+R*(Z(M)+R*Y(M)) +T*(U(M+3)4-R(U(M+2)+R*U(M+l)) +T*(U(M+6)+R(U(M+5)+R*U(M+4)))) No. 8 DO 8 KX=2,3 DO 8 K Y = 2,21 DU1=U1(KX,KY+1,NL1)-U1(KX,KY-1,NL1) DU2=U2(KX,KY+1,NL1)-U2(KX,KY-1,NL1) DU3=U3(KX,KY+1,NL1)-U3(KX,KY-1,NL1) Ul(KX,KY,NL2)==Ul(KX,KY,NLl)+All*DUl+A12*DU2+A13*DU3 +SIG*(U1((KX+1,KY,NL1)-2*U1(KX,KY,NL1)+U1(KX-1,KY.NL1) U2(KX,KY NL2)=U2(KX,KY,NL1)+A21*DU1+A22*DU2+A23*DU3 +SIG*(U2((KX+1,KY,NL1)-2*U2(KX,KY,NL1)+U2(KX-1,KY,NL1) U3(KX,KY,NL2)==U3(KX,KY,NL1)+A31*DU1+A32*DU2+A33*DU3 8 +SIG*(U3((KX+1,KY,NL1)-2*U3(KX,KY,NL1)+U3(KX-1,KY,NL1) 170 No. 9 DO 9 1=1,100 9 PX(1,I)=BM28*PX(13,I)+BM27*PX(12,I)+BM26*PX(11,I)+ BM25*PX(10,I)+BM24*PX(9,I)+ BM22*PX(7,I)+C0*(PX(5,I)+PX(6,I)+PX(3,I) No. 10 DO 10 1=1,100 AR=CX(5,I) BR=AR-PX(5,I) PX(5,I)=AR CR=BR-PX(6,I) PX(6,I)=BR AR=CR-PX(7,I) PX(7,I)=CR BR=AR-PX(8,I) PX(8,I)=AR CR=BR-PX(9,I) PX(9,I)=BR AR=CR-PX(10,I) PX(10,I)=CR BR=AR-PX( 11,1) PX(11,I)=AR CR=BR-PX(12,I) PX(12,I)+BR PX(14,I)=CR-PX(13,I) 10 PX(13,I)=CR NO.11 X(1)= Y (1) DO 11 K=2,1000 11 X(K)=X(K-1)+Y(K) No. 12 DO 12 K =l,999 12 X(K)=X(K-1)+Y(K) C. Liverm ore Loops not R epresentable in forpipe Loops 171 No. 13 DO 13 IP=1,128 11=P(1)IP) J l= P ( 2,IP) P(3,IP)=P{3,IP)+B(I1,J1) P(4) IP)==P(4,IP)+C(IlJJl) P(1,IP)—P(1,IP)+P(3,IP) P(2)IP)=P(2,IP)+P(4,IP) I2=P(1,IP) J2=P(2,IP) P (1 ,IP)=P( 1 ,IP)+Y(I2+32) P(2,IP)=P(2,IP)+Z(J2+32) I2=I2+E(I2+32) J2=J2-fF (J2+32) 13 H(I2,J2)=H (I2, J2)+10 No. 14 DO 14 K =l,150 IX=GRD(K) XI=IX VX(K)—VX(K)+EX(LX)+(XX(K)-XI)*DEX(LX) XX(K)=XX(K)+VX(K)+FLX IR=XX(K) R I=IR RXI=XX(K)-RI IR=IR-(IR/64)*64 XX(K)—RI+RXI RH(IR)=RH(IR)+1.0-RXI 14 RH(IR+1)=RH(IR+1)+RXI
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A bit-plane architecture and 2-D symbolic substitution for optical computing
PDF
Programming language constructs for data intensive application development
PDF
Specification, verification, and implementation of concurrent programs
PDF
Parallel computing with optical interconnects.
PDF
Processor efficient parallel graph algorithms
PDF
Communications-efficient architectures for massively parallel processing
PDF
Parallel processing techniques for production systems
PDF
Transformation techniques for parallel processing of production systems
PDF
Dynamic load balancing for concurrent Lisp execution on a multicomputer system
PDF
A unified approach to the construction of correct concurrent programs
PDF
Language features for a static data flow environment
PDF
Reliability analysis and optimization in the design of distributed systems
PDF
Parallel computations on meshes with static and reconfigurable buses.
PDF
Orthogonal architectures for parallel image processing
PDF
Stateful computations in functional languages
PDF
A divide-and-conquer computer
PDF
Unambiguity and subset-strict interpretations
PDF
Searching for and beyond replication origins
PDF
Compiler-directed design space exploration for pipelined field -programmable gate array applications
PDF
User-assisted design and evolution of physical databases
Asset Metadata
Creator
Xu, Zhiwei (author)
Core Title
Parallel language and pipeline constructs for concurrent computation
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c17-769843
Unique identifier
UC11349123
Identifier
DP22767.pdf (filename),usctheses-c17-769843 (legacy record id)
Legacy Identifier
DP22767.pdf
Dmrecord
769843
Document Type
Dissertation
Rights
Xu, Zhiwei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA