Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DREAM MACHINE - A PLATFORM FOR EFFICIENT IMPLEMENTATION OF NEURAL NETWORKS WITH ARBITRARLY COMPLEX INTERCONNECTION STRUCTURES by Soheil Shams A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Engineering) August 1992 Copyright 1992 Soheil Shams UMI Number: DP22854 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. Dissertation Publishing UMI DP22854 Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, written by S o h e il Shams under the direction of h.X& Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillm ent of re quirements for the degree of Ph- P- virrpi.W D O C TO R OF PH ILO SOPH Y Dean of Graduate Studies Date August 12, 1992 DISSERTATION COMMITTEE f\jCUU(MAAS&~' Chairperson Dedication my grandparents, Davoud & Sima Shams and Nasrollah & Maryam Javanshir Acknowledgments I would like to thank Dr. Jean-Luc Gaudiot for serving as my advisor and for his; encouragement and assistance in completing my Ph.D. studies. I would also like to thank Drs. Shahram Ghandeharizadeh, Keith Jenkins, and Viktor Prasanna for serving on my dissertation committee. A special thanks to Dr. Christof von der Malsburg for many' I hours of stimulating discussions during the early periods of my Ph.D. work. I am also1 grateful to Dr. Petar Simic for his collaboration on the optimization portion of my research. The numerous hours of discussion on optimization methods and parallel processing were one of the highlights of my research work. I would like to thank Scott Toborg for being a good friend and colleague. Many thanks to Dr. Wojtek Przytula for I his support and collaboration. I would also like to thank Drs. Greg Nash and David Schwartz for their encouragement, support, and helpful comments on my research. L am grateful to the Hughes Aircraft Company for supporting my studies through their fellowship program. A special thanks to Barbara Dover in helping me with text editing! and creating some of the figures in this dissertation. ; ! Completion of this dissertation would not have been possible without the love and support of my family throughout all these years. I would like to specially thank my parents, Iraj and Sorour for always being there to give love, support, and encouragement. I like to also thank my brothers Fariborz and Sepehr for their love and support. Most of all, I would like to thank my love, Ladan, who in addition to helping me with many aspects of this dissertation, has been the main source of my energy in; completing my degree. iii Contents Dedication ii i i Acknowledgm ents iii' List of Figures ix ; List of Tables xiii Abstract xiv j I 1 Introduction 1! 1.1 Problem Statem ent......................................................................................1 1.1.1 The Architecture Design Problem .........................................................2 1.1.2 The Mapping Problem .....................................................................3 1.2 Motivation..........................................................................................................3 1.2.1 Why Parallel Implementations Are Necessary...............................4j 1.2.2 Deficiencies of Current Implementations.......................................4j 1.3 Approach........................................................................................................... 5! 1.4 Summary of Contributions 6 ' 1.5 Organization of the Dissertation.............................................................7 2 Background 8 2.1 Neural Network Models...................................................................................8 2.1.1 Neural Computation Overview................................................................8 2.1.2 Representation of Neural Computation as Vector-Matrix [ Operations............................................................................................. 10. 2.1.3 Neural Network Interconnection Structures....................................... 11; 2.1.4 Implementing Neural Network Learning Algorithms.....................13 1 2.2 Inherent Parallelisms of Neural Networks............................................14 2.2.1 Network Level Parallelism ............................ ............................. 15 2.2.2 Layer Level Parallelism 16 t 2.2.3 Neuron Level Parallelism 17 : 2.2.4 Synapse Level Parallelism 19 j 2.3 Parallel Implementation Methods...................................................................20 2.3.1 Implementations Applicable to Dense and Regularly Interconnected Neural Networks.................................................. 22 2.3.2 Implementations Applic able to Sparsely Interconnected Neural Networks................................................................................................24 2.3.3 Implementations Applicable to Arbitrarily Interconnected Neural N etw orks..............................................................................26 3 The DREAM Machine 29 3.1 System Level Overview..........................................................................30 3.1.1 System O rganization...................................................................... 30 3.1.2 Execution Paradigm..............................................................................31 3.2 Processing Element Design Description....................................................... 32 3.2.1 Internals of the Processing Elements.................................................33 3.2.2 Processor-Memory Interface............................................................... 33 3.2.3 Instruction-Word Description..............................................................35 3.3 Implementing a Table Lookup Mechanism on the DREAM Machine 35 3.4 Interprocessor Communication Network......................................................36 3.4.1 Nearest Neighbor Communications.................................................... 36 3.4.2 Global Communications.......................................................................38 4 Mapping Method Preliminaries 40 4.1 Mapping Principles..........................................................................................40 4.1.1 The General Mapping Method.................................................... 41 4.1.2 The Algorithmic and Optimized Mapping Approaches 42 ; I 4.2 Mapping Neural Networks onto Fixed-Size Ring-Connected | Architectures....................................................................................................43; 4.2.1 System Utilization Characteristic of the Mapping Onto the j Fixed-Size Ring Architecture 45 j 4.1.2 Execution Rate Characteristics of the Mapping Onto the Fixed-Size Ring Architecture................................................................ 47 v 4.2.3 Mapping Multilayer Neural Networks Onto the Fixed-Size Ring Architecture...................................................................................48 4.2.4 Deficiencies of the Mapping Onto the Fixed-Size Ring ] Architecture 48 i 4.3 Initial Mapping Method for 2-D Connected Architectures 49 ! 4.3.1 Computational Aspects of the B ack-Propagation Algorithm........... 51 4.3.2 Mapping Details......................................................................................53 4.3.3 Implementing Neural Networks Larger than the Processor Array Size..........................................................................................59 4.3.4 Implementing Nettalkon the Flughes SCAP Architecture..................61 4.3.5 Performance Evaluation of the Mapping M ethod 63 1 4.3.6 Applicability of the Mapping Method.......................................... 64 4.3.7 Deficiencies and Shortcomings of the Mapping Method.................... 65 i The Algorithmic Mapping Method 66 j 5.1 Applicability of the Algorithmic Method 66 j 5.2 Implementation of Variable-Size Processing Rings on the DREAM ' i Machine............................................................................................................ 67 5.3 Using Variable Length Rings to Process a Single Layer 68 i 5.4 Implementing Multilayer Neural Networks 71 ! i 5.5 Implementing Back-Propagation Algorithms 74 j I 5.6 Implementing Block Connected Neural Networks 75 j 5.6.1 Implementing Data-Fusion Style Neural Networks 77 i 5.6.2 Implementing Feature-Detection Style Neural Networks...................78 5.7 Implementing Neural Networks Larger than the Processor Array 78 ; i 5.8 Batch-Mode Implementation.......................................................................... 79 5.9 Implementing Competitive Learning..............................................................80 5.10 Implementation Examples and Performance Evaluation.............................. 82 5.10.1 Perform ance M etric..................................................................... 82 5.10.2 Implementing a Fully Connected Multilayer Neural Network 83 5.10.3 Implementing a Block-Connected M ultilayer Neural ; Network 84 ^ i 5.10.4 Implementing a Fully Connected Single Layer Network 86 1 5.11 Analysis of Results.......................................................................................... 87 6 Optimization Based Mapping Method 90 i 6.1 Problem O verview .................................................................................... 90 6.1.1 The Mapping Problem ................................................................... 91 6.1.2 Complexity of the Mapping Problem.................................................. 91 6.1.3 Analogies to Other Combinatorial Optimization Problems................92 6.2 General Approach............................................................................................93 j 6.2.1 Mapping Parallel Algorithms Onto Parallel Architectures...................93 | 6.2.2 Addressing the Clustering Problem............................................. 94 i 6.3 Solving the Assignment Problem....................................................................95 ! | 6.3.1 Problem Representation.........................................................................96 6.3.2 Assignment Cost Function Formulation.............................................97 6.3.3 Constraints on the Assignment M atrix................................................98 6.3.4 The Complete Assignment Cost Function......................................... 100 J 6.3.5 Assignment of Neurons to Processors............................................... 101 ■ 6.4 Solving the Scheduling Problem..........................................................102 1 6.4.1 General A pproach..........................................................................102 j 6.4.2 Specific Problem Representation............................................... 104 6.4.3 Scheduling Cost Function Formulation..............................................106 j 6.4.4 Constraints Terms of the Scheduling Cost Function.........................I l l j 6.4.4.1 Architecture Dependent Constraints........................................ I l l > 6.4.4.2 Constraints on Path Variables............................................... 6.4.5 The Complete Scheduling Cost Function..........................................113 6.4.5 Use of the Scheduling Procedure for Mapping Neural Networks..............................................................................................115 6.5 Constraint Nets Used for Optimization.......................................................117 6.6 Optimization Procedure................................................................................ 120 6.6.1 Update Equations for the Assignment Problem............................... 121 6.6.2 Update Equations for the Scheduling Problem.................................122 6.6.3 Implementing the Optimization Procedure........................................ 127 7 Results of the Optimization Procedure 128 7.1 Implementation of the Optimization Algorithm.................................... 128 7.1.1 Control File Form at...................................................................128 7.1.2 The Assignment and Scheduling Optimization Routines................. 130 7.2 Results of the Assignment Procedure 132 j 7.2.1 Results from Implementing the Bokhari Example 132 ■ 7.2.1.1 Problem Description 133 j 7.2.2 Results from Implementing the Everstine Example........................... 142 7.3 Results of the Scheduling Procedure............................................................ 144 7.3.1 Receptive-Field Exam ple........................... 144 7.3.2 Randomly Sparse Matrix Example..............................................150 8 Asymptotic Performance Analysis 153 8.1 Implementation Complexity of Neural Network Processing......................153 8.1.1 Mapping Method Comparisons...........................................................154; 8.1.2 Area-Time Complexity of Neural Network Implementation 1 Methods................................................................................................. 156 8.1.2.1 Complexity of the Fixed-Size Ring Implementation............. 156 8.1.2.2 C om plexity of the V ariable-L ength R ing Implementation 157 j 8.1.2.3 Complexity of the Optimization Based Implementation 158 I j 8.2 Complexity Analysis of the Optimization Based Mapping Method............159 ! 8.2.1 Time Complexity of the Assignment Problem 159 j 8.2.2 Time Complexity of the Scheduling Problem 160 , 8.2.3 Tradeoffs on the Use of the Optimization Procedure 161 ; 8.3 Perform ance C om parison 162 : 9 Conclusion 165 ; i 9.1 Summary........................................................................................................ 165 9.2 Future Research Directions............................................................................167 References 170 viii List of Figures Figure 1-1 - Basic structure of a neural network.................................................................. 2 ' Figure 2-1 - Computation performed by neuron i.................................................................9 J t ! I Figure 2-2 - Typical transfer functions used by neural networks, (a) Threshold j function, (b) Ramp function, (c) Sigmoid function.............................................. 11 1 Figure 2-3 - Examples of neural network interconnection structures, (a) One layer i fully connected network, (b) Two layer network, (c) Two layer network with 1 j limited receptive field interconnections................................................................................. 13 I Figure 2-4 - Exploiting network level parallelism by mapping complete networks j to individual PEs.................................................................................................................... 15 Figure 2-5 - Exploiting layer level parallelism by mapping consecutive layers of ! the neural network to consecutive PEs of the processing pipeline..............................16' i Figure 2-6 - Exploiting neuron level parallelism by mapping individual neurons to | distinct P E s....................................................................................................................... 18 ] Figure 2-7 - Exploiting synapse level parallelism by mapping single synapses to j each PE. In this figure s denotes the number of synaptic connections used in the | neural network........................................................................................................................ 19; Figure 2-8 - Implementing a fully connected neural network with 16 neurons on a ; ring-connected systolic array, (a) Assignment of neurons to processors and flow of data, (b) Computation performed in each PE at time t...................................... 231 I Figure 2-9 - Using network level parallelism for implementing neural networks on j the Warp machine..................................................................................................................27, Figure 2-10 - Implementation of neural networks on the connection machine,- j from [79]................................................................................................................................ 27; j Figure 3-1 - The top level design of the DREAM Machine...................................... 30; j Figure 3-2 - Top level design of the Processing Elements................................................ 32: ’ Figure 3-3 - Processor and memory detail diagram. Associated with each data ; 1 value in the local memory of each PE is a switch setting value used to configure 1 the communication topology of the machine........................................................................ 34 j ix Figure 3-4 - A single processor and its associated interprocessor communication switches 37 j I Figure 3-5 - Schematic of the reconfigurable interprocessor communication j switch 38 | Figure 4-1 - Mapping neurons to PEs and constructing a computation path I between the PEs......................................................................................................................42 \ Figure 4-2 - Mapping a fully connected neural network onto a ring systolic processing array......................................................................................................................44; Figure 4-3 - Exploiting network level and neuron level parallelism on a 2-D mesh ; connected systolic architecture 50 ( I I ' Figure 4-4 - Recall phase (forward pass) processing of neuron i on layer i of the i ; multilayer perceptron neural network................ ................................................................. 51: i j Figure 4-5 - Learning phase (error back-propagation) processing for neuron i on I layer I of the back-propagation algorithm........................ 52 f Figure 4-6 - Data organization in the processor array during the first cycle of the recall phase. Only processors in the first row are enabled 53 j Figure 4-7 - Processing in each processor of the array during the recall phase 54 i Figure 4-8 - Next two cycles of the data flow through the array after initial cycle ! shown in Figure 4-6...............................................................................................................55 ^ Figure 4-9 - Data organization in the processor array during the first cycle of the ; t learning phase. Only processors in the first row are enabled 56 i Figure 4-10 - Processing in each enabled processor of the array during the i i learning phase.................................................................................................................. 57; ! Figure 4-11- Next two cycles of the data flow through the array after initial cycle j shown in Figure 4-9. Shaded processors are disabled from performing j calculations..............................................................................................................................58 Figure 4-12 - Network partitioning for implementation on a "3 x 3" processor ! a rra y ....................................................................................................................................... 59 Figure 4-13 - System architecture of the Hughes Systolic/Cellular Coprocessor.......... 61 j ■ Figure 5-1 - Using the reconfigurable switches of the DREAM Machine to ; construct circular rings on the processing array, (a) A single ring of size 59. (b) I Three disjoint rings of different sizes executed in parallel.................................................. 68 i Figure 5-2 - Plot of Rt vs. Nt with P=256 processors.......................................................70 Figure 5-3 - Plot of Tt vs. Nt for the fixed-size ring mapping defined by equation (4-5), and for the variable-size ring mapping defined by equation (5-3). The parameters k] and k2 are assumed to be 1 in order to simplify die comparison...............71 Figure 5-4 - A three layer neural network with N]>N2>Ns...............................................72 j Figure 5-5 - An embedded ring structure containing a ring of size N2 embedded in j another ring of size N ] .................................................................................................. 73 i Figure 5-6 - A blocked structured neural network utilizing 3 disjoint blocks j | between the input and hidden layers and its associated mapping on 3 processing | rin g s ........................................................................................................................................76! Figure 5-7 - Memory locations of synaptic weights on the inputs of neuron i..................81, Figure 5-8 - A neural network structure for image compression and decompression with a regularly blocked structure [56] 85 j ! Figure 5-9 - The ring structure associated with the image compression and | j decompression network of Figure 5-8 86 j j 1 Figure 6-1 - Mapping of the nodes of a process graph D onto a processor graph j ; G using the assignment matrix C O ......................................................................................... 96 : Figure 6-3 - An example data dependence graph for a simple arithmetic : calculation..............................................................................................................................103 1 Figure 6-4 - Several different dependence graphs performing the same computation due to the commutative nature of the addition operation....................... 104; j Figure 6-5 - A 3-D representation of 3 non-intersecting path traversals in time. The left hand plane represents the 2-D processor array.....................................................105 Figure 6-6 - An example of a path traversal and its associated penalty term , contributions due to repeated crossings of a neuron m required for its 1 computation. The value of T ] ‘ m(a) can be any value in the range [0,1] in the shaded region............................................................................................. 110 Figure 7-1 - Interconnection topology of a 6x6 Finite Element Machine (FEM)...........133 Figure 7-2 - The interconnection matrix G*3 describing the 6x6 FEM topology.......... 134: Figure 7-3 - A 33 node finite element structure used as the process graph................... 1351 Figure 7-4 - The process graph interconnection matrix D an...........................................136j I Figure 7-5 - Interprocessor communication distance matrix GA B obtained from the architecture topology graph G**....................................................................................137^ Figure 7-6 - The Histogram associated with 1000 randomly generated assignments for the 33 node Bokhari example..................................................... 1381 Figure 7-7 - Histogram of solution cardinalities for the 1231 runs described in j Table 7-3. The global optimum solution is at cardinality 78 which is found the j largest percentage of times................................................................................................... 139 xi , I Figure 7-8 - Graph of system temperature and cardinality value vs. sweep j num ber................................................................................................................................140 1 Figure 7-9 - Graph of system temperature, total cost (given by equation (6-13)), J j and mismatch cost (given by equation (6-1)) vs. sweep number.................................... 140: | Figure 7-10 - Graph of system temperature and the various penalty terms (given i by equations (6-6), (6-7), (6-11), and (6-12)) vs. sweep number................................141 Figure 7-11 - Histogram of the solution cardinalities when mapping the 25 node j process graph to FEM of size 5x5 taken from a collection of 736 runs.....................141! j Figure 7-12 - The histogram associated with solution cardinalities of 1000 random . assignments of the Everstine 59 node example mapped to a 4-nearest neighbor connected 8x8 parallel processing architecture 143 > ■ Figure 7-13 - Neural network with a receptive field type interconnection structure................................................................................................................................. 145 ' Figure 7-14 - Scheduling flow pattern for the 4 paths associated with Figure 7-13. ! The shaded boxes indicate the required neurons to be traversed by each path...............146, t Figure 7-15 - A second scheduling flow pattern for the 4 paths of Figure 7-13, with path length M = ll. The shaded boxes indicate the required neurons to be traversed by each path. A loop in the path indicates that the path stayed in the i processor for the corresponding cycle................................................................................ 148 Figure 7-15 - Scheduling flow pattern for the 4 paths of Figure 7-13, with [ ■ optimum path length (M=9)................................................................................................. 149 * i i Figure 7-16 - Interconnection matrix associated with a sparse iterative system 150' ! Figure 7-17 - Four of the 16 paths associated with the sparse iterative system ! shown in Figure 7-16. The shaded boxes indicate neurons which must be traversed by the associated paths. The boxes with thick outlines indicate start and end processors of each path................................................................................................. 151: List of Tables ; Table 4-1 - Comparison of different implementations of the Nettalk neural ' network based on throughput and local memory size requirements...................................64 Table 5-1 - Performance comparison of the variable-size ring mapping vs. the fixed-size ring mapping based on the MCPS metric............................................................88 Table 5-2 - Performance comparison of the variable-size ring mapping vs. the fixed-size ring mapping based on the optimality ratio.............................................. 89 Table 7-1 - Control file format used for the assignment routine.......................................129 t j Table 7-2 - Control file format used for the scheduling routine............................... 130 i Table 7-3 - Statistic collected from executing the assignment optimization routine \ on the 33 node Bokhari example. The shaded row is the results of a Monte-Carlo search algorithm used for comparison.................................................................................137 i I Table 7-4 - Statistics collected from executing the assignment optimization routine | on the 59 node Everstine example. The shaded row is the results of a ! Monte-Carlo search algorithm used for comparison..........................................................144 Table 8-1 - Comparison of several implementation methods used for neural network processing...............................................................................................................163 Table 8-2 - Comparison of several implementation methods used for neural network processing, assuming a constant number of connections per neuron.............. 164 Abstract r i In this dissertation, the Dynamically Reconfigurable Extended Array Multiprocessor! (DREAM) Machine is presented for efficient implementation of neural network models, we identify several sources of parallelism, inherently available in neural network models.! We show that the Single Instruction stream Multiple Data stream (SIMD) execution! paradigm can be an effective and efficient method for parallel implementations of neural! networks. We argue that the current implementation methods are only applicable to neural networks with specific characteristics in their interconnection structures, such as being dense and regularly interconnected. In light of this argument, we present a general, processing scheme for implementing neural network models on systolic parallel! computers. This processing scheme is realized using two different mapping methods' which require a limited degree of communication autonomy at each processor, as offered! by the DREAM Machine architecture. The first is an algorithmic method based on j constructing variable-length processing rings on the DREAM Machine architecture. The second method involves an optimization procedure in which efficient mappings of neural networks with arbitrary interconnections onto the DREAM Machine are automatically] generated. We show, both analytically and empirically, that these mapping methods] allow the DREAM Machine to be an effective architecture for implementing neural! network models with a variety of different interconnection structures. i Chapter 1 i f : Introduction Performing complex numerical calculations, such as evaluating k to the lOO^1 significant; digit, could take a human several months or even years, where as a simple digital! ! computer can perform the task in a few seconds. On the other hand, recognizing a familiar face in a crowd is a task that a human can perform rather effortlessly but it cannot be performed effectively even with the most powerful computers. This kind of problem is at the center of the current surge in research in neural networks. Neural network, models are intended to solve problems that are readily performed by humans (e.g., vision and speech), but have been difficult to implement using conventional algorithms. This ! dissertation presents the Dynamically Reconfigurable Extended Array Multiprocessor ! (DREAM) Machine as a platform for efficiently implementing the computation associated; with processing a wide range of neural network models. I I i 1.1 Problem Statement As the term "neural network" implies, these models are based, to varying degrees, on1 the biological nervous system. A great deal of effort has been put into developing a general model for neural computation. Unfortunately, no single model, or even a class; of models, have proven to be effective in addressing the broad range of "human-like"; abilities. This has resulted in a lack of a clear-cut definition of what constitutes a neural! I network model. The fundamental principles that make up a substantial number of current1 I neural network models, and are expected to be consistent with further developments in; neural modeling research, can be described as follows. Neural network models consist' of two major components: processing units (referred to as neurons) and connections (referred to as synapses). The synapses are used for transferring and possibly modifying the information transferred between neurons, see Figure 1-1. Synapse .Neuron Synapse Synapse $ y™ p se Synapse [Neuron] .Neuron Neuron Figure 1-1 - Basic structure of a neural network , j The problem being addressed in this dissertation is how to implement these models so, ! that the required computation can be completed in the shortest amount of time. Anl j obvious solution is to use the inherent parallelism available in neural networks to perform1 multiple operations in parallel. This solution, in turn, poses two major questions: what type of parallel processing architecture is best suited for neural network processing, and how to map the computation associated with neural networks onto a parallel processing^ architecture. These two problems are the primary issues being addressed in thisj dissertation. I 1.1.1 The Architecture Design Problem ! The difficulty in solving the implementation problems lays in the fact that neural network models cover a wide range of algorithms, each having a specific computational j requirements. In addition to the differences in the computational requirements among' j various neural networks, the significant diversity in the complex interconnection' structures of these models, even between different applications of the same modelj makes efficient general-purpose implementations difficult to realize. There is an inherent' trade-off between execution speed and the degree of programmability offered by an implementation. High speed implementations can be developed which take advantage of I ! j all the available peculiarities of a particular neural network to optimize the design of the architecture. Such an implementation would be of limited use, since it can only be used I for implementing a single neural network. On the other hand, allowing for flexibility in] j the implementation, reduces its execution speed. Achieving a good compromise between flexibility and speed, in implementing neural network models, is a prime objective of; this dissertation. j 1.1.2 The Mapping Problem i A major issue in efficient utilization of any parallel processing architecture involves the; j mapping of processing modules of a specific computational task to individual processors j of the architecture in an optimal manner. The optimality criteria could be based oni ; various parameters, such as execution rate, memory utilization, and interprocessor communications. The problem of mapping neural network models on parallel, architectures involves three basic operations. First, we must identify the individual; processes involved in the computation of neural networks which can be executed in J parallel. The goal here is to identify those processes which require minimal amount of ! communications and are sufficiently fundamental to the computation as to be applicable to; a wide range of models. Second part of the mapping problem is to arrive at an efficient method for assigning each process to a unique processor in order to achieve efficient! processing. Finally, a schedule for the sequence of computation and communication; operations, associated with the implementation of the neural network model, must be generated. ! i j 1.2 Motivation The computational paradigm used by neural networks is quite different than that used by! | conventional algorithms. A conventional algorithm specifies the exact order and the1 specific operation to be performed at each time step in order to implement a given task. Al neural network, on the other hand, "computes" based on the specific characteristics of; the neurons and the specific interconnection structure of the network. This fact highlights a major difference between the method used for programming neural networks as compared to that used for a conventional algorithms. This rather radical method of , 1 ; programming forces an application developer to think in terms of simple processing units' and their interactions as opposed to a specific flow of operations. ! Numerous advantages have- been cited for applying neural network ‘models' toi complex problems that do not lend themselves well to procedural implementations [1, j 26]. These advantages include: the ability of neural networks to be used as robust; pattern associators or pattern classifiers, the ability to self-organize or learn with orj without explicit information from an outside supervisor, the ability to generalize based on] l I previously learned information, and inherent fault-tolerance achieved by using distributed I representation and processing. In addition, an attractive attribute of neural networks that, has generated much enthusiasm for expecting high implementation speeds is associated! with their inherently parallel computational structure. It is desirable to efficiently harness j this parallelism by implementing neural network algorithms on parallel processing1 , computers. : 1.2.1 Why Parallel Implementations Are Necessary j i Clearly, neural networks employ a significantly different type of computational model! than that used by conventional algorithms. This fact has generated a need for special j architectural designs that can best support distributed processing required by such , models. Although neural networks can solve computationally difficult problems, such as; | the Traveling Salesman Problem (TSP), in a relatively short number of iterations [13,! 33], each iteration of a neural network consists of updating a large number of neurons; and synaptic connections in parallel. Implementing these neural network models on' conventional uniprocessors enforces serialization of an inherently parallel computation and thus will lead to low throughput rates. With the continual research in neural network model development, and their application to an ever increasing number of real-world* problems, the number of neurons and interconnections being employed by these models! is increasing rapidly. For these systems to be used in real-world applications, an' efficient hardware platform is required to utilize the inherent parallelism available in these models to its fullest extend. j j | 1.2.2 Deficiencies of Current Implementations | Although conventional parallel architectures can, to a certain degree, help in speeding up neural network processing, as described in Chapter 2, they lack many features required - for efficient implementation of neural networks. Furthermore, many special-purpose | parallel machines lack the required flexibility needed to be applicable to the wide range of neural network applications currently being developed. Two main factors have limited the amount of parallelism that can realistically and' efficiently be utilized by parallel implementations. The first is due to the high communication requirements between individual neurons in the network. A neuron requires input from many other neurons during each processing cycle in order to produce j an output value. If each neuron is assigned to a unique processing element (PE), high! i interprocessor communication bandwidth is required to accommodate this exchange of output values among the various neurons. The second problem facing parallel implementations of neural networks is attributed to the uncertainty in the specific neural computational requirements. Any practical implementation method should offer sufficient i | flexibility so that a wide array of neural networks with diverse interconnection structures' | and computational needs can be efficiently processed. Without such flexibility the' applicability of the implementation and its useful life expectancy will be greatly limited. I ! t I I ' I 1.3 Approach The approach taken in this dissertation, at solving the problems outlined in Section 1.1, ! involves three parts. First, we organize and represent all neural computation using a : uniform representation scheme. We determine the specific areas where parallelism can; efficiently be exploited to attain maximum speedup. At the same time, we represent' neural computation sufficiently general so that it may be applicable to a wide range of neural network models. j Second, we propose a special-purpose architecture which can be effectively utilized' \ for neural network processing. The design of this architecture is accomplished by identifying the most compute intensive operations performed by neural network I algorithms. This information can be used to guide the design of the architecture such that these operations are performed more efficiently. This may be accomplished through i special-purpose hardware mechanisms, or through special mapping techniques. j : j Finally, we propose a mapping method for efficiently implementing various neural network models on this architecture. This mapping method is based on the fundamental operations performed by most neural network models. Therefore, it can be applied to a large class of neural network models, in addition to implementing different neural! network models, the implementation will be able to efficiently implement networks withj j arbitrary interconnection structures. The approach here is to introduce flexibility in the! communication network of the architecture to be able to support irregular and dynamically changing flow patterns. Such a capability can be efficiently utilized to match the neural, ] network interconnection structure to the topology of the computer architecture. I i 11.4 Summary of Contributions ! , The contributions made by the research reported in this dissertation are listed below: ! ; i i • Various sources of parallelism available in neural network computation have been1 identified. • A mapping method for implementing neural network computation on 2-D, mesh-connected systolic arrays has been developed and demonstrated. ; I ! • The DREAM Machine, a reconfigurable parallel processing architecture with I specialized features for efficient implementation of neural networks with arbitrary j interconnection structures, has been proposed. I • An algorithmic mapping method has been introduced to efficiently implement i neural networks with regular interconnection structures on the DREAM Machine. j I ; • A general approach for mapping parallel algorithms onto parallel processing architectures, based on combinatorial optimization techniques has been developed. I ! i • An optimization based mapping approach has been used to automatically generate' efficient mappings of neural network structures onto the DREAM Machine1 architecture. ! 6 ! _ _ i 1.5 Organization of the Dissertation ~ ~ ] I i In Chapter 2, we present background material relating to neural network computation and their associated parallel processing implementations. In this chapter we identify and present the various sources of parallelism available in neural computation. A taxonomy of various interconnection structures, employed by neural network models, is also ] presented in Chapter 2. In Chapter 3, we present the DREAM Machine architecture. i | The specific architectural features of this architecture, which have been specifically ! design for neural network processing, are described in detail. We show the basic mapping method used for implementing neural network models on parallel systolic| architectures in Chapter 4. A detailed analysis of a mapping method, proposed for | implementing neural networks on ring-connected systolic architectures, is performed in this chapter. In Chapter 5, an algorithmic mapping method is introduced to implement regularly interconnected neural networks on the DREAM Machine. The effectiveness of this mapping algorithm is demonstrated through several examples. The performance of this! i method is analyzed and compared to that of the linear ring implementation. An analytical j measure for evaluating the goodness of a particular mapping is presented in Chapter 6. j This is done by deriving two cost functions, one associated with the problem of [ assigning neurons to process, and another associated with generating the necessary scheduling of operations required for implementation of a particular neural network. The: i basic concepts of Constraint nets (Cnet) [78] used for solving combinatorial optimization, problems, are given in Chapter 6. The use of the Cnet method for solving the, assignment and scheduling optimization problems is shown in Chapter 6. j | Empirical results of implementing the optimization based mapping on several' j benchmark examples are given in Chapter 7. The asymptotic performance of our implementation method, along with a time complexity of our optimization based mapping procedure, are analyzed and compared to other methods in Chapter 8. The conclusions1 I are gathered in Chapter 9. Chapter 2 Background In this chapter we will describe neural network processing viewed from a computation/communication requirements perspective. This approach enables us to, effectively highlight various inherently parallel structures available in neural networks, i I We will outline several levels of parallelism that can be used for efficient implementations1 on parallel architectures, and will demonstrate their usage through several examples. 2.1 Neural Network Models i Neural networks cover a wide spectrum of models with varying computational and structural characteristics. In order to design a general-purpose neurocomputer which can efficiently implement a large number of these models, fundamental operations which are' common to all neural computations must be identified. Below, we describe the principle computations performed by neural network models and define a taxonomy of neural network interconnection structures. ; i { f 2.1.1 Neural Computation Overview As described in the introductory chapter, neural network models are comprised of two basic components, neurons and synapses. Neurons perform simple computations while synapses communicate the result of these computations among the neurons. A general formula for the computation performed by a neuron can be stated as j 8 ai — f { ui + #,)’ where (2-1) and (2-2) a, is the neuron output value, 6l is the neuron threshold value, is the synaptic connection weight between neuron i and neuron j , and / i s the neuron transfer function, j This computation is graphically depicted in Figure 2-1. Neural computation is comprised; I of two classes of operations: local, and distributed. Local operations are those which require data values locally available to each neuron, such as the application of the transfer function /, the addition of the threshold value 6 , and other model specific operations involving local parameter values. On the other hand, distributed operations require the transfer of neuron output values among the neurons in the network by means of the synaptic interconnection network. The calculation of the weighted sum of neuron output values received by a particular neuron, designated by the summation operator in equation | (2-2), is an example of a distributed operation. I Figure 2-1 - Computation performed by neuron i. In addition to the neural computation described by equation (2-1), neural network models employ various learning algorithms. Learning algorithms are used to modify the synaptic interconnection weights, and possibly the neuron threshold values, in order to produce a desired response from the network. The computational requirements of these algorithms can also be divided into local and distributed operations. Similar to the case of neural computation, the distributed operations of the learning algorithm are of greatest interest to us as they require communication among parallel operations performed by neurons. In this section we will describe computational characteristics of a few common neural network learning algorithms. 9 2.1.2 Representation of Neural Computation as Vector-Matrix Operations Vector and matrix notation can be utilized in order to obtain a systematic and uniform representation of neural network computation. Neural network models and various learning algorithms can be formulated using vector-matrix operations [42]. In such a scheme a vector a is used to represent the state of the neurons in the network and a matrix W is used to denote the synaptic interconnection network. Equation (2-1) can be rewritten using this notation as a matrix-vector-product operation a"+ J ) = /(W aw), (2-3) | where the superscript (t) denotes the iteration index. In such a representation, the! threshold value associated with each neuron can be implemented as an additional synaptic j weight received from a neuron with a constant output value of one. A great many neural network models can be represented in this form depending on the choice of the iteration index (t), the activation function/, and the specific restrictions imposed on the interconnection structure W [42]. Similar approaches involving vector-matrix-product and inner-product operations can be used to implement various learning algorithms. i It can be observed from equation (2-1) that the two primary operations performed by; neural networks are: the application of the transfer function/, and the calculation of the! weighted input sum to a neuron. The application of the transfer function is most often not; dependent on the interconnection structure of the network. Examples of some common; transfer functions used by neural networks are the threshold function, the ramp function, and the sigmoid function (see Figure 2-2). All of these functions can be implemented locally by each neuron. A widely used approach for fast evaluation of complex transfer functions (such as the sigmoid) is through the use of a table lookup mechanism. Depending on the amount of quantization tolerated by the neural network model, an appropriately sized portion of memory is allocated as a lookup table which each neuron I uses to determine its output value. j From a parallel processing point of view, the min/max function is a more| challenging type of transfer function to be implemented. The min/max function is used to identify an output neuron with either the largest or the smallest output value depending on the neural network model. These functions are used by competitive models employing a "winner-take-all" scheme, such as the Bidirectional Associative Memory (BAM) [38] and 10 f(x) 4 f(x) (a) -► x X (b ) f(x) ► x (c) Figure 2-2 - Typical transfer functions used by neural networks, (a) Threshold function, (b) Ramp function, (c) Sigmoid function. j | the Adaptive Resonance Theory (ART) [10]. A min/max transfer function cannot be' implemented based solely on the local information in each neuron since the final state of the network is a function of all the neurons competing with each other. Support for parallel implementation of such functions has been rather limited and will be discussed in' I more detail later in this chapter. A method for efficiently supporting min/max type operations on the DREAM Machine is presented in Chapter 3. 2.1.3 Neural Network Interconnection Structures i j I A major obstacle in devising a general method for efficient implementation of neural1 : i networks on parallel architectures is associated with the complex and diverse range ofj neural network interconnection structures. A general taxonomy of interconnection structures is presented here in order to systematically analyze various methods for their, realization on parallel hardware. First, let us define some commonly used terminology. i | A layer refers to a set of neurons that are only connected to the neurons of the adjacent layers. An example of a two layer network is depicted in Figure 2-3(b). It should be noted here that others might refer to such a structure (Figure 2-3(b)) as a one layer network by counting the layers of synapses rather than neurons. 11 I The simplest interconnection structure is one in which every neuron is connected to all ! the other neurons in the network, see Figure 2-3(a). Such an interconnection structure i can be represented as a matrix W with all non-zero off-diagonal elements. Generally, such networks perform iterative update calculation with index (t) of equation (3) denoting j time. An example of such a model with a dense and symmetric interconnection structure1 j is the Hopfield network [32]. An extension of this interconnection structure is used in! | two layer iterative models (see Figure 2-3(b)) such as the Bidirectional Associative; Memory (BAM) [38] and the Adaptive Resonance Theory (ART) [10] models. In these' models, information flows between one layer to another and back until the systemj converges to a stable state. Therefore, the iteration index (t) of equation (2-3) refers to: l i j an alternating layer index switching between the first and second layers. I However, a number of two layer neural network models do not iterate between the1 ■ i ; layers, rather input data is presented to the first layer and the network produces a ! corresponding response in the second layer. The Kohonen's self-organizing feature maps [37] and the Perception [65] are examples of such models. This type of layered; i neural network models can be extended to include models with multiple layers of neurons. The widely used Multilayer Perceptron [66] and the neocognitron [19] are examples of models with such multilayer interconnection structures. These layered neural; networks employ "feed-forward" processing. That is, the iteration index of equation! (2-3) is used as a layer index ranging from the input layer number to the output layer I number. ■ i 1 ! In the discussion of various interconnection structures above, we did not consider the interconnection density and possible restrictions on the synaptic weight values imposed by the neural network model. Most neural networks have been formulated with full' I interconnections between neurons of adjacent layers. Several weight pruning techniques j have been proposed to eliminate a large number of these connections to improve the generalization characteristics of the model [46]. As neural network models continue to bej applied to real-world applications, a priori knowledge about the problem will be j increasingly used to eliminate a great many interconnections, resulting in a more sparse ! and better structured interconnection network, such as the one shown in Figure 2-3(c).‘ Furthermore, a priori knowledge about the problem can enforce certain restrictions on thJ weight values in the interconnection network. For example, the neocognitron network I contains many feature sensitive neurons. These neurons have a limited number of i 12 connections to the previous layer such that each neuron is connected to only a small] neighborhood (or receptive field) of neurons in the previous layer. The model also j requires that all the neurons associated with a specific feature type to have the samei i I weight value distribution. This method has also been termed weight sharing since ai single weight value is shared among a number of different neuron pairs. Each feature! detecting neuron can thus be tuned to have a maximum output value when a particular J activity pattern is presented in its input receptive field area. (c) Figure 2-3 - Examples of neural network interconnection structures, (a) One layer | fully connected network, (b) Two layer network, (c) Two layer network with i limited receptive field interconnections. j 2.1.4 Implementing Neural Network Learning Algorithms j s In addition to parallel implementations appropriate for neural computation, we are also interested in methods for efficient implementation of learning algorithms. Neural network j learning can be classified into three classes. The first is called one-time learning. In this I method, the synaptic interconnection weights are determined without the use of neural computation and are based only on a specific equation (as in the case of the Hopfield network [33]) or on the actual patterns to be memorized (as in the case of the BAM [38] and the PNN [80] models). The second class of learning algorithms is called] unsupervised. Models utilizing unsupervised learning generate appropriate weight1 j modification values based on the input pattern received and the current state of the network. Unsupervised learning algorithms generally employ a competitive layer, as; described in Section 2.1.2, where a number of neurons compete to find the maximum or S minimum output activation value. The third type of learning algorithm is the supervised^ i learning method. This method requires an outside teacher to advise the network of the! correct response for a given input pattern. This information is consequently used by the- neural network to generate appropriate weight modification values. | I From a computational point of view, one-time learning algorithms can be implemented "off-line" using any available method, since the calculation is performed only once and does not cause a computational bottle-neck. The unsupervised learning; strategies employing competitive learning, such as the Kohonen's SFM [37], require a method for selectively modifying synaptic weights associated with the winning neuron and possibly a number of other neurons in its local neighborhood. Since only the, weights associated with one or a few neurons in the network are updated, the amount of available parallelism of such algorithms is fairly limited. Supervised learning methods, such as the back-propagation algorithm [66], generally propagate error values through the neural network using the synaptic interconnections. In most instances, an efficient' I implementation for forward propagation of data, used for neural computation, can ! efficiently be applied in the learning phase. ; I 2.2 Inherent Parallelisms of Neural Networks i i In this section, we will describe four types of parallelism, inherently available in neural i networks, that can be exploited by implementations on parallel processing architectures.1 Each type of parallelism corresponds to a different level of computational granularity: These levels are designated here, in order of decreasing granularity, as network level,; layer level, neuron level, and synapse level parallelism. Mapping methods can utilize j one or more of these parallel computational structures to assign neural network! computation to processors in a parallel processing architecture. In the following paragraphs we will describe in detail the various mapping methods and corresponding i i 14 computer architectures used for efficient implementations. This description is organized! based on the level of parallelism used by each method. i j 2.2.1 Network Level Parallelism Network level parallelism involves mapping one complete neural network to each! processing element (PE) of a parallel processing system, see Figure 2-4. Employing this! i type of parallelism increases the total throughput of the machine by executing a number of; networks in parallel, but the execution rate for a single network is not increased. Network level parallelism can be used efficiently for various applications. For example, j ' several neural network models use a number of parameters that are determined empirically through a trail-and-error procedure. In a machine with P processing elements, P, i different parameter settings can be evaluated in parallel by executing one neural network! per processor, using network level parallelism. The solution of many iterative neural j network models, such as the Hopfield net [33], Constrained net [78], and Elastic net 1 [13], used in optimization applications is dependent on the initial state of the network.! Identical networks with different initial states can be implemented on a parallel processing! architecture utilizing network level parallelism. This type of mapping usually leads to a linear speedup factor since there is no need for communication between processors of the I ' ' architecture. i ! PE #1 PE #2 PE #P Interconnection Network Figure 2-4 - Exploiting network level parallelism by mapping complete networks ! to individual PEs. ' I Network level parallelism can also be used to increase the throughput of some neural i network learning algorithms. In particular, learning algorithms which allow for the. so-called batch-mode training, such as the back-propagation algorithm, modify the' network's synaptic weights after summing individual weight contribution of a number of different training patterns. Parallel implementation of this technique can be accomplished by assigning a different input pattern to each of the identical neural networks implemented in different processors. The weight change contribution of individual input patterns are summed together across the interprocessor communication network to arrive at the final weight modification values. Each processor in the system uses the summed modification j values to update its copy of the network. Unlike the previously described uses of| J network level parallelism, this approach requires communication of weight modification values among all the processors in the machine. Therefore, efficient utilization of this approach requires high bandwidth communications between the PEs of the target! architecture. An example of such an implementation is presented later in this chapter.! Network level parallelism in general is applicable to coarse grain parallel processing architectures that offer powerful processing elements with a large amount of memory. 2.2.2 Layer Level Parallelism | i Layer level parallelism involves mapping each layer of a neural network to individual processors of the architecture, see Figure 2-5. It is obvious that this type of parallelism; I : is applicable to layered neural network models. Furthermore, the amount of parallelism; gained through the use of this method is directly related to the number of layers in the network. Since most neural network models contain rather a few (less than 10) layers, parallel implementations exploiting only layer level parallelism offer very limited speedup' over uniprocessor implementations. Therefore, implementation methods using layer' level parallelism are often utilized in conjunction with one or more of the other type of; parallelisms. PE #1 PE #2 PE#L j Layer #1 Layer #2 Layer #L ' Figure 2-5 - Exploiting layer level parallelism by mapping consecutive layers of I the neural network to consecutive PEs of the processing pipeline. | When implementing non-iterative feed-forward neural network models, such as the Multilayer Perceptron (MLP) network [66], the iteration index (t) of equation (2-3) is 16 used as an index over the layer numbers; thus the computation is performed sequentially^ from one layer to the next. In order to take advantage of layer level parallelism, it is; necessary to process a number of different input patterns concurrently in a pipelined} | fashion. In effect, layer level parallelism implements L complete networks concurrently,' ' where L is the number of layers in the network. The restrictions regarding the! applicability of this method for implementing neural networks are similar to those i imposed by network level parallelism. The major difference here is that the effective| throughput of the system is equal to the time associated with processing of the largest' I layer in the network, where as the throughput of implementations utilizing network level parallelism is equal to the time required to process a complete network. ! I ; As defined in Section 2.1.3, the neurons of a given layer, in a multilayer neural; network, are only connected to neurons of the adjacent layers and possibly to each1 other. This type of interconnection structure requires the computer architecture to have aj I high communication bandwidth between adjacent PEs. The amount of information that is to be transferred between two adjacent processors is proportional to the number of neurons in the layers that are mapped to those processors. The efficiency of implementations utilizing layer level parallelism is related to the match between the neural1 network structure (number of layers in the network) and the computer system's size and topology. This fact imposes severe restrictions for using this mapping approach for implementing neural networks of varying structures. I j ; 2.2.3 Neuron Level Parallelism i i By far, most parallel implementations of neural networks have taken advantage of neuron level parallelism. This approach involves mapping each of the neurons in the neural network to a distinct processing element of the architecture, see Figure 2-6. Neurons constitute the fundamental parallel operations in all neural network models. Therefore,' the mapping method based on neuron level parallelism can be applied to a wide range of neural network models. This is the major advantage of employing neuron level parallelism over the previously mentioned methods. The main difficulty in achieving practical implementations based on neuron level parallelism is in accommodating the complex and demanding interconnection structures, employed by neural networks, on : the limited physical topology of the target computer architecture. ■ PE #1 PE #2 PE #3 ~PE#N Neuron #2 Neuron #N 3 ^ Neuron #3 Neuron #1 Interconnection Network Figure 2-6 - Exploiting neuron level parallelism by mapping individual neurons to j distinct PEs. J Due to the large number of neurons in a typical neural network, mapping methodsj I utilizing neuron level parallelism are applicable to medium and fine grain parallel^ processing architectures. Each processor in the system requires only a small amount of, j memory since only the synaptic weights associated with a neuron's input or output i connections are stored in the local memory of each processor. Since it is impractical to; ; build fine grain systems with fully interconnected topologies, these mapping methods! | must use multiplexing techniques to map the large number of synaptic connections,! employed by neural networks, onto the limited physical connections of the architecture. ! The interprocessor communication bandwidth of the systems utilizing neuron level parallelism can be considerably large. Each neuron, mapped to a specific processor,! must communicate its output value to other neurons in the network, each mapped to a' different processor in the system. The number of data items that are transferred among various processors is thus proportional to the number of synaptic connections used by the: neural network. Two primary methods can be used for realizing the synaptic I interconnection structures on parallel processing architectures. The first method is based j on systolic nearest neighbor communications between processors. In this approach, the ; regularity of the neural network structures is exploited to efficiently communicate, ; information among neurons mapped to various processors. The second method involves: j I I the use of a message-passing network to transfer the neuron output values to their | appropriate destination processors. In general, the massage-passing approach requires a1 longer latency for data transfers between neurons but offers greater flexibility in: implementing neural networks with arbitrary interconnection structures. I 2.2.4 Synapse Level Parallelism The finest grain parallel operations that are performed by neural networks are those performed by individual synaptic connections of the model. As mentioned in the introductory chapter, the most common operation performed by a synapse is to multiply I its weight value by the associated neuron’s output value. By mapping each synaptic ! connection to a unique PE, all of the operations performed by the synapses can be executed in parallel, see Figure 2-7. As with the neuron level parallelism, synapse leveli j parallelism is applicable to a large variety of models since synapses are also a fundamental! j part of each neural network model. \ PE #1 PE #2 PE #5 Synaptic Weight #1 Synaptic Weight #2 Synaptic Weight #S Interconnection Network Figure 2-7 - Exploiting synapse level parallelism by mapping single synapses to 1 each PE. In this figure s denotes the number of synaptic connections used in the j neural network. ! I In order to efficiently exploit all the available parallelism at the synapse level, the ! supporting computer architecture should be very fine grain, offering a large number of [ simple processors. In order to implement the neural computation of equation (2-1), each I neuron output value must be broadcasted to all the PEs holding its output synaptic j weights. The appropriate weighting function can be applied to the neuron output values by all the processors in unison. Calculating the weighted sum input to a neuron involves^ communicating and accumulating the values in each processor holding the input synaptic" weight of a specific neuron. This scatter-gather type operation is performed in parallel for, all the neurons in the network. Such operations require sophisticated routing andf communication support. Implementations on architectures that efficiently support synapse level parallelism are rather robust to the neural network interconnection structure since no specific assumption about the network structure is made by the mapping. 19 2.3 Parallel Implementation Methods In this section, several methods for implementing neural network models on parallel; t processing architectures are described. The range of methods used for parallel i implementation of neural networks is highly diverse and covers many different levels of design granularity as well as implementation technologies. Among various parallel processing methods used for implementation of neural networks, two principle approaches have been pursued extensively: One in the development of dedicated hardware devices [22, 24, 29, 31, 53, 54, 67, 68], and another in the design of parallel \ computer architectures and associated mapping methods [5, 9, 11, 12, 18, 20, 25, 30, ] 34, 35, 41, 43, 50-52, 55, 59-61, 64, 73, 79, 87-89]. The approach to directly mimic j the processing of neurons and connections in hardware, either via analog or digital computation, is limited in applicability since these devices offer little flexibility in the type of algorithm and the network interconnectivity structure. These devices have also been; difficult to use as building blocks for larger neurocomputer systems since the physical i pin-out limitations creates a bottle-neck for transfer of data between various parts of the j neural network. Nevertheless, this approach can lead to high processing rates and a j relatively compact size suitable for special-purpose implementations. Due to their limited ■ applicability, specific device level implementations are not described in this chapter. However, the various parallel processing approaches and examples described in this! chapters can be used as basis or to point out viable directions for further work in physical device implementations. Among these approaches, implementations on programmable digital parallel processing systems offer the best compromise between flexibility and speed for processing a great variety of neural network models. This approach offers high throughput rates by employing a large number of processing elements. It also offers sufficient programmability for implementing a wide array of neural network models, j These systems can be used both for implementing time sensitive processing, and as simulators and development tools for designing new neural network models. A major challenge in the design of such systems is to achieve the maximum amount of flexibility and speed at the minimum cost. t Our presentation in this chapter is mainly focused on methods for implementing! neural networks on programmable parallel architectures. In addition to concentrating on ! 20 "digital parallel processing systems, wefurther limiuliediscu ssioiT to" implementations on] Single Instruction stream Multiple Data stream (SIMD) class of parallel processing: J architectures. The SIMD execution paradigm is an ideal match for implementing neural j network models primarily due to the homogeneous nature of the operations being j ' performed by neurons. The SIMD machines are efficient since there is only a single copy; j i i of the program being executed by all the processors. The SIMD paradigm allows for! ! i ; simple program development on fine grain parallel machines with minimal overhead in! ! implementing interprocessor synchronization mechanisms. Moreover, limited autonomy in each PE, such as conditional masking and support for a local table lookup mechanism,, can significantly increase the efficiency of these architectures for implementing neural| networks. ! i Implementing neural network models on parallel processing architectures requires a' > method for efficiently mapping neural computations to PEs of the target architecture. In I general, the objective of this mapping is to distribute the required computation among the I PEs such that the amount of interprocessor communication operations is minimized and at j the same time, the number of PEs performing useful operations (i.e. those operations that directly contribute to the neural network processing) is maximized. Many approaches; for mapping neural networks onto parallel SIMD architectures have been proposed [5, 9, 11, 23, 25, 42, 55, 60, 61, 64, 73, 88, 89]. The performance of any implementation is strongly related to its efficiency in performing the weighted sum operation of equation (2-2) in a distributed fashion. This requires accumulating the various weighted neuron output values that are distributed across a number of different PEs in the processing array. The implementation of computations requiring data local to the neuron, such aJ application of the transfer function to the partial sum value, does not require^ i communication among neurons, and therefore, can be processed efficiently in parallel; regardless of the interprocessor communication network topology. The presentation below is organized according to the individual mapping method's efficiency in implementing neural networks with specific types of interconnection structures. These' include mappings applicable to dense and regularly interconnected networks, sparsely interconnected networks, and any arbitrary interconnected network. j i , i 21 2.3.1 Implementations Applicable to Dense and Regularly Interconnected! j Neural Networks I j Many neural network models are comprised of dense and regularly interconnected i j neurons, as described in Section 2.1.3. This regularity in the interconnection structurej i can be used to arrive at efficient implementation methods on regularly structured parallel! | processing architectures. These methods can be applied to fine grain parallel processors; I by exploiting neuron level parallelism. The degree of parallelism at this level is equal to the number of neurons in the network (A O - Ideally, such a scheme can lead to a similar speedup factor of N. This speedup can only be attained if interprocessor communication delays of the mapping amount to zero. Two distinct methods can be used to achieve zero communication overhead costs. One method is to assume full interconnections among! 1 processors of the processing array. Since this approach is prohibitively costly for fine1 i i j grain architectures, it can be eliminated as a viable option. The second method involves i ! , scheduling the flow of data through the processing array such that the appropriate neuron I i output values arrive at a processor at the correct instant to perform the input sum ! accumulation operation. This accumulation operation involves summing up a number of ; neuron output values distributed among the various processors. This type of processing yields zero communication overhead since each processor can perform only a single; 1 multiply-accumulate operation during an instruction cycle, and thus, the communication! steps can be overlapped entirely by computations. ; I A very efficient mapping method for implementing neural computation using this! approach on systolic array processors has been proposed in [42], In this method the neurons and the synaptic interconnection network are represented as vectors and matrices, respectively, and the neural computation is formulated as in equation (2-3). This approach treats all neural computations as operations on vectors and matrices. The mapping method takes advantage of the regularities in vector-matrix operations to synchronize the operations of the systolic array. This mapping method exploits neuronj level parallelism by assigning each neuron to a unique processor in the processing array.; I The processors are arranged in a 1-D ring topology with the number of PEs in the machine being equal to the number of neurons in the network. The operations involved; ! in implementing equation (2-3) is depicted in Figure 2-8, where utft) represents the j partial weighted sum input to a neuron at time t. The neurons are mapped to processors rand the synaptic weights are organized and accessed in such a fashion that the input value! I w^aj needed by neuron i is added to the partial sum value m . at the instant that both w~! and m , arrive at the processor holding aj . W ( U ) ( 1 6) W ( 151(161 W ( 16) ( 16) W. uft+1) = Uj(t) +ajwy(t) ' l(I6) (a) (b) | j I , Figure 2-8 - Implementing a fully connected neural network with 16 neurons on a ; , ring-connected systolic array, (a) Assignment of neurons to processors and flow j of data, (b) Computation performed in each PE at time t. I I By utilizing N processors, such a processing scheme can execute N 2 operations' (associated with the N2 synapses) in O(N) time steps. Therefore, properly structured neural networks can achieve optimal system utilization and consequently generate very! I high throughput rates. Unfortunately, the mapping approach places a number of restrictions on the interconnection network structure in order to achieve maximum1 I | efficiency. One such restriction is that the interconnection matrix is required to be { quadratic, having equal number of rows and columns, and fully interconnected. Zero j weighted synapses and neurons with zeroed output values are introduced to satisfy these, j constraints, should the network structure be sparse and/or non-quadratic. The efficiency I i of this implementation method is directly related to the density of the interconnection network since the addition of filler synapses and neurons do not contribute to the actual computation and are used solely to insure proper synchronization of data movement through the array. This mapping method implements equation (2-3) for a single iteration or a single layer of interconnections. For implementing multilayer neural network, the processing can be performed sequentially using the entire processing array for each of the layers in the network, starting from the input propagating towards the output layer. This approach' leads to a processing rate of 0(LN*) where N * ( = max{A^}, L is the number of layers in the network, and Nt is the number of neurons in each layer. This approach assumes requal number of neurons per layer and can lead to considerable inefficiencies in] implementing neural network models with large discrepancies between the number of! neurons in various layers. j i Another approach to implementing multilayer neural networks, proposed in [41],, involves the use of layer level parallelism in addition to neuron level parallelism. In this! approach, each layer is mapped onto a row of a 2-dimensional processing array with! i wrap-around connections in the East-West directions. Each row of the processing array j I can then perform the computation associated with each layer and the complete network; j i computation moves in the vertical direction between the rows of the processing array, i Using pipelining techniques, L networks can be implemented concurrently to achieve an J effective throughput rate of 0(N*t ). This approach can impose even more stringent! : requirements on the architecture, requiring that the number of rows in the processing! array to be equal to the number of layers in the neural network. ! 2.3.2 Implementations Applicable to Sparsely Interconnected Neural; Networks I Although most neural network models are formulated with an assumption of dense : interconnection structure, recent findings have demonstrated that performance improvements can be achieved by eliminating a large number of unnecessary synaptic I connections [46, 82]. In addition to algorithmic approaches used to prune the neural I network structure, a priori knowledge about the specific application can be used to] design a neural network with structured but sparse interconnections [70]. Two methods' for efficient implementation of sparsely interconnected neural networks, without placing specific demands on the regularity of the network interconnection structure, are described here. The first method, proposed in [61], exploits synapse level parallelism by assigning individual synapses as well as neurons to unique processors of a 2-D mesh-connected^ i 1 parallel processing architecture. The computation method is based on an algorithm for, efficient implementation of sparse vector-matrix m ultiplication operations on 2-dimensional, mesh-connected, SIMD architectures. The algorithm involves distributing the neuron output values, allocated to various 1 PEs in the machine, to those processors holding their corresponding output synaptic I i weights. This distributing procedure is performed in parallel via a congestion-free i I three-phase routing algorithm using O(^fP) cycles, where P is the number ofj processors in the array. When all the neuron output values are received by theirj associated synapse PEs, a multiplication operation is performed in unison in all the; synapse processors taking 0(1) time steps. The next step in the processing involves : summing the weighted input values of a given neuron, where each value is assigned to a ! different PE of the machine. The same three-phase congestion-free routing algorithm is? I t used to perform this gathering operation. The execution rate of this mapping method is; thus 0( VP ) with a constant multiplicative factor of approximately 24. Since the number of processors in the array is equal to the number of neurons andr synapses of the neural network, the throughput of this mapping method is 0(^jN + E), I j where E denotes the number of synapses in the network. For a fully interconnected 1 o 1 | neural network, where E=NZ, the asymptotic performance of this mapping is O(N). Onj the other hand, assuming that in large scale neural network models each neuron is! ; connected to only a constant number (k*) of other neurons in the network (as is the case’ ! with biological neural networks), the number of connections in the network will be E=k*N. Consequently, the throughput rate of this mapping method would be 0(-\lN + k*N) - It is apparent that the performance of this implementation* i method is directly related to the neural network sparsity. Although this implementation! i method and the one described for ring-systolic architectures [42] in Section 2.3.2 achieve the same asymptotic performance for implementing fully interconnected neural networks, j the much smaller multiplicative constant of the ring-systolic method makes it a much more attractive solution for implementation of densely interconnected networks. On the other hand, the three-phase implementation method [61] is more efficient for implementing neural network models with sparse interconnectivity. j Another approach for efficient implementation of neural networks with sparse I i interconnection structures has been proposed in [84]. This approach exploits neuron level parallelism by assigning every neuron to a unique processor of the system. The computer architecture used for this mapping utilizes the SIMD execution paradigm with; i nearest neighbor interconnection topology. The fundamental premise of the method is based on having local autonomy such that each processor can independently select one of its neighbors for communication. The architecture is thus similar to that of the DREAM . I I Machine proposed in this dissertation (see Chapter 3). This autonomy in interprocessorj 25 communication is achieved by having dynamically reconfigurable interconnection! switches in the nearest neighbor topology. j i Having each neuron mapped to different processor in the system, implementing' p equation (2-2) requires communication among all the processors participating in the evaluation of a single neuron output value. The output value of each neuron is communicated to other processors by a message-passing mechanism using intermediate ! PEs for routing. In order to assure congestion-free routing, the algorithm creates1 I incrementally longer communication paths at each iteration. This mapping method hasi i been shown to be effective only for extremely sparse networks [12] since the number of I useless communication steps grows exponentially with respect to the interconnection density of the network. In addition, with longer paths, the amount of memory required at each PE to support the routing table is also increased exponentially. 2.3.3 Implementations Applicable to Arbitrarily Interconnected Neural Networks i l j I | The mapping method described in Section 2.3.1 is efficient for implementing densely « ^ ! j connected neural networks while inefficient for implementing sparsely connected networks. The methods described in Section 2.3.2 are efficient at implementing sparsely connected neural networks but not efficient for densely connected networks. In this' section, two drastically different mappings are described that are efficient regardless of the interconnection density or structure of the neural network being implemented. The first method, reported in [60], utilizes only network level parallelism to achieve speedup. A complete neural network is mapped to each of the processors of the machine. In the actual implementation on the Warp machine, 10 networks were implemented in each PE to match the computation and communication speeds of the architecture. Weight values were stored in a global memory area and were circularly passed among the various PEs, see Figure 2-9. Since all the values necessary for implementing a given network can be stored locally in each PE, the efficiency of this implementation method is not dependent on the interconnection network structure or density. The throughput rate of this mapping method is thus 0 (N 2lP). As mentioned previously, network level 1 parallelism is of limited use since it is only applicable for coarse grain architectures where I I i i 26 i | Weights Cluster Memory (39 MB) PE #2 PE #1 Pattern Pattern ! Figure 2-9 - Using network level parallelism for implementing neural networks on ! the Warp machine. | I most often P « N . Consequently, the amount of speedup achieved by this method is' negligible for implementing large neural network models. The second implementation method applicable to arbitrarily connected neural networks has been proposed in [5]. This mapping is in principle very similar to that; j reported in [61], described in Section 2.3.2. This mapping method has been used for' J i ) implementing neural networks on the Connection Machine [27], a massively parallel; i ! i processor. The method exploits neuron and synapse level parallelism by assigning, i neurons and synapses to distinct processors, see Figure 2-10. Due to the high S S S lQ iS g g A A A AAA i 3 m Neuron Processor m Fan-out Synapse Processor | g Fan-in Synapse Processor j Figure 2-10 - Implementation of neural networks on the connection machine, from [79], j ' bandwidth interconnection network of the Connection Machine, the scatter and gather ; operations involved in communicating the neuron output values and summing the I weighted input values can be performed efficiently regardless of the number and specific i l I location of the PEs holding the synaptic weight values. Nevertheless, communication^ ! overheads associated with this mapping become apparent when the performance of this! approach is compared to that of [89] which utilizes both neuron and network levelj parallelism on the same target machine [79]. i Chapter 3 The DREAM Machine j In this chapter, we present the Dynamically Reconfigurable Extended Array, ! Multiprocessor (DREAM) Machine. The DREAM Machine is a parallel processor^ I designed specifically for efficient implementation of neural network models. Due to the ; considerable diversity in the structure and computational requirements of neural network ! models and the wide range of applications utilizing these models, the DREAM Machine architecture is designed to be scaleable and to offer programmability in both computation ; j i and communication operations. Its computational flexibility is obtained through the use' J of digital programmable Processing Elements (PEs). The communication flexibility is j achieved through the use of programmable interconnection switches used for j communicating data between neighboring PEs. The description of the DREAM Machine architecture presented in the following sections is intended as a general guideline for the design and implementation of the i machine. Therefore, specific details of the design, such as clock rates, number of words in the register file, arithmetic function unit design, etc., are left to be determined based on the implementation technology and the allotted cost of the final product. In the 1 following sections we elaborate on special details of the DREAM Machine architecture' which are unique to this design and are necessary for efficient implementation of neural ! network models. j i I 29 3.1 System Level Overview ! i j The DREAM Machine can roughly be classified as a Single Instruction stream Multiple t I ; Data streams (SIMD) [17] medium- or fine-grain parallel computer. The Processing; ; Elements (PEs) are arranged on a 2-D lattice where each PE is connected to eight of its! closest neighbors through 4 programmable switches. The top level architecture of thej j DREAM Machine is depicted in Figure 3-1. 1 Memory Interface Memory Array Processor Array Figure 3-1 - The top level design of the DREAM Machine. . i 3.1.1 System Organization j i i i The DREAM Machine system consist of three major units: the host computer, the controller, and the Processor Array (PA). The DREAM Machine can be viewed as an attached coprocessor of the host computer. Its memory can be accessed by the host computer through the use of a high speed Direct Memory Access (DMA) channel. The host computer needs only to specify the memory location of the data block to be accessed; in the DREAM Machine memory space and the number of words to be transferred. The' rDM A controller can transfer the data without using any additional cycles from the host1 j I computer or the coprocessor. This feature permits simple programming interface between I the host and the DREAM Machine. The host computer is primarily used for properly I ( | formatting the input data, long term storage of data, and as a visual interface between; ’ the user and the DREAM Machine. J j I j The controller unit interfaces to both the host computer and the PA. The controller, j contains a microprogram memory area that can be accessed by the host. High-level ' programs can be written and compiled on the host and the generated control information | can be down-loaded from the host to the microprogram memory of the controller. The > controller broadcasts an instruction and a memory address to the PA during each 1 ) ■ processing cycle. The processors in the PA perform operations received from the! controller on their local data. Each processor can selectively be masked from performing; computation based on a mask flag available in each PE. The DREAM Machine architecture allows for conditionally modifying the status of the mask bit in each processor, giving an additional degree of autonomy to each processor. 1 i The Processor Array (PA) unit contains all the processing elements and the supporting interconnection switches. Each Processing Element (PE) in the PA has direct access to its local column of memory within the DREAM Machine memory space. Due to ! this distributed memory organization, memory conflicts are eliminated which l ! consequently simplifies both the hardware and the software designs. ! i 3.1.2 Execution Paradigm i I The choice of an SIMD execution paradigm for neural computation can be justified by considering the fact that neural network models are built of a homogeneous array of neurons with each neuron performing the same basic operation on a different data set. This type of computation is a perfect match for implementation on SIMD architectures since all processors perform the same computation on their locally stored data. Certain neural network models, such as the Adaptive Resonance Theory (ART) [10] and the1 ! neocognitron [19], utilize different neuron types in different layers. In general, the' , number of different neuron types is small (two in the case of ART and the neocognitron).[ | Furthermore, the majority of operations performed by either type of neurons involve identical processing steps. Therefore, these models can be effectively implemented on a r SIMD machine by time-multiplexing the operations that are specific for each neuron typo ! without a significant loss in throughput. In addition, the use of the SIMD execution! paradigm leads to a considerable amount of saving in the instruction memory! requirements, since only a single copy of the program is stored for the entire system. : ! ! The tightly coupled, distributed memory structure of the DREAM Machine offers aj ; simple scheme for interprocessor communication. Communication between processors is] i 1 I performed one word at a time requiring only a single instruction cycle of each processor. Communication and computation operations can be initiated in parallel by a single instruction. When implementing systolic algorithms, the communication operations can be completely overlapped by computation, and consequently, attain very high execution' efficiency. ! i I i i | 3.2 Processing Element Design Description i The Processing Element of the DREAM Machine makeup the computational engine of the ; system. As mentioned earlier, the PEs are part of the Processor Array subsystem. All ' PEs in the PA receive the same instruction stream but perform the required operations on' ( their own local data stream. Each PE is comprised of a number of Functional Unitsj ; (FUs), a small register file, interprocessor communication ports, and a mask flag, see1 Figure 3-2. Control FUs File G-bit Mask I/O Ports Data Bus ! X-Net Switches G-Bus Controller Memory Figure 3-2 - Top level design of the Processing Elements. 32 ; 3.2.1 Internals of the Processing Elements : The functional units in each PE include adder, multiplier, shift/logic unit. Depending oni the specific implementation, additional FUs can be added to the design. Similar to many RISC type processors, each PE has an internal data bus to transfer data between various units. For example, data could move from a register in the register file to one of the operand registers of the adder unit, or data can be transferred from the output register of I the multiplier to the I/O output port. A mask bit is used to enable/disable the FUs from performing the operation instructed by the controller. The status of the Mask-bit can be; i set unconditional by the controller, or determined based on a logical operation performed1 by one of the FUs. The Mask-bit can be modified during each instruction cycle. i The G-bit register in each PE holds the Most Significant Bit (MSB) of the shift t j register output. This bit is used to implement two separate functions. The output value j of this bit is directly connected to the G-bus, see Section 3.5, and also buffered and used as input to the address modifying shift register in the PE's local memory area. The: | G-bus connection performs global type operations, such as global broadcasts, finding* ! maximum or minimum activation value, etc.. It is also used to modify the memory ! j address value received by the controller according to a locally stored value. This function j is used to implement a variable accuracy lookup table, see Section 3.3. Each PE communicates with its neighbors through the I/O ports. Only one input and one output port is required in each PE since during each systolic cycle only a single data, value is received and transmitted by a PE. The output of each of the I/O registers is| connected to the four X-net switches surrounding each PE. The selection of the destination PE for an outgoing data value and the source PE for an incoming data value is made by the specific switch settings of the X-net switches. | ; I I 3.2.2 Processor-Memory Interface j f Each processor of the DREAM Machine has read/write access to its own local memory, area, see Figure 3-3. The memory is kept off chip in order to allow for simple memory expansion. During each processing cycle, the memory location associated with each instruction is broadcasted by the controller unit to all the PEs. In this fashion, all the | processors can access a single plane of memory at each time step. The memory access speed is matched with the computation rate of each processor so that each memory access] can be completely overlapped with computation. This feature allows for efficient systolic! I processing. ( Each word in the PE's local memory area is comprised of two distinct fields. The first is the data field used to store and retrieve the actual data associated with the computation, such as neuron activation values, synaptic weight values, etc.. The second field holds 4 bits which indicate the switch setting for the X-net switch associated with each PE. Two bits are used for selecting an input port and 2 bits for selecting an i * | output port. These switch setting values are determined prior to the start of computation; l ’ j and are pre-loaded by the host. A new switch setting value can be read during each! instruction cycle. PE MSB ! s w > I I G-Bus LSB \ ( Cntrl. Addr. Memory Figure 3-3 - Processor and memory detail diagram. Associated with each data , value in the local memory of each PE is a switch setting value used to configure the ' communication topology of the machine. ' i I I 3.2.3 Instruction-Word Description i The microword instruction used by the DREAM Machine contains two distinct fields, i one for specifying the memory address and another for specifying the operation to be l j performed by the processors. The instruction set of this machine consists of the l conventional arithmetic, logical, and shift operations (e.g. addition, multiplication,! ! AND, OR, left and right shifts, etc.). In addition, the DREAM Machine instruction set! also supports global broadcasting of data from the controller to all the PEs in the processing array. This is accomplished using an Instruction/Data bit in the instruction word. If this bit is set to zero, the remaining bits of the instruction word are treated as a single data value. If it is set to one, the remaining bits of the instruction word are, decoded and used for their appropriate functions within the PE. I j In addition to supplying memory address and control instructions to the PA, each! instruction word contains a specific field to control the loading and shifting of data into' the memory address modifying register. This field is used when the memory address supplied by the instruction needs to be uniquely modified based on some local information in each processor, as in the case of a table lookup. , 3.3 Implementing a Table Lookup Mechanism on the! DREAM Machine » i A novel feature of the DREAM Machine architecture is its hardware support mechanism for implementing a variable accuracy lookup table. Neural network models use a variety of non-linear transfer functions (represented a s /in equation (2-1)), such as the sigmoid, the ramp, and the threshold functions (see Figure 2-2). These functions can be efficiently implemented through the use of a lookup table. Implementation of a table1 i lookup mechanism on a SIMD architecture requires a method for generation/modification! of the memory address supplied by the controller, based on some local value in each PE.; The BLITZEN Machine [6] performs this task by logically ORing the 10 most significantj bits of the memory address supplied by the controller, with a local register value. Such a scheme does not offer sufficient flexibility as required for general-purpose neurocomputer design. The accuracy, or level of quantization of the neuron output values tolerated by i neural networks can vary significantly (from 2- to 16-bits) among different neural network models and different applications of each model. In order to accommodate lookup tables of varying sizes, the DREAM Machine incorporates two shift registers that are used to modify the address supplied by the controller. One shift register is associated with the PE and keeps the data value used for addressing the lookup table. The other shift register is associated with the PE's local memory and is used to modify the address received from the controller, see Figure 3-3. The table lookup procedure is initiated when the controller loads the base address of the table to each of the shift registers associated with each PE's local memory. The offset value is then shifted into this register one bit at the time from the local register in the PE starting from the most significant bit into the least significant bit of the memory address register. The control signals for this shifting operation are generated by the controller and are broadcasted to all PEs as part of the microinstruction word. With this procedure, an address for a table of size 2k can be generated in k time steps by each processor. 3.4 Interprocessor Communication Network Two methods for interprocessor communications are supported by the DREAM Machine architecture. The processors in the PA are arranged on a 2-D lattice with eight nearest-neighbor connections, implemented through dynamically reconfigurable switches. In addition to the nearest neighbor communications performed through these switches, a single bit wide global bus, called the G-bus, is used to perform bit-serial global communications among the PEs. 3.4.1 Nearest Neighbor Communications i i i Nearest neighbor communications in the PA are performed through dynamically reconfigurable switches connecting one PE to three of its eight nearest neighbors. There are four switches connected to each PE, see Figure 3-4. The basic switch design is similar to the X-net switch used on other systems, such as the MasPar and the BLITZEN machine [6]. The X-net switch allows for communication between eight nearest-neighbor PEs while only requiring 4 I/O connections to each PE. The communication bandwidth 36 between adjacent PEs are kept equal to the memory access bandwidth to assure efficient systolic processing. The most unique feature of the DREAM Machine architecture is to allow each switch in the interconnection network to be distinctly configured. PE Figure 3-4 - A single processor and its associated interprocessor communication switches. The switch settings are stored in the local memory of each switch and accessed at the beginning of each processing cycle. The switch memory address is supplied by the controller at each cycle. This design allows for a dynamically changing flow pattern of data through the processing array. In other 2-D connected parallel SIMD architectures, such as the Hughes SCAP [63] and the AMT DAP [57] architectures, all the PEs perform the same communication operation. In other words, the instruction word broadcasted by the controller specifies which direction the data is to be moved in the processing array, e.g. North to South, East to West, West to South, etc.. In the DREAM Machine, on the other hand, one PE can receive data from its North Neighbor while another is receiving data from its West neighbor, depending on the local switch settings. An example implementation of this switch is shown in Figure 3-5 using a multiplexer and demultiplexer to select one of four inputs and route the data to one of four outputs. This flexibility in interprocessor communication is essential in efficiently implementing neural networks with sparse or block structured interconnections, as will be demonstrated in the Chapters 5 and 6. 37 1:4 MUX 4:1 DEMUX Figure 3-5 - Schematic of the reconfigurable interprocessor communication I switch. I 3.4.2 Global Communications i Another method of interprocessor communication supported by the DREAM Machine is through a single bit global bus, called the G-bus, electrically connecting all the PEs' together, see Figure 3-3. The principal use of the G-bus is to perform global broadcast communications among all processors. The G-bus can be used to efficiently implement [ competitive neural network models, see Section 5.9. A number of neural network models, such as the self-organizing feature maps [37] and Adaptive Resonance Theory ! ART [10], utilize various versions of the competitive learning algorithm [39]. | Competitive learning requires that the neuron with the maximum activity value be identified. In a distributed system where each neuron is mapped to a different PE, this task might require O(N) time, where N is the number of neurons in the network. Since in general N can be quite large, a more efficient method for determining the neuron with1 max/min output value is required. The G-bus is designed as a wired-OR circuit such that each PE can pull down the bus by setting a local flag register to zero. It can thus be used to implement a global minimum or maximum operation in k time steps, where the data to be evaluated is k bits, using the' procedure described in [75, 86]. This procedure involves simultaneous broadcasting of i | the data values over the G-bus by all the PEs one bit at a time starting from the most- t I 38 ! igniflcauit^i5iE After broadcasting of each"bifis completed^- all the processors compare their local data value with the wired-OR value on the G-bus. This allows each PE to selectively mask itself out of future broadcasts if another PE in the system has a greater or smaller data value, associated with the min or max function respectively. A useful side-i effect of this method of finding the global min/max value is that after the completion of | the k processing steps, all the PEs contain the min/max value. In addition to neural, I network applications, this feature can be useful in implementing other algorithms, suchj ; as genetic algorithms [21], where all the PEs require access to the global min/max costi value. Chapter 4 i » Mapping Method Preliminaries i i ! ! In this chapter, we present the basic concepts involved with mapping neural computation, on parallel processing architectures. The discussion here is specifically directed at a' ; particular mapping strategy which takes advantage of neuron level parallelism. We state i i I our reasoning behind selecting such a mapping strategy and discuss the various; i | implications involved with its application. The basic principles of the mapping method, I assignment of neurons to processors and formation of computational paths, is described ] in Section 4.1. As a basic starting point, we analyze the performance characteristics ofi I ; the mapping method used for implementing neural networks on fixed-size ring systolic; architectures [42], described previously in section 2.3.1. This analysis demonstrates that the mapping method can be optimum, in the sense of full utilization of system' resources, in some specific applications. At the end of this chapter, we present an, extension to this basic mapping method, utilizing network level as well as neuron level parallelism, in order to enhance the performance of the mapping when applied to 2-D; connected parallel processing architectures. In Chapter 5 and 6, we present two mapping ( methods which are based on a generalized version of the basic mapping described in this1 chapter. These new mapping methods can be applied with much greater efficiency toj ! implement a larger number of neural network interconnection topologies. | \ I ! 4.1 Mapping Principles i The mapping method described in Section 2.3.1, for implementing neural networks on ring connected systolic architectures, can be very efficient in principle. Given full] interconnections between all of the layers in the neural network and a perfect matchi between the processor ring size and the neural network size, this mapping can achieve I 100% system utilization and therefore a linear speedup factor. This type of utilization isj possible since the mapping has the potential to keep all of the processors in the array busy1 performing useful operations. Inefficiency in the mapping arises in the cases where the j network structure is not regular and its size does not math the processing array size. Inj 1 ; such cases NOP operations are used to synchronize the flow of data to conform the neural: i network with the processor array. In order to reduce the effects of the neural network j structure on the implementation efficiency, we now propose a generalized version of this | mapping method. Due to the use of neuron level parallelism, this mapping method can j be applied efficiently to a great variety of neural network models. I 4.1.1 The General Mapping Method In our general mapping approach, individual neurons are assigned to distinct; processors of the parallel architecture. For the sake of simplicity, at this time we assume' i that the number of processors in the target machine is always greater than or equal to the number of neurons in the network. The cases for which this assumption is not valid are discussed separately for each mapping method in Chapters 5 and 6. In general, the assignment of neurons to processors can be performed arbitrarily. Performing the neural computation operation of equation (2-1) involves the calculation of a weighted sum of all the inputs to a specific neuron. Since all of these inputs are output values of other t | neurons in the neural network, each value to be added is located at a different processor- ! in the machine. Thus, the summation operation can be performed in a distributed! | i i fashion. ! i A small segment of an arbitrarily connected neural network and an example mapping! is shown in Figure 4-1. The basic computation performed by each processor in the® machine is identical to the one described in Section 2.3.1. More specifically, a Pe ! i holding the output value of neuron j receives the partial sum value ut(t) from the previous; PE in the path at time t. It then adds its contribution (yvijaj) to the partial sum value ui(t)\ and passes on the updated partial sum value ui(t+l) to the next PE in the path. Due to the! commutative nature of the addition operation, the computation paths required for evaluation of the partial sum values u of equation (2-2) can be constructed without any t ■ specific ordering of the processor traversals, as long as each path passes through all the* I I I i Neural Network I 1 M apping M j> '=u i' '+w I I I Processing Array j i Figure 4-1 - Mapping neurons to PEs and constructing a computation path between the PEs. PEs contributing to its partial sum value U j. Moreover, multiple paths can be traversed concurrently as long as no two paths attempt to cross the same PE at the same time. 4.1.2 The Algorithmic and Optimized Mapping Approaches j i J | The mapping method described in the previous section describes how the weighted sum! I operation of equation (2-2) can be implemented in a distributed fashion. However, this| > general mapping does not explicitly specify the exact assignment of neurons to processors! ! and the formation of computational paths through the processing array. The simplest! | method is to linearly assign the neurons to processors, as described in [42]. If the! i i i 42 1 processors are arranged in a linear ring topology, creation of computational paths Is] simply accomplished by having each path start at a different PE and travel in the samej : direction around the ring. This will guarantee conflict-free communications among the' j processors. Unfortunately, this method, as mentioned previously, can be very ' inefficient in implementing irregularly connected and sparse neural network structures. i ! | In this dissertation, we propose two methods for efficiently arriving at assignments j I of neurons to processors and creation of conflict-free computational paths. The firstj I method is called the algorithmic mapping method. This method is similar in spirit to the' linear ring mapping, except it allows for construction of multiple processing rings o f1 variable size on the 2-D connected processing array of the DREAM Machine. This method is suitable for mapping neural networks with regular interconnection structures. ! It can also address sparsely connected neural networks efficiently as long as the : interconnection structure is comprised of a number of densely connected blocks of: 1 neurons. The second mapping method, on the other hand, involves a combinatorial, optimization procedure. Since the throughput of a mapping can be measured with respect- ! to the specific assignment of neurons and computational paths, optimization methods can; j be used to find optimal or close to optimal mapping solutions. Therefore, a method for j analytically measuring the throughput rate of a specific mapping is required. In Chapter ; 6, we present one such measure in the form of an energy cost function. A neural-type J optimization method, called Constrained nets, is used to find solution mappings that are close to optimal or at times optimal. i I I ; 4.2 Mapping Neural Networks onto Fixed-Size! j Ring-Connected Architectures j I i i i i , A simple yet efficient mapping strategy for implementing densely interconnected blocks1 of neurons can be devised by linearly assigning each neuron to a single PE of a i ring-connected processing array, as shown in Figure 4-2. In order to simplify the discussion, we describe the mapping for a single block of a two layer network consisting of only an input and an output layer. The extension of this method to multilayer networks; will be addressed later in section 4.2.3. The basic mapping concept is to assign: individual neurons and all of the weights associated with their output synaptic | connections to unique processors. The mapping method requires that the PE being- assigned the first neuron of the block to be a neighbor of the PE that is assigned the last neuron of the block. This restriction implies construction of a processing ring of size equal to the number of input layer neurons. In cases where the number of output layer, neurons is greater than the number of neurons in the input layer, the input layer is padded with neurons having zero output values so that the input and output layers are of equal size. I For the sake of simplicity, we will assume that the number of processors in the1 system is greater than or equal to the number of neurons. In order to evaluate the output! Mapping T Wll WN 1 W 31 W21 a 1 i W22 W12 i» < i < i w42 W32 a,1 l i T w 33 w 23 w 53 W4 3 °N I W nn V N-1N W2N WIN Neural Network Processor Array Memory Figure 4-2 - Mapping a fully connected neural network onto a ring systolic processing array. Note: Superscripts denote layer numbers. | value of each neuron in the output layer, according to equation (2-1), the weighted sum! of all the input layer neuron's output values must be calculated. As mentioned in Section1 2.1.1, this is a distributed operation requiring data from each of the neurons assigned to I different processors. Due to the inherent serial nature of the addition operation, for each output layer neuron, we can construct a path through all the PEs participating in this 1 i summation operation by starting at a specific PE and traversing the complete ring arriving back at the same starting PE. Each path is initiated in a different processor and rotates in; , the same direction around the ring, thus always insuring that no two paths ever cross the ! same PE at the same time. : i I At each time step t, the product wijaj is calculated by the rth PE in the ring and added to the partial sum value U i received from the preceding PE. The partial sum is accordingly modified as ut (f +1) = ut (r) + wijaJ and passed to the next PE. After a complete traversal j of the processing ring, the appropriate threshold value 6i can be added to the partial sum I w , followed by the application of the transfer function/. In this mapping, only a single! j systolic cycle of each processor is needed by each path to perform the i multiply-accumulate operation, therefore, multiple paths can utilize the same PE each at a different time. This feature allows for parallel execution of various paths in the network.. , This mapping is equivalent to the one described in Section 2.3.1. Similar mapping ; approaches have been proposed in the literature for implementing neural network modelsj i on various parallel processors [55, 73, 89]. : i i | 4.2.1 System Utilization Characteristic of the Mapping Onto the; Fixed-Size Ring Architecture Mapping neural networks onto ring-connected processing arrays is highly efficient when j the number of neurons in the network is equal to the number of processors in the, ; processing array. Formally, let N( represent the number of neurons in layer I , and P j J I | represent the number of PEs in the processing array. Since P paths are traversedj i simultaneously with each path performing P operations, the implementation has a; I > ! potential for performing P2 useful calculations. A two layer neural network requires 1 t_1 operations where /u is defined as the interconnection density between layers £ j and £-1, calculated by ! # "of non“ zeroconnectionsbetweeniayer -£-&~lay er I ~\------------- ~r~j. u = --------------------------------------------------- -------------- . (4-1) AW-1 | Assuming that the number of neurons in each layer of the neural network is less than or equal to the number of PEs in the processing ring, the system utilization factor 77 indicating the ratio of useful operations per total possible operations can be determined by (4-2) where jiNtNl_1 is the number of non-zero synaptic connection weights in the neural network. It is apparent that the maximum system utilization ( 7 7 = I) is achieved when jU =100% and Nt = N t_2 = P. System utilization 77 scales linearly with respect to both interconnection density and the ratio of neurons to PEs. Should the number of neurons exceed the number of processors, virtual processors can be created using time multiplexing techniques. In such cases a single PE can simulate multiple processors by using different consecutivej time slices to implement the processing associated with each virtual PE. The general form; of the system utilization function for a ring systolic array can thus be written as fiN M -i 7 7 : P (4-3) P where |~ N( / P"| indicates the number of neurons of layer i assigned to one physical ! processor. | It can be noticed that the round-off factor introduced by the ceiling function can bej minimized by either keeping the number of neurons per layer Nt equal to an integerj multiple of the processing array size P, or by having much fewer processors than the number of neurons in the network. With the round-off effects removed under these conditions, the system utilization will be T ] = iu. (4-4) i This result indicates that to insure good system utilization, regardless of the exact size of the neural network, we require dense interconnections between layers of neurons and a : small number of processors. At the same time, we would like to exploit all the available parallelism in the neural network by having a large number of processors. This causes a dilemma in determining a good number of PEs to be used in the system architecture] 46 -Earge number of-processors is-required-to effectively-implement-large-neural-networks? but their use causes great inefficiency for implementing networks with different sizes. 4.1.2 Execution Rate Characteristics of the Mapping Onto the Fixed-Size Ring Architecture Although system utilization is of interest in evaluating the efficiency of a particular implementation, the primary measure of performance is the execution rate of the 1 implementation. We will now analyze the execution rate of the mapping and the effects of processor array size on system throughput. The number of cycles required to process a neural network using the above mentioned mapping on a ring systolic architecture is equal to the number of cycles needed to traverse a path (making a complete circle around the processing ring) plus the additional cycles used in implementing the neuron activation function. The total time for implementing a single block can be calculated by [N,~\ r ^ - 1 1 Pky + r tv. i \ I € k2 P p P y where kj is the time required to perform a multiply-aceumulate operation and & 2 is the time required to implement the neuron transfer function. This formula also assumes the use of time-multiplexing of multiple neurons on a single PE in cases where the processing array is smaller than the neural network. If P > Nt and P > N(_} the execution time will i be O(P). This implies that it is possible to reduce the execution rate o f an implementation byj increasing the number o f PEs in the system. As the number o f neurons in the network increases, w e would like to increase the number o f PEs to maintain high throughput.J Unfortunately, with a large number o f processors, the degradational effects o f unequal number o f neurons to PEs on the system utilization and throughput is proportionally larger. In Chapter 5, we will outline a method for constructing variable-length rings on the DREAM Machine which can alleviate a number o f these problems by matching the ring size to the problem size. I i | 47 4.2:3 Mapping Multilayer Neural 'Networks Onto_the~Fixed-Size_Ring Architecture Implementing multilayer neural networks on 1-D ring systolic arrays can also be performed using the mapping technique described above through time multiplexing of the processing associated with each layer on the processor array. The computation associated with each layer of a neural network can be performed consecutively on the ring array as the data flows from the input layer toward the output layer. This type of implementation leads to an execution time of O(LP), where L is the number of layers in the neura network structure. The exact execution time can be calculated as T = Y t„ (4-6) 1=2 where Tt is defined by equation (4-5). Pipelining techniques can be used, as proposed in [Kung, 1989 #14], to decrease the execution time back to 0(P) given L ring arrays organized in a 2-dimensional mesh topology. The efficiency of this type of implementation is strongly related to the matchl between the number of neurons and layers to the size and structure of the processor arrayJ The pipelined mapping method requires the number of layers in the neural network to be equal to the number of rows (i.e. pipeline stages) in the processor array. Furthermore] the mapping assumes equal number of neurons for all the different layers. These are rather strong restrictions on the neural network interconnection structure for general-purpose implementations. 4.2.4 Deficiencies of the Mapping Onto the Fixed-Size Ring Architecture As mentioned previously, a major deficiency in the use of the linear mapping method for implementing neural networks on fixed size ring systolic architectures is attributed to the strong dependence of the processor array size on the efficiency in implementing varied size neural networks. We have shown through equations (4-5) and (4-6) that the execution rate of this mapping is O(P) when both the processing array size and neural network size are of the same order of magnitude. This type of scaling characteristics is not well suited for neural network applications since the large number of neurons requirec 48 for real-world applications will in turn require large number of processors which would finally yield a large execution rate. ' i If we assume full interconnections between all the neurons of a large real-world| network, this mapping would remain viable as it yields maximum utilization. However, current evidence, both from biological examples [3] and artificial neural networks [46], [ indicate that real-world systems are constructed of modular and sparsely interconnected networks. This fact thus indicates that the fix-ring mapping method, although useful for1 relatively small and densely interconnected neural networks, is not suitable as an, implementation method for large scale general-purpose neural processing. i 1 In addition to the low system utilization characteristics of the fixed-size ring systolic! mapping, the processing array interconnection topology is also problematic. The ring; I systolic architecture utilizes a 1-D interconnection topology which does not lend itself j well to such applications as image processing and vision where neural networks have extensively been used [19, 83]. A planner topology is a more natural architecture for such applications where the data is structured and manipulated in a 2-D format. In, addition, locally connected 2-D planar architectures, such as the mesh, are a morej natural match to VLSI implementations than the 1-D topology due to the inherent 2j dimensional nature of integrated circuits. j 4.3 Initial Mapping Method for 2-D Connected Architectures | A mapping method suitable for 2-D VLSI implementations based on the fixed-size ring! method has been demonstrated in [72]. The basic motivation for this mapping methodj was initially to extend the use of the ring systolic mapping to be applicable meshl connected processing arrays. In addition, we required the mapping method to have small local memory requirements in order for it to be used with fine-grain parallel architectures.! The basic concept of this mapping involves the use of batch-mode processing, where; l different instantiations of a complete network are mapped onto different rows of a 2-D processor array, see Figure 4-3. Each row of the array is considered as a single 1-D ring of processors employing a mapping similar to the one described above in Section 4.2. The computation performed in each row is similar to that of the mapping in [42] except; 49 ! that the weight values are communicated between various networks through the vertical connections of the processing array. D o PATTERN #1 PATTERN #2 □ • • PATTERN #p Figure 4-3 - Exploiting network level and neuron level parallelism on a 2-D mesh i connected systolic architecture. j This mapping is memory efficient since only a single copy of the interconnection j weights are stored and used by all the networks in the various rows of the array.. Computation associated with each layer of the neural network is performed sequentially in each row. This imposes no constraints on the number of layers in the neural network, but the mapping nevertheless requires equal number of neurons per layer and full interconnections between neurons of adjacent layers. Very high execution rates have been reported using this type of mapping approach for implementing densely connected! neural networks [73, 89]. j In comparison to the fixed-size 1-D ring systolic mapping described in Section 4.2,1 our mapping method here is suitable for a large class of neural networks and fine- to I medium-grain, two-dimensional, mesh-connected, SIMD arrays. However, it isj j described in detail here for implementing the multilayer perceptron neural network using; the back-propagation learning model. The performance of the method is illustrated using | the Nettalk neural network and the Systolic/Cellular coprocessor of Hughes [62] in order; to allow for comparison with other implementations. ! 50 In this section we describe the implementation oftheldata processing^ inthe recalf and learning phases of the multilayer perceptron (MLP) network with back-propagation learning (also called the back-prop model) on a 2-D mesh array architecture. The structure of the MLP network consists of several layers of neurons, where the first layer is called the input layer, the last is called the output layer, and all the layers between the input and output layers are called hidden layers. There are no synaptic connections between neurons on the same layer, but each neuron is assumed to be connected to all other neurons in the adjacent layers. i l 4.3.1 Computational Aspects of the Back-Propagation Algorithm ; [ , The back-prop model operates in two distinct phases, one is the recall phase in which the; training pattern is presented to the input layer of the network and a corresponding output; is recalled at the output layer. The other is the learning phase in which the network1 adjusts its synaptic weights in order to minimize the error between the recalled pattern and' the correct pattern supplied by a supervisor. The operations of each neuron in the network during the recall phase is demonstrated in Figure 4-4, where a denotes the L a y e rl - l a i ( t - l ) di(l) L ayer I • U i(£)= j ( £ - l ) w ij( £ - l) + d id ) h i ai(£) = f(U i(£ )) L ayer 1+1 Figure 4-4 - Recall phase (forward pass) processing of neuron i on layer I of the multilayer perceptron neural network. neuron activation value, w denotes the synaptic weight valued 0Henotes the neuron threshold value, and/denotes the nonlinear neuron activation function. In short, the^ function of each neuron, during the recall phase, is to calculate the weighted sum of all its inputs and apply the nonlinear activation function to this sum in order to generate the neuron's output value. Similarly, the processing involved in the learning phase is depicted in Figure 4-5, where 5 denotes the error term and f ' is the first derivative of the neuron activation! I function. The error terms for neurons on the output layer (I =L) are calculated as | I 8;(L) = f ' m m t ; - a^L)) (4-7)j where ti denotes the desired output value of neuron i supplied by the supervisor. Whereas in the recall phase, the neuron activation values are propagated forward in the networkj the error values are propagated backwards in the network during the learning phase.J During the backward propagation of errors, the synaptic weight values are modified according to (I) = TjSt (I + l)at, where (4-8) L a y e rl — l i . S i d ) di(£)^ f ' ( U i(i))A cci Acci = £ 5 j( £+l)wij( £) iC C i L ayer I • Wij(£)<r- W y(£)+ 7]Sj(£+l)di(£) ’S Wij(£) L ayer l+ l Figure 4-5 - Learning phase (error back-propagation) processing for neuron i on layer £ of the back-propagation algorithm. 52 <5,(0 = n U . m Y , $ (* + D ^ (0 for l < L. (4-9) j In the remainder of this section we introduce a mapping method for efficient implementation of the operations in the recall and learning phases of the back-prop model; on a 2-D mesh-connected SIMD architecture. For the sake of simplicity, let us assume^ for now that the neural network under consideration has the same number of neurons ini i each layer, and that this number is equal to the number of processors in a single row of the processor array. This assumption will be dropped later. 4.3.2 Mapping Details In our method, a different input pattern for the entire neural network is mapped onto each row of the processor array, as shown in Figure 4-3. Thus, each row processes data for| one complete network starting from the first layer and then continuing sequentially for consecutive layers until the outputs are obtained. At the beginning of the recall phase,j the processors contain, in their local memory, the inputs for the network, one input! W 21 W 32 w 1 3 W31 w12 w23 Wj j W22 w33 111 l f ! ! i alit-J) 1 1 1 i i i i Figure 4-6 - Data organization in the processor array during the first cycle of the : i recall phase. Only processors in the first row are enabled. ; pattern per row and one item per processor, see Figure 4-6. The indices of the three' different sets of inputs/activation values appear as superscripts, subscripts denote 53 different neurons in a given layer, and the layer index is given in parentheses. The shaded squares represent the processors that are disabled (masked) in the given cycle. I The different input patterns are transformed into outputs of the first hidden layer of t the networks by the processing involving vertical - North to South - flow of the synaptic weights wij, and horizontal flow - West to East- of the partial sums U [. A single shift South is followed by a single shift West until the partial sums make a full circle. Onej processing cycle, for the jth processor, consist of the following operations: (1)! receiving the weight value wij from the North neighbor and at the same time, sending the! weight value w(j received in the previous cycle to the South neighbor; (2) receiving thej partial sum U[ from the West neighbor and at the same time, sending the old £// to the! East neighbor; (3) Multiplying the incoming wij by the aj, which is stored locally, and adding it to the incoming Ui, see Figure 4-7. wtj(t) Ui(t + l)= U i(t)+ajWi j(t) mj(t +!)= wtj(l) Figure 4-7 - Processing in each processor of the array during the recall phase. I The weight values are organized in a skewed fashion in the global memory. Thisj organization is used in order to have the correct weight value wij and partial sum Ui meet; in the same processor at the proper time. Figure 4-8 demonstrates the next two iterations! following the initial cycle shown in Figure 4-6. I I t I 54 □ e](f) W21 w 31 i aht-D 1 ukt) okt) w32 w 12 W H r a'U-D ^2: ukt) okt) W is w23 L a2{l-D I a^t-D ~^3. ukt) akt-lI | ! ! « f (a) i l l ! m 9 f a eko W 21 ek-Q 93 2 ( t) w32 ekt) okt) w13 ukt. > 13 '33 (b) j Figure 4-8 - Next two cycles of the data flow through the array after initial cycle j t shown in Figure 4-6. j In this procedure, the vertical flow brings the same weights in contact with the different input patterns in the consecutive rows of the array. This is possible since the weight values are identical for all the networks in the array. Thus two levels of parallelism are achieved, one is the processing of complete networks being performed 55 f concurrently in a pipeline mode in the different rows of the processor array, and two is j that each network is being processed in parallel by the different processors of each row. j j After all the weight values of a given layer have passed through the processor array and| the threshold values 0/ have been added to the partial sums Ui in each row, the | computation of the activation function takes place in unison in all the processors. At the ' completion of this operation, the activation values a(£)'s for a specific layer i are; available in each processor for computing the activation values for the next higher layer | i + 1. Concurrent with the propagation of weights through the array, the final U( values j in each row are propagated through the array and stored in the global memory for later1 use during the learning phase. This process is repeated until the activation values for the 1 neurons on the output layer have been evaluated. i | It is apparent from Figures 4-6 and 4-8 that the processing of multiple networks in the: processor array is performed in a pipeline fashion. The organization of data in the! memory is structured such that the pipeline flushing time is overlapped with loading oft values for the next operation. This method is efficiently used to implement networks^ larger than the processor array by loading and unloading different segments of the network for processing. W 2 1 W 3 2 W J 3 w31 w12 w 23 ™ 22 w33 ■ A ec m i Figure 4-9 - Data organization in the processor array during the first cycle of the I i j learning phase. Only processors in the first row are enabled. ; During the learning phase, we also assign the computation of an entire network to each row of the processor array. This leads to a similar data flow and identical organization of the activation and weight values. Thus the transition from the recall to the] learning phase does not require data reorganization. The activation values ai along with the accumulated sums Acc[ and the learning rate rj are stored in each processor's local memory, see Figure 4-9. The weights w(j travel from North to South, as before, and the error values Si move' from West to East, see Figure 4-9. Two sets of weights are being transferred concurrently - the old and the new, where the new weights coij are computed on the fly according to the learning (or weight update) formula from Figure 4-10. The old weights wij are the same weight values as used during the recall phase. Treating the new and the old weights separately in this fashion insures consistency in learning between multiple networks executing in different rows of the processing array. wij(t) I (Oij{t) Accj(t+ 1) = Acc:(t)+ 5i(l)Wjj(t) w-Jt + l) a j j + l) Si(t+1 ) , Si( t + l ) = S iU) wij( t + l ) = w ij(t) ■ i 0)ij(t+7)=<otf(0 + »?^.<0ay I I I Figure 4-10 - Processing in each enabled processor of the array during the j learning phase. j i The processing produces modified synaptic weights 0) and the error values 8 for' each consecutive layer of the network starting from the output down to the input layer. Atj each cycle the old weight value wij, received from the North port, is multiplied by thej error term Si, received from the West port, and added to the partially accumulated sumj Accj in each processor's local memory, see Figure 4-10. The error term Si is also; multiplied by the neuron activation value aj and the learning rate rj, both stored in eachj processor's local memory, to calculate the weight modification value. The new weight value coij, received form the North port is updated by being added the weight modification value just produced in the processor. This procedure implements the generalized delta rule learning law (as shown in Figure 4-5). The next two cycles of the data flow following the initial cycle of Figure 4-9 are illustrated in Figure 4-11. At the 57 completion of updating all the weight values between layer £ and £ + \, the error term S(£) for layer i is calculated according to 8. (£) = fXUi (£))AcCi for £ * L u k t ) u k t ) u k t ) W21 w 32 WJ3 w 3i w1 2 w 23 1 A c c i Acc2 A ces “1 (f) a 'it) o k o E . K 1 3 ? i 0)33 A ccj a2 ,(t) — - A cc 2 afrt) A cc3 afrt) i i i l d a ) 1 1 i p i r T t uh*> uj{t) w2 1 (a) ukt) uio) w32 uj(t) ™ 1 3 '23 I C O '33 0)33 A cc A cc Acc- (4-8) (b) Figure 4-11 - Next two cycles of the data flow through the array after initial cycle shown in Figure 4-9. Shaded processors are disabled from performing calculations 58 The accumulated sum Acci in (4-8) is calculated concurrently while updating the weights. The term /'(£/,• CO) is calculated by receiving the partial sum Ui (£) values from the north port and implementing the / ' function in unison all the processors. The partial sum j values are those that where calculated in the recall phase and stored in the global memory for use here in the learning phase. 4.3.3 Implementing Neural Networks Larger than the Processor Array Size The neural networks used in practice have layers of different sizes and often contain larger number of neurons than there are processors in a row of a medium-grain processor array. Therefore, it is essential to address the issue of data partitioning as part of this mapping method. The partitioning in our method is implemented in such a way that the required local memory of the processors can be very small and independent of the A X ! (C) Figure 4-12 - Network partitioning for implementation on a "3 x 3" processor array 59 network size. If a given layer is smaller than the rows of the processor array, then some of the processors will contain phantom neurons with activation values always equal to zero. The processing in this case is not changed except that the weights between the real neurons and the phantom ones are set to zero. The layers that are larger than the rows of the processor array have to be processed in fragments of the size smaller or equal to the row size. This procedure essentially creates virtual processors through the use of time-multiplexing. The order in which each fragment is processed in the recall phase is shown in Figure 4-12. In this example the processor array size being used is "3 x 3" (similar to those in Figures 4-6 and 4-9) and there are 7 neurons in layer 1. The activation values from the first 3 neurons on layer 1 (^'(1) through a‘ 3( 1), where i is the network index) are initially loaded into the processor array. The processing is performed as described earlier by propagating through the network the 9 (or 3^) weight values corresponding to the connections between neurons ^ ‘(1), a£(l), and ^(1 ) on layer 1 and the neurons a[(2), a2(2), and (2) on layer 2, during which the partial sum values U(s will be accumulated. The activation values for the next 3 neurons on layer 1 {a\ (1)through al 6( 1)) are then loaded concurrently with the tail end of the processing phase of the last cycle, in a pipeline fashion. The next batch of 9 weight values for the connections between neurons ^ (1 ), ^ ( l) , and u£(l)on layer 1 and neurons a[(2), a2(2), and <3^(2) on layer 2 are then propagated through the processor array as before and the computation of the partial sum values continues. Finally the last neuron on layer 1 «^(1) is loaded into the network and processed. After all the neurons on layer 1 have been processed in this fashion and the threshold values been added to the partial sums, the neuron activation function is applied to calculate the activation values of the 3 neurons on layer 2 (^‘(2), a2(2), and a3(2)). If there were more than 3 neurons on layer 2, the activation values just calculated for neurons a[ (2), a2( 1), and (1) would be unloaded to the global memory and the above process would be repeated to evaluate the activation values for the next 3 neurons on this layer. This processing is repeated until the activation values for all the neurons on the output layer have been calculated. A similar partitioning procedure is used to perform the back-propagation of the error values through the network (learning phase). In this case segmentation starts from the output layer toward the input layer, as oppose to the reverse direction in the recall phase. 60 The pipelined processing employed in this mapping method allows for efficient loading and unloading of activation values from the processor array in this partitioning scheme. Since these operations almost completely overlap with the emptying and loading of the processing pipeline, only a small overhead results from processing the network in small partitions. 4.3.4 Implementing Nettalk on the Hughes SCAP Architecture As described earlier, the mapping method described in this section is suitable for machines with 2-dimensional, mesh-connected, SIMD architectures. An example of such a machine is the Systolic/Cellular coprocessor designed and developed at the Hughes Research Laboratories [62]. The architecture of this machine consists of a 16 by 16 array of processors controlled by a single controller in SIMD fashion, see Figure 4-13. The HOST BUS INTERFACE COPROCESSOR BUS TO ALL PROCESSORS PROCESSOR ARRAY HOST CONTROLLER PROGRAM MEMORY DUAL PORT ARRAY MEMORY Figure 4-13 - System architecture of the Hughes Systolic/Cellular Coprocessor. 61 processors are connected to four of their nearest neighbors. Processors on the boundary columns are connected with each other through wrap-around connections. Each processor contains a small local memory (24 words) and seven 32-bit, fixed-point, functional units - two multipliers, two adders, a divider, and a comparator, which can all compute in parallel. A 2K words dualported memory is used as a data queue. The dual-ported data I memory can be accessed in parallel by all the processors in the top and bottom rows of * the array. The mapping method was tested by implementing the well known Nettalk neural network [Sejnowski, 1987 #85] on the Systolic/Cellular machine. The Nettalk network is a good example of a working neural network, which is large enough to require partitioning for our target machine and one that has also been used by others to benchmark their implementations of neural network mappings [5, 60]. The Nettalk network is a three layer feed-forward model which learns to pronounce; written English text, through the use of the back-propagation learning algorithm. T he! network receives seven letters as inputs and the task of the network is to generate the I i correct phoneme corresponding to the middle letter. The input to the network is through j 203 neurons, 7 groups of 29 possible input characters. The hidden layer contains 6 0 j neurons and the output layer contains 29 neurons. Each layer is fully connected to its neighboring layers which yields a total of 13,826 connections in the network. In our implementation, there were 16 input patterns executing concurrently - one per row of the processor array. Two neurons were mapped into a single processor to take advantage of the multiple functional units available in each processor. For example, the j double add and the double multiply instructions were used to multiply two activation! values and their corresponding weights together and then added the results to the partial j sum values in parallel in a single processor. It is apparent from the mapping algorithm that there are approximately equal number of computation and communication operations. This system characteristic allows for efficient and simple overlapping of computation and communication operations. In our implementation of the Nettalk network on the Hughes systolic/cellular processor, only 26.9% of the total processing time was attributed to the communication operations which were not overlapped by computation operations. Additional parallelism was exploited by overlapping computation operations, which j require more than a single instruction cycle, with communication operations. For; example, a North to South data transfer in all 16 processors in a row requires 3 cycles 62 ~(when accessing global memory), and a multiplication operation takes 7 cycles. A multiplication operation was initiated in the multiplier and while this operation was being processed, the next two operands for the next instruction were fetched from the North processor concurrently. In this fashion the two communication operations did not contribute any additional time to the processing time of the implementation. The implementation of cellular operations, which are executed in unison across all the processors in the array, are not specified in this mapping. These operations may be implemented by the most suitable method for a given target machine. A major portion of these operations are involved with the implementation of the neuron activation function. The activation function used in most back-prop implementations, including Nettalk, is the sigmoid function defined as f( x ) = (4-9) 1 + e In our implementation, this function was realized using the exp(x) function which in turn was calculated by means of a range reduction technique [16]. This implementation of the activation function was arrived at after an extensive evaluation of other implementation methods. One such method is a table look-up scheme. This method offers fast response (one memory read instruction), but is impractical for implementation on the Systolic/Cellular machine of Hughes since it requires indirect addressing within each processor’s local memory and a relatively large local memory size neither of which were available in the target machine. 4.3.5 Performance Evaluation of the Mapping Method A measure of performance, which is becoming a standard for neural networks, is Million Connections Per Second (MCPS). This metric is used to measure the feed-forward processing rate of a given implementation for a particular neural network. Another measure, Million Connection Updates Per Second (MCUPS), is used to evaluate the performance of a given implementation for processing both the feed-forward (recall) and feed-back (learning) phases of a neural network. The execution time of a complete recall and learning cycle, including all the data loading and unloading operations, was used to arrive at a 100 MCUPS performance for our implementation of Nettalk on the Hughes 63 systolic/cellular system. Considering only the recall processing, we achieve a 240 MCPS performance with our implementation. Table 4-1 combines the comparative results for this mapping method on the Systolic/Cellular Coprocessor, with mappings from [41] implemented on the same machine, and two other special purpose implementations on parallel machines - Warp [60], and Connection Machine [5], and a workstation - Sun 3/160 with a floating point accelerator as reported in [5]. The minimum local memory requirements for systems other than the Hughes systolic/cellular are estimates arrived at under the assumption that no loading/unloading operations from an external memory are allowed. The performance of the Connection machine implementation was also estimated using the performance reported in [[5] multiplied by two to account for expected improvements due to code optimization. The implementation on the Warp used 32 bit floating point data with a floating point adder and multiplier and an integer ALU. Implementation on the Connection Machine used a two bit serial ALU per processor. The data type used in our implementation was 32 bit integer with integer arithmetic functional units. SYS/CELL HRL SYS/CELL S.Y. KUNG WARP CMl - 16K SUN 3/160 + F.P.A. PROCESSING RATE (MCPS ) 100 18 17 7 0.034 MINIMUM LOCAL MEMORY ( WORDS ) 10 29 6182 5 14000 Table 4-1 - Comparison of different implementations of the Nettalk neural network based on throughput and local memory size requirements. 4.3.6 Applicability of the Mapping Method The execution procedure described in this section could be applied to many layered feed forward neural networks. Even though this mapping method uses a SIMD architecture, the activation functions used for the neurons can vary from layer to layer. Moreover,! within each layer there is also a limited variability possible. For example if a polynomial j approximation is being used to compute the neuron activation function, different 64 polynomials can be implemented in each neuron at the same time by having different coefficients stored in each processor's local memory. If the network requires significantly different activation functions in the same layer, it can be accomplished by sequentially applying the desired functions and disabling/enabling appropriate neurons. Since the learning rate 77 is stored in the local memory of each processor, different neurons in the network could, if desired, use different learning rates with no effect on the performance of our mapping. Neural networks with feedback architectures (such as the Bidirectional Associative Memory [38] and the Hopfield net [33]) can be implemented with the same type of architecture and style of mapping. The basic computation involved in the processing of these networks can be represented as matrix and vector calculations as described in Chapter 2. The data organization and movement through the array of our mapping implements these operations very efficiently. The cellular operations in the algorithm, such as the activation function evaluation, could be changed to any other type of cellular operations without any effect on the inter-processor communications. This allows for j great flexibility in the models that can be implemented with this type of mapping. ! | 4.3.7 Deficiencies and Shortcomings of the Mapping Method j Although this mapping method can achieve very high processing speeds, asj demonstrated in Section 4.3.5, it nevertheless has similar shortcomings to the m apping: on fixed-size 1-D ring architectures mentioned in Section 4.2.4. Namely, the mapping requires a good match between the number of columns in the processing array and the number of neurons per layer of the neural network. In addition, although the processing array is organized in a 2-D topology, the mapping method treats the processors as a pipeline of 1-D ring processors. This type of mapping cannot take advantage of the 2-D nature of the processor array when manipulating 2-D data structures, such as images. Because of these limitations, the DREAM Machine architecture has been proposed and various mapping methods have been developed to remedy these problems. The mapping methods are described in detail in the following two chapters. 1 65 Chapter 5 The Algorithmic Mapping Method Mapping methods for implementing regularly structured neural network models can take j advantage of the regularity in their interconnection structures and devise a systematic and! direct approach for solving the assignment and scheduling problems mentioned in Section' 4.1.2. In this chapter we will describe the use of such an algorithmic approach used to! arrive at efficient mappings of regularly connected neural networks with dense j interconnection structures onto the DREAM Machine architecture. This mapping method j is based on the processing principles described in Section 4.1 and can be considered as an extension of the mapping method for implementations on fixed-size ring architectures described in Sections 4.2 and 4.3. ■ i 5.1 Applicability of the Algorithmic Method As mentioned previously in Chapter 4, a major problem with implementations on 1-D ring architectures is attributed to the requirement of having equal number of neurons and } processors in the ring to achieve maximum efficiency. The flexibility in thej communication network of the DREAM Machine architecture allows for embedding of variable length 1-D rings on the 2-D processor array topology of the machine [71]. Therefore, the size of the processing ring can be varied to match the number of processors in the ring to the number of neurons in the specific layer of the neural network being processed. With such an approach, fine grain implementations with large number of processors can be utilized to implement any size neural networks. Furthermore,} neural network models comprised of multiple densely connected blocks of neurons, w ith! j 66 j i ______________________________________________________________________________i limited interconnections between various neurons can be implemented efficiently by concurrently processing multiple processing rings on the machine. In summery, the mapping method is applicable to neural network structures with single or multiple layers of neurons where each layer can be comprised of one or more blocks. The mapping method produces the necessary assignment of neurons to processors and establishes the associated computational paths through the processing array. This is accomplished by employing a processing ring approach, similar to the one described in Chapter 4, for linearly assigning neurons to PEs of the ring and scheduling the paths as complete traversals of the processing ring. 5.2 Implementation of Variable-Size Processing Rings on the DREAM Machine i With the ability to vary the ring size to match the neural network size, the system throughput of the DREAM Machine can be maintained at a high level compared to that of the fixed-size ring array. With a processing array size of P processors, a ring of size R can simply be embedded onto the 2-D interconnection topology by folding the ring into a "snake-like" shape, see Figure 5-1. This is accomplished by setting the X-net switches) of each processor in the DREAM Machine in a specific configuration as required by the j folding. Figure 5-1(a) shows how a ring topology can be created on the DREAM I Machine architecture with the ring length less than the number of PEs. Three disjoint rings of differing lengths are depicted in Figure 5-1(b). Due to the SIMD nature of the processing employed by the DREAM Machine, in cases where multiple rings of varying sizes are implemented concurrently, processors assigned to a specific ring are enabled only during the cycles required to complete the ring traversal. Thus, in such as scheme, the number of cycles associated with processing all the rings is determined by the length of the longest processing ring. In cases where the neural network size exceed the number of processors in the processing array, time multiplexing techniques can be employed, as before, to create i virtual PEs. A further advantage of the variable-size ring mapping method can b e 1 demonstrated by closely examining the effect of time-multiplexing the processors on the system efficiency. Creation of virtual processors on the fixed-size ring architecture' > •• > • > ■ • > •• > > > • > • < 1 > ■ • > •• ;> • ;> • > • o >» • ii > • > • > • < • < • < • < (a) > • > » > • > • > 9 > «— < • < • < -• < ■ - ♦ < > • - > ■>• > ♦ ■ > • > • > ■ < >< • < • < • < • < >< Figure 5-1 - Using the reconfigurable switches of the DREAM Machine to construct ; circular rings on the processing array, (a) A single ring of size 59. (b) Three ! disjoint rings of different sizes executed in parallel. ! i involves assigning multiple neurons to different time slices of a single processor. For- example, if a neural network with 257 neurons in a particular layer is to be processed on j a processing array with 256 processors, two consecutive time slices of each processor] must be used to create a ring virtual ring of size 512 PEs. Therefore, the execution rate of the implementation will correspondingly be 0(512) requiring to 255 unnecessary operations. On the other hand, with the variable-size ring approach of the DREAM Machine, a virtual ring of size 258 PEs can be constructed by using two consecutive cycles of each processor in a 128 PE long ring. This mapping only requires one unnecessary operation due to the round-off factor and yields an execution rate of (9(128). 5.3 Using Variable Length Rings to Process a Single Layer The ability to construct variable length rings on the 2-D processing array can be efficiently utilized for neural network processing. In Chapter 2 we described the basic mapping method for implementing neural networks on 1-D ring connected architectures. We 68 further mentioned how the mismatch between the number of processors and the number of neurons can lead to inefficient processing. In this section we analyze the performance of the mapping method given the ability to adjust the processing ring length. | First let us discuss the use of variable length processing rings for implementing one and two layer neural networks. The extensions to this approach for implementing! multilayer and blocked structured networks are presented later in this chapter. Let P \ j represent the total number of processors in the machine and lets define Rt to be the j number of PEs assigned to a specific processing ring used for computing the neuron l output values of layer I. For example, in Figure 5-1(a) P=64 and/?=59. We further define v, as the number of output layer neurons assigned to each PE in the processing ( ring of length Rt, and ft), as the number of input layer neurons assigned to each PE of the ring. In the case of one layer neural networks, such as the Hopfield net [33], all the neurons are treated as belonging to both the output and the input layers. Formally, v, 1 and ft), are defined as i ft), 'Nt-i and (5-1) (5-2) The v, and ft), values are used to address the need for time-multiplexing multiple neurons on a single processor. In general, if the number of processors in the machine is larger than the number of neurons, both v, and ft), will have a value of one and thus their multiplicative effects can be removed from our equations. Using our notation, equation (4-5) which is used to calculate the time required for evaluating the output values of neurons in layer £ can be rewritten for the variable-size ring implementation as i Tt = (v,ft),P,&j + )> where 1 < P, < P . (5-3) J It can be noticed that the execution time is no longer a function of the processor array j size, rather it is now dependent only on the number of processors used in the ring and I the number of neurons assigned to each processor of the ring. For the sake of simplicity, | in our examples in this chapter, unless otherwise specified, we assume that Nt < Re and! Ne_j < P, so that v, = 1 and ft), = 1. ; i 69 j In general, for neural network structures with no regular sparse structure between two layers of neurons, we can determine the appropriate ring length as (5-4) P V 'l This equation takes into account the need for time-multiplexing multiple neurons on a single processor when the neural network size is larger than the machine size. Figure 5-2 shows a graph of Rt as a function of the number of neurons for a fixed processor array size of 256 PEs. It can be noticed that as the ratio of number of neurons to the number of processors increases, Rt asymptotically approaches P. Formally, Ne » 1 =» Rt -+ P. (5-5) This indicates that in order to benefit from the variable length ring feature of the architecture, the machine size should be on the same order as the neural network size. Consequently, this architecture favors fine-grain implementations with large number of processors. 250 0 3 c cc 200 C l ) JC e o 1 / 3 w C 1 3 o o k_ CL O 50 0 200 400 600 800 1 000 Number of N eurons Figure 5-2 - Plot of Rt vs. Nt with P =256 processors. As mentioned previously in Section 4.2.4, the mapping on fixed-size ring architectures can cause great inefficiencies if the number of processors in the machine is considerably larger than the neural network size. Thus, such mappings are more 70 applicable to medium-grain architectures where the inefficiency in implementing neural networks smaller than the machine size and the rather limited speedup gained due to lack of utilization of all the available parallelism in large neural networks is some what balanced. Figure 5-3 shows a plot of the execution rate of both fixed-size ring and! variable-size ring implementations. It can be observed that as long as the number of' 1 neurons is smaller than the machine size, the variable length ring approach maintains a , < performance close to optimal, whereas the fixed-size ring approach achieves a worst-case constant execution rate. 4 0 0 0 V ariable-Size Ring — Fixed-Size Ring 3 0 0 0 - a > E l- c B 2000- 1 3 o < 1 ) X UJ 1 0 0 0 - 0 200 400 600 800 1 000 Number of N eurons j Figure 5-3 - Plot of Tt vs. Nt for the fixed-size ring mapping defined by equation (4-5), and for the variable-size ring mapping defined by equation (5-3). The parameters ki and k2 are assumed to be 1 in order to simplify the comparison. 5.4 Implementing Multilayer Neural Networks ' i For the one and two layer neural networks with no regular block interconnection structure j ! (e.g., the Hopfield net [32] and the BAM network [38]), a single ring is sufficient toi efficiently implement these networks. In case of multilayer neural networks (e.g. the Multilayer Perceptron model [66]), where neuron output values are propagated through consecutive layers of the network, a more complicated ring structure can be designed 71 | which dynamically matches the number of PEs in the ring to the number of neurons in each layer of the network being processed. This is accomplished by dynamically changing the ring size as the computation moves from layer to layer with| Rt = ma.x{Ni,Ne_1^ for each layer t. As stated earlier, for the sake of simplicity we t assume that there are enough processors in the system so that Ne <P and N(_x < P. j I As an example, a three layer neural network with Nj>N2>Nj (see Figure 5-4), can j be mapped efficiently to a ring structure of length N2 embedded in a longer ring of length I Nj, as shown in Figure 5-5. In the first Nj cycles of processing the partial sum values u's are propagated through the large loop of length Nj. At the completion of this phase, the final u values for the N 2 neurons in the second layer are available in the first N 2 PEs of the ring. All the processors can then perform the application of the neuron activation' function in unison to evaluate the second layer neuron's output values. Since the large1 ring of length Nj is folded in such a way that PE #1 and PE #N2 are adjacent to each other, the communication switch between these PEs can be reconfigured to implement a| ring of size N2, see Figure 5-5. 1 Figure 5-4 - A three layer neural network with Nj>N2>Ns. 12 Figure 5-5 - An embedded ring structure containing a ring of size N 2 embedded in another ring of size Nj. After N 2 processing cycles, the first N j PEs in this ring will have the final u values of the neurons in the output layer. After the application of the transfer function, the I network output values are ready to be stored in memory for further processing or' accessed by the host computer. The total time required to complete the computation of a ; network with L layers is T = Y jr „ (5-6) t=i ! ! with vt = 1 and (O t — l the total time to complete the network computation can be calculated using equation (5-3) as T = kl'^ R t + { L - \) k 2. (5-7) i e=z In cases where the number of neurons in one layer is less than the number of neurons in the next higher layer (Nt_j < Ne), a similar ring embedding structure can be created. In such cases, layer £-1 can be padded with zero-valued neurons in order to make Ne_j = Ne. This results in the creation of only a single ring of size Nt PEs. In general, the ring length needed to implement the processing associated with layer £ is Rt =max(Nt,Nt_]). (5-8) j By means of this ring embedding technique, various ring structures can be devised for efficient implementation of each layer in the neural network. A number of different neural network structures and their associated processing rings used for their implementation are I illustrated in Section 5.10. -The-case-where-the-number-of-neurons-exceed-the-number-of-processors-can-be addressed through the use of time-multiplexing multiple neurons on a single PE. The number of time slices required for implementing each individual neuron on a single PE is designated by vt and cot defined in (5-1) and (5-2). In multilayer processing the outputs of layer £-1 are used as inputs to layer £. Thus if vt_x neurons of layer £-1 are mapped to each PE, there should be that many neurons per processor for inputs to layer £. i J other words, cot = vi_1. This restriction disqualifies the simple use of equations (5-1)1 and (5-2) in calculating v( and C 0 e for all the layers when the neural network size exceed the processing array size. Due to the non-linear nature of equation (5-3) used to calculate the execution rate of the mapping, determining the optimum values for vt and ( 0 t is non-trivial and require a heuristic search method. I 5.5 Implementing B ack-Propagation Algorithms Some neural network learning algorithms, such as error back-propagation [66], require that the error values be calculated and propagated from the output layer back to the input layer after the evaluation of the output layer neurons. During the backward propagation of error values, the synaptic weights between the neurons of consecutive layers are j adjusted according to the specific learning algorithm being employed. The computational aspects of the most commonly used learning algorithm, the back-prop algorithm [66], i was described in Section 4.3.1. This computation can be implemented using the variable-size ring mapping method by keeping the partial sum values associated in the calculation of the error contributions (Acct in Figure 4-5) and the neuron activation values ai local in each PE while rotating the error terms Si through the ring structure. This approach is similar to that described in Section 4.3.2 and [Shams, 1991 #105] except that the ring length can now be adjusted to fit the neural network size for increased efficiency and throughput. In this approach, the construction and traversal of the embedded ring structure is performed in the reverse order. In other words, the ring of size Rt is traversed followed by the ring of size Ri_i and so on. This procedure is continued until processing for layer 2 is completed and all the weights in the neural network have been modified. In this mapping, a complete network updating operation is executed before the next one is initiated. Other mapping methods [72, 89] take advantage of network level parallelism pand-execute-multiple training-pattems-in-paraLlel.— The use of-this approaches described in detail in Section 5.8. I The execution rate for implementing the backward error propagation algorithm has the1 same order of complexity as that of the forward neural computation described by equation] (5-3). However during each systolic cycle of the algorithm, the number of algebraic operations is increased by two multiplication and one addition operations. Similarly, the, time required to implement the neuron transfer function is replaced by the time required toj implement the first derivative of this function. If the sigmoid function, equation (4-9) J is used as the neuron transfer function, the computation of its first derivative can bej simply performed as I f'(U i) = ai( l - a i), (5-9) j which requires only one subtraction and one multiplication operations. Such a scheme is only useful if the architecture allows for enough local memory area to store all the( activation values for the neurons of consecutive layers mapped to a single PE. j j i i 5.6 Implementing Block Connected Neural Networks In addition to the layered interconnection structures, a number of neural network models use or allow the use of blocked connected interconnections. In our treatment here, a block consists of two disjoint sets of neurons with full interconnections between the sets and no connections within neurons of each set. Such interconnection structures are more I general than the layered networks since each layer can be represented as being comprised of a single block. A more complex structure might utilize several blocks within each layer. We can treat the mapping method described above as a single block per layer case i and extend this mapping to multiple blocks per layer structures. | \ i 75 j Block structures are commonly used for two reasons. First, block structured networks can be used to perform data fusion by combining the outputs of several disjoint portions of a neural network. An example of such a structure is shown in Figure 5-6. The second type of block structure is usually employed to perform some type of feature detection using a concept called weight sharing [46]. These networks generally perform a convolution type operation on the previous layer's neurons using the synaptic weights as a mask filter. The interconnection structure of these networks contains many small and Blocked Neural Network Structure o ooo o o o <■ ■ 4 0 200 '75 M apping 40 Ring Structure >lj K 7 A 200 Figure 5-6 - A blocked structured neural network utilizing 3 disjoint blocks between the input and hidden layers and its associated mapping on 3 processing rings. 76 overlapping blocks. An example of such a network is the neocognitron model used for invariant object recognition [19]. In this section we describe the mapping methods best, suited for implementing each style of interconnection structure on the DREAM Machine using the variable length ring processing scheme. ! t 5.6.1 Implementing Data-Fusion Style Neural Networks The simplest case of implementing block connected networks is when each block is completely disjoint from other blocks in the network. In this case ring structures such as those described earlier in this section can be used to process each block on different parts of the processing array. This type of mapping leads to ring structures similar to that; i 1 | shown in Figure 5-1(b). If the blocked structure network is from the class of models I used for data fusion, the output of a number of blocks is combined and used as inputs to i a single block and possibly processed by additional layers in the neural network (such a s, the one depicted in Figure 5-6). Lets define N b ( as the number of neurons in block b of j layer £, and Rb t as the length of the processor ring associated with that block. If thej output of several blocks of layer £ are to be treated as input to layer ^+1, processing! rings associated with each block of layer £ are placed next to one another to form another; ring for computation associated with layer £+\. For example, in Figure 5-6 after 200 j cycles used for completion of the longest ring, the switch setting between the PEs of J adjacent rings will be changed to form a processing ring of size 13. After 13 computation j cycles in this configuration, the final neuron values for the output neurons are available! in the first three PEs in this ring. j The number of cycles required to implement one layer of a blocked connected neural; ! network using this mapping method is j Tt = {vt6}tRtkj + v(k2), where (5-10) ■ j v/ = max(v?}, d)e = maxlty?}, and R, = maxi/?? j . (5-11). 1 bsBt L 1 J * beBt I * J * b&Bt L * J I In equation (14), Bt represents the set of all blocks that comprise layer £ of the network. I I Since multiple blocks are executed concurrently on different processing rings of the; i machine, the time required to complete the computation associate with all the blocks is equal to the time required to complete the longest ring. Thus, Rt is set equal to the longest ring of the layer. The DREAM Machine’s masking capability is used to inhibit 77 the processors in all the rings that have completed their computation before the longest! ring is completely traversed. The number of neurons in layer i assigned to each PE of! the ring must be equal, or treated as being equal, for all the different blocks in that layer. This is due to the SIMD execution paradigm used by this mapping. Therefore, v( and (bt are set equal to the largest value of vt and cot , respectively, of all the blocks in layer i, as represented by in equations (5-11). 5.6.2 Implementing Feature-Detection Style Neural Networks ! A different approach can be taken when implementing block connected structures w ith1 overlapping blocks of neurons on the DREAM Machine. In many cases involving: structures with overlapping blocks of neurons, a weight sharing technique is used to I I create specific feature detecting neurons [19, 45]. Weight sharing is a concept where all l j the neurons have input synapses of the same spatial distribution. In other words, all! neurons share the same input synapse values connected to different overlapping blocks of | neurons. Efficient implementation of such structures can be accomplished by mapping; each unique synaptic weight value to a specific PE and storing the corresponding neuron; activation values in the local memory of each processor. In this fashion there is only a single copy of each synaptic weight stored in the processing array and only a few; redundant copies of the neuron activation values are stored in the array, depending on the, amount of overlap between adjacent blocks. This mapping approach is more efficient j l when the number of processors in the processing array is close to the number of unique weight values in the neural network. 5.7 Implementing Neural Networks Larger than the Processor Array In the above discussion we assumed that there is always enough processors to constructs rings of arbitrary size Rt. This condition can be satisfied if the number of neurons in the largest layer of the neural network is less than or equal to the processor array size. To formulate the above mapping technique in a more general fashion we I employ the use of time-multiplexing in creating virtual PEs. The number of virtual PEs; that can be implemented is an integer multiple of the processing array size. The 78 formalism given in equations (5-3) and (5-10), for arriving at the execution rate of this mapping, incorporates the effects of the required virtual PEs through the use of the vt and (O t terms. The goal of an optimal mapping will be to determine the appropriate values for the Rt values such that the total execution time (given by equation (5-7)) is minimized. Due to i the noncontinuities in the execution time equations introduced by the ceiling functions, we cannot directly solve for the best Rt values. In many cases the network structure can i give good hints as to approximate values for the ring sizes. Since the execution rate of aj specific mapping can be evaluated analytically using equation (5-7), optimization techniques, such as simulated annealing [36], can therefore be utilized to arrive at I efficient mapping solutions. i 5.8 Batch-Mode Implementation i As described earlier, batch-mode processing utilized network level parallelism by| simultaneously implementing multiple instantiate of a single neural network on a parallel j processor. This type of mapping has been extensively used to increase throughput rates | for a number of different parallel implementations [49, 60, 72, 88, 89]. Batch-mode J processing can be used to improve the system utilization of the mapping method in cases1 where the number of neurons in the network is smaller than the number of processors in the processor array. The DREAM Machine’s processor array can be configured into: many small regions of size equal to the size of the neural network being implemented. \ I Each of these regions can independently implement the complete network assuming that input patterns associated with each network are stored locally in each region. This type of processing requires that a number of different input patterns be available to the system before the processing is initiated. Batch-mode processing can also be used to implement neural network leamingj algorithms. There are two major disadvantages with using batch-mode learning. The: first is due to the lack of models that allow for batch-mode learning. The second problem; is associated with the effectiveness of this training approach. In gradient descent based' learning algorithms, such as back-propagation [66], the mathematically correct algorithm' i requires weight updates after each pattern presentation. By using batch-mode learning,! 79 true gradient descent is not implemented since weight values are updated according to the sum of the weight updates associated with a number of different training patterns. Another attribute of this approach is that by combining several weight contributions together, the learning process might be slowed proportional to the batch size. Therefore, i the speed gained by running multiple networks in parallel is lost by having to increase the1 number of learning cycles by a factor close the batch size. In Section 5.10 we compare a number of different neural network implementations based on their throughput. Although a considerable amount of increased utilization and throughput can be obtained, in these comparison we will not consider the use of batch-mode processing due to the problems mentioned here. i I 5.9 Implementing Competitive Learning Until now we have assumed a neuron transfer function that can be implemented based on data stored local to each processor, such as the sigmoid function and the threshold1 function (see Figure 2-2). In this section we examine methods for efficiently, implementing neural network models utilizing competitive learning algorithm on thej DREAM Machine. These algorithms update the weight values associated with synaptic connections arriving at the inputs of the neuron with the largest activation value, called the winner neuron, in the specific layer. Thus the transfer function associated with the1 neurons in these models depend on activation values of a number of other neurons, scattered in the processing array. Certain models, such as the self-organizing feature maps [37], modify the weight values of neurons in a local neighborhood of the winning neuron in addition to the connections arriving at the winning neuron. In Chapter 3 we described how the G-bus of the DREAM Machine architecture can be utilized to efficiently determine the winning neuron. Using the conditional masking! instruction, all the PEs except the one assigned the winning neuron can be masked out from executing the weight modification procedure. In our mapping the weight values associated with the inputs of the winning neuron are distributed across the different PEs in the ring. Nevertheless the learning algorithm can be implemented by rotating the mask bit through the PEs of the ring starting from the winner PE. In this fashion thei appropriate weight associated with the winning neuron is modified in the properly1 enabled processor. This procedure requires 0(R e) time steps, where Re is the number of 1 1 neurons in the ring being evaluated. J j A more efficient weight updating method can be devised using the local address j modification capability of the DREAM Machine (described in Section 3.3) in addition to ! the G-bus. In this approach, the local memory of each processor holds the neuron ID, value indicating the neuron's relative location in the ring. The input weights associated with neuron i are stored in memory location (WJBase+offset) of the processor assigned neuron j, where W Base represents the base address of the weight memory space and offset is calculated as Mapping Neural Network O r l-l W Base W Base+Ri-1 H Z b a i • • • m • • i W ij Processor Array Memory W B ase+ j-i Figure 5-7 - Memory locations of synaptic weights on the inputs of neuron i. 81 see Figure 5-7. j - i if j > i offset = < Re + j - i if j < i (5-12)| In order to achieve maximum parallelism in the weight updating phase, after the j determination of the winning neuron, the locally stored neuron ID value (JDmax) is I broadcasted to all the PEs in the machine by the controller using the Instruction/Data1 broadcast mechanism described in Section 3.2. All the processors can then calculate their corresponding memory offset values according to equation (5-12). The controller can then load each PE's local address register with W B ase, and shift the offset value into this register in 0(Log Rt) time steps. At this point all processors can access the appropriate weight values and perform the update function in unison. For large values o f1 R(, this approach yields a higher performance requiring only 0(Log Rt) steps compared to the earlier mention method which yields a performance of 0(R() steps. : j 5.10 Implementation Examples and Performance: Evaluation i i i i In this section we demonstrate the performance of the DREAM Machine through the use j of several "real-world" example mappings. We compare this result with that obtained! from mappings onto a ring systolic architecture. Due to the nature of the mapping | method, other more complex interconnection topologies (e.g. hypercube, plain m esh,1 etc.) do not offer additional advantages over the 1-D ring and thus are not included in this! evaluation. j 1 5.10.1 Performance Metric A commonly used and quoted measure of implementation performance for neural networks is the Million Connection Updates Per Second (MCUPS) metric. This measure is calculated by dividing the total number of connections in the neural network by the amount of time required to perform a weight updating procedure on the complete; network. The execution time is measured starting from the point where input data is 1 presented to the network through the point where all synaptic weight updates have been I 82 completed. The main problem with using this measure as a performance metric is that the weight values can be updated using a number of different learning algorithms each having a different amount of computational complexity. For evaluating and comparing the performance of our implementation we use a less restricting metric based on recall, or feed-forward, processing only. The metric used here is the Million Connections Per Second (MCPS). This measure is evaluated by dividing the number of connections in the neural network by the total time between the presentation of the input pattern until all the output values are generated. Use of "real-world" example networks offers a better assessment of the system's ability in efficiently implementing varying network structures over the often quoted peak execution rate. Optimally, the number of operations that can be performed in parallel with any, implementation taking advantage of neuron level parallelism (assigning individual! neurons to distinct PEs) is equal to min(A^, A^_,) for processing layer I. The optimum! execution rate for implementing the processing associated with layer I can be represented | as T‘ = *,-,)■ (5-13); mn(iV(,JV H ) A relative measure of performance of any mapping method can be evaluated as a ratio of the mapping's execution rate over the optimal. Three different neural network structures have been selected as benchmarks for evaluating our mapping method applied to the DREAM Machine architecture. We use both the MCPS and the percentage of optimality measure in the evaluations. 5.10.2 Implementing a Fully Connected Multilayer Neural Network A commonly used neural network model for comparing relative performances of parallel implementations of neural network is the Nettalk network [69]. Nettalk is a simple three layer neural network with full interconnections between adjacent layers which utilizes the back-propagation learning algorithm to produce speech from printed text. The network consists of 203 input neurons, 60 hidden neurons (neurons in the second layer), and 29 output neurons. Using the algorithmic mapping method describe earlier in this chapter, we can construct a ring structure (similar to Figure 5-5) with N j=203 and #2=60. 83 Assuming a DREAM Machine architecture with 256 processing elements, the implementation execution rate can be arrived at by using equations (5-3) and (5-6) to be T-263ki+2k2 seconds, where kj is the time to execute a single systolic cycle of the algorithm and k2 is the time required to perform the table lookup operation. i I Since the DREAM Machine architecture allows for memory access, interprocessor' data transfer, and arithmetic operations, to be executed in parallel, ki will be equal to| the time required to implement the most time consuming of these three operations.} Assuming current technology for the implementation of the DREAM Machine, we can j safely assume a value of &i=100ns. Assuming an 8-bit quantization level for the neuron j I activation function, the time required to implement the table lookup operation will by 9 1 memory access cycles (8 for loading the shift register and 1 for reading the final value). Allowing for a 50ns memory access time, ^2=450ns. Using these values we arrive at a performance measure of 512 MCPS for, implementing the Nettalk network on the DREAM Machine. The implementation of the same neural network model on a fixed ring systolic array, following the mapping in [41], i with a ring size of 256 PEs, will achieve a throughput rate of 5\2kj+2k2 seconds.' Allowing for the same implementation technology and setting &/=100ns and &2=450ns, we arrive at a 267 MCPS throughput rate. This assumes that the ring architecture: i supports the same type of table lookup mechanism as the DREAM Machine. If thej neuron activation value is to be determined analytically, the performance will be further} reduced. [ The relatively similar execution rate obtained from the fixed ring mapping and the ^ DREAM Machine mapping is due to the limited degree of parallelism available in the, Nettalk network and its simple fully interconnected structure. This point becomes evident | if we consider the percentage of optimality factor for both implementations. The optimal j execution rate for implementing the Nettalk network is T*=263kj+2k2 seconds,! according to (5-13). This leads to an optimality factor of 100% for the DREAM Machine; implementation vs. 52% for the fixed ring implementation. ! i i 5.10.3 Implementing a Block-Connected Multilayer Neural Network j i A neural network model which reflects the current trend in structured network design has been proposed in [56] for image compression application. The structure of this network 84 j ",~l ’ 1 " -“ “ I (shown in Figure 5-8) is a good example of a blocked connected neural network used for! data fusion, described in Section 5.6. This network consists of 5 layers. The input layer j is comprised of 8 disjoint blocks of 64 neurons each. Each 64 neuron block in the input layer is fully connected to a unique 8 neuron block in the second layer. All the 64 neurons in the second layer are fully connected to n neurons in the third layer. The number of neurons in this layer («) determines the amount of compression performed by the i network. A symmetrically identical interconnection structure is constructed to decompress the information from the third through the fifth layers. The ring structure for implementing this network on the DREAM Machine is shown in Figure 5-9. This ring structure operates in two configurations: Configuration 1 contains 8 non-intersecting rings of length 32 used for processing layers 2 and 5, thus1 R2 = R5 = 32. Configuration 2 is used for processing layers 3 and 4 and consist of a! 64 64 64 O O O O Figure 5-8 - A neural network structure for image compression and decompression with a regularly blocked structure [56]. 85 single ring with 64 processors, thus R3 = R4 =64. Virtual PEs must be used to implement the required 512 neurons in the input and output layers, since the processor array is assumed to have only 256 processors. The use of virtual PEs to implement layers 1 and 5 leads to & > 2 = c d 5 —2. No other layers require virtual processors andj I therefore cb3 = cb4 = 1 and v2 = v3 = v4 = vs = 1. Using equations (5-6) and (5-10) wej can calculate the execution time required for implementing this network to be; T=256ki+4k2 corresponding to a 598 MCPS throughput, assuming the compression factor n=64. Using a fixed ring architecture, the expected execution time will be T=l536k]+4k2, yielding a throughput rate of only 105 MCPS. We can calculate the optimal execution time for this network taking advantage of all the available neuron level parallelism. This leads to a T*=256ki+4k2 value. We can notice that similar to the Nettalk mapping, the DREAM Machine implementation can achieve 100% level of optimality where the fixed ring approach yields a performance of 18% of optimum. 32 32 32 32 Configuration I: Configuration II: ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- 32 32 32 32 -----64 ii • • • • ■ • • • • • • • ■ ■ • • • • • ■ • • • • • • • • • • I Figure 5-9 - The ring structure associated with the image compression and decompression network of Figure 5-8. 5.10.4 Implementing a Fully Connected Single Layer Network The peak performance of the algorithmic mapping method on the DREAM Machine can be evaluated by mapping a neural network structure which offers the greatest amount of parallelism. One such network has been proposed in [33] where a fully connected neural; 86 _____I network is used to solve the Traveling Salesman Problem (TSP). The neural network1 requires N 2 neurons and thus N4 synapses to implement an N city TSP. A 16 city TSP! can efficiently utilize all the PEs of the DREAM Machine to arrive at a peak performance; of 2,516 MCPS. Same type of performance can be expected from a fixed ring architecture when implementing this network since the ring size matches perfectly the number of available PEs in the machine. To implement the 30 cities problem described in [33], we can construct a ring of size J 225 (obtained using equation (5-4)) for processing the required 900 neurons. The' I ! amount of time required for each update of the neuron output values can be found via! equation (5-3) to be T=(4*4*225kj + 4k2) = 361.8jis or 2,239 MCPS performance.! Implementing this network on a 256 element fixed ring architecture leads to a| 1,969 MCPS performance. This illustrates well the capability of the DREAM M achine1 and the variable ring mapping strategy to maintain a performance close to the peak value | for neural network structures with varying sizes. The measure of optimality can also be evaluated for this example using equation (5-13). This leads to T*=900kj + lk 2 =90.45p.s value which assumes having 900- processors. In order to adjust for the actual number of processors we can multiply T* by j a factor of 900/256 = 3.52. This leads to a new T* value of 318|is. Now the optimality I ratio for the variable length ring mapping can be determined to be 88% and for the fixed length ring is 77%. I j I 5.11 Analysis of Results j In this chapter we described a mapping method for efficient implementation of neural network models with regularly structured interconnections. We described how various! features of the DREAM Machine architecture can be used to efficiently implement specific requirements imposed by neural network processing. The mechanism used for construction of processing rings of arbitrary size (less than the processing array size) was shown to be useful for matching the size of the problem to the architecture. The use of embedded ring structures, described in Section 5.4 and demonstrated in Section 5.10, • have outlined the effectiveness of this method in significantly reducing the restrictions on; the size and shape of the interconnection structure of multilayer neural networks for! 87 efficient and fast processing. A summery of the performance comparison based o n ! throughput rates is shown in Table 5-1 and comparison based on the optimality ratio is: given in Table 5-2. ' Nettalk [69] Compression Net [56] Hopfield 30 Cites TSP [33] Fixed-Size Ring Mapping 267 MCPS 105 MCPS 1,969 MCPS Variable-Size Ring Mapping 512 MCPS 598 MCPS 2,239 MCPS Table 5-1 - Performance comparison of the variable-size ring mapping vs. the fixed-size ring mapping based on the MCPS metric. The performance of the DREAM Machine architecture was compared to the ring-connected systolic architecture in this chapter. Since both architectures use nearest neighbor communications, the extra communication hardware required by the DREAM; Machine is only a constant factor greater than the 1-D ring architecture. On the other] i hand, the execution rate of the fixed ring was shown to be O(P) where P is the number j of processors in the network (assuming the number of neurons per layer is always less! i than P), Using the variable ring mapping method on the DREAM Machine, the! execution rate of this implementation is 0 (R b), where Rh is the size of the largest block in the network. As neural networks continue to be applied to more demanding; applications requiring specialized structures for processing various portions of the input data, the number of blocks in a specific neural network will increase but the size of each] block should remain relatively small and constant. In order to exploit all the available; parallelism in such networks, the number of processors in the system should increase. The DREAM Machine architecture and the variable ring size mapping method support] these demands very efficiently since the execution rate scales relative to the small and; constant R b value rather than the large and growing processor array size. 88 r.......... ......... ' ■ ' ......... Nettalk [69] Compression Net [56] Hopfield 30 Cites TSP [33] Fixed-Size Ring Mapping 52% 18% 77% Variable-Size Ring Mapping 100% 100% 88% Table 5-2 - Performance comparison of the variable-size ring mapping vs. the fixed-size ring mapping based on the optimality ratio. I l I I I 89 Chapter 6 Optimization Based Mapping Method Note: the work reported in this chapter and the associated results presented in Chapter 7' have been performed in collaboration with Dr. Petar Simic [74,76]. In this chapter, we { present a general method for mapping parallel algorithms onto parallel processing architectures. This mapping is performed in such a fashion as to optimize a specific objective function. The basic framework of this method is presented in a general form in, order to be applicable to a wide range of problems in parallel processing. However, ini this dissertation we specifically demonstrate the use of this method for implementing; neural network models on the DREAM Machine architecture. The method is based on! using neural computation techniques to arrive at good solutions to the mapping problem. ■ The basic principles of this method and the specific cost functions used to measure the, optimality of a specific mapping is described in detail. Simulation results of the algorithm j are presented in Chapter 7. I t i i 6.1 Problem Overview * i In general, parallel algorithms with specific regularities in their computation and! interconnection patterns can be mapped onto parallel architectures manually by exploiting | the inherent regularities of the algorithm. This type of an approach has been successfully j used to arrive at efficient methods for implementing regularly structured algorithms, such as the Fast Fourier Transform (FFT) and the matrix-matrix-product operations, onto systolic parallel architectures [40]. In Chapter 5, we presented one such method for mapping neural network computation onto the DREAM Machine architecture. This J 90 | method utilized the regularity of the neural network interconnection structure to construct ring-shaped paths specifying the assignments of neurons to PEs and generating a simple flow pattern for transferring partial results between various processors. 6.1.1 The Mapping Problem In mapping parallel algorithms, where no particular regularity in the interconnection patterns is apparent, two distinct approaches can be taken. One method is to enforce a certain regularity in the interconnection structure by introducing phantom processes and connections which do not contribute to the actual computation but are used strictly for correctly synchronizing the operations being performed by various processors. The; analogy of this method for neural network processing is seen by the use of phantom neurons and synapses with zero output values and zero connection weights mentioned in Chapters 4 and 5. This method simplifies the mapping problem since this artificially j created regularity can be used to ensure proper computation and still maintain the basic i straight forward mapping algorithm. On the other hand, the introduction of phantom ! connections and processes to the computation, reduces the system efficiency by dissipating computational cycles on useless operations, such as multiplication and addition by zero. The second method of addressing the problem of mapping irregularly structured algorithms is to abandon the use of the regularity in the interconnection structure all together. The mapping problem can then be stated generally as a problem of assigning! various processes in a given algorithm to the various processors of a parallel machine and construction of conflict-free computational paths in such a fashion as to optimize ai specific objective, such as having minimum execution time. J 6.1.2 Complexity of the Mapping Problem : t < In general, the number of possible assignments of processes to processors, and the ■ corresponding communication flow patterns is combinatorially high. For example, given j a neural network with N neurons and a parallel architecture with P processors there are ^ P\l {P — N)\ many different assignments that can be used to assign the neurons to the; t processors, assuming P>N and each processor can hold no more than one neuron. The ( space of possible configurations would be much larger if we allow for multiple neurons per processor configurations. Generally, there are N computational paths associated with neural network computation (one per neuron). There exist PINl/ (P — M)l different configurations that can be used to construct these paths traversing various PEs in M communication steps. Of course, most of these assignments and flow patterns lead to very low system utilization and throughput rates. Only a handful of possible assignments and communication schedules lead to efficient and high throughput implementations. I \ I 6.1.3 Analogies to Other Combinatorial Optimization Problems The problem of optimally mapping parallel processes onto parallel processing architectures is analogous to many other combinatorial optimization problems. The problem of assigning processes of a process graph to processors of an architecture, defined by an interconnection topology graph, is similar in nature to the sub-graph * isomorphism problem. The relation between the assignment problem and sparse matrix bandwidth reduction has been illustrated in [7]. It has been shown that all of these problems fall into the class of NP-Complete problems where no polynomial time algorithm has been devised which can guarantee a solution to an arbitrary size problem. As discussed in Section 6.1.2, the scheduling problem for implementing commutative computation according to the mapping method of Chapter 4, is of even larger complexity than the assignment problem. The construction of each computational j I path over the processing array can be viewed as a slight variant to the Traveling Salesman ; Problem (TSP), which is also known to be an NP-Complete problem [44]. Having to construct multiple paths while minimizing the longest path length is analogous to running multiple interdependent traveling salesman problems concurrently while optimizing the longest tour length. Solving such combinatorially complex optimization problems has generally involve devising a heuristic and iterative approach which attempts at finding a "good" but not necessarily optimal solution [4, 7]. Starting from a random solution, these methods sequentially and slowly vary the system configuration so that at the completion of the j algorithm no better solution is found. Due to the serial nature of these algorithms, their usage has been limited to small problems. In this chapter, we present a highly parallel; method for efficiently searching the complex space of mapping possibilities in order to 92 find good solutions to the assignment and scheduling problems [78]. Similar but less general approaches have been proposed previously using various neural computation techniques [15, 33, 58]. 6.2 General Approach We have presented the general mapping approach used for implementation of neural j networks on systolic parallel architectures in Chapter 4. This mapping method describes the general concept of the computation process by specifying the computational operations to be performed at each processor and the associated interprocessor communication requirements. However, this general mapping method does not specify j the exact assignment of neurons to processors and the subsequent scheduling for the flow of data between the various processors. We now describe how we attempt to address each of these problems. 6.2.1 Mapping Parallel Algorithms Onto Parallel Architectures The implementation of parallel algorithms on parallel processing architectures involves solving two different but interrelated problems. First, we need to determine what part of the total computation, associated with a specific algorithm, is to be performed by each of the processors in the target machine. Second, we need to establish how and in what order would these processors communicate with one another in order to implement the desired algorithm. In general, this problem can be broken down into three separate but again interrelated problems. The first one is sometimes referred to as the clustering problem. Given a specific algorithm as a collection of intercommunicating processes, the ^ clustering problem involves grouping a number of processes into individual clusters such that the inter-cluster communications are minimized while computational parallelism is maximized. Of course, there is a trade-off between these two objectives, and the best trade-off depends on the specific characteristics of the target architecture. With a specific clustering of processes into clusters, the second problem one must deal with involves finding an optimal assignment of clusters to processors of the target architecture such that the logical communications between clusters can be realized efficiently by the physical interconnection topology of the architecture. This problem is referred to as the assignment problem [8]. The third and final problem, called the scheduling problem, involves specifying the order in which each process is executed and how the interprocessor communications are synchronized. The clustering and assignment' problems must be solved before the scheduling task can begin. The objective of the scheduling problem is to order the computation and communication operations in such a manner as to have all the necessary computation completed in the shortest amount of time while observing specific physical limitations of the architecture, such as conflict-free j communications. Although there is a sequential order in which each of these problems must be solved, clustering, assignment, scheduling, all of these problems are interdependent on each J other, and thus, the best implementation might require concurrent optimization of all three tasks. In the following sections we describe our approach for addressing each problem separately. However, our method is formulated such that concurrently solving [ for the best solution to all problems can also be performed. Below, we describe how the J clustering problem can be addressed. The assignment and scheduling problems are treated separately in Sections 6.3 and 6.4, respectively. 6.2.2 Addressing the Clustering Problem i The need for solving the clustering problem arises in two different circumstances. First, if the number of processes of a given algorithm is larger than the number of processors in the target machine architecture, a method for clustering multiple processes into a single cluster is required. The second reason for performing the clustering operation is due to the specific architectural characteristics of the target machine. If the target machine's j architecture is rather inhomogeneous, having computation and communication properties varying across the machine, clustering can be performed to group similar processes requiring similar architectural characteristics together. For example, with a machine architecture comprised of a group of analog processors and a group of digital processors, processes requiring high precision computation might be grouped together and assigned to digital processors. Other processes requiring certain characteristics which are better matched to analog hardware can be clustered and assigned to the analog processors. ! 94 The clustering problem has the same computational complexity as the assignment problem described in Section 6.1.2. Therefore, solving this problem for the optimal solution involves searching a combinatorially large space of possibilities. In this dissertation we do not attempt to solve the clustering problem for the general case of grouping neurons into clusters before performing the assignment procedure. As mentioned above, the clustering problem arises in cases of inhomogeneous algorithms and architectures, and in cases where the number of processes are larger than processors. For implementation of neural networks on systolic SIMD parallel architectures, both the algorithm and the architecture are rather homogeneous, thus eliminating the need for clustering due to the inhomogeneity. For the case where the number of neurons exceed the number of processors, we use a trivial clustering method involving the generation of virtual processors via time multiplexing. With a machine having P processors, virtual processor Pv is processed a t; the \ P J P ~ } + \ time slot of processor (Pv m od/5 ). Although this method does noti guarantee any goodness of clustering, it is simple to implement. Optimization routines similar to those described for the assignment and scheduling problems later in this chapter can also be developed to address the clustering problem. 6.3 Solving the Assignment Problem In this section we present a general formalism for solving the assignment problem. In general, any computation can be represented as a collection of processes performing specific operations on their local data items while communicating an arbitrary amount of information over a set of logical communication links. Similarly, a parallel processing architecture can be viewed as a collection of processing elements, operating and manipulating data items, being capable of communicating with one another over a physical interconnection network. The objective of the assignment procedure is to generate a mapping which assigns each process of the algorithm to a particular processor in the parallel architecture, such that a given criterion function is optimized. One such criterion function is to maximize the number of logical communication links being! mapped to physical communication channels. 95 6.3.1 Problem Representation A computational algorithm can be represent as a process graph D with nodes n,m=0, 1, 2, - ,N - 1 representing individual process. These nodes are connected together via links / ) “ denoting the logical data dependence between various processes, where D m = 1 denotes a dependence between process n and process m. Similarly, the interconnection topology of a parallel processing architecture can be represented as a graph G with nodes A,B= 0, 1, 2, ••, P -1 denoting the various PEs of the architecture, and links GA B representing the physical communication channel between the processors. Any arbitrary interconnection topology can be specified by setting = 1 to show the presence of a physical connection between processors A and B. With such a representation scheme, the assignment of process nodes to processor nodes can be represented by the use of a assignment matrix co, having binary valued elements coA representing the assignment of process n to processor A, if and only if co* = 1. A graphical representation of this approach is shown in Figure 6-1. m | 'A B Dn Figure 6-1 - Mapping of the nodes of a process graph D onto a processor graph G using the assignment matrix co. Let us further define a distance matrix G with elements G*3 representing the communication cost associated with transferring data from processor A to processor B. J The distance measure can be arbitrarily specified according to the specific architectural! features of the target architecture. For example, a measure which is used in our; implementation is the city blocks distance between the various PEs. Other distance j 96 measures, based on specific capacity of the various communication channels used by an i architecture, can also be represented through the use of this matrix. j i I 6.3.2 Assignment Cost Function Formulation With the representation scheme described in the previous section, we can now proceed to design a specific cost function to assign the various nodes of the process graph to the nodes of the processor graph in such a way as to minimize the communication cost j between the PEs. The ideal assignment would assign each process to a particular j processor such that each of the logical connections in the process graph D n m can be represented by a physical communication channel Gm of the architecture. In general, we cannot guarantee that mappings which satisfy this condition do exist. For example, if a particular node of the process graph is connected to 5 other nodes of the graph, and the j target computer architecture uses a four nearest neighbor interconnection topology, at I best only four of the logical connections in the process graph can be directly mapped onto' physical communication channels of the architecture. In such cases, we would like to • assign the fifth node to a processor which is located at the shortest possible distance to its i ideal position, hence the need for the distance matrix G . Given the graphical representation of the computation and the target architecture through the use of the D and G matrices, respectively, we can measure the amount of | overlap between the links of the process graph and the links of the architecture graph i given a particular assignment matrix o n. Specifically, we are interested in minimizing the j number of mismatches between these graphs with a particular assignment. The extend of j I mismatch can be evaluated as X £ ( G " - < » ,f 2 r \ » £ ) 2, (6-1) A ,B n,m where CQ*Dn m co^ = 1 when process n is mapped onto processor A, process m is mapped onto processor B, and there exists a link between processes n and m. Thus, a m atching; in the assignment of logical process graph edges to physical processor graph edges does not contribute to this sum. As mentioned earlier, in cases where the links associated with each of the two graphs cannot be completely matched, we would like to map processes to processors physically close to the optimal location. This can be accomplished by 97 weighting the cost of each mismatched link by the distance between the two selected PEs as £ 2 f f i A “(G'u,- < D " " o ; ) 2. (6-2) A,B ntm I In general D n m ^ D m n, corresponding to a directed process graph. However, we d o 1 require that the matrix D be quadratic of size NxN. By utilizing the binary nature of the ’ G and D matrices, equation (6-2) can be simplified to ! + ( l - G A B )GA B lDn m cot(D*), (6-3) AtB n,m where £>"” is defined as I JD ™ = |(£ > ” m + D m n). (6-4)' In this form G** is used to add or subtract the quantity G A B E)n m 03^co^ from the total cost, i depending on its state. This can be simply implemented using a conditional statement j requiring minimal amount of computation. In order to guarantee correct computation o f ! this cost function, we require that both the distance matrix G and the process graph D b e : traceless, having elements GM = 0 and D n n = 0 . , I 6.3.3 Constraints on the Assignment Matrix | There are a number of constraints placed on the structure of the assignment matrix which i are required to ensure physically realizable solutions. In our formulation, we assume that the number of processors in the system is always greater or equal to the number of individual processes being implemented, P > N . We have already described how time multiplexing techniques can be used to increase the number of virtual PEs to the desired! amount in order to satisfy this condition. A major constraint on the assignment matrix is to have each process be assigned to "some" processor in the architecture. This constraint can be written as 5 > ;? = i- (6-5) A ! I Later in section 6.6.1 we show how this condition can be strongly enforced in arriving at! solutions to the assignment problem. 98 _____ 1 A second constraint is introduced by requiring that each processor not be assigned more than one process. This constraint can softly be enforced by having a penalty term added to the cost function. In equation (6-6), the parameter dm is used to adjust the weight of this penalty term to the total cost of the assignment. Additional penalty terms can be added to ensure the compliance with other constraints. These include a term for discouraging mappings where one process is assigned to more than one processor by adding to the total cost. In both equations (6-6) and (6-7) we always assume that yM = 0 and — 0. This ensures proper computation of the penalty terms. can be added to the cost function. This term requires that each processor be assigned a single process. The parameters yw and y ^ are used to adjust the contribution weight of each term to the total cost. Equation (6-8) is only useful for the case where P -N . Otherwise, if P>N, then the cost associated with this term will never reach zero since some processes are not assigned to any processors. In order to formulate equation (6-8) in a general form so that it is applicable to the more general case where P > N , we introduce a new binary valued variable 7]A which indicates which of the P processors in the system are assigned the N processes. By ! selectively including only those PEs that are assigned a process, we can replacing equation (6-8) by I (6-6) (6-7) If we assume that the number of processors in the network is equal to the number o f , processes in the process graph, then an additional penalty term of the form i (6-8) ■ (6-9) A V n J We add an additional penalty term to the total cost function as in order to ensure that only N processes are selected to participate in the calculation of equation (6-9). In both equations (6-9) and (6-10), the diagonal elements of the quadratic operations, c q *c q * and r\Ar\A respectively, must be removed from the computation in order to ensure proper convergence of the algorithm. This is accomplished by expanding each equation and individually subtracting these diagonal terms. The resultant form of equations (6-9) and (6-10) can be written as / \ 2 A \ r A,n \2 V N - ' Z I a + 7 n 2 T 7 A (l-^ )> (6-11) (6- 12) respectively. 6.3.4 The Complete Assignment Cost Function The complete cost function used as the criterion of our assignment procedure can be assembled by combining the penalty terms given in Section 6.3.3 with the weighted mismatch cost given by equation (6-2). The final form of this cost function with appropriate weighting constants is given by \ 2 A,B = 0 N - 1 N - 1 iAB n , m = 0 P - 1 a L „ „ , _ n a L ntm — 0 P - 1 A , 5 = 0 f N - 1 \ 2 f J V — 1 + + \r.% \ n,5>:{ i - O ^ A — 0 V n = 0 / L A= 0 \ n = 0 P-l + j r . (6-13) A = 0 A — 0 This formulation of the assignment cost function can be viewed as a method for solving the sub-graph isomorphism problem, with the additional twist of having a weight factor for the mismatched nodes having a magnitude proportional to the distance between the 100 ideal and the actual assignment. Other methods based on solving the sub-graph j isomorphism problem for assigning parallel processes onto parallel architectures have I been previously explored using heuristic techniques, such as [7]. A comparison of our j method with the method of [7] is given in Chapter 7. ! The basic function performed by this assignment cost function is to assign th e ; processes that are connected to each other, defined by matrix D, to a cluster of • processors located in a small neighborhood of each other. For example, let us assume ( that process n is assigned to processor A, co* = 1. Let Sn — \m \D n m ^ Oj represent the j set of all processes that have a logical connection to process n, and define K n to be the number of elements in the set Sn. If each of the Kn elements of set Sn is assigned to processors in the local neighborhood of processor A, then the cost associated with this ( mapping is zero, according to the first term in equation (6-13). Otherwise, each of the] processes mapped to a non-neighboring processor of processor A, call this processor C, increases the assignment cost by a value proportional to the distance between processor A and C, given as G*c . \ There are several cases where such non-optimal mappings occur. The most obvious | one is when Kn is greater than the number of physical connections associated with each processor of the architecture. The second case is related to the complexity of the process graph. If a single process is logically connected to a large number of otherwise disjoint process clusters, the competition between these clusters will cause the particular process to be assigned to a compromise processor that will keep the total cost to a minimum. Of course, there are cases where no optimal assignment can be achieved due to the intrinsic differences between the process and processor graphs. In such cases, it is impossible to find a mapping which completely satisfies the first term in equation (6-13). An example of such a case is given in section 7.2. 6.3.5 Assignment of Neurons to Processors The assignment cost function, described in the preceding sections, can be utilized to find good mapping solutions for assigning neurons of a neural network to processors of a parallel processor. This can be accomplished by treating each neuron as a single process of the process graph and each synaptic weight in the W matrix as an edge of this graph. In this fashion, each non-zero synaptic connection between neurons n and m is represented by a D m element set to one. This representation scheme can be directly applied to single layer neural network models, such as the Hopfield net [33]. Application of this approach to layer structured neural networks requires a relabeling o f ; neurons as processing propagates through the various layers of the network. The formulation of the assignment cost function as presented in equation (6-13) attempts to assign the neurons m e Sn to processors in the local neighborhood of neuron n. In mapping schemes where layer level parallelism is not utilized, it is inefficient to have neurons of different layers mapped to different processors since processing. associated with each different layer is performed in series. Therefore, neuron n of layer ■ £ which is connected to neurons m e S n of layer £-\ is assigned the same processor as : neuron n of layer t - 1. This restriction is not significant for iterative neural network models, such as the BAM [38] and ART [10] models, but can cause inefficiency in layered networks with large differences between the number of neurons in various layers. However, this inefficiency is primarily due to the computation associated with the j implementation of the assignment procedure and not the actual assignment solution. I n ; any case, it is possible to introduce yet another variable to the cost function to automatically determine the best assignment of neurons in layered structures. I 6.4 Solving the Scheduling Problem ! As mentioned previously in Section 6.1, the scheduling problem involves determining an optimal sequence of operation and data movements among processors, such that a specific criterion is optimized. The most common criteria used for the scheduling problem is that of the execution time, which is to be minimized. More complex criterion functions, based on specific needs and architectural characteristics of a particular problem, can also be devised using a similar approach to the one described here. 6.4.1 General Approach Given a particular assignment of processes to processors of a parallel architecture, we are interested in generating a scheduling of operations and flow of data between the processors such that the total execution time of the algorithm is minimized. A data! 102 dependence diagram (or a data-flow graph) is required in order to generate a computation and communication schedule. Figure 6-3 depicts a typical data dependence graph for a simple calculation. Figure 6-3 - An example data dependence graph for a simple arithmetic calculation i i A general approach for evaluating a cost associated with the time necessary to traverse a data dependence graph can be derived similar to the one used for solving the assignment problem. The goal of such a cost function would be to schedule as many operations a s, possible such that they perform their operations in parallel. A more complicated' scheduling problem is encountered when implementing associative or commutative J i operations, such as multiplication and addition operations, in a distributed fashion. In J such cases, the order in which each particular summation or multiplication operation is i perform is irrelevant and therefore the number of possible schedules for this computation! is combinatorially large. Figure 6-4 shows multiple dependence graphs that accomplish | the same commutative computation. A + B + C + D + E + F A + B + C + D + E + F A + B + C + D + E + F Figure 6-4 - Several different dependence graphs performing the same computation ' due to the commutative nature of the addition operation. In this dissertation, we limit our discussion to the details of the scheduling problem | for this special case of commutative operations. This is primarily due to the fact that neural computation involves the calculation of the weighted input sum of a neuron and the mapping method we have selected for parallel implementation of neural networks, | described in Section 4.1, takes advantage of the commutative nature of the addition | I operation. j i 6.4.2 Specific Problem Representation In Section 4.1, we presented the general mapping method used for implementing neural networks on systolic parallel architectures. The scheduling problem for implementing commutative operations on a systolic parallel architecture, where the results of the computation is not dependent on the specific order in which each of the operations is executed, involves determining the specific sequence in which the computation flows 104 from one processor to the next, regardless of the order in which each processor is traversed. In addition to neural network computation, this approach can also be applied to other similar algorithms, such as solving linear systems of equations [74]. There are a certain number of specific assumptions made in our formulation of the scheduling problem. First, the processes mapped to processors are assumed to require only a unit time in order to perform their computation. Second, each processor can autonomously select which of its neighboring processors to communicate with during each communication cycle. Finally, each processor can only perform operations; associated with a single computation during each processing cycle. Given these' assumptions and the mapping principle presented in Section 4.1, the scheduling problem ; involves formation of computational paths, each path associated with the computation of the weighted input sum to a specific neuron. Paths are constructed such that no two paths cross the same processor at the same time, and the data movement is limited to be between processors physically connected to one another through a shared communication \ < Figure 6-5 - A 3-D representation of 3 non-intersecting path traversals in time. The left hand plane represents the 2-D processor array Time 105 channel. The criteria used to optimally arrive at these computational paths is to schedule the flow pattern such that the length of the longest path is minimized while all the restriction of legal data movement between processors are adhered to. Figure 6-5 graphically displays an example of 3 non-intersecting computational paths being traversed concurrently. The 2-D processing array is augmented by the time axis for better representation. The computational paths depicted in this figure follow a simple flow pattern where all the paths move in the same direction and are offset by a few processors in the space dimension. 6.4.3 Scheduling Cost Function Formulation A similar approach to that of the assignment cost function formulation is used here t o ; solve the scheduling problem. The symbol a=0, 1,2, A-1(<N) is used to denote o n e , of the A computational paths in the neural network. The G matrix is again used to j specify the interprocessor communication topology of the target architecture of the j mapping. The dependence graph used for the scheduling problem is specified via the D matrix. However, in the formulation of the scheduling cost function, each element of the D matrix D an represents the need for the computational path a to traverse neuron n. In other words, Da n — 1 denotes that neuron n participates in the computation associated | with path a . \ I For implementing commutative operations in an iterative system, such as single layer j feed-back neural network models, Da n is equivalent to the D n m matrix used for th e : assignment problem, since A=A. Although the case where A<N can also be implemented using the same procedure, by augmenting the matrix with zeros, in the ( remainder of this section we assume that the matrix D is quadratic of size NxN. ! The objective of solving the scheduling problem is to arrive at a solution which specifies, for each path, the time order in which each processor is to be traversed. This can be represented by a 3-D matrix 0 with binary valued elements 6 ‘ n(a). This variable has a value of one when at time i path a traverses neuron n. In order to avoid the use of non-linear functions in our calculation of the scheduling cost function, no direct method for measuring the length of each path, passing through all of its necessary processors,: can be constructed. Therefore, we have selected to use an iterative approach to finding! good solutions that traverse all the necessary processor traversals in a prespecified I I 106 j number of time steps, M. The bounds on the value of M can be derived by considering! that the lower bound on M corresponds to the number of neurons (or processors) to be 1 traversed by the path a*, where a* denotes the path having the largest number K* of neurons to be traversed. The upper bound on M is obviously equal to the maximum number of neurons in the network N. Therefore, we can set the value of M to be in the range K* < M < N . The actual value of M depends strongly on the complexity of the neural network interconnection structure and the number of physical communication channels per processor. In practice, we can choose M to be fairly close to K*, for example 1.2 K*. \ This formulation leads to a 0 matrix of size AxNxM. As stated earlier, assuming j iterative algorithms, path a must be at the processor assigned neuron a. This can simply be implemented in our calculation by having 0n~1«*)= 8an, where (6-14) f 1 if i = j j < 5 , H • (6-15) 1 [0 Otherwise j Equation (6-14) indicates that at the final time step (M -1), path a is in processor! assigned neuron n=a. This condition establishes the end point of each path, however, I the initial point where path a begins (0° (co) is not fixed and is left to be dynamically ■ determined by the optimization algorithm. 1 I According to our representation scheme, the scheduling algorithm logically specifies data movement from one neuron to the next. The specific assignment of neurons to I processors must be considered in order to determine if there is a physical connection j between to neurons by having been assigned to two neighboring processors. This can be j accomplished through the use of the matrix j r„ (® ) = S „ + , V n,m. 0, 1,- , P - l (6-16) A,B=0 assuming that no processor is connected to itself (G^ = 0 V A: 0, 1 , • • • , P — 1). In order to ensure proper calculation of the scheduling cost function, we require that the dependence graph be traceless having all diagonal elements set to zero (£)““ = 0 V a: 0, 1, • • -, A — 1). In case a neuron has a feedback connection to itself, the operation associated with this connection can be executed during the last cycle of! 107 processing and be treated as a local operation. This approach is possible since each path must end in the processor holding its neuron output value according to equation (6- 14). A particular path a must pass through all the processors holding neuron o u tp u t, values of neurons specified by Dan in M-l time steps. We can define a cost function which measures the number of times path a passes through processors that are n o t, necessary for the computation of neuron a . This is given by j M— 2 P- 1 I Z 5 X « H ) ) ( 1 - D “ ) £ , «,<£'<« (6-17) < = 0 n,m= 0 The minima of this cost function for a specific path a represents a path which traverses | all the necessary neurons required for its evaluation. Equation (6-17) counts only does ! jumps between neurons being connected via a physical communication channel, specified ! by CD). Each time path a moves from neuron n at time i to neuron m at time i+1, a constant value of one is added to the cost, if neuron m is not necessary for the computation of path a . Otherwise, if neuron m is required for the computation of path a , no cost is added for going to that neuron. Equation (6-17), however, is not concerned with where a particular path starts its traversal. We would like to introduce a term to the cost function so that paths starting in a processor which is assigned a neuron contributing to the path’s computation would have j a lower cost value. This can simply be done by adding the term i A02 ( l - D “ ) e : (« (6-18) n~ 0 j to the cost function. Here the parameter XQ is used to adjust the weight associated with j this term to the total cost of the scheduling function. i As mentioned in the previous section, equation (6-17) measures the number of times path a passes through unnecessary processors. As it stands, this cost function has the same cost for a scheduling which produces a path that remains in one of the processors required for its computation during the entire M cycles, or a path which traverses all of the Ka neurons required by path a . Clearly, the computation associated with path a requiring neuron rt can be performed during the first cycle when path a crosses th e , processor assigned neuron n. All other instances where path a crosses the processor of J neuron n, should be considered equivalent to passing through a processor which is not j required by path a . In order to accomplish this, we augment the cost function with a | 108 | penalty term counting the number of times path a crosses the same neuron. This is given by M —2 P — 1 ( i \ ; = o m = o V l = o y '«<«> (6-19) We can notice that equation (6-19) contributes zero to the cost the first time path a j crosses neuron m. During the subsequent crossings of neuron m by path a , equation (6-19) adds a penalty equivalent to the number of additional crossings to the total cost. In this fashion, each extra crossing of path a through neuron m contributes a larger value to the cost. This is not exactly the desired effect of having subsequent crossings be , equivalent to passing through a non-contributing neuron. In order to address this ’ problem we introduce a new term which subtracts the extra penalty occurred by more than ; two crossing of neuron m by path a . This term is formulated as j M - 2 P - \ f f i *'=0 m =0 (6-20) V <=o JJ In equation (6-20) we have introduced a new binary valued variable which has a value of one after the second crossing of neuron m by path a . Figure 6-6 shows the J scheduling flow of path a which crosses a neuron m multiple times, assuming that neuron m is required for the computation of path a (D a m = 1). It can be seen through this example that the total cost that would be added to the scheduling cost function will ■ remain a constant value, with multiple crossings, when Cx and C- values are added together. In order to find good solutions to the scheduling problem, we would like to minimize the total cost associated with the terms described by equations (6-17) through (6-20). There are a number of constraints that must also be satisfied in order to correctly accomplish this task. In the following section we describe each of these constraints j individually. i 109 Time (i) m 0, -1 _ C rL - 2 - :* + c * U i_ o, Time 2_ ► Time Time Time Figure 6-6 - An example of a path traversal and its associated penalty term I contributions due to repeated crossings of a neuron m required for its computation, i The value of r\‘ m(a) can be any value in the range [0,1] in the shaded region. i 110 6.4.4 Constraints Terms of the Scheduling Cost Function The constraint terms associated with the optimization procedure, used for solving the j scheduling problem, can be grouped into two categories. First, there are a number of constraints imposed by the physical realization of the scheduling routine, such as limiting the flow of data between processors that are physically connected to each other and I I allowing only a single process to be executed in each processor during a single time slot.; The second type of constraints are used to enforce the legality of solutions produce by the optimization routine, such as not allowing a single path to be in two different processors j at the same time. We will now give a detail description of these various constraints. j i 6.4.4.1 Architecture Dependent Constraints j I I It can be seen from Figure 6-6 that can take on any value during the time between j i the first instance that a path crosses a given neuron to just before the third time the path , crosses the same neuron again. In the actual implementation of the optimization routine, j T }‘ m ( a ) is a continuous valued variable in the range [0,1]. In order to keep the value of j T jl m(a) to a zero or one value, we introduce the following penalty term: M -2 P -1 P v 'L y L Dam*i, M 1- 1 L™)> < 6-21) i— O m = 0 ( where f5v is a scaling parameter. In equation (6-17) we assumed that all data transfers between neuron n and neuron m occur when there is a physical communication channel between the two processors having been assigned the values of these neurons. In actuality, it is not possible to directly guarantee that this condition is always met. In fact, if path a moves from neuron n to neuron m, where there is no physical connection between these two neurons (Tnm (<o) = 0), no cost would be incurred by the cost function of equation (6-17). In | order to ensure that paths only move between processors that are physically connected, we introduce a penalty term which adds a strong penalty to the total cost function j whenever "illegal jumps" are made. This penalty term is similar to equation (6-17) and is j formulated as I I M - 2 P - 1 I A . X X (i-r„(® ))ei(«)ejr,««,. (6-22); / = 0 n,m - 0 ' The parameter A^is used to adjust the magnitude of the amount of contribution to the total cost due to this term. This value should generally be set relatively high, since scheduling solutions with illegal jumps are not physically realizable. A similar constraint, which must be satisfied to ensure realizable solutions, is to allow only a single path to cross a given processor during each cycle. In other words, we would want to ensure that at each time step no two paths attempt to jump to the same i neuron processor. This constraint can also be represented through a penalty term which ! contributes a large cost value to the scheduling cost if more than one path cross the same j processor at the same time. The form of this penalty term is written as A -1 M -2 P - 1 Z Z ] L 7 “''0 > > 0 > . (6-23) a,P » ' = 0 n = 0 where y a /5 is a scaling parameter having y aa = 0. Similar to the parameter in equation (2-22), the value of y°^ should be high relative to the other terms in the cost function. Furthermore, we weight all path crossings equally, and therefore can define y ^ S yA ( ! - $ « ) . 6.4.4.2 Constraints on Path Variables Earlier in this section we introduced the binary valued variable 0‘ n(a > indicating th e 1 crossing of path a at neuron n at time i. In order to correctly calculate the desired cost 1 function of equation (6-17), we must ensure that at each time step, each path is in one and only one processor, associated with some neuron n. This can be accomplished by having = (6-24) n= 0 We will show how this constraint can be directly built into our optimization procedure in ! i Section 6.5. The solutions generated by this procedure will always guarantee the validity j of equation (6-24). Another constraints that is placed on the path variable d‘ a«x) is to not allow the same path a to be in two different locations at the same time i. This constraint can be enforced similar to equation (6-6) used for the assignment problem, by adding the term 112 M -2 P -X (6-25) i s s O m — 0 to the scheduling cost function. Here again, the parameter d^ is a scaling parameter with dln = 0. A complementary constraint for ensuring that A processors are occupied , with A paths during each cycle can be accomplished in the same manner used for solving J a similar constraint of the assignment problem via equations (6-11) and (6-12). The : 1 corresponding equations for the scheduling problem are j 7e M -2 P -X J=0 n = 0 A-X M —2P— X a = 0 i=0 «=0 M -2 P -X and M ~ 2 f P ~ 1 \ 2 z + 2 Z £»(i - £i) i = 0 V n = 0 J i = 0 « = 0 (6-26) (6-27) where the last term in each equation is required to remove the diagonal elements of the 0 | and e matrices. In equations (6-26) and (6-27), y g and ye are parameters used to adjust | the degree of the contribution made by each term to the total cost function. | i The constraint associated with having each path passing through some neuron | i processor at each time step was given earlier through equation (6-24). As stated j j previously, our optimization procedure is design so that this condition is guaranteed at j | the completion of the procedure. We can also add a corresponding penalty term, of small j j magnitude, to the total cost function so that the restrictions imposed by this constraint are softly enforced during formation of a solution. This can be accomplished by adding the term M-2/ P-l i= 0 n = 0 \ i= 0 n = 0 (6-28) to the total cost function, where the second term is again used to remove the extra i I diagonal elements from the computation. ■ i i 6.4.5 The Complete Scheduling Cost Function j i The cost function associated with scheduling the flow of concurrently flowing computational paths through a processing array, with a specific interconnection topology j and a predefined assignment of processes to processors, can be assembled by putting; 113 together the various terms described in the previous section. Given a specific assignment matrix co a good solution to the scheduling problem can be found by minimizing the cost function C9(0|o>) = + + + + + + + + + with respect to the variables (a), T ] l n(a), and The complete scheduling cost; function described by equation (6-29) involves a large number of terms and a significant amount of associated computations. In Chapter 8 we will describe the relative) computational complexity of this function as it relates to solving difficult scheduling problems. A good solution to the scheduling problem is obtained by finding a specific) configuration of the variables 8 ‘ n(a), T ]‘ n(.a), and el n, such that the cost Ca(0|co) is at a i minimum. A good solution refers to a scheduling which ensures that all architectural 114 j P - 1 1 M - 2 £ > - 1 » = 0 ^ '• « i~0 n » m = 0 1 M —2 P — 1 ( i \ l t= 0 m = 0 V M - 2 P - \ f 2 e e ^ j= 0 m = 0 1=0 f am / j i + 1 i X \ V 1=0 )) 1 M - 1 P - 1 i — 0 m = 0 j M — 2 P—l i = 0 n,m~0 A - 1 M -2 P -1 ^ a,P i=0 n = 0 i M -2 P -X ^ i = 0 m=0 ,(a) 1 2 * f M -■2P-1 , 4 - 1 ( = 0 n = 0 a=0 M -2 P -1 E E < 1 - E ^ + 1=0 n = 0 2 M - 2 ( P - 1 \ 2 M - 2 P - 1 l U - S X + E E < ( i - < ) £ — 0 \ n — 0 / i = 0 n = 0 - r P M - 2 / p - 1 M -2 P -1 i = 0 ' n = 0 1=0 n = 0 (6-29) specific constraints and all the constraints on the path variables are meet and, at the same time, all paths are scheduled such that the maximum number of neurons associated with the computation of each path is traversed by that path in a specified amount of time (Af). In Section 6.4.3, we discussed the bounds on the time limit parameter M. Finding optimal solutions to the scheduling problem in practice involves an iterative processes where we attempt to find good solutions in a specified time limit M, and dynamically j varying the value of M between consecutive execution of the optimization procedure. T he, objective here is to find the smallest value of M for which the optimization procedure can j generate legal solutions while all paths completely traverse their necessary neuron processors. When implementing algorithms such as the sparse vector-matrix-product operation, if. the discrepancy between the total number of elements in the vector (N) and the largest | number of elements which participate in a particular path (K*) is large K* « N, then j sufficient improvement in efficiency and throughput can be achieved even with M being j several times the size of K*. 6.4.5 Use of the Scheduling Procedure for Mapping Neural Networks The scheduling cost function presented in the previous section can be used for arriving at • data flow patterns on a systolic architecture for implementation of commutative: calculations, where the solution of the computation is not dependent on the order in; which its component operations are executed. In Chapter 4, we presented a mapping' approach for implementing neural computation on systolic parallel architectures which i required generation of computational paths similar to the ones described above. Since we] I have described the scheduling problem in terms of neural network computation, in this section we discuss the use of this cost function for implementing various neural network structures. The formulation of the scheduling cost function can be directly applied to finding concurrently traversing paths on a distributed processing architecture following the mapping method of Chapter 4. In this formulation, distributed aspects of neural computation is viewed as being equivalent to performing a vector-matrix-product j operation. This approach can directly be applied to single layer neural network models with feed-back processing, such as the Hopfield network [33]. Due to the iterative 115 nature of the computation of these models, the computational path a associated with a| neuron n must end in the processor holding the output value of neuron n. This is exactly what is implemented by enforcing the constraint of equation (6-14). In implementing feed-forward neural networks, with a single layer of synaptic interconnections, such as the Perceptron network [65], there is no specific requirement on the paths specifying the processor in which each path terminates. In such cases, the value of variable is not constrained by equation (6-14) and is determined i j dynamically by the optimization procedure. By not imposing the restriction of equation j (6-14) the solution space is increased by a factor of P which can in turn lead to an j increased possibility of finding better scheduling solutions. 1 | 1 Scheduling computational paths for multilayer feed-forward neural network models, i such as the multilayer perceptron network [66], can be performed in a number of different ways. We can attempt to exploit layer level parallelism by assigning an appropriately sized number of processors to each layer of the network and performing the optimization procedure for the assignment problem for each layer separately. Having a ; specific assignment of neurons to processor and having different processors associated j with neurons of each layer, the creation of computational paths is simply performed by j having path a terminate at a processor holding the output value of a neuron in the next I higher layer of the neural network. Although simple in nature, this approach is not1 effective for implementations on 2-D connected parallel architectures, such as the DREAM Machine, since all paths must cross a 1-D boundary between processors assigned to neurons of layer i and terminate in a region assigned to neurons of layer ^+1. On the other hand, the approach can be applied very effectively to architectures' having higher dimensional interconnection structures, such as the 3-D computer of; Hughes Research Labs [48]. ; The second option for implementing multilayer feed-forward networks, which was eluded to earlier, involves the use of relabeling processors as computation moves from one layer to the next. Computational paths can be constructed without any specific restriction on where each path is to begin or terminate for the first layer of the network. Having arrived at a scheduling solution, the processor where each path a terminates in is assigned the neuron of the next higher layer which is associated with path a. Unfortunately this technique might not be sufficiently efficient, in general, since only the assignment of neurons to processors of the first layer are done in an optimal fashion. In 116 order to remedy this problem, the assignment procedure of Section 6.3 can be| generalized to cluster neurons in an optimum sense considering the multilayer structure of i the neural network. Having a specific assignment of neurons to processors for each layer! of the network, path a associated with a specific neuron of layer £+1 can be forced into; terminating in the desired processor associated with that neuron. Thus, the scheduling) problem can be done separately for each layer of interconnections in the neural network. ' 6.5 Constraint Nets Used for Optimization I By having an analytical formula for evaluating the goodness of a specific solution, either j j for the assignment problem given by equation (6-13) or the scheduling problem given by I equation (6-29), many different techniques can be used to find such solutions [21, 33, j 36, 37, 78]. The simplest method for finding optimal solutions for this minimization! problem is to exhaustively search the solution space for the solution(s) with the lowest) i cost. Due to the extremely large solution space of our problem, such an approach is| entirely impractical for all but the most trivial cases. Therefore, we need to consider; other approaches, which do not guarantee the optimum solution but can generally find; solutions close to optimum. j A direct method for performing the optimization task is to follow the gradient of thej cost function in the direction of lowest cost value. This can be performed by evaluating j the cost function of an arbitrary random starting state configuration and slightly; modifying the configuration state such that the cost function is lowered. This process can be repeated until no slight modification of the system configuration can be found that would further minimize the cost function. This approach has a problem of getting stuck in local minimas in the cost function configuration space. We can expect a large number of such local minimas for complex problems such as the assignment and scheduling problems described above. Probabilistic approaches to the optimization problem have been proposed to avoid the faith of being trapped in a local minima solution of high cost value. A well known method based on statistical mechanics for solving combinatorial optimization problem is the Simulated Annealing procedure [36]. With this technique, slight modifications to thej configuration state are made which lower the cost function. These modifications are also, 117 ! made even if it increases the cost function with a probability proportional to e~(U T )A C , where T is analogous to the temperature of a physical system and AC is the change in the change in the cost function resulting from this new configuration. The optimization procedure begins at a high temperature where uphill jumps (system modifications causing j increase in the cost function) are tolerated with a relatively high probability. The system; temperature is then gradually lowered until a solution is found at T ~ 0. This process is I slow and inherently serial in nature. Several extensions and variants of this approach,; one of which is described later in this section, have been proposed to improve the! convergence rate and allow for parallel implementations [77, 81]. ! j Recently, several neural computation based techniques have been applied in solving! combinatorial optimization problems [2, 13, 28, 33]. A major advantage of these methods over previous techniques is associated with their inherently parallel structure which can lead to high throughput efficient implementations. The use of neural network, methods on solving complex optimization problems have been demonstrated by applying these methods to the classical Traveling Salesman Problem (TSP). Durbin and Willshaw; [13] have proposed a geometrical based method for solving the TSP problem. In this approach an elastic path is placed on the 2-D surface containing the cities to be traversed. Adjustments are made to the various points on this elastic path in order to have the path' pass through all the cites. A similar approach using the self-organizing feature maps neural network [37] has also been proposed in [2], Hopfield and Tank [33] have also formulated a method for solving the TSP problem using neural computation techniques. Although each of these formulation seem to involve a different approach, they all share aj common underlying principle. j It has been shown that statistical mechanics can be used as the basic underlying! principle for these methods [77]. Viewing neural computation in this light, the difference' between the Hopfield net approach [33] and the elastic net approach [13] can be seen to in| the treatment of the constraint terms of the cost function. For example, the constraint of, not allowing a salesman to be in two places at the same time was enforced softly byj adding a penalty term to the cost function in the Hopfield and Tank formulation. Thisj restriction was enforced strongly in the elastic net approach where physical analogy to the elastic band explicitly disallows such conditions to occur. A general approach for enforcing certain constraints strongly has been introduced via Constraint nets [78]. Constraint net optimization is based on mean-field annealing which is similar in concept 118 to the Simulated Annealing procedure described earlier in that it allows for limited movement in the direction of increasing the cost function in order to escape local minimas. On the other hand, mean field annealing manipulates average probability values and is formulated in a deterministic form, alleviating a need for random number generators. In general, the configuration state of the system can be specified by a state vector T| j of n dimensions. Assuming a binary valued state vector, the number of possible configurations of the system can be thus calculated to be 2”. In many applications, such as the assignment and scheduling problems, a significant portion of this space corresponds to illegal configurations. Given a specific cost function defined by C [t|], we can define a certain probability distribution function over the possible configuration space. The Boltzmann distribution of the form ■ (6-30) Z/J has been widely used by various neural network models including the Constraint net. In equation (6-30), the parameter p is the inverse of the thermodynamic temperature parameter defined as P = Y~ (6-31)1 A primary goal of the optimization process is to evaluate the what is called the partition function defined as i Z„ = 2 > p (-j8C [n']). (6-32); K! ; where {n* j refers to the set of all the 'legal' configuration of the state vector r\. An important property of this partition function is that at zero temperature the partition! function is dominated by the configuration with the lowest cost, that is ■ (6-33>; where T|° refers to the configuration with the lowest cost. j I As mentioned earlier, this formulation is termed mean-field annealing since the state vector variables used in the calculation refer to the average probability of events at a specific temperature given by 119 <»>„ (6-3-1) : z » k i ; By using equation (6-34) in conjunction with equation (6-32) we can calculate the configuration with the lowest cost ( T ] ° ) . The major difficulty in implementing this • procedure involves performing the sum operation of equation (6-32) only over the legal i configurations specified by {t]*}. In the Constraint net formulation presented in [78], a ; ; method is demonstrated which uses an effective cost function to analytically approximate; I {ff)p- In constructing this effective cost function, we can choose to eliminate none, ‘ i some, or all of the illegal configuration in the summation operation of the partition i function. The constraints which are explicitly removed from this sum are said to be strongly enforced; those which are not, are said to be softly enforced. The convergence | properties of this method has been proven analytically in [78]. 6.6 Optimization Procedure In Section 6.4, we presented the cost function used for solving the assignment problem. We assumed at the time that all variables where binary valued. In Section 6.5, we saw how mean-field annealing can be used to solve optimization problems. Following the formalism of Section 6.5, we can treat each variable as being continuous valued representing an average probability of an event. In other words, the value of variable cb* indicates the probability that neuron n is mapped to processor A. Similarly, the variable f)A indicates the probability that processor A is assigned a neuron, 0‘ n(a) represents th e ; probability that path a is at a processor assigned neuron n at time i, and so on. In this section, we present the method used for solving the cost optimization problem based on the Constraint net technique. This method is applied to both the assignment and scheduling cost functions. The optimization process involves an iterative procedure for updating the values of the state variables co* and rjA of the assignment problem, and, 9 ‘ n(°o, T} a n d £l n of the scheduling problem. The formulation of the update equations for each of these variables, along with the specific procedure for their1 implementation, is given below. j 120 | 6.6.1 Update Equations for the Assignment Problem Jpdate equations for the assignment variable co* can be arrived at using the Constraint net approach described in Section 6.5. The effective cost function for the assignment problem is given by equation (6-13). In deriving the update equations, we can strongly enforce the constraint of equation (6-5) requiring that each process (neuron) be assigned to some processor of the architecture. The update equation for each of the a> * is written as 8d>* = - 8 t ~ A “> n - /} # ? ( < £ ) ) P - 1 B =0 (6-35) where (p*((o) is the first derivative of the cost function of equation (6-19) taken with respect to the variable C O * . The value of each < p * (to) is given by < b* = - ^ ^ ( - GABG*B B m,c a * + (l-GAB) a A B lI)'mC Q *) A,B m n m co* + nm m ( J V - 1 + Y a > ' n A Vm=0 (6-36) Due to the fact that there are no penalty terms associated with the variable r)A that can 3e enforced strongly, the update equation associated with this variable follows the basic form of the neurons in a Hopfield net [33]. The update equation for this variable is defined as SriA = -S t V a l + e - P + A a\) (6-37) which has the familiar form of the sigmoid activation function with its gain parameter being controlled by (5. At high temperatures, is small and the sigmoid has a smooth and rather flat shape, corresponding to similar r\A values for different processors. As the temperature is lowered, the gain of the sigmoid function for each variable i)A is increased further, making each variable move closer to one of the two binary points (0 or 1), depending on the value of < j > A(f\) defined by 121 K = 7v 1 < p -1 N \B=0 V N - l \ 2 l - ^ X + < ( l - ® 0 V « = o J 1 Y a > (6-38) quation (6-38) is obtained by differentiating the effective cost function of equation i (6-13) with respect to r\A. 6.6.2 Update Equations for the Scheduling Problem i Update equations for modifying the variables associated with the scheduling problem can be similarly derived using the Constraint net technique. We presented a general description of the scheduling cost function and how different constraints on the start and end location of each path effect the mapping method. The update equations associated with path variable 0‘ n(a) can be written to specifically enforce none, some, or all of these special cases. The general form of the update equation is where set (a) -St N - 1 m=0 (6-39) 122 1 p - 1 ^ m = 0 + i Z 'c , « » ) ( i - £ > “ )« rw ) * * m = 0 1 1 M --2 j + ^ T e . ' « » + ^ «.> £*" /=0 ^ y = 0 /=0 i— 1 _ i Af-2 - - ' £ ’ nJ n(a)DC V /'= 0 2 z /=« p-i /=0 \ ■ 4 * 4 E ( 1- 7'™<<°))^1 < “> + E ( 1- r «.(“ ) ) ^ r I< « > £ \ m = 0 m — 0 + £ < £ £ « ■ > + 4 7 ,(1 - 20> > )+ 7„ 1 ( 1 - s ^ l m m = 0 / 3 = 0 1 ( ( A _1 ^ ; l - 2 0 > ) + 2 £ 0 > > - l U=° yy (6-40) quation (6-40) was obtained by taking the first derivative of the scheduling problem's effective cost function given by equation (6-29). Equation (6-40) can be simplified by collecting and reordering some terms to be of the form &<« = \ 2X«>»((i - +(i - D m ) a ‘- \ » ) ’ m — 0 i i— 1 i M— 2 + i D » ( l - C ,(a>)Xe'(a, + 4 £ ( l - I)iW ,)D‘ “0r'<«&^> / = " ^ ;=0 P ~ 1 +iu ” 7 j ; - ‘ + I A . 2 (1 - r„ (a»)(§r « ,, + er < „ , ) ^ ^ m — 0 +d’ I ( i - s j e ‘ , U y+ i y,(i-20>,)+y„2(i-< 5 “ 'i) e> ’ m = 0 /J= 0 +l r»£- y 1 - 2 0 ' (a) + 2 V A — 1 X e > - i V /3 = 0 (6-41) yy where < 5 < ; s /) = [i if/<y 10 Otherwise (6-42) 123 | The starting time (/=0) arid the termination time (i-M -1) of the computational paths require slightly different equations. If we are to restrict the path a to return to the i processor holding its neuron output value, such as in the cases of iterative neural networks, constraint of equation (6-14) can be satisfied by having 0 = < 5 " “ for all a = 0,1, • • • , A - 1 . In this case, the value of 0“ _1(«) is not updated. Knowing that each path must end in the processor holding its neuron output value, we can further limit the partition function of the path variables associated with the next to the last time slot. We can strongly enforce that all paths must be in processors directly connected to the paths final destination at time M-2. This is accomplished by having the update equation for the M-2 time slot be equal to 30, M - 2 (a ) : - S t e M - 2 T a n e (* ) (a) 1 ( a ) m — 0 (6-43) Equation (6-41) can still be used to evaluate 2(«) while considering 6 ^ \a) = 8 na. j Two options are available for handling the starting condition of computational paths. First, we can strongly enforce that all paths start at a processor holding a neuron output value required for that paths computation. This is a valid constraint since each path must cross these processors at some point along its trajectory. This restriction can be enforced similar to the case for the M-2 time slot by having the update equation of variable 0° < « > be equal to < 50°(a) : -8t 0 „ ( a ) • D a n e N - 1 m—0 (6-44) By having such a formulation, we can speed up the convergence by leaving out all the unnecessary variables associated with the initial time slot. On the other hand, by restricting the possible solution space to only those solutions which start in a processor on the particular path, we might leave out certain globally minimum or close to globally minimum solutions. In case we would like to enforce the above condition softly, so that good solutions having some paths which start in processors outside their particular list of useful neurons can also be generated, the general form of the update equation (given by 124 equation (6-39)) can be used for 6r («). The associated 0° (a) equation for this case can be written as # ( « ) = * „ (i - D" ) + i 5 X ( a > ) ( i - D ^)e'm m + ^ m = 0 ^ i M-2 i P-l ^ j =1 ^ m=0 +d“ 1 (1 - + ^7,(1 - 20,V>) + 7 ,2 (1 - S '* ) ? .* ) I m = 0 2 j8 = 0 j Z' A-X ^ + 2 ^ (6-45) 2 £ 0 “w - 2 0 " (a ) - l k . P = ° where A0 is the parameter associated with the soft constraint of having a path start in a non-useful processor. In driving update equations for the 7 ]‘ n(a) variables, we can use some knowledge about the form of these variables to restrict the search space and increase the optimization performance. First we can notice that the variable 7 ]l nw participates only in those terms where D an = 1. Therefore, we can limit our computation of T jl n(a) only to those a 's and n's which have Dan - 1. Next, we can observe from Figure 6-6 that the T f'n(a) variables have a step function form where rfn(a) = Q and = 1. The object of the 7 ]‘ n< .a y update equations is thus finding the proper location where T }‘ n(a> transitions from a zero to a one value. We can limit our search space here as well by choosing a properly formed partition function. Since there are only M-2 different valid configurations for Tj^w, the and 7)^ 1 (a) are known and the rest of the configuration corresponding to different transition points along the time axis, the search space is reduced from 2M to M-2. j We define a constant binary valued set of vectors T]^) (a) of size M-2 corresponding to yalid T j‘ n(a) configurations as T lfl = («) Vn («). ’Vn . («))• (6-40) These vectors have the form shown below: i ‘ 1 2 3 4 (1, 1, 1, 1, rfn 2\ a) = (0, 1, 1, 1, rf2 \ a) = (0, 0, 1, 1, M-2 , 1) ■ 1) , 1) rfn M ~2 \a) S (0, 0, 0, 0, . . 1) Given the set of vectors ti^o*), we can define a partition function which is dominated by the rj‘ n(a) vector producing the minimum cost at zero temperature. This partition function has the form Z n < “ > - 2 M -2 d = 1 where - 2 7 j> ,) + ^ D - e j r ' J 1 ^ ^ V * = o The corresponding update equations for rj'n < « ) can now be derived to be (6-47) (6-48) 8fji n(a) = - S t I M-2 - ^ ^ ^ '( 0 ) ^ ( 0 . ) ^ i = 1 Z „ C « ) d = l v y (6-49) I Finally, the update equations for the e ‘ n variables can be derived. These variables, similar to r\A variables used in the assignment equations, do not have any strongly enforceable constraints. Therefore, the update equations associated with these variables are the simple Hopfield net equations (similar to equation (6-37)). The update equation is described by §£l = - S t -ME ) i ■ ) 1 + e , where (6-50) ^ = f Ye l - 2 e ‘ + 2 1 + — 7 a 2 / f -4-1 \ 2 V a=0 A— 1 l - £ e ;« « ) + £ -««■>) a= 0 (6-51) 126 6.6.3 Implementing the Optimization Procedure I ! In the previous section, we presented the update equations associated with the various | variables involved in the assignment and scheduling optimization procedures. The actual J processing of the optimization procedure is described here. Since the optimization' algorithm is based on mean-field annealing, where the values of each variable refer to the probability of an event, we start the system with all variables initialized to small random J values. The range of these random numbers are calculated differently for different! I variables, depending on their specific constraints. For example, the assignment variables (b* are assigned a random value in the range (0, 2/P) so that the sum over all I I processors (A's) of these variables will approach 1, as P approaches infinity, therefore} conforming to the constraint of equation (6-5). i Along with initializing the variables, the system temperature is set to an initial: temperature value. This initial temperature must be large enough (higher than the I system's critical temperature) so that the optimization procedure can move out of small local minima's of the cost function. After this initialization phase, the update equations are applied to all the variables Nsteps many times, after which the system temperature is i lowered by one or two percent. This processing continues until the system converges to 1 a stable state at low temperatures. At this point the solutions to the assignment and1 scheduling problems can be obtained from the close to binary valued matrices © and 0, respectively. As part of the optimization procedure, the scaling parameters associated with the! various penalty terms of the cost functions, such as X„ and are increased as a function of the temperature. Typically, the values of these penalty terms are increased | such that they are infinitely large at T=0. In our implementation, we divide the initial parameter value by the square-root of the temperature to achieve this effect. Chapter 7 Results of the Optimization Procedure In this Chapter we present empirical results of the optimization procedure applied to | several examples. These results where obtained through software simulation of th e ! Constraint net (Cnet) algorithm described in the previous chapter. We first present a ! description of the implementation code used to perform the optimization task, followed by results of executing the assignment and scheduling optimization routines on some benchmark examples. j 7.1 Implementation of the Optimization Algorithm | i Two separate routines have been developed to solve the assignment and scheduling optimization problems. The basic organization and computational flow of both th e ! I assignment and scheduling routines are similar. The first step of each routine involves j reading a control file, containing various file names and control parameter associated ; with the specific optimization routine, followed by the execution of the optimization algorithm, with appropriately set parameter values, for a specified number of sweeps. Each sweep consists of a number of steps the update equations are applied to each variable at a constant temperature value. 7.1.1 Control File Format I ! i The control file associated with each routine contains the names of the various input and 1 output files associated with the process graph file, interprocessor connection topology 1 i I I 128 ! graph, etc. Additionally, the various parameters associated with each routine, such as dm n, Yab > ^ , along with system temperature and cooling parameters are also given in the control file. The control files associated with the assignment and scheduling ! routines are shown in Tables 7-1 and 7-2. The control parameter Nruns denotes the number of times the optimization routine should be executed, each with a different initial random state. Nruns - Number o f ru n s N - Number o f n eu ron s A - Number o f p a th s P - Number o f p ro c e s s o rs Dm - Name o f n e u ra l n e t s tr u c t u r e f i l e g m - Name o f a r c h i te c tu r e to p o lo g y f i l e < °n - Name o f th e assig n m en t o u tp u t f i l e St - L earn in g r a t e T o - I n i t i a l T em perature AT - T em perature mod. r a t e Ns weeps - Number o f sweeps N step s - Number o f s te p s/sw e e p dn m - P e n a lty p a ra m e te r o f e q u a tio n (6 -6 ). Ta b - P e n a lty p a ra m e te r o f e q u a tio n (6-7) . y < o - P e n a lty p a ra m e te r o f e q u a tio n (6 -1 1 ). r n - P e n a lty p a ra m e te r o f e q u a tio n (6-12) . Table 7-1 - Control file format used for the assignment routine 129 Nruns - Number o f ru n s N - Number o f n eu ron s A - Number o f p a th s P - Number o f p ro c e s s o rs Dm - Name o f n e u ra l n e t s t r u c t u r e f i l e g a b - Name o f a r c h i te c tu r e to p o lo g y f i l e cot - Name o f th e assig n m en t m a trix f i l e M - Maximum tim e v a lu e t o c o m p le te a l l p a th s St - L e a rn in g r a t e T o - I n i t i a l T em perature AT - T em perature mod. r a te Nsweeps - Number o f sweeps N step s Number o f s te p s/sw e e p K . . P e n a lty p a ra m e te r o f e q u a tio n (6 -1 8 ). yVP .. P e n a lty p a ra m e te r o f e q u a tio n (6 -2 3 ). r p - P e n a lty p a ra m e te r o f e q u a tio n (6-28). 7e - P e n a lty p a ra m e te r o f e q u a tio n (6-26). r e - P e n a lty p a ra m e te r o f e q u a tio n (6-27). K - P e n a lty p a ra m e te r o f e q u a tio n (6-22). - P e n a lty p a ra m e te r o f e q u a tio n (6-25). h - P e n a lty p a ra m e te r o f e q u a tio n (6-21) . Table 7-2 - Control file format used for the scheduling routine |7.1.2 The Assignment and Scheduling Optimization Routines The basic processing associated with the optimization procedure was outlined previously in Section 6.6.3. Both the assignment and scheduling routines follow similar control flows. They differ in the computation of the update equations and the actual variables being optimized. The pseudocode representation of the assignment problem optimization routine is given as: t 130 1 . R e a d c o n t r o l f i l e i n f o r m a t i o n . 2. R e a d p r o c e s s g r a p h f i l e D m . 3 . R e a d a r c h i t e c t u r e t o p o l o g y g r a p h G M . 4 . C a l c u l a t e t h e d i s t a n c e m a t r i x (& M b a s e d o n G M . 5. I n i t i a l i z e © a n d T|A m a t r i c e s t o s m a l l r a n d o m v a l u e s . 6 . I n i t i a l i z e s y s t e m t e m p e r a t u r e (T<—T0) 7 . R e p e a t t h e f o l l o w i n g f o r Nsweeps t i m e s . 7 . 1 R e p e a t t h e f o l l o w i n g N ste p s t i m e s . 7 . 1 . 1 U p d a t e © a n d T]A m a t r i c e s a c c o r d i n g t o e q u a t i o n s (6-35) a n d (6-37). 7 . 2 D e c r e a s e t h e s y s t e m t e m p e r a t u r e (T <— (AT)T ) . 7 . 3 A d j u s t P e n a l t y t e r m p a r a m e t e r s . 8 . W r i t e f i n a l © a n d TJA m a t r i c e s t o t h e o u t p u t f i l e . jThe pseudocode of the scheduling routine is represented as: ! 1 . R e a d c o n t r o l f i l e i n f o r m a t i o n . 2 . R e a d p r o c e s s g r a p h f i l e £>“" . i AB * 3 . R e a d a r c h i t e c t u r e t o p o l o g y g r a p h G 4. C a l c u l a t e t h e n e u r o n c o n n e c t i o n g r a p h 7 m a c c o r d i n g t o e q u a t i o n ( 6 - 1 6 ) . 5 . I n i t i a l i z e 0 , T |, a n d £ m a t r i c e s t o s m a l l r a n d o m I v a l u e s . I 6 . I n i t i a l i z e s y s t e m t e m p e r a t u r e (T<—T0) | 7. R e p e a t t h e f o l l o w i n g f o r Nsweeps t i m e s . ! 7.1 R e p e a t t h e f o l l o w i n g N ste p s t i m e s . 7 . 1 . 1 U p d a t e 0 , TJ, a n d £ m a t r i c e s a c c o r d i n g t o e q u a t i o n s ( 6 - 3 9 ) , ( 6 - 4 6 ) , a n d ( 6 - 5 0 ) . 7.2 D e c r e a s e t h e s y s t e m t e m p e r a t u r e ( T<r-(AT)T). 7 . 3 A d j u s t P e n a l t y t e r m p a r a m e t e r s . 8 . W r i t e f i n a l 0 , 1], a n d £ m a t r i c e s t o t h e o u t p u t f i l e . The software used for the assignment optimization routine has been developed on a Sun workstation using the C programming language. This software takes as input 3 j I a SCII files, namely the control file, the process graph file, and the network topology I !file. It generates the assignment matrix © and stores it in an ASCII file. During each ■sweep of processing, the various portions of the assignment cost value is calculated and j ^displayed. In order to collect statistically valid information on the convergence properties Jof the algorithm as a function of the parameter values, a data-parallel implementation of jthis code was executed on the 512 node Ncube parallel machine at Caltech. In this implementation, each node executed the same algorithm with different initial states on different processors. This type of parallelism was defined earlier as network level parallelism for neural network implementations, see Chapter 2. The scheduling routine has been implemented both on the Sun workstation using the C programming language, and also on the AMT DAP-610C parallel processor using the |DAP FORTRAN-Plus parallel programming language. The DAP architecture consists of a 2-D mesh-connected array of bit-serial processors. The DAP-6 IOC model contains 4096 PEs with each PE having an 8-bit arithmetic coprocessor. Due to the parallel nature of the matrix operation associated with the Cnet algorithm, close to linear speedup was j achieved by implementing the code on the DAP. ; i i | 7.2 Results of the Assignment Procedure We now present some empirical results obtained from executing the assignment optimization routine on two examples. The first example problem has been proposed by Bokhari [7] as a benchmark for evaluating the performance of his heuristic algorithm for solving the assignment problem. The second problem is a larger benchmark proposed by j I Everstine [14] to be used for comparing the performance of various algorithms in solving ' sparse matrix bandwidth, profile, and wavefront reduction algorithms. i i 7.2.1 Results from Implementing the Bokhari Example j Two specific examples have been used by Bokhari to measure the performance of his algorithm for solving the assignment problem [7]. Both examples attempt to assign the nodes of a specific process graph, obtained from a finite element matrix, to the processors of a special purpose parallel architecture, called the Finite Element Machine (FEM). The first example involves mapping a 33 node process graph onto a 36 processor FEM. The objective of Bokhari's algorithm is to perform the assignment {operation such that the number of edges in the process graph which are assigned to the Jedges of the processor graph is maximized. Bokhari refers to this measure as the cardinality of the assignment. This objective function can be viewed as a subtask of our assignment objective function, since in addition to maximizing the amount of overlap between the two graphs (assignment cardinality), we require specific treatment of nodes i I 132 j "which are not assignecOo their optimum’location. Below, we present a detailed description of the 33 node example along with the performance results of our algorithm in solving it. The results associated with the 25 node example are briefly described at the end of this section. i i i 7.2.1.1 Problem Description j j The target machine architecture used for the Bokhari example is the FEM. The FEM architecture consists of a 2-D array of processors connected in eight-nearest-neighbor configuration with wrap-around connections on the two boundaries (top-to-bottom and left-to-right). A 6x6 FEM is used as the target architecture for this example problem, see Figure 7-1. The interconnection topology of this architecture can be represented in the form of an interconnection matrix Gm required by our optimization algorithm, see Figure 7-2. , o V - m — - 4 2 V - f 3 V -T 4 V - js V - -T 7 V - - y 9 j - -t T ? m - - r m - - ^ h V- -rT b V m - -H21V- 24V--J2 5 v - 30V--O lV --T32>--T33V--T34>- Figure 7-1 - Interconnection topology of a 6x6 Finite Element Machine (FEM). 133 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ( 1 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 Figure 7-2 - The interconnection matrix G A B describing the 6x6 FEM topology. ; The process graph used for this example is a 33 node finite element structure shown! in Figure 7-3, with the corresponding dependence graph representation D an, shown inj Figure 7-4. The G A B and D a n files are given as input to our optimization procedure. Thej assignment procedure generates an appropriate distance matrix (£** based on the specified Gm file, see Figure 7-5. Since the number of nodes in the process graph is smaller than the number of nodes in the processor graph, some processors are not assigned any process. 134 5 )— ( 6 16V- - f n V - — H8 i x -J2l] 22V--J23V- 25V--{26V- 28V--H29V--<30 31}- - u S - -(33 Figure 7-3 - A 33 node finite element structure used as the process graph. The process graph of Figure 7-3 contains a total of 80 edges. This means that the best possible assignment would have a cardinality of 80. Due to the inherent differences between the two graphs, such an assignment does not exist [7]. The best achievable cardinality for this problem is reported to be 78 [7]. A cardinality value of 32 is achieved with a simple linear assignment of process i to processor i. 135 i 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Figure 7-4 - The process graph interconnection matrix Dan. 1 j The optimization procedure was executed many times to collect statistical information on the performance of our algorithm for different parameter values. Based on these experiments, we have determined a generally good range of parameters for the assignment optimization routine. Table 7-3 gives some statistical results of our simulations. The columns titled "Maxc" and "Minc" indicate the maximum and minimum cardinality values found in all runs, respectively. The column titled "% Opt" indicates the percentage of runs where the globally optimum solution was found. The columns titled "Mean" and "SD" indicate the mean and standard deviation of the cardinality values; respectively. The last row of this table indicates the results of a Monte-Carlo search performed for this assignment problem. This implementation involved generating a random assignment matrix to and calculating the cardinality of this solution. Figure 7-6 shows a histogram of the Monte-Carlo experiment with 1000 points. j 136 _____ 1 0 1 2 3 2 1 1 1 2 3 2 1 2 2 2 3 2 2 3 3 3 3 3 3 2 2 2 3 2 2 1 1 2 3 2 1 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 2 3 2 3 3 3 3 3 3 2 2 2 2 3 2 1 1 1 2 3 2 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 2 3 3 3 3 3 3 3 2 2 2 2 2 3 2 1 1 1 2 3 3 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 2 3 3 3 3 3 3 3 2 2 2 2 2 3 2 1 1 1 2 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 3 3 3 3 3 3 3 2 2 2 2 2 2 3 2 1 1 1 1 2 3 2 1 0 1 2 3 2 1 1 2 2 3 2 2 2 3 3 3 3 3 3 2 2 3 2 2 2 1 2 3 2 1 1 1 X 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 1 2 2 2 3 2 2 3 3 3 3 3 3 2 2 2 3 2 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 2 3 2 3 3 3 3 3 3 2 2 2 2 3 2 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 2 3 3 3 3 3 3 3 2 2 2 2 2 3 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 2 3 3 3 3 3 3 3 2 2 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 3 3 3 3 3 3 2 3 2 2 2 2 1 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 2 2 3 2 2 2 3 3 3 3 3 3 2 2 3 2 2 2 2 2 2 3 2 2 1 0 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 1 2 2 2 3 2 2 3 3 3 3 3 3 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 2 3 2 3 3 3 3 3 3 2 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 2 3 3 3 3 3 3 3 3 2 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 2 3 3 3 3 3 3 2 3 2 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 3 3 3 3 3 3 2 2 3 2 2 2 1 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 2 2 3 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 3 2 2 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 1 2 2 2 3 2 2 3 3 3 3 3 3 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 2 3 2 3 3 3 3 3 3 2 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 2 3 3 3 3 3 3 3 3 2 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 2 3 3 3 3 3 3 2 3 2 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 3 3 3 3 3 3 2 2 3 2 2 2 1 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 2 2 3 2 2 2 2 2 2 3 3 2 3 3 3 3 3 3 2 2 2 3 2 2 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 1 2 2 2 2 2 2 3 3 3 3 3 3 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 2 2 3 3 3 3 3 3 3 2 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 3 2 2 2 2 2 3 3 3 3 3 3 3 2 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 2 3 2 2 2 2 3 3 3 3 3 3 2 3 2 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 2 3 2 2 2 3 3 3 3 3 3 2 2 3 2 2 2 1 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 1 2 3 2 1 2 2 2 3 2 2 3 3 3 3 3 3 2 2 2 3 2 2 1 1 2 3 2 1 0 1 2 3 2 1 1 1 1 2 3 2 2 2 2 2 3 2 3 3 3 3 3 3 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 2 1 1 1 2 3 2 2 2 2 2 3 3 3 3 3 3 3 2 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 3 3 2 1 1 1 2 3 2 2 2 2 2 3 3 3 3 3 3 3 2 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 2 3 2 1 1 1 2 3 2 2 2 2 3 3 3 3 3 3 2 3 2 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 1 2 3 2 1 1 2 2 3 2 2 2 3 3 3 3 3 3 2 2 3 2 2 2 0 2 3 2 1 1 1 2 3 2 1 0 Figure 7-5 - Interprocessor communication distance matrix Gm obtained from the architecture topology graph GM. 1 Inruns T ab y«, A Maxc Mine % Opt Mean SD jl231 0.0 0.5 0.0 78 48 30.5 70.92 8.29 60 0.5 0.4 0.15 78 64 33.3 75.6 3.15 44 0.5 0.4 0.1 78 65 25.0 75.9 2.61 1000 0 0 0-0 0-0 30 l l l l l l 0.0 18.1 3.65 Table 7-3 - Statistic collected from executing the assignment optimization routine on the 33 node Bokhari example. The shaded row is the results of a Monte-Carlo search algorithm used for comparison. 137 10 20 30 Cardinality of Solutions 40 ta Figure 7-6 - The Histogram associated with 1000 randomly generated assignments for the 33 node Bokhari example. In all the runs included in Table 7-3, the other parameter values, not shown in that ble, had the following values: In the first batch of runs (with 1231 runs), equation (6-7) used to discourage assigning the same process to two different processors and equation (6-12) used to find processors participating in the mapping, where not implemented. Also, in these runs, the constraint for assigning a single process to each processor was enforced according to equation (6-9) rather than the more selective equation (6-11). It can be noticed that the other runs shown in Table 7-3, demonstrate solutions with much improved mean cardinality values and smaller variances in their cardinalities. From Table 7-3, we can notice that in general, about 1/3 of the runs end in finding the globally optimum solution. Figure 7-7 depicts a histogram associated with the experiment of size 1231 runs. It is interesting to notice the presence of several local minimas distributed across the cost function landscape. The convergence characteristics of the assignment optimization routine is presented graphically in Figures 7-8 through T A o = 2.0 AT = 0.99 St = 0.025 = 2 - 0 n I 7-10. From these figures, we can see that the network has converged after approximately 200 sweeps at a temperature around T= 0.4. It is interesting to note, from Figures 7-9 and 7-10, that all of the cost values begin a sharp decline at a critical temperature around 0.5. At this temperature, structure is being formed from the seemingly random distribution of probabilities in the assignment matrix to. 50 55 60 65 70 75 80 Cardinality of Solutions Figure 7-7 - Histogram of solution cardinalities for the 1231 runs described in Table 7-3. The global optimum solution is at cardinality 78 which is found the largest percentage of times. 80 2 . 0-1 Tem perature Cardinality 60 C D = 3 4 — 1 C O 1 _ < D Q. E a > I — 0 .5 - 200 250 300 350 1 00 150 0 50 S w eep Number Figure 7-8 - Graph of system temperature and cardinality value vs. sweep number. 2.0 — Tem perature Total Cost - Mistmatch C ost -1 00 - 9 0 CD U 2 q 3 a . E - 8 0 - 7 0 - 6 0 I — 0 .5 - - 5 0 40 350 50 1 00 150 200 250 300 0 S w eep Number Figure 7-9 - Graph of system temperature, total cost (given by equation (6-13)), and mismatch cost (given by equation (6-1)) vs. sweep number. 140 : I 2 . 0-1 T em perature C ost of eq (6-6) C ost of eq (6-7) C ost of eq (6-11) C ost of eq (6-12) 3 0 1 . 5 - 0 5 3 2 o Q. E < D I- 0.5 150 200 25 0 30 0 3 5 0 0 5 0 1 00 S w eep Number Figure 7-10 - Graph of system temperature and the various penalty terms (given by equations (6-6), (6-7), (6-11), and (6-12)) vs. sweep number. A limited number of experiments have also been performed on another example proposed in [7] involving the mapping of a 25 node process graph to a 5x5 FEM processing array. A histogram of our results with non-optimal parameter settings is given f in Figure 7-11. 4 4 4 6 4 8 5 0 5 2 5 4 Cardinality of Solutions Figure 7-11 - Histogram of the solution cardinalities when mapping the 25 node process graph to FEM of size 5x5 taken from a collection of 736 runs. 141 In general, our results compare very well to that obtained by Bokhari. Using the algorithm described in [7], the best assignment found by Bokhari had a cardinality of 74.; Comparing the computational complexity of both algorithms, the Bokhari algorithm, requires 0 ( N 2) operations. Our algorithm, on the other hand, requires1 Nsw eeps*0(P2N) operations. Asymptotically the computational complexity of both algorithms are similar. Our Cnet approach seem to be superior to the Bokhari technique in both finding better solutions and having a deterministic parallel structure. i 7.2.2 Results from Implementing the Everstine Example i The assignment optimization routine has also been applied to a larger size problem,j proposed by Everstine [14]. The problem involves a sparsely connected finite element! graph with 59 nodes and 104 edges having an interconnection density of only 1.67%. This problem has originally been proposed as a benchmark for evaluating the; performance of various algorithms on reducing the matrix bandwidth and profile attributes. The matrix bandwidth problem is closely related to the assignment problem ; since both attempt to find an optimum relabeling [7]. For the assignment problem, this relabeling is performed on the nodes of the process graph, and for the matrix bandwidth reduction problem it is performed on the row and column indices of the matrix. j The target architecture used for this assignment is a 2-D array with 64 processors; organized in an 8x8, 4-nearest neighbor interconnected topology. Similar to the Bokharij example, we have performed a 1000 Monte-Carlo experiment to evaluate the difficulty of this particular assignment problem. Due to the low density of the processes graph (7.67%), random assignments give very poor cardinalities having an average of 6.5 out of the theoretical maximum of 104. Figure 7-12 shows the distribution of the cardinality values obtained from our Monte-Carlo experiment as a histogram. 142 1 4 1 2 2 4 0 0 5 1 0 C ardinality 1 5 20 Figure 7-12 - The histogram associated with solution cardinalities of 1000 random assignments of the Everstine 59 node example mapped to a 4-nearest neighbor connected 8x8 parallel processing architecture. Due to the excessively large search space (64!/ 5! ~ 108 7 ), it is not possible to exhaustively search the solution space for the optimum assignment giving the largest possible cardinality. However, we can see from Figure 7-12 that finding good solutions is difficult. The original process graph D a n used in this problem is fairly banded with a bandwidth of 26 and a profile of 464. Note: the bandwidth of a matrix is defined to be the maximum of all the individual row bandwidths of a matrix, where a row bandwidth! is defined as the number of columns between the first non-zero element of the row up to and including the diagonal. The profile of a matrix refers to the sum of all the individual row bandwidths of the matrix. Due to the existence of this initial structure in the D a n matrix, a linear assignment ofj i nodes from the process graph to nodes of the processor graph yield a fairly highi cardinality value of 42. Statistical data on several mns performed using our assignment optimization algorithm and the Monte-Carlo algorithm are presented in Table 7-4. Since no optimum solution is known for this problem, it is difficult to know how close to the optimum are these solutions. Of course, we can compare the results with randomly! generated solutions and also to the fairly well structured initial solution. ; Nruns T a b 7 * Y r, M ax<; M i n e Mean SD 25 0.1 0.5 0.15 72 63 68.56 2.24 13 0.05 0.4 0.15 73 65 68.69 2.10 mm 0 0 0 0 0 0 16 0 b 5 0 2.45 Table 7-4 - Statistics collected from executing the assignment optimization routine on the 59 node Everstine example. The shaded row is the results of a Monte-Carlo search algorithm used for comparison. Note that in all the runs included in Table 7-4, the parameter values not shown in that table, had the following values: ___ ___ T0 = 1-5 AT = 0.98 St = 0.025 7.3 Results of the Scheduling Procedure j In order to evaluate and demonstrate the use of our scheduling optimization routine, we have selected to apply the algorithm to two different problems. Similar to the problems selected for the assignment optimization task, one of the problem contains a known globally optimum solution and our task is to measure the difference between the solutions generated by our procedure against this optimum. The second problem involves a slightly more complicated scheduling with no known optimum solution. However, as mentioned in Section 6.4.3, there is a theoretical upper bound on the optimum solution, namely the number of neurons in the neural network N. 7.3.1 Receptive-Field Example The first example solved by our scheduling optimization problem involves finding non-conflicting paths associated with the computation necessary to implement the neural network structure shown in Figure 7-13, on a 16 processor parallel architecture with a 144 4x4 4-nearest neighbor interconnection topology. This particular neural network is structured in a commonly used receptive field, or convolution type, form. The number of neurons used for the mapping is 16, referring to the number of neuron in the input j layer. There are four separate computational paths, one associated with each path a , i which pass through the various PEs concurrently. Each path must traverse 9 neurons assigned to 9 processors arranged in a 3x3 local neighborhood. Figure 7-13 - Neural network with a receptive field type interconnection structure. It is a well known technique to generate computational paths associated with this type of a structure by having each path traverse in a whirlpool shape. Each path is offset in space by one or more processors. With this type of a scheduling, the maximum required time to complete all paths is equivalent to the number of neurons associated with each receptive field (in this case 9). Therefore the global optimum scheduling will have all paths complete their traversal of the required neurons in 9 cycles. The upper bound on the number of cycles need to complete all traversals is equal to the number of neurons in the first layer of the network, namely 16. Therefore, we began our optimization procedure with the time limit parameter M set to 11. This means that we require all paths to be of length 11, and all the neurons required for the computation of a particular path must be traversed within 11 cycles. Figure 7-14 illustrates the flow pattern of the four paths over the 4x4 processor array. In all of our runs we manually selected the center processor in each receptive field to be the terminating location of its associated computational path. For example, processor number 5 is selected as the terminating point of path a x. 145 i 5 i f o 6 j I : 12 10 8 10| 13 I 14 11 15 ill 14 « 4 a . Figure 7-14 - Scheduling flow pattern for the 4 paths associated with Figure 7-13. The shaded boxes indicate the required neurons to be traversed by each path. It is interesting to note that all the paths shown in Figure 7-14 start in processors not required for the computation of the associated path. Nevertheless, since we have required that each path be of length M =11, the paths cover unnecessary neurons. It is easy to see from Figure 7-14, that path traversals are limited to communications between neighboring processors. It is assumed that interprocessor communication can be' i performed in full duplex mode, that is two neighboring processors can exchange data 146 concurrently. The DREAM Machine architecture allows for such communication through the use of two X-net switches. With a close examination of these paths, we can show, that no two paths ever cross the same processor at the same time, therefore satisfying ourj scheduling constraints. The parameter setting associated with the run shown in Figure 7-14 are as follows: T o = 0.25 AT = 0.992 St = 0.03 A0 = 1.0 A. = 1.0 7a = 0.2 The parameter values associated with the remaining penalty terms of the scheduling cost function where set to zero and where eliminated from the optimization computation for all' the examples in this section. Figure 7-15 shows a different flow pattern generated with the same value of M=ll. In this figure, we can notice that the paths choose to stay in a processor during one communication cycle (cycle 8). The apparent symmetry in the flow patterns generated in both Figures 7-14 and 7-15 is due to the symmetrical nature of the problem and not necessarily a consequence of our optimization algorithm. This symmetry disappears when scheduling an asymmetrical problem, such as the one presented in Section 7.3.2.: The parameter values used for this run where: I T ‘ 0 = 0.25 AT = 0.99 St = 0.03 A o = 1.0 = 0.5 7 a = 0.2 147 9 © - 10 12 i l 10! i t i 12 14 a 2 0 1 2 3 tv q V 4 I f lli ■>* 6 i j t i l l ! n iiii f 7 X 10 r 8 9 10 11 ! + 6 1 3 I b 4 ▼ 12 13 < ill . 18 a 3 a 4 Figure 7-15 - A second scheduling flow pattern for the 4 paths of Figure 7-13, with path length M=ll. The shaded boxes indicate the required neurons to be traversed by each path. A loop in the path indicates that the path stayed in the processor for the corresponding cycle. j With good results obtained with path length M=ll, we further reduced M to the global optimum point of 9 cycles. We where able to consistently find one of the many whirlpool shaped flow patterns that can be constructed to cover all the required 9 neurons for each path. An example of one such solution is shown in Figure 7-16. Again, we can I 148 notice an interesting symmetry in these flow pattern resulting from the restrictions imposed on the path traversals. 7 0 1 / 2 1 3 11 A 6 4 5 ?: 6 7 ¥ _ A A5 6 9 4 10 11 12 13 14 15 i I i I 10 a. i t Figure 7-15 - Scheduling flow pattern for the 4 paths of Figure 7-13, with optimum path length (M=9). From these examples we can see that although the task of finding solutions that satisfy the complex constraints of our scheduling problem is difficult, the enormous size1 of the solution space offers many solutions which might not be optimal but be close enough to achieve considerable speedup over the simple regular flow patterns. 149 7.3.2 Randomly Sparse Matrix Example In addition to solving the scheduling problem for regularly structured interconnection networks, such as the receptive field structure used in the previous section, we have analyzed the performance of our scheduling optimization algorithm on randomly interconnected structures. In this section we describe an example implementation involving a sparse iterative system with 16 neurons. The interconnection network structure is shown in Figure 7-16, where a filled circle at a specific row and column indicates a connection between the corresponding neurons. The target machine architecture used for this mapping is similar to the one used for the receptive field example. Namely, a 2-D 4-nearest neighbor connected 4x4 processing array, except the target machine here is assumed to have wrap-around connections between the top and ! bottom and between the left and right side processors. i o o 0 0 o o 0 0 • 0 • o • 0 0 0 o o • o 0 0 • o • 0 0 0 o o 0 0 1 • o o o o o o o o • o o • o o o • • o o o • o • o o • o o o Ol lo o o • o o o • o o o o • o o o I • o • • o o o o • • o o o • o O 1 o o • o o o o • o o o • o o o o o o o o o • o o • • o o o o o o o o o o o o o o o o o o • o o o o • o o o o o o o • o o o • o o o • o • • • o o • o o • o • o o o o o o o o o • o • o o o o o o o o o o • o o o o o • o o o o o o • o o o o o o o • o o o o • o o • o • o • o o o o o • o • o o o o o o o o • o o o o o o o o • Figure 7-16 - Interconnection matrix associated with a sparse iterative system. In order to simplify the scheduling problem we must first assign neurons participating I in the same computational path to processors close to one another using the assignment j optimization procedure. We have chosen not to perform this assignment operation and perform the simple linear assignment of neurons to processors. In this way, we have I I ! 150 ; increased the difficulty in finding good solutions and can better predict the performance of] our algorithm on larger and more complicated interconnection structures. I i Since the neural network being implemented is iterative, such as the Hopfield model [33], we strongly enforce constraint (6-14) for having each path a terminate in the processor holding its associated neuron output value. For example, the path associated with the computation of neuron 1 must terminate at the processor having been assigned neuron 1. Path for Equation 2 — | Path for Equation 3 3 ^ □ 10 12 n r 13 14 □ 15 Path for Equation 5 Path for Equation 4 — i n Figure 7-17 - Four of the 16 paths associated with the sparse iterative system shown in Figure 7-16. The shaded boxes indicate neurons which must be traversed by the associated paths. The boxes with thick outlines indicate start and end processors of each path. 151 The optimal path length M* for implementing the network structure shown in Figure 7-16, is bounded below by K*= 8 being the maximum number of neurons to be traversed by a single path, including itself, and bounded above by the total number of neurons in ■ the network 2V=16. In Figure 7-17 we show the flow patterns for four paths, one associated with each of the boxed rows of Figure 7-16. In this run we have set the path to be of length M=9. We can now see that due to the irregular interconnection pattern of the D a n matrix (Figure 7-16), there is no regularity in the flow patterns of each path. We can see that all paths have traversed their required processors and there seem to be room for further reduction in the path length parameter M. This process can be repeated: several time to find a satisfactory solution to the scheduling problem. I 152 1 1 Chapter 8 i Asymptotic Performance Analysis In this chapter, we present the asymptotic performance of our mapping method and the I DREAM Machine architecture as applied to neural network processing. This analysis is 1 ; beneficial from two aspects. First, asymptotic performance analysis can help to j determine the performance of an implementation as the problem size and machine size increase. This is specially useful for neural network processing systems which are likely to increase in size by several orders of magnitude in the near future. The second benefit of this analysis is in enabling us to effectively compare various implementation methods. Although this analysis does not generate exact execution rates, it does offer a means to | compare different implementations based on their scaling and relative cost factors. I The analysis is divided into three sections. First we describe the implementation | complexity of neural network processing on the DREAM Machine. This is followed by a description of the computational complexity of our optimization based mapping m ethod., Finally, we compare the asymptotic performance of our method with other previously | proposed methods. j 8.1 Implementation Complexity of Neural Network 'Processing In Chapter 2 of this dissertation, we described the different parallel structures available in ! neural networks, that can be utilized to increase the implementation throughput.; Moreover, we described the applicability of each type of parallelism for different neural. network models. From these descriptions, we concluded that neuron level parallelism, 153 j assigning single neurons to individual processors, is a good compromise solution offering both ample flexibility in type of models which can be implemented, and also by ; j offering a large degree of parallelism for achieving high speed implementations. 8.1.1 Mapping Method Comparisons ; i I We can simplify our analysis by considering neural computation as vector-matrix j i operations. In Chapter 2, we separated neural computation into local and distributed , operations. An implementation based on neuron level parallelism is assumed to be able to j perform local operations in constant time, 0(1). The distributed operations, involving, commutative operations distributed across the processors of the machine, can be ■ i performed efficiently in one of two ways. Lets consider the distributed computation of ■ the weighted input sum associated with a single neuron a in a neural network receiving [inputs from N a different neurons. This computation involves performing Na addition I operations, assuming an adder with two inputs and one output. It is known that this operation can be performed optimally in logarithmic time 0 (logN a) using a binary tree topology. In this implementation, parallelism is attained by performing multiple portions of the summation operation, associated with a single neuron, in parallel. Considering the j simple case of a fully connected neural network the addition operation must be performed j j for all the N neurons in the network. Sequential processing of these operations, using jbinary tree type addition, will lead to an effective execution rate of 0 (N log N ), with i 1 j processors being idle log N percent of the total time. We may use pipelining techniques : ! to increase the system efficiency by executing log A (the depth of the binary tree) neurons concurrently. With this method, the execution rate of the complete network will be 0 ( N ) , not considering the pipeline filling and flushing time. The second method that can be used to efficiently implement the distributed computation of neural networks is to perform the weighted input sum sequentially. In | this method, each processor adds its contribution to a partial sum value and passes the ; updated partial sum value to the next processor. This implementation method was described in detail in Chapter 4 and was shown to require 0 (N ) time to complete the ’ calculation associated with a single neuron. In addition, we have demonstrated how the 1 i computation of N such neurons, in a fully connected neural network structure, can be 154 [performed in parallel using a'ring connected processing array. Hence, this method, " j I similar to the binary tree approach, has a time complexity of O (N ). j A similar method to that of the ring-connected implementation involves broadcasting the neuron output values sequentially over a global communication bus [25]. This ’ method has been employed in a commercially available neural network processing architecture. The performance characteristics of this method is inherently similar to the , jring connected approach, except the time required for the complete computation is 0(N) j ! regardless of the number of processors P. Therefore, in analyzing the performance of i various implementations we only consider the ring connected approach with an understanding that the results can be directly applicable to implementations based on the broadcasting approach, since we assume P=N. There are several reasons why the ring connected approach is more appropriate than the binary tree approach for implementing the distributed operations of a neural network. ; First, the number of interconnections required by the tree topology is 3 connections per 1 processor, whereas the number of connections per processor of a ring topology is 2. Second and more importantly, iterative neural computation, such as those used in a ' Hopfield net [33], require that the result of the computation associated with a specific , neuron be stored in the local memory of the processor assigned to the neuron. This is simply accomplished in the mapping method of Chapter 4 using the ring structure. However, this operation involves more data transfer cycles to implement on the tree ■ j architecture. ; ! It should be noted here, however, that more complex interconnection topologies, such as the hypercube, can be utilized to implement the binary tree style addition operation when implementing sparse neural networks and also used as a ring when implementing dense neural network models. Due to the poor scaling characteristics of such interconnection topologies, requiring excessive amount of area and associated, I ' ; hardware for large systems, we will not include such approaches in our comparison. ! I I Furthermore, in light of the above mentioned discussion, we can regard the mapping' j method of Chapter 4, to be optimum for implementing fully connected neural network structures. Therefore we will now consider the time and area complexity of neural computation for this mapping method as applied to neural network structures with | I arbitrary interconnection structures. In particular, we consider here the specific mapping ; 155 | methods of variable-length ring and optimization based mappings, described in Chapters 5 and 6, respectively. 8.1.2 Area-Time Complexity of Neural Network Implementation j Methods The time complexity associated with implementing neural network computation on ring ! systolic processing arrays was presented in Chapter 4. In Chapter 5, we showed how, j by varying the length of the processing ring, high throughput rates can be achieved for j implementing sparse but regularly connected neural networks. In Chapter 6, we removed the regularity condition imposed on the neural network interconnection structure ! |for efficient implementations. In this section we analyze the time and area complexity of ' 'these methods and compare them with other methods proposed in the literature. The time complexity of mappings onto fixed-size ring architectures where shown to be O(P), where P is the number of processors in the ring, in Chapter 4. This time complexity is arrived at by considering that each computational path associated with each 1 neuron traverses the entire ring in order to compute the weighted input sum of equation j (2-2). On the other hand, we have shown that with block-connected neural network ' 'structures, computation associated with multiple blocks in the same layer can be ; I 1 performed in parallel on the DREAM Machine. Furthermore, the computation associated with each block has a time complexity of 0{Rt ), where Rt is the number of processors in the ring associated with layer t . Therefore, the overall time complexity of the ; implementation, utilizing variable-length rings, is O(R), where R is the size of the j largest block in the neural network structure and defined as R = max {/??}. : l*L,b<=Bt L 1 J | 8.1.2.1 Complexity of the Fixed-Size Ring Implementation j I We can use the well known A T 2 efficiency metric [85] to compare the performance of i variable-size ring and fixed-size ring architectures for neural network processing. The | A T2 measure is commonly used to compare various VLSI implementation. It is ; measured by multiplying the area complexity of a design by the square of the time : complexity associated with a particular problem implemented on this system. The Area complexity of the fixed-size ring architecture is O(P), assuming that a constant area is required to implement each processor. The linearity of the area complexity of this ; architecture is due to its nearest neighbor interconnection topology which can be implemented with constant length wires. If we consider the size of the memory area used 'for holding the synaptic weights values at each processor, the total area complexity of J this implementation will be 0(P + N 2) . Assuming the processing array size to be of the j same order as the neural network size, the area complexity can be written as A = 0 (N 2) . j Earlier in Chapter 4, the time complexity of the implementation on fixed-size ring : | ~ . ! [architecture was shown to be O(N). Therefore, the AT measure for this ■ i ! implementation is 0 (N 4). I 8.1.2.2 Complexity of the Variable-Length Ring Implementation j | A similar analysis can be performed on the variable-size ring architecture implemented on the DREAM Machine. Each of the N neurons must allow for a maximum of R synaptic weights to be stored in its local memory area. As such, we arrive at an area complexity j A = O(NR), with a constant multiplicative factor several times larger than the fixed-size jring implementation. This extra area is required for the additional wires in the ! interconnection network and for the increased size of memory holding the data routing ' information. Nevertheless, both of these factors increase the area complexity by a constant factor and can be dropped in an asymptotic performance evaluation. | The time complexity associated with implementing neural networks on the DREAM ; Machine, using the variable-length ring method, is more difficult to evaluate. We have , ; shown the time complexity of this method to be O(R) in Chapter 5. The value of R is j [ inherently related to the interconnection structure of the particular neural network being i implemented. For fully connected neural networks R = N , thus the time complexity of this approach is less than or equivalent to that of the fixed-size ring approach, since N<P. A s mentioned earlier, a basic premise of this dissertation is that large scale neural network models are sparsely interconnected having a relatively constant number of i synaptic connections per neuron. With this assumption we can treat R as a constant. | i sy | i The AT metric associated with this implementation, assuming a constant size block ■ 1 structure is found to be A T 2 = 0 (N R 3) = O(N). 1 8.1.2.3 Complexity of the Optimization Based Implementation Implementing neural network models on the DREAM Machine, following the optimization based mapping method described in Chapter 6, has a similar performance to that of the variable-length ring approach. However, the optimization based mapping ; method does not rely on any specific regularities in the interconnection structure, such as j block structures, to attain efficient mappings. As mentioned earlier, the optimization based mapping method attempts to find mappings that can complete all the necessary computation associated with a particular neural network structure in M cycles. In Chapter 6, we mentioned that the value of M is bounded above by the number of neurons in the network N, and bounded below by the maximum number of connections to a neuron. I ! The value of M can be represented as M = kS, where S is the maximum number of j synaptic connections per neuron (a constant value) and k is a constant greater than one. It should be noted here that there is no guarantee that a constant valued k can be found in the general case. The value of k depends directly on the degree of the interprocessor communication topology and the neural network structure. Even with a moderately low dimensionallity in the interconnection topology of the architecture (e.g., eight j nearest-neighbor connected), it is expected that a constant k value can be found for most ! neural network structures. Therefore, by following the argument of a constant number ;of synaptic inputs per neuron presented earlier, the time complexity of this ' implementation can be written as 0(M ) = 0 ( 1), assuming M remains constant with respect to N and P. The area complexity of this approach is similar to the variable-length ring method, ; since both methods use the DREAM Machine as the target architecture and both method |follow the same basic execution principles. In the case of the optimization based ' !mapping, each neuron must locally store M synaptic weight values, some of which might be zero. Therefore, the area complexity of this mapping is 0{N M ). This leads to an A T 2 measure of 0(N M 2). 8.2 Complexity Analysis of the Optimization Based Mapping Method The complexity analysis of implementing neural networks on the DREAM Machine presented in the previous section, was associated strictly with the implementation of the neural computation. In general, the mapping algorithm associated with the fixed-size ring architecture requires minimal amount of computation. The assignment of neurons to processors and scheduling the flow of data between processors is straight forward. Neurons are assigned linearly to processors and the data flows in a circle around the processing ring. The mapping for variable-length ring is a bit more complicated. For complex block structured networks, it requires a rather heuristic approach in order to align the multiple processing rings such that the computation can flow efficiently between the various layers of the neural network. On the other hand, the complexity of computation associated with the optimization based mapping method is rather significant and can be evaluated analytically. In this section we evaluate the computational complexity associated with the optimization procedures used for solving the assignment and scheduling problems. 8.2.1 Time Complexity of the Assignment Problem I The time required to implement the assignment optimization procedure can be evaluated by considering the pseudocode describing this algorithm given in Section 7.1.2. From this pseudocode, it can be seen that the compute intensive portion of the algorithm is associated with executing the update equations in step 7. It can be noticed that this operation involves a doubly nested loop of size Nsweeps and Nsteps. Within these loops is the specific update equation used to update the assignment matrix co. The time complexity associated with updating the entire C D matrix can be derived by analyzing the update equations (6-35) through (6-38). Performing the updating of the neuron to processor selection matrix tj, according to equations (6-37) and (6-38), requires only 0 (P ) time. Implementing the update equations of the assignment matrix C D according to equations (6-35) and (6-36) requires a number of matrix-matrix-product operations. This leads to a time complexity of 0 (P 2N ) assuming P>N. Therefore, the total time complexity of updating the assignment matrix is 0 (P 2 N ) + 0{P) = 0 (P 2 N ). 159 ~The~complete_algorithm'performs~this~updating function~iV.svi'eeps*iV.sfe/>s~number~of times. In general, both Nsweeps and Nsteps values can be considered to be constant. Therefore, the time complexity of the complete assignment procedure can be given to be 0 (P 2 N ). However, the constant multiplicative factor here can be quite large. A similar time complexity value has been reported by Bokhari [7] for his heuristic approach for solving the assignment problem. 8.2.2 Time Complexity of the Scheduling Problem t The time complexity of the scheduling optimization algorithm can also be evaluated based j on the computational demands of the specific update equations associated with this problem. The major portion of the computation associated with this algorithm is attributed to the computation of the update values for the scheduling matrix 0 and its related variables r^(a) and e‘ n. The computation associated with updating the e‘ n values requires O(MN) operations, where M is the maximum path length value and N is the number of neurons in the network. The computation associated with updating the r^(«) values can be implemented efficiently by considering only those variables which participate in the computation. Specifically, we only need to determine rfn(a) values which have a corresponding D a n = 1. This computation is further simplified by limiting j the search space through the use of the matrix, as described in Chapter 6. The I overall computational complexity associated with updating the ^(«) variables is 0 (S M 2 A ), where S is the maximum number of synaptic connections per neuron which has been assumed to be a constant. Therefore, we can write the time complexity of updating the rj‘ n(a) values as 0 (M 2A). The computational complexity of implementing the update equations for the 0 matrix can be shown to be 0 (P 2A M ). Thus the total computational complexity associated with the scheduling problem is determined to be O iP2AM) + 0 (M 2 A) + O(MN). We have already presented an argument as why the maximum path length parameter M can be considered to be a constant with respect to processing array size and the neural network size. With this assumption, the asymptotic computational complexity of the scheduling procedure is 0 (P zA) + O(A) + 0 (N ) = 0 (P 2A), assuming N and P are of the same order. 160 '8.23 Tradeoffs on the Use of the Optimization Procedure The computational complexity of the optimization procedure can be rather significant, a s! described in the previous sections. The lower-bound of the computational complexity of the combined assignment and scheduling procedures is 0 (P 2 A) + 0 (P 2 N ) = 0 (P 2 N ), assuming that the number of paths is of the same order of magnitude as the number of neurons in the network. However, the inherently parallel structure of the optimization procedures can effectively be used to considerably reduce the execution time of these algorithms. If one uses the target machine of the mapping, with P processors, to implement the optimization procedure, the required implementation time will become O(PN). Even with this parallel implementation, the computational demands of the algorithm are considerable and must be compared to the expected gain to be attained from their use. ! In general, the optimization procedure should be used to generate efficient mappings for neural network structures which have no apparent regularity and can be considered to be static for a large number of iterations. Otherwise the mapping method based on variable-length processing rings might offer a less computationally demanding solution. We can analytically compare the use of these two approaches. t i Considering a sparse and irregularly connected neural network structure, with N neurons and a maximum of S synaptic connections per neuron, the time required to implement this neural network on a P processor DREAM Machine, using the variable-length ring approach, is O(N). On the other hand, using the optimization! based mapping method, we can expect an execution rate of O(M ) , where M = kS and k ' is a constant greater than one. The time attributed to the variable-length ring mapping procedure can be considered to be 0(1) since it only requires the construction of a processing ring of length N. On the other hand, the time requirement of the optimization procedures, not considering the parallel implementation, is 0 (P 2 N ) with a rather large constant multiplicative factor. The complete time to execute r iterations of the neural j network, including the mapping, is O(rN) using the variable-length ring mapping and is O(r) + 0 (P 2 N ) using the optimization based mapping. i Using the above values, we can approximate the size of r for which the optimization, procedure might be beneficial to use. This is accomplished by having 161 --------------------------------------------------rN=T--f-p2N;---------- and------------------------- (8-1)" r — JLJ^L ~ ^ 2 (8-2) N - 1 Therefore, a good heuristic to use for determining when to use the optimization procedure is by using r> P2 as a criteria. If the number of iterations the neural network is used is greater than the square of the number of processors used for its implementation, it would be beneficial to use the optimization procedure. Of course, this is based on the J asymptotic assumptions of having very large N and P values. 8.3 Performance Comparison \ In this section, we compare the implementation characteristics of our mapping method on the DREAM Machine architecture with those of others. This comparison is based on the performance of each implementation in processing neural networks with arbitrarily complex interconnection structures. The basis for this comparison is the A T2 metric described in Section 8.1. In order to have a fair comparison, we examine only SIMD nearest-neighbor connected architectures used for neural network processing. These t include the fixed-size systolic ring method proposed in [42], the linear array architecture! with broadcast buses proposed in [25], and the algorithmic mappings onto mesh connected SIMD arrays described in [47, 61]. Table 8-1, shows the collected area, time, and AT2 measures associated with each mapping method. In this table, N is the number of neurons in the network, E is to the number of non-zero synaptic connections in the network, R is the length of the largest block of neurons in the network, and M is the maximum length of time required to complete all computational paths in the neural network found by the optimization procedure. As mentioned earlier, the performance characteristics of the fixed-size ring architecture and its associated mapping method [42] and that of the linear array with broadcasting of the neurons output values [25] are equivalent. However, the architecture proposed in [25] employs a mechanism which alleviates the need for storing zero valued weights. Thus, the area requirement associated with this architecture is 0 (N + E). 162 A T A T 2 Fixed-Size Ring [42] 0 (N 2) 0 (N ) 0 (N 4) Broadcasting Method of [25] 0 (N + E) O(N) 0 { N 2 + NE) Algorithmic Method of [47] 0 (N + E) 0 ( ^ N + E ) 0 ( N 3+ N 2E) Variable-Length Ring O(NR) 0(R ) 0(N R 3) Optimization Based Method O(NM) O(M) 0(N M 3) Table 8-1 - Comparison of several implementation methods used for neural network processing. With the assumption of having a constant maximum size limit on the number of interconnections to a neuron, we can modify the performance table and simplify some of the terms. The first simplification can be made by combining the variable-length ring method with the optimization based mapping method. In essence, the variable-length ring method is identical to generating optimum assignment and optimum scheduling of the, information flow by using the regularities in the interconnection structure of the neural network. From this perspective, the maximum block size R is equivalent to the maximum path length M. Thus, we can use either R or M in our analysis. Furthermore, we can calculate the total number of interconnections E in the neural network as N £’ = ^ 5 „ , where (8-3) n = l Sn is the number of connections associated with neuron n. Equation (8-3) can be rewritten as E — N S , where (8-4) S is the average number of connections per neuron. By assuming a constant number of connections per neuron, we can treat S as a constant with respect to the neural network size. Therefore, the time complexity of the algorithmic mapping of [47] can be written as 0(-s[N) and a corresponding area complexity of O(N) is attained. Similarly, we can assume again that M = kS, with k being a small constant. Incorporating this condition in our analysis, the time complexity of our method reduces. 163 "to~0(l)~andthe'areacomplexity reduces*to~0(iV)7~Since'the~fixed-ring'approaclrdoes'not have any provisions to efficiently handle sparsely interconnected neural networks, its time and area complexity are not effected by our assumptions here. Table 8-2 comparatively shows the performance characteristics of each implementation incorporating our assumption of constant connections per neuron. A T A T 2 Fixed-Size Ring [42] 0 (N 2) O(N) 0 (N 4) Broadcasting Method of [25] o m 0 (N ) 0 (A 3) Algorithmic Method of [47] 0 (N ) 0(^[N ) 0 (N 2) Variable-Length Ring / Optimization Based Method 0 (N ) 0{k) 0 (N ) Table 8-2 - Comparison of several implementation methods used for neural network processing, assuming a constant number of connections per neuron. 164 Chapter 9 Conclusion 9.1 Summary One of the primary factors in the resurgence of neural networks has been attributed to the availability of cost effective parallel processing technology. This technology can enable neural networks to be applied to large and complex "real-world" applications. In Chapter 2, we described several different levels of parallelism, inherently available in neural network models, which can be used as basis for mapping neural computation onto parallel processing architectures. Although a great deal of parallelism is available in neural networks, efficiently harnessing this parallelism has been difficult, primarily due to the vast variety of neural network models and the complexity of their interconnection structures. We have presented several previously proposed methods for implementing neural network models on parallel architectures in Chapter 2. We have shown that most implementation methods are applicable to neural network models with a specific type of interconnection structure, such as being densely or sparsely interconnected. In Chapter 3, we outlined the architectural design of the Dynamically Reconfigurable Extended Array Multiprocessor (DREAM) Machine. A general description of the architecture was presented along with a detailed description of various architectural features specifically designed to address neural network computational demands. The major aspect of the DREAM Machine architecture, which allows it to be used efficiently for implementing a wide range of neural network algorithms with arbitrary complex interconnection structures, is its ability to have each processor communicate 165 ~autonoffiously~with_one_o fits_nearest*neighbors:— This-autonomy-has-been-achieved- through the use of dynamically reconfigurable switches in the interprocessor communication network. Each switch has a local data routing memory area which is accessed during every communication cycle. The contents of this routing table determine which of the communication channels are to be used by the particular processor for send and receive operations. By allowing each processor to uniquely communicate with any of its neighbors, complex flow patterns can be constructed for propagating information through the processing array. In Chapter 4, we presented the basic mapping methodology used for implementing neural network computation. This mapping concept is based on neuron level parallelism, where one neuron is mapped to each of the processors in the processing array. The distributed operations associated with neural computation are performed by constructing computational paths which traverse all the necessary processors required for implementation of each specific distributed operation. In Chapter 5, we demonstrated how such computational paths can be constructed by embedding a processing ring of arbitrary length on the 2-D topology of the DREAM Machine. We have also shown how sparse but regularly connected neural network structures can be mapped onto the DREAM Machine by using several concurrently executing processing rings. The communication flexibility attained by programmable switches in the DREAM Machine architecture, was shown to be useful in constructing variable-length processing rings. Being able to construct several variable-length processing rings , allows for efficient implementation of neural networks with a variety of interconnection structures. Even though this mapping approach is applicable to a wide range of structures, it can be inefficient in implementing sparse and irregularly connected neural network structures. In Chapter 6, we presented the mapping problem as a conglomeration of three interdependent problems: the clustering problem, the assignment problem, and the scheduling problem. Although the primary objective of our work in Chapter 6 was to arrive at a method for efficient mapping of neural computation onto parallel processing architectures, the method and some of the cost functions associated with it were formalized such that they would be applicable to other problems in mapping parallel algorithms onto parallel processors. The major contribution of Chapter 6 was to derive effective cost functions which evaluates the goodness of a specific assignment and scheduling for a given neural 166 rnetwork-structure-—W e-showedr-using-statistical-physies-techniques—how-neural- computation can be employed to find solutions to these problems by optimizing these cost functions. In Chapter 7, we presented the results of our optimization procedure in mapping several small example problems onto different parallel processing architectures. By comparing our results to randomly generated solutions, we presented empirical I evidence indicating that our optimization routine generates efficient solutions. Furthermore, we demonstrated that our assignment optimization procedure has been able to consistently find better solutions than those found by a previously proposed j optimization routine on a benchmark example [7]. In Chapter 8, we analyzed the asymptotic computational complexity of our implementation method and also that of the optimization procedures. We compared the j results of our implementation method with those of others and showed that both results ; are equivalent in asymptotic terms. However, this equivalence holds only if no limits are placed on the interconnection network structure of the neural network models. Since realistic neural network models do not require full interconnection among all the neurons ! in the network, as evident by examining biological [3] as well as artificial neural j networks applied to real-world problems [45], it can be shown that the implementation method described in this dissertation offers the best overall performance. i As neural network models evolve to be more complex, and as their sizes grow to j address larger, more sophisticated problems, the need for efficient parallel j implementations becomes ever more pressing. The DREAM Machine architecture, I presented in this dissertation, is intended to be a special medium to address the current I demands of neural network computation, and at the same time, offer sufficient flexibility for efficiently implementing future generations of neural network models. I 9.2 Future Research Directions The research presented in this dissertation encompasses a number of novel and promising areas in the design of parallel processing architectures and mapping methods for their efficient utilization. However, due to the wide scope of materials described in this dissertation, and the novelty of the concepts, much work remains to be done. T h e' future-directions-for-this-research-can-be-divided-into-three-separate~areas-as described- below. ? ( • The DREAM Machine architecture: The architectural description of the DREAM Machine, presented in Chapter 3, is given at a high level in order to demonstrate the specific features of the architecture used for efficient processing of neural networks. A more rigorous and detailed treatment of the architectural design is * required to determine such details as the specific functional units, the data width, memory size, etc. to be used for an actual realization of this machine. Such detail design specifications must consider specific technologically imposed limitations of available hardware and their associated cost. i Another possible direction for further exploration involves determining other applications where the DREAM Machine can be used for efficient processing. The flexibility in interprocessor communications, offered by this architecture, along with its indirect addressing capabilities, can be effectively utilized by a ] number of other applications in addition to neural networks. * Mapping neural networks onto the DREAM Machine: Two methods for mapping neural network models onto the DREAM Machine where presented in this j dissertation. The algorithmic mapping, described in Chapter 5, involves! construction of variable-length rings on the DREAM Machine architecture. As | eluded to in Chapter 5, the determination of optimum ring lengths for sparse but regularly connected neural networks is complex and requires an optimization procedure. In addition to further research in developing such an optim ization! tool, an automatic method for physically embedding variable-length rings on th e ! 2-D lattice of the DREAM Machine should b devised in order to simplify the mapping process. In the area of optimization based mapping method for implementing neural networks, the approach described in this dissertation should be extended to ' efficiently address multilayer neural networks. Also, concurrent implementation of the assignment and scheduling optimization routines should be studies in detail. This approach has a potential for arriving at good solutions for the scheduling routine by allowing for interaction between both optimization procedures.1 168 Im plem entation-ofsuch an-approach-requires-analysis-of-the-convergence characteristic of the system having interdependent variables. I • Automatic mapping of parallel algorithms onto parallel architectures using optimization techniques: As mentioned in Chapter 6, the optimization based mapping, proposed in this dissertation, can be applied to a variety of similar problems in mapping of parallel algorithms onto parallel processing architectures. Further research is required, particularly in the scheduling optimization area, for demonstrating the effectiveness of this method for a wider range of problems in parallel processing. Currently, work is in progress in applying this technique to the sparse matrix bandwidth reduction problem and efficient implementation of: matrix-vector operations on systolic arrays [74]. Research into efficient mappingsi of data-flow graphs onto parallel architectures is also a promising and natural| direction to follow. 169 References [1] DARPA Neural Network Study. AFCEA International Press, 1988. [2] B. Angeniol, G. D. L. C. Vaubois and J.-Y. L. Texier, “Self-Organizing Feature Maps and the Traveling Salesman Problem.” Neural Networks. 1: 289-293, 1988. [3] M. A. Arbib, The Metaphorical Brain 2. Neural Networks and Beyond. John Wiley & Sons, 1989. [4] B. A. Armstrong, “A Hybrid Algorithm for Reducing Matrix Bandwidth.” Intr. Jour, fo r Num. Meth. in Engr. 20: 1929-1940, 1984. [5] G. Blelloch and C. R. Rosenberg, "Network Learning on the Connection Machine," Proceedings o f the 10th Intern. Joint Conf. on Artificial Intelligence, Milan, Italy, pp. 323-326, 1987. [6] D. W. Blevins, E. W. Davis, R. A. Heaton and J. H. Reif, “BLITZEN: A Highly Integrated Massively Parallel Machine.” Journal o f Parallel and Distributed Computing. 8: 150-160, 1990. [7] S. H. Bokhari, “On the Mapping Problem.” IEEE Transactions on Computers. C-30(3): 207-214, 1981. [8] S. H. Bokhari, Assignment Problems in Parallel and Distributed Computing. Kluwer Academic Publishers, 1987. [9] J. R. Brown, M. M. Garber and S. F. Venable, "Artificial Neural Network on a SIMD Architecture," Proceedings o f the Symposium on the Frontiers o f Massively Parallel Computations, pp. 43-47, 1988. [10] G. A. Carpenter and S. Grossberg, “A Massively Parallel Architecture for a Self-Organizing Neural Pattern Recognition Machine.” Compute Vision, Graphics, and Image Processing. 37: 54-115, 1987. [11] A. J. De Groot and S. R. Parker, "Systolic Implementation of Neural Networks," Proceedings o f the SPIE Vol. 1058 High Speed Computing II, Los Angeles, Ed. K. Bromley, pp. 182-190, 1989. [12] E. Deprit, “Implementing Recurrent Back-Propagation on the Connection Machine.” Neural Networks. 2: 295-314, 1989. [13] R. Durbin and D. Willshaw, “An Analogue Approach to the Travelling Salesman Problem Using an Elastic Net Method.” Nature. 326: 689-691, 1987. 170 [14] G. C. Everstine, “A Comparison of Three Resequencing Algorithms for the Reduction of Matrix Profile and Wavefront.” International Journal o f Numerical Methods in Engineering. 14: 837-853,1979. [15] L. Fang and T. Li, “Design of Competiotion-Based Neural Networks for Combinatorial Optimization.” International Journal o f Neural Systems. 1(3): 221- 235, 1990. [16] C. T. Fike. Computer Evaluation of Mathematical Functions. Prentice-Hall. 1968. [17] M. J. Flynn, “Some Computer Organizations and Their Effectiveness.” IEEE Transactions on Computers. 21: 948-960,1972. [18] B. M. Forrest, D. Roweth, N. Stroud, D. J. W allace and G. V. Wilson, “Implementing Neural Network Models on Parallel Computers.” The Computer Journal. 30(5): 413-419, 1987. [19] K. Fukushima, “Neocognitron: A Hierarchical Neural Network Capable of Visual Pattern Recognition.” Neural Networks. 1: 119-130, 1988. [20] J.-L. Gaudiot, C. v. d. Malsburg and S. Shams, "A Data-Flow Implementation of a Neurocomputer for Pattern Recognition Applications," Proceedings o f the Fourth Annual Aerospace Applications o f Artificial Intelligence, Dayton, OH, Vol. 1, pp. 327-338, 1988. [21] D. E. Goldberg, Genetic Algorithms in Search. Optimization, and Machine Learning. Addison-Wesley Publishing Co., 1989. [22] H. P. Graf, L. D. Jackel and W. E. Hubbard, “VLSI Implementation of a Neural Network Model.” Computer. 21(3): 41-49, 1988. [23] K. A. Grajski, "Neurocomputing Using the MasPar MP-1," Technical Report No. 90-010, Ford Aerospace Corp., Oct. 1, 1990 [24] M. K. Habib and H. Akel, “A Digital Neuron-Type Processor and Its VLSI Design.” IEEE Transactions on Circuits and Systems. 36(5): 739-746, 1989. [25] D. Hammerstorm, "A VLSI Architecture for High-Performance, Low-Cost, On- chip Learning," Proceedings o f the Inter. Joint Conf. on Neural Networks, San Diego, Vol. 2, pp. 537-544, 1990. [26] R. Hecht-Nielsen, Neurocomputing. Addison-Wesley Publishing, 1990. [27] W. D. Hillis, The Connection Machine. Cambridge MA, MIT Press, 1985. [28] G. Hinton, D. Ackley and T. Sejnowski, "Boltzmann Machines: Constraint Satisfaction networks that learn," Technical Report CMU-CS-84-119, Camegie- Mellon University, Dept, of Computer science, 171 [29] Y. Hirai, K. Kamada, M. Yamada and M. Ooyama, "A Digital Neuro-Chip with Unlimited Connectability for Large Scale Neural Networks," Proceedings o f the Inter. Joint Conf. on Neural Networks, Washington D.C., Vol. 2, pp. 163-169, 1989. [30] A. Hiraiwa, M. Fujita, S. Kurosu, S. Arisawa and M. Inoue, "Implementation of ANN on RISC Processor Array," Proceedings o f the Inter. Conf. on Appl. Spec. Array Proc., Princeton, NJ, Ed. S. Y. Kung, E. E. Swartzlander, J. A. B. Fortes and K. W. Przytula, pp. 677-687, 1990. [31] M. Holler, S. Tam, H. Castro and R. Benson, "An Electrically Trainable Artificial Neural Network (ETANN) with 10240 "Floating Gate" Synapses," Proceedings of the Inter. Joint Conf. on Neural Networks, Washington D.C., Vol. 2, pp. 191- 196, 1989. [32] J. J. Hopfield, “Neural Networks and Physical Systems with Emergent Collective Computational Abilities.” Proceedings o f the National Academy o f Science USA. 79: 2554-2558, 1982. [33] J. J. Hopfield and D. W. Tank, “"Neural" Computation of Decisions in Optimization Problems.” Biological Cybernetics. 52: 141-152, 1985. [34] A. Iwata, Y. Yoshida, S. Matsuda, Y. Sato and N. Suzumura, "An Artificial Neural Network Accelerator Using General Purpose 24 bits Floating Point Digital Signal Processors," Proceedings o f the Inter. Joint Conf. on Neural Networks, Washington D.C., Vol. 2, pp. 171-175, 1989. [35] F. A. Kamangar, R. A. Duderstadt and J. O. Smith, "Efficient Implementation of Connectionist Models on MIMD Parallel Processors Using Chordal Ring Topologies," Proceedings o f the Inter. Joint Conf. on Neural Networks, Washington D.C., Vol. 2, pp. 588, 1989. [36] S. Kirkpatrick, C. D. Gelatt Jr and M. P. Vecchi, “Optimization by Simulated Annealing.” Science. 220: 671-680, 1983. [37] T. Kohonen, Self-Organization and Associative Memory, second ed., Springer Series in Information Sciences. Springer-Verlag, 1987. [38] B. Kosko, “Bidirectional Associative Memories.” IEEE Transactions on Systems, Man, and Cybernetics. 18: 49-60, 1988. [39] B. Kosko, “Unsupervised Learning in Noise.” IEEE Transactions on Neural Networks. 1(1): 44-57, 1990. [40] H. T. Kung and C. E. Leiserson. "Systolic Arrays for VLSI." In Introduction to VLSI Systems. Ed. C. Mead and L. Conway, Reading, MA, Addison-Wesley, pp. 1980. [41] S. Y. Kung and J. N. Hwang, "Systolic Architectures for Artificial Neural Nets," Proceedings o f the IEEE Inter. Conf. on Neural Networks, San Diego, CA, Vol. 2, pp. 165-172, 1988. 172 [42] S. Y. Kung and J. N. Hwang, “A Unified Systolic Architecture for Artificial Neural Networks.” Journal o f Parallel and Distributed Computing. 6: 358-387, 1989. [43] H. K. Kwan and P. C. Tsang, "Systolic Implementation of Multi-Layer Feed- Forward Neural Network with Back-Propagation Learning Scheme,” Proceedings o f the Inter. Joint Conf. on Neural Networks, Washington D.C., Vol. 2, pp. 155- 158, 1990. [44] E. L. Lawler, J. K. Lenstra, A. H. Kan, G. Rinnooy and D. B. Shmoys. The Traveling Salesman Problem: A Gauided Tour of Combinatorial Optimization. Wiley, New York, 1985. [45] Y. Le Cun, B. Bowser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel. "Handwritten Digit Recognition with a Back-Propagation Network." In Advances in Neural Information Processing Systems 2. Ed. D. S. Touretzky, San Mateo, CA, Morgan Kaufmann, pp. 396-403, 1990. [46] Y. Le Cun, J. S. Denker and S. A. Solla. "Optimal Brain Damage." In Advances in Neural Information Processing Systems 2. Ed. D. S. Touretzky, San Mateo, CA, Morgan Kaufmann, pp. 598-605, 1990. [47] W.-M. Lin, V. K. Prasanna and K. W. Przytula, “Algorithmic Mapping of Neural Network Models onto Parallel SIMD Machines.” IEEE Transactions on Computers. 40(12): 1390-1401, 1991. [48] M. J. Little and J. Grinberg. "The 3-D Computer: An Integrated Stack of WSI Wafers." In Wafer Scale Integration. Ed. E. Swartrzlander, Boston, Kluwer, pp. Chapter 8, 1988. [49] R. Mann and S. Haykin, "A Parallel Implementation of Kohonen Feature Maps on the Warp Systolic Computer," Proceedings o f the Inter. Joint Conf. on Neural Networks, Washington D.C., Vol. 2, pp. 84-87, 1990. [50] V. Milutinovic, “Mapping of Neural Networks on the Honeycomb Architecture.” Proceedings o f the IEEE. 77(12): 1875-1878, 1989. [51] M. Misra and V. K. Prasanna Kumar, "Massive Memory Organization for Implementing Neural Networks," Proceedings o f the Inter. Conf. on Pattern Recognition, Vol. 2, pp. 259-264, 1990. [52] M. Misra and V. K. Prasanna Kumar, "Neural Network Simulation on a Reduced Mesh of Trees Organization," Proceedings o f the SPIE/S PSE Symp. on Electronic Imaging, 1990. [53] A. Moopenn, T. Duong and A. P. Thakoor, "Digital-Analog Hybrid Synapse Chips for Electronic Neural Networks," Proceedings o f the Advances in Neual Information Processing Systems 2, Denver, CO, Ed. D. S. Touretzky, pp. 769- 776, 1989. 173 [54] N. Morgan. Artificial Neural Networks: Electronic Im plem entations. IEEE Computer Society Press, Los Alamitos, CA, 1990. [55] N. Morgan, J. Beck, P. Kohn, J. Bilmes, E. Allman and J. Beer, "The RAP: A Ring Array Processor for Layered Network Calculations," Proceedings o f the Inter. Conf. on Appl. Spec. Array Proc., Princeton, NJ, Ed. S. Y. Kung, E. E. Swartzlander, J. A. B. Fortes and K. W. Przytula, pp. 296-308, 1990. [56] A. Namphol, M. Arozullah and S. Chin, "Higher Order Data Compression With Neural Networks," Proceedings o f the Inter. Joint Conf. on Neural Networks, Seattle, WA, Vol. I, pp. 55-59, 1991. [57] D. Parkinson and J. Litt. Massively Parallel Computing with the DAP. MIT Press, Cambridge, Massachusetts, 1989. [58] C. Peterson and B. Soderberg, “A New Method for Mapping Optimization Problems Onto Neural Networks.” International Journal o f Neural Systems. 1(1): 3-22, 1989. [59] F. Piazza, M. Marchesi, G. Orlandi and A. Unicini, "Coarse-Grained Processor Array Implementing the Multilayer Neural Network Model," Proceedings o f the Inter. Symp. on Cir. & Sys., New Orleans, LA, Vol. 4, pp. 2963-2966, 1990. [60] D. A. Pomerleau, G. L. Gusciora, D. S. Touretzky and H. T. Kung, "Neural Network Simulation at Warp Speed: How We Got 17 Million Connections Per Second," Proceedings o f the IEEE International Confer, on Neural Networks, San Diego, 1988. [61] V. K. Prasanna Kumar and K. W. Przytula, "Algorithmic Mapping of Neural Network Models onto Parallel SIMD Machines," Proceedings o f the Inter. Conf. on Appl. Spec. Array Proc., Princeton, NJ, Ed. S, Y. Kung, E. E. Swartzlander, J. A. B. Fortes and K. W. Przytula, 1990. [62] K. W. Przytula, "Systolic/Cellular System," Internal Report, Hughes Research Labs., Aug. 1989 [63] K. W. Przytula and J. G. Nash, "A Special Purpose Coprocessor for Signal Processing,” Proceedings o f the 21st Asilomar Conference on Signals, Systems and Computers, Monterey, CA, pp. 736-740, 1987. [64] U. Ramacher and U. Ruckert. VLSI Design of Neural Networks. Kluwer Academic Publishers, 1991. [65] F. Rosenblatt, “The Perceptron; A Probabilistic model for information storage and organization in the brain.” Psychology Review. 65: 386-408, 1958. [66] D. E. Rumelhart, G. E. Hinton and R. J. W illiams. "Learning Internal Representations by Error Propagation." In Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1, Ed. D. E. Rumelhart and J. McClelland, Cambridge, MIT Press, pp. 318-364, 1986. 174 [67] S. Satyanarayana, Y. Tsividis and H. P. Graf. "A Reconfigurable Analog VLSI Neural Network Chip." In Advances in Neural Information Processing Systems 2. Ed. D. Touretzky, San Mateo, CA, Morgan Kaufmann, pp. 758-768, 1990. [68] D. B. Schwartz, R. E. Howard and W. E. Hubbard, “A Programmable Analog Neural Network Chip.” IEEE Journal o f Solid-State Circuits. 24(2): 313-319, 1989. [69] T. J. Sejnowski and C. R. Rosenberg, “Parallel Networks that Learn to Pronounce English Text.” Complex Systems. 1: 145-168, 1987. [70] S. Shams, "Neural Networks for Passive Sonar Target Detection," Technical Report, Hughes Research Labs, Dec. 1991 [71] S. Shams and J.-L. Gaudiot, "Efficient Implementation of Neural Networks on the DREAM Machine," Proceedings o f the 11th Inter. Conf. on Pattern Recognition, The Hague, The Netherlands, 1992. [72] S. Shams and K. W. Przytula, "Mapping of Neural Networks onto Programmable Parallel Machines," Proceedings o f the Intern. Symp. on Circuits and Systems, New Orleans, LA, Vol. 4, pp. 2613-2617, 1990. [73] S. Shams and K. W. Przytula. "Implementation of Multilayer Neural Networks on Parallel Programmable Digital Computers." In Parallel Algorithms and Architectures for DSP Applications. Ed. M. Bayoumi, Kluwer Acadamic Publishers, pp. 225- 253, 1991. [74] S. Shams and P. Simic, "Efficient Mapping of Sparse Iterative Systems onto Parallel Systolic Architectures Using Constrained Nets," submitted for publication in the Proceedings o f the Neural Information Processing Systems 92. [75] D. B. Shu, J. G. Nash, M. M. Eshaghian and K. Kim. "Implementation and Application of a Gated-connection Network in Image Understanding." In Reconfigurable Massively Parallel Computers. Ed. H. Li and Q. F. Stout, Prentice Hall, 1991. [76] P. Simic and S. Shams, "Solving the Assignment and Scheduling Problems Using Cnet," Technical Report CALT-68-1892, California Institute of Technology, 1992. [77] P. D. Simic, “Statistical Mechanics as the Underlying Theory of ’ Elastic' and 'Neural' Optimisations.” Networks. 1: 89-103, 1990. [78] P. D. Simic, “Constrained Nets for Graph Matching and Other Quadratic Assignment Problems.” Neural Computation. 3: 268-281,1991. [79] A. Singer, “Implementatins of Artificial Neural Networks on the Connection Machine.” Parallel Computing. 14(3): 305-315, 1990. [80] D. F. Specht, “Probabilistic Neural Network.” Neural Networks. 3: 109-118, 1990. 175 [81] H. Szu, "Fast Simulated Annealing," Proceedings o f the A1P Conference Proceedings 151: Neural Networks for Computing, New York, Ed. J. Denker, pp. 420-425, 1986. [82] H. H. Thodberg, “Improving Generalization of Neural Networks Through Pruning.” International Journal o f Neural Systems. 1(4): 317-326,1991. [83] S. T. Toborg and K. Hwang, “Cooperative Vision Integeration Through Data- Parallel Neural Computations.” IEEE Transactions on Computer. 40(12): 1368- 1379, 1991. [84] S. Tomboulian. "Introduction to a System for Implementing Neural Net Connections on SIMD Architectures." In Neural Information Processing Systems. Ed. D. Z. Anderson, New York, American Institute of Physics, pp. 804-813, 1988. [85] J. D. Ullman, Computational Aspects of VLSI. Computer Science Press, 1984. [86] C. C. Weems. (1984). Image Processing on a Content Addressable Array Parallel Processor. PhD Dissertation, University of Massachusetts, Amherst, MA. [87] S. S. W ilson, "Neural Computing on a One Dimensional SIMD Array," Proceedings o f the Inter. Joint Conf onArti. Intel., pp. 206-211, 1989. [88] M. Witbrock and M. Zagha, “An Implementation of Back-Propagation Learning on GF11, a Large SIMD Parallel Computer.” Parallel Computing. 14(3): 329-346, 1990. [89] X. Zhang, M. McKenna, J. P. Mesirov and D. L. Waltz. "An Efficient Implementation of the Back-Propagation Algorithm on the Connection Machine CM-2." In Advances in Neural Information Processing Systems 2. Ed. D. S. Touretzky, San Mateo, CA, Morgan Kaufmann, pp. 801-809, 1990. 176
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11255726
Unique identifier
UC11255726
Legacy Identifier
DP22854