Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Orthogonal architectures for parallel image processing
(USC Thesis Other)
Orthogonal architectures for parallel image processing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ORTHOGONAL ARCHITECTURES FO R PARALLEL IMAGE PROCESSING by Dongseung Kim A Dissertation Presented to the F a c u l t y o f t h e G r a d u a t e S c h o o l U n iv e r s it y o f S o u t h e r n C a l if o r n ia In Partial Fulfillment of the Requirements for the Degree D o c t o r o f P h il o so p h y (Computer Engineering) June 1988 Copyright 1988 Dongseung Kim UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90089 This dissertation, w ritten by Dongseung Kim under the direction of h.i& Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillm ent of re quirem ents for the degree of P h £ t CpS $8 auk i .7 3 D O C TO R OF PH ILO SOPH Y Dean of Graduate Studies Date J u ly 2 6 , 1988 DISSERTATION COMMITTEE Acknowledgements I would like express my sincere appreciation to my advisor, Professor Kai Hwang, for his guidance, support and encouragement in the writing of this dissertation. It has been a pleasure and privilege to work w ith him. I would also like to thank Professor Alexander A. Sawchuk for his advice and comments. I would like to thank Professor Theodore E. Harris for serving on my dissertation committee and for taking time to read my thesis. I offer my highest gratitude to Professor Sukhan Lee, Professor Michel Dubois, and Professor Gary L. Miller for serving on my Ph.D. guidance committee. I thank my wife, Jungsoon Suh Kim, for her constant support and encour agement in spite of her hardship in raising and educating our children Do-yang and Do-gyun, and for her patience while I was studying. My sincere thanks go to my m other and grandm other for raising and educating me. W ithout their support, I could never have the opportunity to study for the degree. I thank Dr. Kiseon Kim for his discussion about problems on my thesis research and his support. I thank my colleagues, Joydeep Ghosh, Raymond M. Chowkwanyun, Hwang-Cheng Wang, Christoph Scheuric, Parasskevas Evripi- dou, Yi-Hsiu Wei, and Ahmed Louri for their valuable discussion and help. I also thank my friends Kyu Sik Chung, Yeong Gil Shin, Ho-in Jeon and Andrew Sohn for their encouragement. Special mention m ust be made of M arlena Lib- m an, Shirin M istry and Jenine Abarbanel for their kind assistance and support during my study. iii Contents A cknow ledgem ents ii List o f Tables v i List o f F igures v ii A b stract ix 1 In trod u ction 1 1.1 Background and M o tiv atio n .................................................................. 1 1.1.1 Advanced Parallel C o m p u ters................................................... 2 1.1.2 Image Processing and P attern Recognition ......................... 4 1.1.3 Related Previous Research ...................................................... 7 1.2 Research Objectives ............................................................................... 8 1.3 Organization of the T h esis....................................................... 9 2 O rth ogonal M u ltip rocessor A rchitecture 11 2.1 The Orthogonal Multiprocessor ........................................................ 11 2.1.1 The System Architecture .......................................................... 11 2.1.2 Shared Memory O rg an izatio n................................................... 13 2.1.3 Comparison with Other M ultiprocessors............................... 13 2.2 Principles of Orthogonal Memory A ccess........................................... 15 2.2.1 Orthogonal Access P rinciple...................................................... 15 2.2.2 Interleaved Memory D e s ig n ...................................................... 17 2.3 Memory Bandwidth Analysis .............................................. 20 2.3.1 Effective Memory B a n d w id th ................................................... 20 2.3.2 Bandwidth of Crossbar Connected Multiprocessors . . . . 22 2.3.3 C o m p a riso n s................................................................................. 25 2.4 Simulation C a p a b ility ............................................................................ 25 2.4.1 Mesh-Connected M ultiprocessors............................................. 27 2.4.2 Hypercube C o m p u te rs................................................................ 28 2.5 C onclusions................................................................................................ 29 IV H a rd w a re D esig n o f th e O rth o g o n a l M u ltip ro c e s so r 32 3.1 Overall System C onfiguration................................................................ 32 3.2 OMP Processor S p e c ifica tio n ................................................................ 34 3.2.1 Candidate Microprocessors and C o p ro cesso rs..................... 34 3.2.2 Processor Module D e s ig n ......................................................... 36 3.2.3 Cache Design and Coherence Control .................................. 38 3.2.4 Direct Memory A ccess................................................................ 39 3.3 Orthogonal Memory Array ................................................................... 41 3.3.1 Memory Modules and Interconnections.................................. 41 3.3.2 Addressing Scheme and Access C o n t r o l ............................... 42 3.3.3 System and Memory B u s e s ...................................................... 46 3.4 Host and Peripheral U n its ...................................................................... 49 3.4.1 Host C o m p u te r............................................................................. 50 3.4.2 Peripheral D e v ic e s.................................................................. . 50 3.5 VLSI and WSI Im p lem en tatio n ............................................................ 51 3.5.1 VLSI and WSI R eq u irem en ts................................................... 51 3.5.2 RISC P ro c e s s o rs .......................................................................... 52 3.6 Conclusions ....................................................................................... 53 P a ra lle l Im a g e P ro c e ssin g a n d P a tte r n R e c o g n itio n 55 4.1 Fundam ental O p eratio n s.................................................. 56 4.1.1 Recursive Doubling ................................................................... 56 4.1.2 Histogramming .......................................................................... 60 4.1.3 Two-dimensional Linear T ra n s fo rm s ...................................... 65 4.2 Image Enhancement Using Histogram E q u alizatio n ........................ 67 4.2.1 Histogram E q u a liz a tio n ............................................................. 67 4.2.2 Parallel A lg o rith m ....................................................................... 67 4.3 Line Detection Using Orthogonal P ro je c tio n ..................................... 74 4.3.1 Image P ro je c tio n .......................................................................... 74 4.3.2 Hough T ransform .......................................................................... 75 4.4 Parallel P attern Clustering ................................................................... 80 4.4.1 P attern C lu ste rin g ....................................................................... 80 4.4.2 Parallel Algorithm Im plem entation......................................... 83 4.5 Scene Labeling by Discrete R e la x a tio n ................................................ 89 4.5.1 Relaxation L a b e lin g ................................................................... 89 4.5.2 House Scene L a b e lin g ................................................................ 92 4.6 C onclusions................................................................................................. 97 M u ltid im e n sio n a l O rth o g o n a l M u ltip ro c e sso rs 101 5.1 Binary Orthogonal M ultiprocessors......................................................... 101 5.1.1 The System Architecture ............................................................. 102 5.1.2 Comparison with Hypercube C o m p u ters...................................106 5.2 Generalized Orthogonal M ultiprocessors................................................107 j 5.2.1 A (^ n )-O M P A rc h ite ctu re ..........................................................107 5.2.2 Comparisons of Hardware R equirem ents................................... 110 5.3 Communication Strategy .......................................................................... 112 5.3.1 Memory H y p e rc u b e ...................................................................... 112 5.3.2 Processor Interconnection M o d e l................................................ 113 5.3.3 Routing A lg o rith m s ......................................................................116 5.3.4 Average Internode D ista n c e .........................................................119 5.4 C onclusions.....................................................................................................121 , 6 O ther V ariation o f th e O rthogonal A rch itectu re 123 6.1 Mesh w ith Bypass C o n n e ctio n s................................................................ 124 ! 6.1.1 The A rc h ite ctu re ............................................................................. 125 i | 6.1.2 Binary-tree Operations ................................................................ 129 6.1.3 Pyram id E m b ed d in g ...................................................................... 133 6.2 Mapping of Parallel A lg o rith m s................................................................ 134 6.2.1 M atrix M u ltip lic a tio n ...................................................................134 6.2.2 Image P ro je c tio n .............................................................................136 6.2.3 Finding a Row Containing a M e d ia n ........................................138 6.2.4 Difficult M apping Problems on M B C ..........................................140 6.3 C onclusions.....................................................................................................143 7 C onclusions 147 7.1 Summary of Research C o n trib u tio n s.......................................................147 7.2 Suggestions for Further R e se a rc h ............................................................. 149 B ib liograp h y 151 v i List of Tables 2.1 Hardware Complexity of Various Multiprocessor Architectures . 14 2.2 List of Notations U s e d ............................................................................ 20 2.3 Candidate Algorithms for Parallelization to Apply the Orthogo nal M u ltip ro cesso r................................................................................... 30 3.1 Characteristics of the 32-bit M ic ro p ro c esso rs................................. 37 3.2 Characteristics of the 32-bit VMEbus and M ultibus I I ................ 49 4.1 Histogram and its Reassignment Counts for Equalization . . . . 71 4.2 Comparison of Time Complexities of Parallel Algorithms on the OMP with Serial A lg o rith m s ............................................................... 99 5.1 Binary-Reflected Gray Code for 16 Numbers .................................... 106 5.2 Characteristics of the OMP and Comparison w ith a Hypercube C o m p u te r....................................................................................................... 107 5.3 Characteristic Comparisons of Generalized OMP w ith General ized Hypercube and Spanning-Bus H ypercube..................................... 110 5.4 Various Implementation of Multidimensional O M P ..........................112 6.1 Performance Comparisons of MBC with n X n Processors for Se lected O perations.......................................................................................... 144 v ii List of Figures 1.1 Sequence of image processing and analysis .................................... 5 2.1 The orthogonal multiprocessor configuration of 4 processors, 16 memory modules interconnected by 4 memory buses is shown. . 12 2.2 Orthogonal memory access modes in OMP .................................... 16 2.3 Legitimate and illegitimate memory access patterns (accessed mem ory modules are shaded) ...................................................................... 18 2.4 Two-dimensional 4-way interleaving in the OMP memory array (The numbers inside each module are memory addresses with stride distance 1 between modules in adjacent columns, and 16 between modules in adjacent rows.) .................................................. 19 2.5 Comparison of effective memory bandwidths of a crossbar-connected multiprocessor and an orthogonal multiprocessor of equal size. . 26 3.1 The implementation configuration of the OMP architecture . . . 33 3.2 A possible OMP processor configuration using the Intel 80380 chip set.......................................................................................................... 40 3.3 A memory board configuration consisting of 4x 4 memory mod ules................................................................................................................. 43 3.4 A board-level configuration of a 16-processor orthogonal m ulti processor .................................................................................................... 44 3.5 Address f o r m a t......................................................................................... 45 3.6 A memory access c o n tro lle r.................................................................. 47 4.1 Recursive doubling of n = 8 data on an 8-processor OMP . . . . 58 4.2 A k n x k n image m apped onto the n x n OMP memory array 62 4.3 Histogramming of L=8 levels using an OMP with n= 4 processors 63 4.4 Image enhancement by histogram equalization [Pavlidis’82] . . . 68 4.5 Histogram reassignment from old grey levels 1 ,1 — 1, ... to a new level m ....................................................................................................... 70 4.6 An image projected onto a line with an angle $ ................... 76 4.7 A projection of a binary image onto a line with 6 = 45° 77 via i 4.8 16 sample data in 2-D feature space representing 16 rectangles of various shape. They will be classified into 3 clusters ................. 82 ; 4.9 Snapshots of parallel pattern clustering on the OM P............. 87 4.10 Scene labeling problem with four objects and four labels . . . . 94 4.11 Compatibility m atrix for house scene labeling problem stored in i the memory of the OMP with four p ro cesso rs....................... 98 j 5.1 An architecture of a binary three-dimensional O M P ............ 103 5.2 An architecture of a binary four-dimensional O M P ............... 104 5.3 Locations of processors in binary four-dimensional OM P . . . . 105 5.4 A logical diagram of a memory module consisting of n-to-1 switch and a m e m o r y ................................................................................. 109 5.5 The architectures of Generalized Hypercube and Spanning Bus Hypercube I l l ; 5.6 Memory modules accessible by a processor in (3,3) OM P . . . . 114 , 5.7 (2,3)- and (3,2)-processor hypercubes ................................................... 117 6.1 Mesh with bypass connections ................................................................ 127 6.2 Mesh with m ultiple b ro a d c a stin g .................................................128 6.3 Procedures of finding a sum of n = 8 data ......................................... 131 6.4 Binary-tree operation of 8 x 8 data .......................................................132 6.5 Snapshots of a m atrix multiplication with 4 x 4 P E s ............ 135 6.6 Median row finding (8x8 P E s ) ....................................................141 6.7 Routing vectors for a m atrix transpose (8x8 elements) . . . . 142 6.8 Relative tim e requirements for computing a sum on the mesh w ith broadcast, mesh with multiple broadcasting and mesh with bypass ........................................................................................................... 145 Abstract In this thesis, the orthogonal multiprocessor (OMP) architecture is explored and generalized for parallel image processing and pattern recognition applica tions. The OMP architecture was independently conceived by two research groups at the University of Southern California [HT85a,HTK89], and at Prince ton University [SM87]. This architecture compares very favorably with other fully shared-memory multiprocessors. Due to its regular structure, fast com munication capability and efficient parallel operation, the OMP supports many tasks in image processing and pattern recognition. It is shown th at image en hancement with histogram equalization, line detection using orthogonal projec tion, parallel pattern clustering and scene labeling by discrete relaxation can be efficiently m apped onto the OMP. The architecture provides linearly scalable performances for these algorithms. The OMP is extended to higher dimensions based on the hypercube topology. It reduces the heavy memory requirement from the original OMP. The characteristic features include homogeneous struc ture, conflict-free memory access strategy, and partially shared-memory orga nization. The extended OMP is well suited for building fine-grain massively parallel computers for signal/image processing and other numerical applications th a t emphasize large-scale data parallelisms. The thesis also presents a new SIMD array processor architecture, called mesh with bypass connections (MBC). The architecture provides a fast global routing capability based on reducing mesh diam eter from 0 ( N ) to 0(1) for an N x N mesh. The MBC outperforms the original mesh, the mesh with broadcast and the mesh with multiple broadcasting for many algorithms in image processing. Due to its planar structure, the MBC perm its constructing highly parallel systems based on the VLSI technology. Chapter 1 Introduction 1.1 Background and Motivation A lot of high-performance computers have been built with recent techno logical advances and software development. Nevertheless, there is always some demand to have higher computing capability. The need for high-performance computers comes from various applications such as weather prediction, remote sensing, nuclear reactor sim ulation, com putational aerodynamics, artificial in telligence, and m ilitary applications. For the past three decades, a trem endous progress has been made in electronic technology to improve the device switch ing speed and m iniaturization in order to increase processing speed. But the improvement may soon reach its saturation point. Parallel processing inspired m any com puter researchers and architects as a promising approach to signif icantly boost the com putation speed. New machines are emerging which are designed based on parallel processing techniques. However, there are many re search problems yet to be solved in this approach. Exploitation of potential concurrency is limited by the lack of software support for autom atic program partitioning, or it relies on primitive, often machine-dependent, parallel pro gramming techniques. The problem of efficient load balancing among the pro cessors has not been solved satisfactorily. Communication and synchronization between multiple processors cause significant overhead to eclipse the predicted performance. Nevertheless, the parallel architecture design will be one of the most fundam ental issues for developing high-performance computers for the time being, which is also a core research subject in this thesis. 1.1.1 Advanced Parallel Com puters Supercomputers (Cray series, IBM 3090/400, ETA-10) use, in general, a small num ber of very fast and powerful processors using ECL logic devices and high speed memories. Concurrent scalar processing and vector processing with m ultiple functional units support them . The Cray uses sequential programming and lets an intelligent compiler detect parallelism while converting to machine code. The peak performance of the supercom puters reaches several hundred M FLOPS (million floating-point operations per second) when a problem matches to the system. However, the effective performance is 5 to 20 percent of the peak performance [Don87]. Hypercube computers (Intel iPSC, FPS T-series, NCUBE, and Ametek S- 14), called distributed-memory multicomputers, are built as alternative systems to the supercom puters challenging their performance with low cost. They are constructed by duplicating a num ber of identical processor-memory nodes (from 4 to 1024) using rather slow microprocessors and standard memories. Their in terconnection is based on the hypercube topology. Each node of hypercube corresponds to a processor with its dedicated memory, and the edge forms a link between the adjacent processors. Due to the lack of common shared mem ory, the interprocessor communication is performed by message passing. For an iV-processor hypercube com puter, only log N nodes are directly connected. The message should travel logJV nodes in the worst case to reach its destina tion, which may also cause heavy traffic on the internode links. To reduce the communication delays between remote nodes and load balancing of the overall system are current research issues [Cho88]. Also an autom atic decomposition of ! programs to the parallel computers is one of the key problems to be solved for ! general applications. i Massively parallel architectures, for example, Connection Machine (having 64K processors) and Massively Parallel Processor (with 16K processors), employ ; a huge num ber of fine-grain processors which execute the same program on all data. They axe parallel array processors which operate in lock-step fashion under the same controller (called SIMD operation). Since they execute the same instruction on all data, these machines are only suitable for data-parallel problems [Gho88]. The implementation cost for the large system is another constraint in this approach. In shared-memory multiprocessors, all or p art of memories are shared by multiple processors. The sharing is achieved by interconnections between pro cessors and memories. A common bus is an effective connection when there are not so m any functional devices. It can be built w ith low costs and expansion is easy and simple. However, when m any resources share the bus, it becomes a major bottleneck for the whole system ’s performance. A crossbar switch gives the most flexible interconnection between processors and memory modules. The control of the crosspoints is straightforward. However, the 0 ( N 2) complexity prohibits the real implementation beyond some m oderate size. Recently opti cal implementations of the crossbar have been actively studied to overcome the implementation constraints and have a better control scheme using optical tech nology [SJRV87]. A multiple bus network is devised to enhance the capability and reliability from a common bus. Studies show th at the m ultiple bus can deliver high bandw idth close to a crossbar if the num ber of buses is over a half the crossbar input ports (B > N , where B is the num ber of buses) [LR82]. This is somewhat economic, but the hardware complexity is still O (N B ) = 0 ( N 2). M ultistage interconnections are usually constructed log JV long and N j 2 wide with small switch blocks (like 2x2 crossbars). It results in lower hardware complexity ( 0 ( N log AT)) than the crossbar network. Some problems here are routing overhead, path contention (for blocking interconnections), and hot spots [PN85]. 1.1.2 Im age Processing and P attern R ecognition Image processing demands high computational throughput due to the vast ( am ount of picture d ata to be processed within a short tim e [Lev87]. It starts | from a two-dimensional picture whose data size ranges from several hundreds i to a few million pixels. Conventional serial computers are too slow to cope w ith vast processing requirements in these applications. The com putational , characteristics are somewhat different from th at of num ber-crunching for the ! scientific application. Hence many special-purpose research machines have been ; designed and built independently from the scientific com puter industry. The ! research has been boosted in recent years due to technological advances in VLSI implementation of highly parallel computers with low-cost, low-power, and fault- tolerant design. Those parallel systems are either massively parallel fine-grain processors, pipelined processors, or rather general-purpose parallel computers. Those machines also inspire wide research in parallel algorithm development. In this thesis, the parallel architecture design focuses on the application for image processing and pattern recognition. Image understanding tasks consist of a sequence of hierarchical processes [MC85] (refer to Fig. 1.1). Com putational characteristics are quite different j from one another among the processes (from low-level image processing to high- level image analysis). A short sequence of simple logical or arithm etic operations I (window operations) characterizes the image processing (preprocessing and fea- J ture extraction). The same operations are performed onto each pixel of the whole image and the output is also image data (pixel to pixel transform ations) [Ros83]. 1 I This requires a huge am ount of processing power, however, it also creates great : parallelism which can be explored by SIMD (Single Instruction M ultiple Data) array processors, pipeline processors, or systolic arrays [Hwa83]. The high-level Raw Image D ata Feature Extraction Preprocessing Structural Analysis P attern Classification Description and Representation F ig u re 1.1: S eq u en ce o f im ag e p ro c e ssin g a n d a n a ly sis 6 tasks (pattern classification and structural analysis), on the other hand, use symbolic d ata to obtain a description, recognition or classification of the objects in the image. Artificial intelligence operations like search, m atching, consistency checking are performed in these stages. The communication pattern is not as regular as th at of local operation. Thus, the computing system should support versatile interconnections and flexible programming. MIMD (Multiple Instruc tion M ultiple D ata) multiprocessors or distributed-m em ory m ulticom puters are suggested for high-level operation. The image processing/analysis systems th at have been developed, based on the above operational principles, can be roughly classified into one of the following groups according to their major architectural characteristics: (1) Mesh connected system (M PP [Pot83,Pot85], DAP [Hun8l], CAP [Mor86], CLIP [Fou85]); (2) Hypercube SIMD processor (Connection Machine [Wal87,Hil85]); (3) Bus based system (OMP [HTK89], HBA [WH87], TOSPIX-II [KS86]); (4) Pipeline processors (PIPE [GLSW87], PIFEX [GW85], P P P E [HSJP87], SLAP [Fis86], Cytocom puter [Ste81]); (5) Linear systolic array computers (Warp [AAG*86,Kun84]); (6) Pyram id computers (PCLA [Tan84], PAPIA [Can86], GAM -Pyramid [SWH85]); (7) Multiprocessors or m ulticom puters (Hypercube [SF88], Butterfly [CGG*85], Zmob [WKM*85]); (8) Integrated hierarchical system (IUA [WLHR87]). M ajor choices in designing parallel computers for image processing and analysis include m any aspects as listed below: • Fully/partially parallel - The size of a system is chosen to exploit the parallelism either fully (one PE per pixel) or partially according to the cost of implementation. Connection Machine and M PP are fully parallel processors. • SIM D/M IM D - Same com putations can be carried out on all data in lock- I step m anner for data-parallel tasks (SIMD) or tasks can be decomposed ; and executed by several processors asynchronously (MIMD). j | • Real-time processing - The applications may require real-tim e processing j j or have loose tim e constraints. For raster-scan image, each frame of im- : I age is generated every 1/30 sec. Many pipeline processors like P IP E and i ! TOSPIX are designed for real-time applications. I I ! • B it-serial/bit-parallel - Depending on the size of the system and degree of > parallelism, the com putation can be done using either a bit-serial m anner t | i or a bit-parallel one. • Interconnection topology - In distributed memory system s, only close- t j neighborhood processors are directly linked. Global communications can [ also be supported at the expense of interconnection complexity. Widely used interconnections are mesh, hypercube, linear array, and common buses. • Early-vision/integrated system - The system is only for early vision pro cessing of data or capable of further processing of high-level symbolic im age data. CLIP, DAP are for early-vision processors, but IUA, HBA are j integrated systems. I I . | i 1.1.3 R elated Previous R esearch , i I An orthogonal multiprocessor (shortened as OM P) was recently proposed [HT85b,THK85,HTK89,SM87]. The architecture was designed to exploit coarse- grain parallelism in scientific and numerical com putation using its new memory j accessing m ethod. The original papers [HT85b,THK85] include the architectural j concept and its operational strategies. They dem onstrate the potentials of the ! architecture by m apping several algorithms such as 2-D unitary transform ation, , F F T , sorting, and recurrence evaluation. They also present system atic m ethods for m apping parallel algorithms designed for mesh and hypercube multiproces sors onto the OMP. The integrated paper [HTK89] covers all the above topics together with new works by the author: a performance analysis (Section 2.3), m atrix algebraic com putation, and enhanced usage of the orthogonal memory arrays (Section 2.2.2). The same architecture was independently developed and more numerical applications were emphasized w ith communication analysis in [SM87]. The pa per covers m atrix com putation, linear system solving, partial differential equa tion, and polynomial evaluation. 1.2 Research Objectives The thesis research is initiated from orthogonal architecture design. The term orthogonal architecture is defined as a com puter organization which employs disjoint sets of links (buses, point-to-point links, etc.) in memory access or interprocessor communication. The disjoint sets will have separate dimensions each other if they are m apped onto a multidimensional space. This is why ■ they are called orthogonal. An orthogonal multiprocessor has two orthogonal • sets of buses: column buses and row buses. A d-dimensional OMP, which is an extended architecture of the original OMP, has d sets of orthogonal buses. Hypercube com puters may be organized to have logjV disjoint sets of access j links. Thus the hypercube computers belong to the orthogonal architecture. I However, the architecture is not the scope of the thesis research. The objectives and original contributions of the thesis research are to build a theoretical framework for the orthogonal architectures, to develop and evaluate parallel algorithm s on the architectures especially for image processing and pat- : tern analysis (Chapter 4), and to extend the architecture for massively parallel ' com putation (Chapter 5). Also the research is to assess the detailed implementa tion of the OMP using current off-the-shelf hardware components and using the VLSI technology (Chapter 3). Because we have fixed the applications and set up the basic organization principle for the machine design, the approach closely follows the bottom up m ethod. The justification for the architectures and asso ciated algorithm s will be given by performance comparisons with other equally complex systems using analytical m ethod and com parative way. The thesis re search starts from conventional architectures like orthogonal multiprocessors for image processing and pattern classification and mesh connected computers to 4 s . enhance the performance by extending the interconnection (Chapter 6). The applications of the com puter systems are not necessarily limited to image data processing. O ther numerical applications, scientific com putations, and so forth, can also be included. 1.3 Organization of the Thesis The thesis consists of seven chapters. C hapter 1 includes the background of the research with research objectives. C hapter 2 contains the conceptual description on the orthogonal multiprocessor. The OMP architecture and the principle of operations are specified. Architectural comparisons are provided to contrast the difference and to emphasize some advantages and some drawbacks of the architecture. Memory bandwidths for the OMP are derived and compared with similar shared-memory multiprocessors. The regular structure of the OMP makes easy the sim ulation of the other architectures like mesh-connected com puters and hypercube computers. In C hapter 3, a design of a prototype OMP is considered at printed-circuit-board level. Overall system configuration, indi vidual processor specification, orthogonal memory array, supporting hardware design, feasibility study for VLSI and WSI im plementation, etc. are included. Detailed parallel algorithms for image processing and pattern recognition are developed in C hapter 4. They are histograming, linear transform ation, image 10 enhancem ent, line detection, pattern clustering, and scene labeling. Their per formance figures are computed in term s of speedup. In C hapter 5, the OMP is generalized to k-ary n-dimensional systems which dem and less hardw are than the original ones. The generalized architecture supports massively parallel com putation using fine-grain SIMD processors. Mesh-connected array architecture is extended in Chapter 6, called mesh-with-bypass connections. It enhances the communication capability, especially for global d ata movements, in the mesh w ith nearest neighborhood interconnection. Finally, Chapter 7 summarizes the m ain contributions and suggests further research issues. ! Chapter 2 I i I I Orthogonal Multiprocessor i i 1 Architecture I 2.1 The Orthogonal Multiprocessor ; The orthogonal multiprocessor is characterized by its regular structure and ; shared memory organization. This section introduces the architecture of the multiprocessor and contrasts its similarity and difference w ith other m ultipro cessor systems. I 2.1.1 The System A rchitecture The logical architecture of an OMP system is depicted in Fig. 2.1. The system is constructed with n processors Pi, for * = 0,1, • • •, n — 1, and n 2 memory modules M ij for *,j = 0 , 1 , — 1. These resources are interconnected with n dedicated buses B{ for i = 0,1, • • •, n — 1. Each bus is dedicated to one processor only; there is no tim e sharing of the buses by multiple processors. This will 1 greatly reduce the memory access conflicts due to the lack of bus contention. ! The n buses can be functionally divided into two operational sets: n row buses 1 ^ B \ and n column buses B \ for i = 0 ,! ,• • • ,n — 1. Physically, the B r { and B? 12 Processors Memory Access Controller / / t B. B B. B. F igu re 2.1: T he orth ogon al m ultip rocessor configu ration o f 4 proces sors, 16 m em ory m odu les interconn ected by 4 m em ory b u ses is sh ow n . are the same bus P ,. When Bi is used in a row access mode, the B \ is enabled and B f is disconnected. Similarly, the B? is enabled and B \ is disconnected, when Bi operates in a column access mode. The two access modes are m utually exclusive. This constraint simplifies the memory access control significantly. All arbiters in the memory modules use only two-way switches. The arbiters are coordinated by a memory access controller, which has two control states; one for column access and the other for row access. If we remove the orthogonality constraint, we have to have n 2 independent arbiters, which demand much more in hardw are control complexity. 2.1.2 Shared M em ory Organization We regard the OMP as a partially shared-m em ory system, because each memory module is shared by at m ost two processors. Each memory module Mij has two access ports, one for B \ and the other for J9y. This implies th at Mij is only accessible by processor Pi and processor Py. W ith the orthogonality and partial sharing of memory, possible memory access conflicts are significantly reduced. Hence higher inter-processor-memory bandw idth can be established. The diagonal memory modules Mu for i — 0,1, • • • ,ra — 1 are accessed by pro- j cessor Pt - exclusively. In other words, the diagonal modules are local memories to each processor. The off-diagonal modules are shared memories, each shared by two processors only. i I 2.1.3 Com parison w ith Other M ultiprocessors The hardw are demand of the OMP architecture is compared w ith several known multiprocessor architectures in Table 2.1. The comparison is based on : using equal num ber of processors (n) and the same m ain memory capacity (C I words). There are n memory modules for a fully shared-m em ory multiprocessor, j each having a m odular capacity of C /n words accessible by n processors. In the case of the OMP, there are n 2 memory modules, each having C /n 2 words accessible by at m ost two processors. The multi-bus architecture uses multiple tim e-sharing buses and m ultiported memory modules [MHW87]. The major difference between the fully shared-memory system using multiple bus and the OM P architecture lies in the num ber of memory ports per module (n ports versus 2 p orts), the m odular capacity (C /n words versus C /n 2 w ords), and the effective memory bandw idth (n words versus n 2 words per m ajor cycle). The crossbar architecture requires n 2 cross-point switches and the multistage network demands the use of n log n 2-by-2 switches. The control complexity of these fully shared-memory architectures is much higher than th at of an OM P as shown in Table 2.1, especially in switching complexity and the degree of memory sharing. Table 2.1: H ardw are C om p lexity o f V arious M u ltip rocessor A rch itectu res Architectural Feature Fully Shared-Memory Multiprocessors OMP with Partially Shared Memory Crossbar Switch Multiple- Buses M ultistage Network Number of Processors n n n n Number of Memory Modules n n n n 2 Module Capacity C /n C /n C /n C /n 2 Degree of Memory Sharing n processors n processors n processors 2 processors Switching Complexity n 2 cross points n shared buses n log n switch boxes n nonshared buses Maximum Memory Bandwidth n n n n 2 15 i i I The OM P requires an interconnection complexity of n buses w ith only 2 j control states, which is much simpler than the crossbar switch and the multi- ' stage network. W ith the use of n buses and n 2 dual-ported memory modules, : ; the OM P memory array should be able to operate w ith a faster memory cycle | and simpler control than any of the fully shared-memory configurations. The 1 i bandw idth of the OMP memory varies between n and n 2 words depending on i how frequent the interleaved mode is used on various buses. Of course, fully shared-m em ory multiprocessors have higher flexibility in supporting general- purpose applications. The OMP, w ith restricted memory access, is really m eant ' for special-purpose applications in science, engineering, and high technology. ! I 1 I < l 1 | 2.2 Principles of Orthogonal Memory Access i 2.2.1 O rthogonal A ccess Principle The orthogonally accessed memory offers a num ber of attractive application potentials. Given any pair of processors (Pi, Pj), they can communicate with each other through the shared memory modules Af,y or M, - , - in one or at most two memory cycles. D ata or instruction broadcasting can be done in exactly two i memory cycles. W ith n buses active at a tim e, the m axim um memory bandw idth ! is n words per major memory cycle, which equals those systems using a crossbar switch or a m ultistage network. This implies th a t n parallel reads/writes can be ; i carried out per each m ajor cycle. | To characterize the orthogonal memory access patterns, let us denote Afiy[k] as the k-th word in the memory module M,y. The i-th row memory, M [, consists of n memory modules M,y for j = 0,1,•• •, n — 1, and similarly for the j- th column , memory, M f, consists of M ij, for i = 0,1, • • • ,n — 1. The parallel accesses of M \ and of M j are depicted in Fig. 2.2a,b, respectively. Note th a t B? is ' used to access M- and B j is used for M j. Table 2.2 summarizes the notations r o. BZ M q M :u . M >2 M)3 b : He M i M 2 M B i M 23 (a) Row access mode D 3l (b) Column access mode F igu re 2.2: O rth ogonal m em ory access m od es in O M P used in the paper. In fact, the above memory access allows various row and column perm utations as shown in Fig. 2.3. The following access rules m ust be observed: when row buses are used, only modules from distinct rows can form a legitimate access pattern as shown in Fig. 2.3a. Similarly, only modules from distinct columns can be accessed in parallel using the column buses as shown in Fig. 2.3b. Mixed access patterns are forbidden as shown in Fig. 2.3c. 2.2.2 Interleaved M em ory D esign To improve the memory bandw idth, one can consider the use of a two- dimensional interleaving scheme on the OMP memory array. All modules be longing to the same row or the same column can be n-way interleaved to allow pipelined accesses with a minor cycle, which is 1 fn of the m ajor memory cy cle. These row and column interleavings form the two-dimensional addressing illustrated in Fig. 2.4 for the case of n = 4 and k — 4 words per module. The row interleaving assumes a stride distance of one; while the column interleaving assumes a stride distance of nk = 16. W ith n-way interleaving per each row and per each column, the effective memory bandw idth can be enhanced from n words to n 2 words per major memory cycle. Usually, the interleaved memory access is for the execution of vector instructions or for access to regular data structures [HX88]. The random access mode is for random ly accessing scalar operands stored irregularly in the memory array. The practice of random access mode (Fig. 2.3) and interleaved mode (Fig. 2.4) m ust not be mixed on the same memory bus at the same tim e. W hen an interleaved mode is used, only modules attached to the same row bus (or the same column bus) are accessed in an overlapped fashion using m inor cycles. Theoretically, these two access modes can be applied on different row buses (or different column buses) at the same time. This may require a very complex addressing control. For our purpose, we restrict the access of memory to be either w ith a random-access mode or with an interleaved mode (but not both) r 18 0 & ) ¥ < ► I M i ) ^ 1 1 M ; ( ^ ) * 4 ) m J M i 1 ( 5 ) M il m J M : 1 ; (a) A legitimate row access < s ) " N i l < 5 ) M i r C D M u - I n i “ M i - 2 f s - m M i , m J S " ' m % “ M u S ' " - * f c - M s - M l : M « M % M : M e L m J M l : <3 > M i , H m i i i (b) A legitim ate column access (c) An illegitimate access using mixed modes F igu re 2.3: L egitim ate and illegitim ate m em ory access p a ttern s (ac cessed m em ory m od u les are shaded) 19 Mqo Moi £ o to M)3 0 1 2 3 4 5 - 6 _ . 7 ---------------------------g ---------------------- 12 . - - 9 13 ____ 1 0 _____ 14 _____ J 1 _______________ 15 H o H i H 2 M1 3 - - - - -16------ ---------2 0 ---- -------- 2 4 - - - 28 ---------47------ ---------21- - - - -------- 2 5 ---- 29 ______L8_ _ _. ---------22- - - - ---------26- - - - 30 ----------------------------19--------------------- ---------23- - - - ---------------------------2 7 ------ 31 M20 H i M2 2 M 2 3 ---------32- - - - ----------------------------3 6 _ _ . ---------------------------40 - - - 44 ----------------------------33- - - - 37 ---------------------------4 1 - - - 45 ----------------------------34- - - - ----------------------------3 g ---------------- ---------------------------4 2 - - - 46 ----------------------------35-------------------- ______3 9 _ _ . ______4 3 ___ 47 M30 H i H 2 H s ----------------------------4 8 ---- ----------------------------5 2 - - - ---------------------------5 6 ----------------- -----------------------------60------------------ ______49. _ _ . ----------------------------5 3 ---- ---------------------------5 7 - - - ...........61 ---------------------------50- - - - ----------------------------54- - - - _____ 58.___________ .............................. 62. ... ----------------------------51--------------------- ______55_____________ ---------------------------59----------------- 63 F ig u re 2.4: T w o -d im e n sio n a l 4-w ay in te rle a v in g in th e O M P m e m o ry a rr a y (T h e n u m b e rs in sid e each m o d u le a re m e m o ry a d d re sse s w ith s trid e d is ta n c e 1 b e tw e e n m o d u le s in a d ja c e n t c o lu m n s, a n d 16 b e tw e e n m o d u le s in a d ja c e n t ro w s.) Table 2.2: L ist o f N o ta tio n s U sed 20 Notation Meaning Pi Processor t, 0 < * < n — 1 Bj Bus j , 0 < j < n — 1 B f Bus i used in column access mode B r i Bus y used in row access mode Mij The (*,y)-th memory module, 0 < i , j < n — 1 M f All memory modules in the ith column M l All memory modules in the yth row for any given tim e period. Of course this does not preclude the use of these two modes alternately at different tim e periods. Concurrent vector processing is done using m ultiple buses, each in an interleaved mode. The OM P architecture is aimed at solving large-grain problems, where the d ata and program s are distributed over a large array of memory modules. The MIMD mode is adopted for flexibility in applications. We shall compare the OM P w ith other shared-memory multiprocessor architectures in Section 3. M apping of various parallel algorithms onto the OM P system will be presented in Sections 4 through 8 . Then we analyze the performance and design tradeoffs, and discuss program m ing and application issues. 2.3 Memory Bandwidth Analysis 2.3.1 Effective M em ory B andw idth We analyze below the effective memory bandwidth of the OMP architecture. The effective memory bandw idth E is the average memory accesses per unit cycle defined as N E = J2 kP ro b {N fk) k= 0 (2.3.1) 21 > : where Prob(N, k) is the probability th a t k processors out of N are requesting memory accesses per memory cycle, and we assumed there are N processors and N shared memory modules. Let p be the probability th a t a processor requests a memory access (memory access rate). In the OMP, a memory access is performed w ith either a column access mode or a row access mode. Let qc be the probability th at a processor requests a column access. Then qr = 1 — qc is the probability of requesting a row access. The memory bandw idth E is derived below as a function of n ,p , and qe: ■ T h e o re m 2.1 The orthogonal multiprocessor w ith n processors has the follow- . ing effective memory bandwidth: £=E{(")p)d-pr'* £ ( / - * ) ' « ? (!-« •)'■ * + £ , l« f(! k = i v k J *=r>/2i * k (2.3.2) P r o o f The probability Prob(k) th at k processors request column accesses in the same memory cycle is Prob(k) = £ Pm(j) • Pe(k\j) (2.3.3) j= k where Pm{j) is the probability th at exactly j out of n processors request memory accesses. Pc(k\j) is the conditional probability th a t k processors request column accesses and j — k processors request row accesses, given th a t there are in total j memory requests. Using Bernoulli trial, Pm{j) is obtained as: P~U)= | " I p' U - p)”^ (2.3.4) : Similarly, 22 Pc(k\j) = | ' | g*(i - qey ~ k (2.3.5) Suppose j processors request memory accesses. If k of them w ant the column access and j — k the row access, we would carry on the access w ith a m ajority for the whole j requests due to the orthogonality. Hence the num ber of successful memory accesses in each memory cycle is m a x(kyj — k). The bandw idth is thus obtained by including all possible values of j : E = T,Pm(j) 3=1 ri/21-1 j £ (j - k)Pe{k\j) + £ kPe(k\j) k=0 k = \j/2 \ (2.3.6) where the first term in the brackets corresponds to row accesses, and the sec ond to column accesses. The proof is complete, if Eq. 2.3.4 and Eq. 2.3.5 are substituted into Eq. 2.3.6. Q.E.D. 2.3.2 B andw idth of Crossbar C onnected M ultiprocessors For a multiprocessor with n processors and n 2 memory modules intercon nected by an n x n crossbar switch network, the effective memory bandw idth has been form ulated by Bhuyan [Bhu84] as: £ = re{ l - ( l - p , ) ( l - p i ^ r 1} (2.3.7) n — 1 where q represents the probability of a request to access a favorite memory. A favorite memory is a memory module th a t a processor accesses more often than the rest. The probabilities of accessing the rem aining modules were as sumed equal in the above formulation. Detailed derivation of the bandw idth is described below. Consider a multiprocessor with N processors interconnected w ith M shared memory modules. All the memory access is assumed to be synchronized; each processor turns in a memory request, if any, in the beginning of a memory cycle. 1 We assume th at the requests are random and independent of the other proces sors (spatial independence), and requests which are not accepted are rejected | (random choice of acceptance). Each memory module and processor will have j the same pattern of requests; i.e. if memory m odule 1 (Mi) has a set of access probabilities { qi,q 2, • • • , 9jv } from processor 1,2, •••, N , respectively, all the other modules have the same set of probabilities from the processors though the I corresponding m apping is different. From the point of processor, it represents ! a collection of probabilities of access to all memory modules. We call this ho mogeneous. If N ^ M , the homogeneous condition does not exist. Hence we concentrate the analysis for the case N = M . Since a processor can generate an , access request to any memory module w ith a certain probability, the total of all ; possible cases should be equal to one: j ; n 1 qk = 1. (0 < qi < 1 for all i = 1 , 2 ,..., N ) ! * = i i Let X denote the probability th at there is at least one access request to a certain memory module, for example, Mm. Then the probability th a t there exist ' at least k requests from processors is specified using Bernoulli trial as: Prob{N,k) = ^ ^ j X *(l - X ) N~k. I The effective bandw idth defined below becomes a linear function of X due to the m ean of binomial distribution: N E = kProb(N, fc) = N X . (2.3.8) k=0 There has been an assum ption above th at we can serve up to N requests at a tim e, which is possible if we have an N x N crossbar network. ; Now the probability X is computed. The probability th a t at least one access request exists for a memory module is 1 — Prob{no requests to the module at I all). Since there are N processors, the event of no request to the m odule happens ' only when none of the processors request a reference to the module. This is a joint event th at [Pj i — ► M m] D [P2 i — * ■ M m] fl • • • fl [Pjy M m], Here i — » ■ represents no request. We also use — * ■ to denote an access request. Since we assumed the spatial independence, the probability for the above joint event is com puted as a product of those of the individual event. Let 9, represent a probability th at a processor p requests an access to a certain module, say m: Prob(Pi -* Mm) = qi, (0 < 9, < 1) ■ Assume th at each processor generates memory access requests at the rate p (0 < p < 1). Given p, the (conditional) probability th at processor Pi requests an access to memory M m is then Prob(Pi -» M m\p) = PQi, (0 < p, 9, < 1) Also Prob(Pi h - * - M m\p) = 1 — pg,-. Using the joint event, the probability X becomes a function of p and p i,...,p w ” X = 1 — P ro 6 (No request at all) = 1 — Prob{P\ y-* M m\p)Prob(P2 *-* M m\p) ■ • • Prob(Pif i-> M m\p) = 1 - (1 - P 9i)(l - P92) • • • (1 - pqN) N X = l - n ( l - P 9 i f e ) ( 0 < X < 1 ) (2.3.9) *=i Since we have an access probability q for the favorite module, and the remaing N — 1 modules are accessed with uniform probability 9* = (1 — q )/(N — l),fc = 2,•• • ,JV, the bandw idth is obtained by replacing 9*’s in equations (2.3.8) and (2.3.9): E = N { l - (1 - pq)(l - 25 2.3.3 Com parisons For the sake of comparison, we consider the diagonal local memories as the favorite ones in the OM P architecture. This is due to the fact processors access the local memories more frequently th an the shared modules. Equal access probabilities are assumed for all off-diagonal memory modules. We divide the access probability qe in Eq. 2.3.5 into two term s as follows: 9. = 9 + ^ y 2 = ^ y 2 (2.3.10) The term q is for accessing the favorite memory and the term ■ ^ is for accessing the remaining modules in the same column or in the same row. Figure 4.5 shows the effective bandw idths of two m ultiprocessor architec tures; the crossbar system versus the OMP. The curves are plotted as a function of the access probability p under two cases. The case in Fig. 4.5a corresponds to a higher probability of accessing the favorite memory (q = 0.75) and Fig. 2.5b for a lower value (q = 0.5). The plots dem onstrate th at the effective memory bandw idth of the OM P is comparable to th a t of the crossbar network. In summary, we found th at the OM P architecture requires m uch less con trol hardw are by using dedicated memory buses (rather than tim e-sharing buses across m any processors). The increased memory bandw idth is due to conflict- free access, lower degree of memory sharing, and two-dimensional memory in terleaving. For the OMP, the resource conflict problems, such as hot spot and bus contention, are avoided by the synchronized orthogonality among m ulti ple processors. All of these features are crucial in making OM P an attractive parallel-processing architecture for scientific com putations. 2.4 Simulation Capability In this section, we analyze the performance and capability of the OMP as compared with other parallel computers such as MCC and hypercube sys- 26 Bandwidth .. (w ords/m em ory cycle) 30 N o. of processors = 32 20 10 - - 0 0.5 1.0 (a) Favorite m emory access rate q = 0.75 30 Bandwidth ( w ords/m em ory cycle) 20 No. of processors = 32 1 0- - 16 0.5 (b) Favorite memory access rate q = 0.50 . 1.0 Access probability Access probability F igu re 2.5: C om p arison o f effective m em ory b a n d w id th s o f a crossbar- con n ected m u ltip rocessor an d an orth ogon al m u ltip rocessor o f equal size. tem s. Many m atrix and graph algorithms follows regularly structured d ata flow patterns. Most of these algorithm s can be efficiently m apped onto the MCC [AK84,DNS81,Joh87,MS85]. O ther algorithms which dem and global data transfers, such as sorting, F F T , and other arbitrary d ata movement operations can be m apped into the hypercube system efficiently. The OM P can simulate these architectures w ith a linearly scalable performance. We compare below an OM P of n processors w ith an MCC containing kn x kn processors. Note th a t k 2 memory modules of MCC are m apped into a single memory module in the OMP. Assume th a t each memory module in the MCC has a capacity of r words. Then each memory module of the OMP m ust have at least rk2 words, : i.e., rk2 < m , where m is the capacity of each OM P memory module. W hen r and k are large, rk2n2 d ata points imply really large-grain com putations. The following two theorems show th at the OMP can be used effectively to simulate either the MCC or the hypercube com puters using only a small num ber of fast I processors and memory modules. f 2.4.1 M esh-C onnected M ultiprocessors T h eorem 2.2 Any algorithm demanding T steps on an MCC of kn x kn pro- . cessors can be executed in at m ost k?nT steps on an OMP of n processors. P r o o f We m ap kn x kn blocks of d ata in the MCC onto the n x n memory modules of an OMP. Each k X k block is m apped onto k 2r consecutive words ! of an OMP memory module, i.e. the memory locations between Mij[(pk+q)r] and Mij[(pk+q+l)r-l] are the (/, J)-th memory module of the M CC, where (J, J ) = (ik + p.,jk + q) and 0 < i , j < n, 0 < p,q < k. To sim ulate the oper- i ations of MCC, processor Pi of OMP performs local com putations of the MCC : processors in columns ik to (i -f l)k — 1. OMP processors sim ulate the north- | south d ata movement through column access; similarly, east-west d ata transfer i through row access. Each execution step in the MCC can be perform ed on an OM P in k 2n steps. Thus an algorithm dem anding T steps on an MCC requires at m ost k2n T steps on an OMP. Q.E.D. 2.4.2 H ypercube C om puters Next, we consider the sim ulation of a hypercube com puter w ith kn 2 pro cessors using an OM P of n processors, where kr d ata words of each k hypercube nodes are handled by each memory module of OMP. Note th a t the values of n and k m ust be chosen such th a t kn 2 = 2N for some integer N . T h e o re m 2.3 An algorithm demanding T steps on a hypercube com puter of kn2 processors can be executed in at m ost k n T steps on an OMP of n processors. P r o o f The hypercube memories are m apped into a rectangular array of kn X n blocks of r words each; i.e. Mij[lr] to M ij[(l+ l)r- 1] correspond to the local memory of node K , where K = (j n + i)k + I , 0 < i , j < n — 1, 0 < I < k — 1, and kr < m. Each OM P processor P, simulates the local com putations of node ikn to node (x + l)k n — 1 in the hypercube com puter. Processors rearrange data stored in their column memory modules to sim ulate d ata transfers of distances 1, 2, 4, . . . , kn /2, in the hypercube com puter, similarly row memory modules are used for distances kn, 2kn, 4kn, . . . , kn 2/2. Each com puting step and each routing step in the hypercube com puter can be performed in kn steps on an OMP. Q.E.D. The above m ethods indeed provide system atic ways of m apping onto the OM P architecture any parallel algorithm , which is originally designed for the MCC or for the hypercube systems. The above tim e complexities (Theorem 8 and 9), as obtained by sim ulating an MCC or a hypercube com puter on an OMP, pose some upper bounds for OM P architecture. The mesh and hyper 29 cube m ultiprocessors are both popularly used in m any scientific applications. This indicates th at the OMP is indeed a powerful architecture for scientific and engineering com putations. 2.5 Conclusions We conclude by summarizing the advantages and shortcomings of the OMP architecture. Based on the results presented, the OM P architecture has the following distinct advantages: (1) The control complexity for parallel memory accesses in OM P is very simple as compared w ith the fully shared-memory systems, because of synchro nized orthogonality, smaller m odular capacity, and lower memory access time. (2) W ith the orthogonal memory access and 2-D interleaved memory orga nization, the effective memory bandw idth is potentially n times higher than fully shared-m em ory multiprocessors using a crossbar switch, m ulti ple buses or a m ultistage network. (3) The OM P architecture has been dem onstrated to be very powerful to im plement a large class of scientific algorithms. Some of them have been also independently confirmed by Scherson and M a [SM87], in which they prove th a t the performance of the OM P architecture compares favorably with the best known multiprocessor architecture w ithin a factor of three. (4) M any parallel algorithm s are attractive candidates for efficient m apping onto the OMP architecture. Essentially, these are the algorithm s which can exploit the orthogonal memory access property. We list the candi date algorithms in Table 2.3 [DNS81,Hwa87,HD88,MS85,OV85,Qui87]. A few of them have been thoroughly m apped onto the OM P [HTK89]. To find some system atic m ethods for autom atic problem decomposition and m apping of those algorithms is an open problems for further research. T ab le 2.3: C a n d id a te A lg o rith m s fo r P a ra lle liz a tio n to A p p ly th e O r th o g o n a l M u ltip ro c e s s o r C a te g o ry A lg o rith m s a n d C o m p u ta tio n s V ector/M atrix Arithmetic M atrix M ultiplication Sparse M atrix Operations Linear System Solution Eigenvalue Com putations Least Squares Problems Signal/Im age Processing Convolution and Correlation Digital Filtering Fast Fourier Transforms Feature Extraction P attern Recognition Scene Labeling by Relaxation Optim ization Processes Linear Program m ing Sorting and Searching Integer Program m ing Branch and Bound Algorithms Constrained Optim ization Partial Differential Equations O rdinary Differential Equations Partial Differential Equations Finite-Elem ent Analysis Domain Decomposition Numerical Integration Special Functions and G raph Algorithms Power Series and Functions G raph M atching Interpolation and Approxim ation Two obvious shortcomings of the OM P architecture are identified below. We point out these drawbacks in order to inspire continued research efforts to over come the stated difficulties. (1) Orthogonal memory access principle prohibits possible memory accesses ! | w ith mixed modes. This m ay prolong the comm unication between proces- | sors and reduce the flexibility in m apping those algorithm s w ithout the property of access orthogonality. I (2) The num ber of memory modules increases as the square of the num ber of ' processors, which makes the system hard to be used for large-scale parallel ; com putation. Despite a few shortcomings, the advantages of the OM P architecture are i sufficiently strong to proclaim its efficiency, when the OM P is applied for ma- ; I trix algebra, signal/im age processing, sorting, linear program m ing, F F T , and parallel PD E solutions. The OM P architecture is well suited for m odular con struction using commercially available microprocessors, random-access memory chips and off-the-shelf high-bandwidth buses. Because of partial memory shar ing, orthogonal access of data, and regular interconnection structure, the OMP architecture is more suitable for large-grain scientific and engineering com puta tions as often demanded in mainfram e systems or supercom puters. Based on the current technology, the OMP does not support a fine-grain, massive parallelism. 32 Chapter 3 Hardware Design of the Orthogonal Multiprocessor This chapter deals with detailed hardw are im plem entation of the orthogo nal multiprocessor. A functional design is provided using off-the-shelf IC com ponents. Alternative components are suggested for possible design changes. A m odular design is pursued in order to integrate the system components with VLSI or Wafer Scale Integration technology in the future. 3.1 Overall System Configuration A complete OMP system consists of the multiprocessor, a host com puter, and some attached peripherals. A possible im plem entation configuration is shown in Fig. 3.1. The host handles in p u t/o u tp u t, software development, and resource m anagem ent. A SUN 4/160 workstation can be used to provide fa cilities such as the console, disk drive, tape unit, and hardcopy term inals. It also helps diagnose the system faults and m onitor the system perform ance. An interface between host and the OMP m ediates d ata exchange, program load ing/unloading, and other in p u t/o u tp u t activities. The interface facilitates high speed data transfer using Direct Memory Access. T erm in a l D isk T a p e SUN-4 HOST VME BUS P r o c e s s o r B o a r d s DMA S y s t e m / M e m o r y B u s I n te r c o n n e c ts Fault M o n ito r P a ra llel I/O B u s e s M em ory V id e o I n t e r f a c e M em ory B oard s TV C a m e r a M onitor Orthogonal Multiprocessor F igu re 3.1: T h e im p lem en tation configuration o f th e O M P arch itectu re 34 j 1 A prototype OM P is designed using IC chips and components commercially i available by 1988. 32-bit microprocessors are chosen including the Intel 80380 family and M otorola 68020 and 88000 series. The prototype OM P consists of 16 processors and 16x16 memory modules. The expected peak performance is ; 80 M IPS (or 24 M FLOPS for 64-bit precision) w ith the 16 processors. Each , memory module contains 128 Kbytes. The total memory becomes 128Kx 16x16 = 32 M bytes, fitting 16 memory boards. The memory can be easily expanded by l 1 adding more storage spaces to the local memories. Processors are also grouped into boards of 4 processors. Hence a total of four processor boards are required. Fault tolerance is provided by adding spare processors and extra column or row memory modules. The fault m onitor constantly watches the system to detect any failure and to suppress its propagation. If there is any failure, the entire column or row memory containing the failed module is logically disconnected by Bus Interconnect, and the spare hardw are is then used. A detailed system description on processors, memory system, and peripherals is given in the sub sequent sections. 3.2 OMP Processor Specification 3.2.1 C andidate M icroprocessors and C oprocessors Many 32-bit microprocessors we have considered include the Intel 80386, i the M otorola MC68020, the National Semiconductor NS32332, and the Inmos T-800 Transputer. The choice is based on the com putational requirem ent and I j performance expandability. The microprocessors are assessed below for our pur- ! poses: [ The In m o s T ra n s p u te r T-800 is a single-chip VLSI device combining 1 processor, memory, and communications. Each chip contains m ultiple commu- i nication by including an autonomous I/O processor, a DMA interface, and four j serial processor-to-processor interfaces. It has an on-chip coprocessor which per- I forms 1 M FLOPS with 64-bit precision. For interprocessor communications, the | T ransputer uses four serial point-to-point links th a t are completely independent i of the memory and I/O buses. j The N a tio n a l S e m ic o n d u c to r N S 3 2 3 3 2 is a 32-bit CPU targeted for ap- i | plications in engineering and CAD workstations [Mat85]. The microprocessor is i j designed to reduce the bus interference between direct memory access, m ultiple 1 CPUs, and graphics by reducing memory bus traffic. The interference is re duced in two ways: it maximizes the information in each transfer and eliminates transfers altogether by keeping information where it is needed. The reduction i in memory traffic is achieved at the expense of increased on chip complexity. i ( The M o to ro la M C 6 8 0 2 0 is a 32-bit m icroprocessor having non-multiplex- ; ed address and d ata buses. The outstanding features of the MC68020 are the on-chip cache, the execution pipeline, a coprocessor interface, and several ad ditional addressing modes useful for high-level language and d ata structures i [MMM84]. The instruction cache consists of 64 32-bit entries for a total of 256 bytes. Under norm al operation, the user has no knowledge of the presence or ab sence of the cache. The CPU has a three-stage pipeline to improve the execution speed by allowing overlapped prefetch, imm ediate operand extraction, and de- : code of up to three successive instructions. Further performance improvement can be achieved by attaching a special-purpose coprocessors. The MC68020 : incorporates a hardw are communications protocol for synchronous coprocessor ! operation such th a t the m ain processor is idle while the coprocessor is running. I The characteristics of the microprocessors are sum m arized in Table 3.1. The Intel 80386 microprocessor is provided w ith various functional chip support for connecting peripherals to enhance the performance. The following prototype | is based on the 80386 augmented w ith a coprocessor, cache and DMA. 36 3.2.2 Processor M odule D esign The In te l 80386 is a 32-bit microprocessor. Its im plem entation exploits pipelining and parallel execution to provide high performance. The processor supports m ultitasking based on a pipelined architecture w ith an address transla tion cache, a high-speed bus, and memory m anagem ent [EA85]. These features provide for fast instruction execution as well as allow com pact system design. The 80386 implements a full 32-bit architecture with 8 registers, ALU, and in ternal buses; it provides 32-bit instructions, addressing capability, d ata types, and a 32-bit external bus interface. At 16 MHz, the 80386 bus can sustain a 32-M bytes/s transfer rate. The internal 80386 bus is optimized for efficient in terface to a variety of memory and I/O devices. The bus can be dynamically configured to support both 16 and 32-bit devices. It can also be coupled w ith a range of 8-bit, 16-bit and 32-bit peripherals. Typical instruction mixes indicate an average processing rate of 4.4 clocks per instruction. W ith 2 0 MHz clock rate, the 80386 executes at 4 to 5 M IPS (Million Instruction Per Second). Thus a total of 80 MIPS peak performance is expected for a 16-processor OM P sys tem . The 80386 processor will employ the Intel 80387 chip among the following coprocessors for fast numerical d ata processing. There are two floating-point coprocessors which can be m atched w ith the 80386 microprocessor: Intel 80387 and Weitek 3364 FPU . The 80387 supports single-, double-, and extended-precision operations. The 80387 is compatible with IEEE 754 floating point standard. The 80387 delivers minicomputer-like perform ance of 1.5 M W hetstone/s for 16 MHz 80386/87-based system. The W hetstone Code is a synthetic imm ediate code for benchm arking which covers various operations like arithm etic, branch, procedure calls, standard functions, etc [Ben75]. The 80387 has a tightly-coupled interface w ith 80386 and sup ports for trigonom etric, exponential and logarithmic functions. O ther support for 80387 includes the Unix System V/386 Release 3.0 operating system and optimizing compilers. 37 T ab le 3.1: C h a ra c te ris tic s o f th e 3 2 -b it M ic ro p ro c e s so rs M anufacturer Intel M otorola National Inmos Semicon. T ransputer Product iA P X 80386 M C 6 8 0 2 0 N S 3 2 3 5 2 T -8 0 0 1.5/irai 1 /L tm 1.5/zm 1.5/jm Technology CMOS HMOS CMOS CMOS M aximum Clock Frequency 2 0 MHz 25 MHz 30 MHz 30 MHz Word Length 32 32 32 32 E xternal A ddress/ D ata Bus 32/32 32/32 32/32 32/32 Multiplexed? No No No Yes On-chip 32-page 1Kbyte D at a/In stru ct ion T L B / 128x16/ D ata Cache 1KX32 / Memory 16 byte 128x16 512 byte 1K x 32 prefetch queue Inst. Cache Time to Load 1 Word - 150 ns 150 ns 2 0 0 ns Add 2 Words - 97.5 ns 133 ns 50 ns Peak/Sustained - - 15 MIPS 10 MIPS Performance 4 MIPS - 8-10 MIPS - I 38 The Weitek W TL 3364 integrates on one chip a 64-bit floating point m ulti plier, a 64-bit floating point ALU, a divide/square-root unit, and a 32-wordx64- bit register file. This chip set does not fully implement the IEEE 754 standard b ut it is compatible with the standard. Performance up to 2 0 Mflops can be realized with the coprocessor unit. If the OM P is applied for intensive floating point operation, the Weitek coprocessor is quite suitable. The Unix System V/386 Release 3.0 also provides for optimizing compiler support for the 3364. 3.2.3 Cache D esign and Coherence Control W ith its two-clock bus, the 80386 can put out an address in one clock and get d ata back in the next clock. At 16 MHz, each clock cycle is 62.5 ns. For zero- w ait-state operation, the memory access tim e should be faster than 50 ns. Static RAMs alone have similar fast access time, but they are too expensive to use for a large array. A small but fast SRAM cache memory (like 1DT71024 SRAM from Integrated Device Technology) with large but slower DRAM memory system will be economic and cost effective. The m ost often used code and d ata can reside in the fast-access cache while less often used code and d ata rem ain in the slower m ain memory. In designing a cache memory system, price/perform ance and cache co herency implications are the factors considered. In a common-bus architecture, the cache system needs to m onitor the bus via bus watching so th a t the processor can update the cache when d ata are modified. In the m ultiprocessor environ m ent, however, we have several caches which may contain some shared data between processors. Hence each processor should broadcast its w rite operations to all caches. Every cache then examines its assigned addresses to see if the broadcast item is already loaded in its storage. If it is, the corresponding cache page should be invalidated or updated. This scheme lets each cache continuously m onitor all activities on the memory buses to detect references to its stored data by other processors. W hen such a reference occurs, each cache then directly up- I I I ; dates the status of its own contents. These activities are all perform ed by the aid of added hardw are not to slow down the memory accesses. In the OM P sys- ■ tem , the Memory Access Controller can supply the memory w rite information to specific caches whenever a new memory write occurs. Thus no broadcast is needed. The controller should be fast enough to give out all write information 1 in a memory cycle to related caches. Another im portant consideration is the I i software support required for the cache. The cache system m ust be software- • transparent for com patibility w ith existing software and ease of upgradability in the future. The Intel 82385 is a cache controller which is specifically designed to in- t 1 terface w ith the 80386 and th at can address the entire 4-Gbyte m ain memory i , address range. Because the tag-memory and control logic are on-chip, design of j a cache memory subsystem is greatly simplified. The chip supports a 32-Kbyte 1 two-way set-associative cache th a t has bus-watching capability. The 82385 pro- ■ hibits stale d ata and guarantees cache coherency of d ata and code in systems in which other bus m asters have access to the same memory space. It requires no software initialization and is software-transparent. Refer to Fig. 3.2 for the configuration of cache and its controller. : 3.2.4 D irect M em ory A ccess i i The Direct Memory Access (DMA) control is very critical for the system I i performance among other peripherals. D ata movement between m ain memory i | and secondary storage, between I/O devices can be perform ed w ithout proces- | sors’ direct supervision. Processors are relieved from the I/O transfer duties, hence the system performance increases since the CPU doesn’t have to switch context for every transfer. The DMA controller should be able to facilitate a fast transfer th a t fully utilze the bus bandw idth. It should also support various ! the d ata form at for m any peripherals (8-bit through 32-bit form at). The 82380 is the newest generation DMA controller among the m atching controllers Intel 40 80387 80386 Coprocessor Processor a d d r e ss data c o n tr o l 1DT71024 Cache 82385 Cache Controller bus To OMP Memory Array watching ROM L 82380 DM A Controller a d d r e s s d a ta c o n tr o l Memory Access Controller Bus Interface To Host & I/O M emory Bus (Functional units in the shared area are dedicated to one processor.) S y s te m Bus (M ultibus II) F igu re 3.2: A p o ssib le O M P p rocessor con figu ration u sin g th e In tel 80380 chip set. 41 supports for 80386 processor. The 82380 integrates a num ber of other com mon system peripheral functions such as DRAM refresh control, program m able tim ers and interrupt control and w ait-state generation. The 82380 can transfer data through eight individual program m able channels at the full bus bandw idth of the 80386-40 M B ytes/s at 20 MHz. The OM P system contains m ultiple pro cessors, thus DMA controllers should support independent data transfer requests between peripherals, I/O devices, and memory modules. 3.3 Orthogonal Memory Array Conceptually the memory for OM P is simple and very regular if the entire memory can be accommodated in a single plane. For even a m oderate sized system , however, the planar im plem entation is not possible due to the limited area and pin-out constraint of a circuit board. To support the orthogonal access and facilitate all data movement operations like broadcasting, parallel shift, a proper address form at and hardw are support should be realized. This section deals w ith the design issue of the OMP memory array. 3.3.1 M em ory M odules and Interconnections A two-level im plem entation is sought for the memory system: individual board design and a system-level inter-board connections. Each board is built identically for easy m anufacturing and m aintenance. One memory board im plem ents a portion of the 2-D memory array. A memory board consists of 4x4 memory modules with column and row buses multiplexed and demultiplexed on board as shown in Fig. 3.3. The multiplexing is necessary since the num ber of pins required is over 500 per board. We have 4 buses running in the X direction (row buses) and another 4 buses running in the Y direction (column buses). Each bus consists of 32-bit address lines and 32-bit d ata lines. Hence we need 8 X (32+32) = 512 pins per board w ithout multiplexing. Interboard connection is done in two ways as illustrated in Fig. 3.4. Backplane common bus connection gives a set of bus interconnections. Another set of bus connections are provided using flexible cables running on the side of boards. Orthogonal bussing is im plem ented in two-layer routing. For a printed circuit board, we layout the front and rear surfaces of the board for row buses and column buses, respectively. Processors are also grouped into several boards. A processor board consists of 4 processors and a set of multiplexed buses. A System /M em ory Bus Inter connect is constructed to control the connections between memory modules and OM P processors, I/O processors, and host. This also contributes in switching out fault modules in case of hardw are failures. 3.3.2 A ddressing Schem e and A ccess Control Each processor is responsible for generating the memory access requests and queueing the unsuccessful request to avoid generating repeated requests. The address will be examined by the Memory Access Controller to determ ine the current memory access mode. The memory address consists of three fileds: mode selector, module index, and inner-module address (see Fig. 3.5). The mode selector contains one of the following items: row access, column access, row broadcast, column broadcast, and no access. Three bits are assigned to dis tinguish the access mode. The m ost significant bit of the access mode represents memory access (1) or non-memory access (0). The rem aining bits represent ei ther the column or row access, and norm al access or broadcast. The module index indicates the location of the memory module in the row or column memory. The inner-module address is the internal address w ithin the selected memory module. Each memory module of the OM P memory array consists of two parts: sem iconductor chips and a bus selector. The memory chips provide the physical space for d ata read and write. The bus selector consists of 2 -to -l switches. The bus selector electrically switches the memory to one of the two processors sharing R o w Busses 32*4 t o 3 2 M U X - DEMUX B ~ 64IC 8 i B us sele c to r 64K 8 B u s selecto r Column Busses 32*4 to 32 m u x - d e m u x i .........— r F igu re 3.3: A m em ory board configu ration co n sistin g o f 4 x 4 m em ory m od u les. 44 VB4 4 4 m em ory m odule board 4 - p r o c e s s o r board ! , F igu re 3.4: A b oard -level configu ration o f a 16-p rocessor orth ogon al i I m u ltip rocessor m od u le A c c e s s in d ex m ode A d d r e ss Processor P i A c c e s s m ode 111 : column access 110 : row access 101 : column broadcast 100 : row broadcast Oxx : no access (x: don't care) F igu re 3.5: A d d ress form at the same memory according to the selection signal (refer to Fig. 5.4). The selection signal is provided by the Memory Access Controller at every memory cycle. By observing the address sent on the address buses, we can tell exactly w hat location a processor wants to reference. The orthogonal memory access is centrally supervised by the Memory Access Controller. The Memory Access Controller collects the access requests and decodes their addresses to count the i column access requests and row access requests. It then determines the m ajority access request to be the actual memory access. The selection signal is distributed i to all memory modules to choose either the row bus or the column bus for the current memory access. The unsuccessful requests should be queued for the next available memory access. To provide a chance for the rejected accesses, priority is set to the rejected access such th at if the num ber of rejections (rejection count) reaches a predefined threshold, the next memory access is guaranteed to the rejected mod. The rejection count is reset to zero right after the successful access (see Fig. 3.6). The threshold could be program m ed statically or adjusted dynamically depending on the program s or applications. The memory reference patterns could be optimized by choosing a proper threshold. The w idth of the memory address is determ ined by covering all necessary fileds. Four bits are sufficient in indexing 16 memory modules, and three bits are , needed for mode selection. If we allow a 32-MByte m ain memory distributed into 16x16 memory array, 25 bits are sufficient for inner-m odule addressing. i j Hence the memory address is 32 (=44-3+25) bits wide, j 3.3.3 System and M em ory B uses | The OM P system uses three different types of buses: memory buses, a ! system bus, and a processor bus. Memory buses support read/w rite activities of | processors to their m ain memory. The system bus interconnects OM P processors to shared resources such as I/O processors and other peripherals. A processor 47 n - 1 Row a c c e s s counter R o w /c o lu m n s e l e c t i o n s ig n a l Colum n a c c e s s counter reset T h resh old up d ate XOR T h resh old R e je c tio n co u n te r C om p arator C om p arator F igu re 3.6: A m em ory access controller 48 bus is common to all OM P processors to be used for interrupts or for short interprocessor communications. The system bus is selected from standardized 32-bit system buses including the M otorola VME bus, the Intel M ultibus II, the Texas Instrum ent Nubus, the IEEE Futurebus, and the US Nim Com mittee Fastbus [Bor85]. VM Ebus and M ultibus systems are available as commercial products. The VM Ebus is asyn chronous, non-multiplexed bus w ith m axim um bandw idth 57 M bytes/s. The M ultibus II is a synchronous bus with a 10 MHz clock and address and data are m ultiplexed on the same lines. It has a m axim um 40 M bytes/s bandw idth. Both buses are capable of addressing 2 32 byte memory space, and the M ultibus supports bus repeaters for bus extension while the VM Ebus does not. Vari ous characteristics are compared in Table 3.2. We use the M ultibus II which m atches to the Intel 80386 family. The system buses provide d ata paths be tween processors of the OMP, peripherals, and other system supporting hard ware. In p u t/o u tp u t, user program loading/unloading, d ata exchange between subsystem s are achieved via the common system bus. The OM P and the I/O processors are interconnected by the system bus. The VME bus of the host and M ultibus in the OM P are linked through Parallel I/O Interface. Interprocessor communications can be perform ed in two ways. For m any parallel algorithm s, the comm unication between processors is perform ed via shared memory. The processor bus provides for passing short d ata like synchro nization variables and interrupt signals to other processors to get an immediate attention. The memory bus is a direct connection between the processor and its row and column memory modules. Physically they are running w ithin boards and between memory boards. Since the function of the memory buses is only to provide memory references (i.e. read, w rite, and broadcast), they are designed separately from other buses. A num ber of memory boards m ay be intercon nected on the same memory bus. The larger the system , the longer becomes the 49 : bus length. The long bus is susceptible to degradation in its performance, es pecially bandw idth. Hence the memory-bus design is very critical to the overall perform ance of the OM P system. T ab le 3.2: C h a ra c te ris tic s o f th e 3 2 -b it V M E b u s a n d M u ltib u s II A sp e c t V M E b u s M u ltib u s I I Standardization status IEEE P1014 IEEE Pxxx D raft 1.2 Intel Rev. C Prim ary sponsor M otorola Intel Bus bandw idth 20 to 57 M Bytes/s 40 M B ytes/s Bus protocol Asynchronous Synchronous D ata path Non-multiplexed M ultiplexed Address space 2 32 23i Bus repeater Not supported Supported Provision for 1 reserved line, 2 reserved lines, future expansion spare address 1 reserved status Number of active signal lines 107 67 TTL m ixture, T TL m ixture, Bus interface 48, 64mA Tristate 48, 64mA Tristate and open collector and open collector Parity None M andatory A rbitration 4-level daisy chain D istributed Board sizes (in.) 9.2x6.3 9.2x8.7 Num ber of pins 2x96 96 Physical slots Dedicated 1 1 Nondedicated 1 0 19 3.4 Host and Peripheral Units The processor I/O operations are performed in a m em ory-m apped m anner as used in PDP-11. The address space is divided between memory addressing and I/O addressing. The memory address space lets the CPU access the PROM 3 50 I I | and RAM. The I/O address space provides access to program m able peripherals • such as interrupt controllers and other devices : 3.4.1 H ost C om puter ' The SUN 4/160 consists of a console w ith 4 M Byte (up to 16 MBytes) t m ain memory and a hard disk. A M ultibus adapter card is used to interface the SUN VME backplane and M ultibus board for possible interconnection to other subsystems. The system can communicate w ith rem ote term inals and , hosts through an E thernet link. The operating system is SUN UNIX which is com patible w ith m ost of UNIX and BSD UNIX. The host com puter is connected via the parallel I/O interface of the SUN, Direct Memory Access, and I/O processors. The System /M em ory Interconnect mediates d ata transfers to and from the OMP memory array. The host should be able to serve the sim ultaneous requests from m ultiple processors. It should have some policy and queues to resolve the conflict in the access of the shared resources. The tasks to be supported by host com puters are software develop m ent, O /S support, perm anent data/program storage, and fast d ata transfer through DMA. 3.4.2 Peripheral D evices The secondary storage provides perm anent data memory w ith hard disks : and magnetic tapes. The content of these mass storage are swapped to and from j the m ain memory or up to cache under the memory m anagem ent unit. The SUN j system is supplied w ith Fujitsu M2351 474 M Byte disk. Also the system contains 1 a SCSI 1/4 inch tape drive and a controller. Since the orthogonal multiprocessor is responsible for only com putation, other activities like d ata/program loading I and unloading, storage, and in p u t/o u tp u t are perform ed through the periph- ' erals of the host com puter. Fast d ata transfer between peripheral devices and 51 memory is performed by Direct Memory Access. To support m ultiple processor I | architecture, the DMA facilitates independent services to individual processors i whenever they request the operation. ' User term inals are linked via an intelligent serial port expansion which is , tied to the system bus. One of the designated applications of the OM P is image : processing and p attern recognition. O utput devices like monochrome monitors and high-resolution color m onitors support the applications for image processing j and pattern recognition. For picture d ata conversion, the system is equipped w ith a TV cam era, digitizer, and A /D & D /A converters. i 3.5 VLSI and WSI Implementation 3.5.1 VLSI and W SI R equirem ents Technological advances in the circuit integration w ith VLSI or ULSI (Ultra- large Scale Integration) have result in reduced m anufacturing cost w ith high packaging density, low chip-level component count, and simplified assembly, l improved system performance due to low power, higher switching speed, and improved reliability, etc [ST86,KKP87] However, some unavoidable design con- ■ straints are limited num ber of pins, the num ber of gate count, or the m axim um j speed of a gate, design for testability, and design verification. I The increase of the complexity of integrated circuits recently reaches to I Wafer Scale Integration (WSI). A system-level im plem entation can be possible on a single silicon area including all interconnections between functional units. 1 j By W SI we expect high performance because of faster comm unication and low | power consum ption, and low cost due to less packaging steps, less severe pin- i out lim itation. Array processors, PLA, and memories are initial candidates for W SI im plem entation. Critical problems in the scheme are yield, packaging, reconfigurable algorithm s, test strategy, etc. Since the defective elements cannot 52 discard or fixed, the design should include redundancies and proper algorithm s to reconfigure the components and interconnections in spite of device failures or defects during the m anufacturing process (it needs defect tolerance). The eventual im plem entation of the OM P system is pursued by either VLSI or W SI technology. The factors of the OM P th a t support the VLSI and WSI integration are • P lanar construction of processors and memory modules (in a strict sense, it is sem i-planar since row buses and column buses are crossing, which needs two-layer interconnections). • Easy layout and easy design to add redundancy due to homogeneous and regular interconnections. • Easy reconfiguration by isolating the set of bad processors and associated memory modules (defect tolerance). • M odular construction of functional units. • A few ports per functional element (e.g. two ports per memory module). 3.5.2 R ISC Processors The size of the OM P (JV) can be increased to some extent. The complexity of the individual processor and the heavy hardw are requirem ent of the OMP memory are m ajor problems to be solved for building a highly parallel system. To reduce the processor complexity and silicon area RISC processor architecture is preferred. RISC processors are characterized by a small set of instructions, a large num ber of general-purpose registers, a single-cycle instruction execution, no m icroprogram m ed control, and an emphasis on optim izing the instruction pipeline [Sta8 6 ]. A simplified instruction set reduces or eliminates the need of microcode. It enables to exploit more efficient instruction pipelining. Also a lot of registers make it possible to reduce the rate of mem ory access, thus increasing speed. One-instruction-per-cycle is an im portant appealing factor to the OM P since it can provide high memory bandw idth due to synchronized parallel memory access. In fact, image processing and p attern recognition don’t need a sophisticated complex instruction set. The RISC processor can explore fine-grain parallelism efficiently in those applications. Also RISC-like processors are better than CISC-like ones in building a large system w ith small silicon area per processor supporting denser integration and higher yield w ith VLSI or WSI technology. The R3000 from MIPS Com puter Systems, Inc., Am29000 from Advanced Micro Devices, 80960 from Intel, and Clipper C300 from Intergraph Corp. are microprocessors built based on the RISC architecture. T heir perform ance ranges up to 20 M IPS with 25 MHz clock. Recently M otorola RISC microprocessor MC88000 family are available th at delivers 17 M IPS w ith 32-bit data. The processor MC88100 exploits fine-grain parallelism w ith only 51 instructions. It is based on register-to-register architecture with 32 general-purpose registers, and contains four concurrent execution pipelines, and separate d ata and instruction memory ports. Two cache memory m anagem ent units (CMMU) MC88200’s accompany the MC88100 for both data and instruction. The CMMU containing 16 Kbyte cache provides zero-wait-state memory m anagem ent and d ata caching, cache coherency by write-through policy, and 4 Gbytes memory address space. 3.6 Conclusions A prototype OMP system has been designed here at the functional level. The design is expandable to larger systems by simply adding more processor and memory boards until the memory buses support the increased dem and. Techni cal difficulties identified here will be helpful to guide future im plem entation. The design reveals th at the 0 ( N 2) complexity of the JV-processor system is a major i limiting factor to the implementation of large orthogonal m ultiprocessor sys- l I terns, and a more elaboration is needed for the orthogonal bus interconnections ! than for a single bus connection. But the author believes th a t the complex ity or technical difficulty is lower than th at of m any known systems such as hypercube computers and other shared-m em ory multiprocessors. The design utilizes current available components, but it is still open to the VLSI or WSI im plem entation since the OM P supports planar construction, very regular and homogeneous interconnections, and easy redundancy addition for defect toler ance. Those integration will give a higher performance, lower cost and better | reliability w ith compact design. 55 Chapter 4 i > I < I ! Parallel Image Processing and J Pattern Recognition The im portant strategy of parallel com putations in OM P lies in simple task decomposition and easy coordination of processor com m unication. We can di vide the m ultiprocessor into n independent processors w ith dedicated memory (say each is composed of a processor and its entire colum n m emory), each of which computes partitioned subtask asynchronously. Their results are combined through orthogonal direction (using row accesses) whenever necessary. This is called the orthogonal processing principle. All processor can simultaneously comm unicate w ith any other processors w ithout conflict w ithin a constant time. M any problem s having the inherent orthogonality, such as two-dimensional sepa rable linear transform ations, perfectly m atch to the OMP. Three typical patterns i ■ of orthogonal processing are identified below. Here the input d ata are evenly | distributed among memory modules. By merge below we m ean th a t a given j function is applied to entire input d ata to obtain final results (for example, add, ! m ax, etc.). i I i < I | • n to 1 m e rg e : n d ata stored in each diagonal memory can be m erged in O (logn) tim e steps using recursive doubling. A detailed im plem entation ! is given in Section 4.1.1. • k 2n 2 to 1 m erg e : Two-step com putation is performed: Step 1: Each processor sequentially com putes a sum (or other merging I function) of k 2n items in its column memory in 0 ( k 2n) tim e steps. The ! results are moved to diagonal memory modules. i Step 2: n items are added by recursive doubling in O (logn) tim e. Total tim e complexity is 0 ( k 2n + logn) = 0 { k 2n). • k 2n2 to L m erg e: This is done by histogram m ing w ith an 0 ( k 2n + L) tim e complexity to be described in Section 4.1.2. t 4.1 Fundamental Operations The recursive doubling and histogram m ing algorithm s presented below per fectly follow the orthogonal processing principle. They will be often used for other algorithm development in later sections. i 4.1.1 R ecursive D oubling | Suppose we like to find n partial sums (or an overall sum) for n input d ata D (i), i = 0 , • • •, n —1, defined below: ! j S { j ) = ' t D ( i ) j = 0,1, • • • ,« —1. (4.1.1) I 1=0 I ! The sum can be replaced by other consensus functions such as MIN, MAX, ! AND, etc. A fast parallel algorithm to solve the problem is called recursive ! doubling [KS73]. It only needs logn steps when properly realized on parallel machines such as hypercube com puters and OMP. The algorithm on the OM P needs all d ata to be stored in diagonal memories in the beginning. Each step consists of one parallel routing operation (i.e., parallel shift by row access) and one parallel com putation (by column access). Each processor receives a partial sum from the processor 2*-1 behind, then adds it to its current sum at Step k. New results replace the previously stored values. Snapshots of the com putation are shown in Fig.4.1. Thin lines indicate d ata routing operations, and thick lines represent com putational activities. The detailed algorithm is specified in Algorithm 1 using a PASCAL-like language w ith a parallel construct fo ra ll P i, k = 0,1, • • •, n — 1 d o p a ra lle l. Each statem ent enclosed by the construct is executed by n processors simultaneously. Memory access modes are specified as either row-acc for row access or col-acc for column access. B ut the specification of col-acc is usually om itted for simplicity. L e m m a 4 .1 The consensus function such as finding the m axim um , minim um, or sum m ation, and boolean AND, OR can be com puted on an n-processor OMP in 0 (log n) tim e steps, where one step corresponds to one arithm etic/logic op eration and one parallel d ata shift. P r o o f It is obvious th a t the for-loop in the Algorithm iterates logn times. Each iteration consists of a d ata shift and one binary function com putation. Q.E.D. The M PP-type array processors solve the same problem in 0 ( n ) tim e in spite of having n times more processors because of the communication delay during d ata routing. C o ro lla ry 4 .1 .1 Consensus function for k 2n2 d ata can be com puted in 0 ( k 2n) using an n-processor OMP. P r o o f Assume the d ata are evenly distributed in n 2 memory modules. Two- step com putation is performed: Step 1: Each processor sequentially computes a 58 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Q 0(7) Q D(6) Q ' o m ------------- — ( Q - D ( 4 ) " " Q D<3 > Q D(2) --------------------- 0 0 1 2 3 4 5 6 7 (1) D(0) (a) Initial data distribution 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 (d) Step 3 (d= 2 ) 0 1 2 ^ 4 5 6 7 Q ® < 7 L ________________ Q S (6 ) Q sW Q S(4) Q s p i . . QS(2) ( 1 ) S(0) 0 1 2 3 4 5 6 7 iP§ (b) Step 1 (d= 2 ) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 (c) Step 2 (d= 2 ) d: routing distance computation data routing S(k) = ^ D(i) i=0 (e) Final results F igu re 4.1: R ecu rsive dou b lin g o f n = 8 d ata on an 8-p rocessor O M P 59 A lg o rith m 1 : C on sen su s F un ction p ro c e d u re RECURSIVE-DOUBLING (D (0 : n - l ) , / c n ) { INPUT: n data 25(0), • • • ,D (n — 1), distributed in diagonal memories Mpp. OUTPUT: n results, one in each diagonal memory. F(a,b) is a binary associative function chosen by fen among > ADD,M IN,AND etc.} ' fo ra ll Pp, p = 0, 1,- • • ,n — 1 d o p a ra lle l fo r m = 1 to log n d o if (p - 2m~1 0) th e n {.Routing outside array is prohibited. } b e g in X (p — 2m_1) = D{p) (row-acc) {parallel shift from Mp< p to MP )P _2m-i } D{p) = F {D {p),X{p)} (col-acc) {c o m p u ta tio n } ■ end(m ) e n d fo ra ll sum (or other merging function) of k 2n items in its column mem ory in 0 { k 2n) tim e steps. The results are moved to diagonal memories. Step 2: n items are added by recursive doubling in 0(log n) steps. The overall tim e complexity is thus 0 { k 2n). Q.E.D. 4.1.2 H istogram m ing Histogram reveals statistical properties of a picture. Suppose there are L different grey levels in a picture. Histogramming is a task to find iif (/)’s (I — 0,1, - • •, L — 1) where H{1) is the num ber of pixels whose grey level is I. The determ ination of the histogram level can be expressed as a function / of a pixel x as / = / (a;). The com putation of a histogram on an OM P is quite sim ilar to the way used on the M PP array processor [KWR82]. Each processor independently computes a local histogram of those d ata in its own column memory in the first stage (this process is called a column merge). A local histogram for Jth level is stored in the Vth memory module in its column memory, where V = I mod n. The column merge produces a set of n partial counts H ^ { l),p = 0,1, • • •, n — 1 for each level I. There are L such sets in total. The superscript [p] indicates the data stored in the column memory of the processor Pp. In the second stage, using row accesses, each processor will com pute the global values H (l) by adding the local histogram (row merge): n— 1 H(l) = Y l Hl9](0 f = 0 , • • • ,L — 1 (4.1.2) 9=0 Notice th a t histogram s of n different levels can be com puted by n processors si multaneously. Hence it needs \L /n) repetitions to complete the histogram m ing. Here the input d ata are evenly distributed. Suppose we have a kn x kn image G(0 : k n —1,0 :k n —l) m apped onto the n x n OMP memory array in row-major order. Each memory module M,j contains k x k pixels. A pixel stored at corresponds to the image d ata G(u, v), where u — ik + a, v — j k + b, I = ak + b (0 < a ,b < k , and 0 < i , j < n)(4.1.3) (See Fig.4.2 for details.) Unless specified otherwise, all the following image d ata will be stored using this m apping. Detailed procedure of histogram m ing is given in Algorithm 2. Snapshots of the histogram com putation for an 8 x 8 synthetic image w ith 8 grey levels are given in Fig. 4.3 for 4-processor OMP. L e m m a 4.2 The histogram of a kn x kn image consisting of L grey levels can be com puted in 0 ( k 2n + L) steps using an n-processor OMP. P r o o f According to the Procedure HISTOGRAM, the column merge needs 2k 2n steps, and the row merge dem ands \L /n \ iterations each of which needs n additions. Hence the tim e complexity of the parallel histogram m ing on an OMP is 0{2k2n + \L jn \n ) « 0 ( k 2n + L). Q.E.D. Note th a t the tim e complexity is the same as th a t of M PP-type array processors which use n times more processors [KWR82]. This is due to the high efficiency of the OM P in global d ata routing: sending d ata from one place to another takes a constant tim e, however, O(n) steps for an n x n mesh-connected array processor. F igu re 4.2: A kn x kn im age m ap p ed o n to th e n x n O M P m em ory array 4 3 3 4 5 4 4 5 4 2 111 ! 5 6 3 3 4 4 1 0 1 2 2 3 7 3 2 i 2 2 3 3 4 4 5 4 3 3 3 6 5 4 3 5 4 4 3 2 4 5 2 2 3 4 3 2 4 5 6 7 4 4 5 3 3 (a) 4*4 Input image data with 8 grey levels H I O (0) C111 A)>-1 I o ” I I o o I I C O H[°(l 4)= SI I I (l 4>= 4 H[2f o = 4 5 H1 V 1 H f i) = 3 H[2(I 1)= o h |3/ i )= 0 H % ) = 3 H ls)= 2 HI2<5)= 2 H l3fo= « C O m M * C N | ©W *~X lilt f e ) - 2 H % ) - 3 H l3fc)= 2 H10^ ) - 1[ l i f o ’ o Ht2(6)= 1 H 1 3 ^ 1 H t°(l 3)= 3 H i j ) = 3 H t2(l 3)= 6 rf V a HI°(I 7)= 0 H 1fr)= 1 H[^ 7 )= 0 1 (b) Column Merge H(0)=1 H(4 ) = 18 y<Mm iilil : § M M H(5 ) = 9 H(2) = 10 H(6) = 3 H(3) - 17 H(7 ) = 2 (c) Row Merge H (j) = 2 F igu re 4.3: H isto g ra m m in g o f L = 8 levels u sin g an O M P w ith n = 4 j p rocessors A lg o rith m 2 : H istogram m in g p roced u re H ISTO G RAM {G (0: k n —1,0: kn — 1), H (0: L —1)} { INPUT: k 2n2 d ata G stored evenly in the memory array. A memory module Mij contains k 2 data J9y[0:fc2 —1]. OUTPUT: L histogram values H in diagonal memories. } forall Pp, p = 0 , 1, • ■ • , n — 1 d op arallel {column merge} for * = 0 to n — 1 do for j = 0 to k 2 — 1 do b egin l(p) = f{D ip[j}) (col-acc) {l[p): grey level of a pixel, Dip\j\: jth data in M,p } (/(p )) = .fftpl(/(p )) + 1 (col-acc) {/(a:) returns a grey level of a pixel x. } end(i) { lf lpl(/(p)) : partial count of H{l[p)) } { s to r e d in M mp where m = l(p) m o d n. } {row merge} for * = 0 to L — 1 step n do for J = 0 to n — 1 do {#Ip ](p + in) is to be stored in a diagonal m em ory Mpp. } ifW (p + in) = HM{p + in) + H ^ (p + tn) (row-acc) en d forall i 4.1.3 T w o-dim ensional Linear Transform s Two-dimensional linear filterings are frequently used for image preprocess- . ing; for example, correlation and convolution for m atching, edge detection, and , image transform ation like Discrete Fourier Transform , sine transform , etc. Their I ‘ com putation principles are same, but they have different kernels. The basic I parallelization m ethod is introduced below. Consider a two-dimensional unitary ; transform of an image m atrix [Pra78]: ; F (u ,v ) = JZ f { ^ 3 )H {i,j,u ,v ), { N ^ k n ) . (4.1.4) *=0 j = 0 i where F and / are N x N m atrices, and H is a transform kernel. M ost of the use- i ful transform s have separable transform kernels as H (*, j, u, v) = H c(i, u) Hr (j, v). In this case, the 2-D unitary transform can be com puted in two steps of 1-D transform . F irst, a one-dimensional transform is perform ed along each column of the image m atrix (/) yielding a m atrix where G (u >i) = ^2 f i h j ) He{i,u) i= 0 Next, the same one-dimensional transform is perform ed along each row of the interm ediate transform m atrix G giving the final result F {u ,v ) where ; F (u,v) = J 2 G K i ) I 3 = 0 i ; This scheme is well known, but only few architectures m atch well to the ) actual com putation and communication patterns (such as shuffle-exchange and hypercube system s). The OM P perfectly m atches to the above application. Sam- J pie d ata / are evenly distributed among the memory modules using the way stated in Section 3. The transform kernel is copied into each diagonal mem ory. Each processor simply performs a column transform on its column memory and then a row transform on its row memory. Since the OM P can follow the I exact procedure w ithout data relocation or communication overhead between processors, the performance will be greatest compared w ith hypercube or mesh connected array processors. We select below a two-dimensional F F T as a typical example of the lin ear transform ations for detailed implementations. Two-dimensional Discrete Fourier Transform (DFT) of size N x N is defined as follows: X ( i J ) = E E Z(p, g ) W ' ^ = E p = 0 q= 0 f>=0 A T J2x(p^)w tq q=0 w iv N = E * i D( P , 'W " (0 < , J < N ) (4.1.5) p= 0 Here X m {p, 4) is an one-dimensional D FT along the column *, for a fixed value of q. Using sequential F F T algorithm , we can com pute it in 0 ( N log N ) time. Suppose N = kn, k = 2p,n = 2q, where p and q are some integers. The image is stored evenly to the n x n memory array in row-major order. Each column memory contains kn x k data. The 2-D D FT can be done by two steps of one-dimensional F F T using separability of the transform ; firstly finding row transform s, then column transform s [Chu88,Nus82,Pea68]. Each processor Pt - will com pute k 1-D F F T s of kn samples (all k • kn initial d ata are stored in the column) using sequential algorithm , n processors do their own transform s w ithout comm unicating with other processors: all required d ata are stored in the same column memory. On the OMP during F F T com putation there is no sam ple d ata movement for rearranging data. Once the 1-D procedures are done, we do row com putations to complete the 2-D F F T in a sim ilar way. The estim ation of tim e complexity is straightforw ard. Since we have n processors, n column d ata are com puted in parallel, which yields n sets of 1-D F F T s in 0 ( N log N ) time. To complete kn sets of 1-D F F T s thus dem ands O(AW log N ) tim e. Based on these 1-D F F T ’s, we again use the sam e sequen tial 1-D F F T algorithm in row direction to find 2-D F F T . This procedure will also take 0 ( k N log N ) = 0 ( k 2n log kn) tim e. Serial F F T algorithm requires 0 ( k 2n2 log kn) time. f 67 ^ i \ 4.2 Image Enhancement Using Histogram Equalization i It is possible th a t a histogram H (/) is zero for m any levels of I for some im age, especially for a low-contrast picture. This means th a t the available levels of quantization of the brightness d ata are not used efficiently. It would be desirable : to modify them to increase the dynamic range of the picture, which enhances the I i picture contrast. This can be done by reassigning the histogram level of pixels i so th a t the resultant histogram is as flat as possible as dem onstrated in Fig. 4.4 [Pav82]. The scheme is called histogram flattening (equalization) [Pav82,RK76]. I 4.2.1 H istogram Equalization i Suppose the desired histogram distribution JST'(m) is given for all levels ; m = 0, " ' , L — 1. A histogram flattening should produce the same num ber of pixels H ave in all levels, i.e., tf'(0 ) = H \ l ) = • • • = H ' ( L - 1) = f H(l) (4.2.6) h 1=0 Consider the histogram d ata in Table 1 for example. Since H aV e = 16, all pixels ; of levels 0 and 1 , and 8 out of 21 level- 2 elements should be assigned to the new ! level 0 . The new level-2 will be formed by the rem aining 13 pixels of old level-2 | data and more from the subsequent level(s). The rem aining assignments can be { done in a sim ilar way. This is a sequential procedure. 4.2.2 Parallel A lgorithm For a parallel assignment, consider the following case: I— 1 T O I £ / r ( * ) < £ H'U) < £ f T ( i ) (4.2.7) t = 0 i = 0 1 = 0 (a) Before equalization (b) After equalization t ; (c) Original histogram distribution | F igu re 4.4: Im age en h an cem en t by h isto g ra m eq u alization | [P avlid is’82] 69 This implies th a t the new level m will be composed of pixels from old level I and subsequent levels below. Let n y , called a reassignment count, represent the num ber of elements to be assigned from an old level i to a new level j. Then a ' new level m (0 < m < L) will consist of «jm pixels from old level I, pixels from old level / — l, • • * , such th a t the following condition is satisfied: nim + + ni- 2,m + • ■ ■ = H '(m ) (4.2.8) Let S (m) be the sum of histogram s from level 0 to m . T hen the Eq.4.2.7 becomes S ( l - 1) < (m + 1 )H ave < S(l) (4.2.9) ! The portion coming from the level I (njT O ) is com puted by the relationship (refer to Fig.4.5): nlm = (m + 1 )Have - S{1 - 1) (4.2.10) Once we know njm, the other reassignm ent counts for the new level m can be com puted sequentially from the levels / — 1 , / — 2 , and so on. If all the cumulative sums S (t)’s are available, reassignment counts for all n levels (e.g. levels 0 , 1, • • • n — 1) can be com puted simultaneously. A table of all necessary n y ’s ( 0 < i , j < L — 1) is thus constructed concurrently. If the original histogram is not too m uch concentrated, m ost n.-y’s will be zeroes. Let P , called a reassignment spread, be the m axim um nonzero entries of n^-’s for any level i. Table 1 shows the reassignm ent counts com puted for the previous example (P = 3 ). The actual histogram reassignm ent is accomplished in a distributed m anner by partitioning the counts ntJ- into n partial reassignment counts ’s (p = 0 , • • • , n — 1 ), each of which represents the am ount th a t the processor Pp should i do reassinging w ithin its column memory. The partial counts are chosen in proportion to the partial histogram s H M (explained in Section 4.1.2) in the column memory: Cumulative sum of histograms 2 2 k n 1 -1 H(j) j=0 H(0) H (I -1) - -H(l) ^_H(l+1) H(L-1) — " I I - --------------- n lm I, ■ J H 'fO ) m S(m) = ] > H'(j) i=o OLD histogram distribution NEW 1 histogram distribution 1 -1 n,m = S(m) I m H (j) J=o F igu re 4.5: H isto g ra m reassign m en t from old grey levels 1,1 — 1, ... to a n ew level m T able 4.1: H isto g ra m an d its R eassign m en t C ou nts for E q u alization H ave = 128/8 = 16. Level (m) 0 1 2 3 4 5 6 7 H{m) 1 7 21 35 35 2 1 7 1 S(m ) 1 8 29 64 99 1 2 0 127 128 Tlmj ^00 = 1 wio = 7 W -20 = 8 ri2i = 13 n si = 3 W 32 = 16 W 33 = 16 fl44 = 16 W 45 = 16 W 46 = 3 W 56 = 13 W 57 = 8 n 6 7 = 7 W 77 = 1 n g1 = riij HW(i) H(i) (0 < i , j < L —l) Hence overall reassignm ent will be achieved as desired because n-1 E « 0 = p= 0 n I^o‘irW (0 r H (i) n. \j- (4.2.11) (4.2.12) The procedure of histogram equalization on an OM P is sum m arized below: 1) Find a histogram of the given image. 2) Find L histogram sums S' (*), * = 0,1, • • •, L — 1. 3) Broadcast all S (i)’s. 4) Find reassignm ent counts n ,/’s for all 0 < i , j < L —1. 5) Broadcast n,y’s to every processor Pp to com pute partial counts n ^ ’s. 6) All processors do histogram reassignm ent. If there is an old level * which contains excessive elements (e.g. H(2) in Table 4.1), the corresponding level will be m apped to m ultiple levels: n,y, I n i,j+ 2 , • ■ *. Note th a t there are several ways to choose the new levels such as ran- 1 dom selection, middle level selection, or selection according to those of neigh borhood pixels [Pav82]. Detailed algorithm is given in Algorithm 3 using a random selection pol icy in choosing new levels. Histogram and cum ulative sum s S (0 : L — 1) are f com puted by four lines of statem ent after forall. Assume the OM P has four processors. The d ata given in Table 4.1 are again used. Consider the case of j = 0 for j-loop. The first line obtains the highest level l(p) from which pix- ' els are assigned to the new level j + p, where p is an index for the processor | Pp. Hence /(0) = 2, 1(1) = 3, 1(2) = 3, and 1(3) = 4, •••. Binary search is i ; used in this process. Initially, the values n 20, nai, • • •, are concurrently com puted by processors Po, Pi, • • • respectively. Then each processor Pp will obtain the next reassignm ent counts (^i(p)-«(p),y+p) sequentially, where *(p)’s are incre m ented from zero until the condition of while holds. For example, processor Po will obtain noo = 1 > ^io = 7, and n 20 = 8 in a sequence. If the iteration for j = 0 ; completes, all reassignm ent counts for the new levels 0, • • •, 3 have been deter mined. Once the j'-loop term inates, a complete table of reassignm ent counts is generated. Then partial reassignm ent counts are com puted by each processor after n y ’s are broadcast. Finally, histogram levels are reassigned by individual processors based on the partial counts (g-loop). Pixel d ata are scanned one by one, then a new level is chosen arbitrary among the candidate levels which still need to be filled. T h e o re m 4 .1 The histogram flattening can be accomplished in 0{(fc2 + L)n} ' • on an n-processor OM P for an image w ith kn x k n pixels having L grey levels and a reassignm ent spread P . P r o o f I) The histogram com putation needs (2k2n + L) steps. 5'(t)’s are found in (L /n lo g L ) steps using recursive doubling, since the recursive doubling of L I j element requires log L iterations and we can deal w ith only n items at a time. | To broadcast 5 ( i) ’s takes (L/n) steps because n items are broadcast at a time. 2) The j>-loop produces the table of reassignm ent counts. The binary search 1 in finding l(p) needs (log L) steps, and while loop iterates a t m ost P times. The rem aining statem ents need constant tim e, thus L /n iteration of the j-loop A lg o rith m 3 : H istogram E q u alization { INPUT: a kn x kn image G with L grey levels and Have = k 2n 2/L . OUTPUT: an image with even histogram distribution } fo ra ll Pp, p = 0,1, • • •, n — 1 d o p a ra lle l { Finding a reassignment count table} H ISTO G R A M (G , H ). {Find a histogram of a picture. } Broadcast H(0 : L — l ) . { Broadcast L histograms, L /n ones at a time. } RECURSIVE-DOUBLING {H{0 : L - l ) , “add” ). {C om pute S{i),i = o, • -,2,-1} Broadcast S (0 : L —T). {Broadcast S {i)% L /n of them at a time. } fo r j = 0 to L —1 s te p n d o b e g in l(p) = {min q\S(q) > (j -f- p + T)Have} { This is done by binary search of 5 (i) } {/(p): start of new level j + p, computed by the processor Pp. } n l(p),j+p = { j + p + 1 ) H ave — S(l(p) — 1) {A portion from an old level Z(p).} left = H ave — {portions not filled to level j + p} i(p) = 0 w h ile left > 0 d o {iVeeds at m ost P iterations } b e g in i(p) = HP) + 1 n *(p)-<(p)d+r = H (lW ~ left = left - ni(p)_,-(p)tj+p e n d (while) if left ^ 0 th e n ?lj^pj— «(p)'i+P — (p)_» '(p ),j-j-p I left {C^orrectjon for the last} end(j) {Computing partial counts } Broadcast all nonzero njm’s (0 < / , m < L —1). (row-acc) n\m = nim H ^(l)/H (l), for all nonzero nim, (0, < /, m < L —T). {Histogram reassignment} fo r q = 0 to n —1 d o fo r * = 0 to k 2 — 1 d o b e g in { Pick the ith pixel in M qp and do reassignment.} I = f[DqP[i]) { T ie function f(x) returns a histogram level of a pixel x. } / (Dqp [/]) = m {Assign it to an arbitrary level m which satisfies nj^[ > 0. } n = n )^ — 1. { Update the corresponding partial count } end(q) e n d fo ra ll 74 ; i i needs at m ost {(log L + 3P + 4)L /n } steps. 3) Since the table contains at m ost I L P item s, broadcasting dem ands (L P /n ) steps, and finding local reassignm ent counts n\^ requires (LP) tim e steps. 4) Finally, reassigning takes (3k 2n) steps. | Therefore the overall tim e requirem ent of the histogram equalization algo- I ' rithm is (2k 2n + L) + L /n + (L /n log L) + L /n + L /n(log L + 3P 4- 4) + L P ( 1 -f- l /n ) + 3k2n = 5k2n + L P ( 1 + 4 /n ) + L /n (2 lo g L + 6), which is bounded by ; 0 ( k 2n) + 0 ( L P ), if n » 1. Since P < n, the overall algorithm complexity is j 0 { ( k 2 + L)n}. Q.E.D. [ I ; | Serial algorithm needs k2n 2 steps for histogram generation, L P for obtain- ■ ; ing a reassignm ent table, and k 2n2 for actual reassignm ent, hence it would take : 0 ( k 2n 2 + LP ) for the histogram flattening. It is not clear w hether the scheme I used on the OM P is suitable also for M PP-type array processors, however, it : should pay a lot of overhead in partitioning and distributing tasks because of ' the network diam eter of O (n). 1 ► 4.3 Line Detection Using Orthogonal Projection i 4.3.1 Im age Projection i , J An image projection of a two-dimensional picture onto one-dimensional i space is an im portant aid in feature extraction, representation for image un derstanding, and pattern recognition. For example, geometrical features can be extracted using projection in an industrial p art inspection [SH87]. Suppose a set of equally spaced parallel beam s are incident on objects. All i j pixels lying on the same beam are projected on one point. Let G (i,j) be a gray : ! value ^ f a pixel in (i, j) of an N x N image. The vector of column sums { £ G (l, j) , E G( 2,j ) , •. •, E G (tf, j)} i -1 j=i j=i is called x-projection of G, which gives a projection onto x axis [RK76]. y- , projection of G can be sim ilarly defined. The com putations for x ,y projections . on an OM P are straightforw ard; each processor sums up the grey values along | a column (row) using column (row) accesses. i j ; 4.3.2 H ough Transform In general, a projection can be defined onto a line having an angle 6 w ith : respect to positive x axis. A pixel at (a:, y) projected onto the line has a (signed) ; distance p from the origin, where , I : p = xcos 0 + y sin# (4.3.14) i (p would be rounded in practice.) See Fig.4.6 for details. Image pixels having the , same p will be projected onto the same point for a given angle 6. Projection can be com puted by collecting and summing grey values of the pixels th a t have the i same p. Thus, histogram ing technique can be applied to concurrently compute the projection with p regarded a label (like a grey level) of each pixel. Histogram ing here should accumulate the grey values of pixels. Suppose a binary image (i.e. grey values are either 0 or l) with kn x kn pixels contains thin edges or line . segments. If a line segment is aligned w ith the current projection beam , the cor responding projection value will be very large. The projection value indicates the num ber of pixels on the specific projection beam , since we consider only pixels w ith grey level 1. By observing projection d ata of various angles, lines i | and their angles in the image can be detected. This technique is called Hough 1 r i transform [DH72J. Figure 4.7 illustrates the projection w ith 0 — 45°. It shows ! labels (\/2p is used for simplicity) and the corresponding projection vector for ’ I an image. 75 ; j (4.3.13) 76 I kn F igu re 4.6: A n im age p rojected o n to a lin e w ith an a n gle 0 77 III 0 0 0 0 0 mm 0 1 0 111 0 i l l 0 0 0 0 0 ill! 0 y A (1,4) (2,4) (3,4) (4,4) (5,4) (1,3) (2,3) (3,3) (4,3) (5,3) (1,2) (2,2) (3,2) (4,2) (5,2) (1,1) (2,1) (3,1) (4,1) (5,1) 5 4 3 2 6 5 4 3 7 6 5 4 8 7 6 5 9 8 7 6 x Grey values of an image Pixel locations p x V 2 (labels) 2 3 4 5 6 7 8 9 Labels ( 0 1 0 4 0 1 0 0) Projection data F igu re 4.7: A p rojection o f a b in ary im age o n to a lin e w ith 6 = 45° A two-dimensional histogram H (0,p ) is constructed covering all angles (0 to 180 degree) w ith a constant increment. The histogram is obtained by com puting 1-dimensional histogram M tim es, where M = 180/A and A is the angle increm ent (resolution). High projection points in the two-dimensional histogram represent lines in the image. Since we have a kn x kn image, the span of the p is up to y/2 kn. Therefore we should generate y/2knM projection data. The Hough transform is implemented in Algorithm 4. The image d ata are stored in the OM P memory array using the same way stated in Section 4.1.2. Each processor computes in parallel the param eter p of the d ata in its column memory for a given angle 6 . If the com putation completes, projections for the angle are com puted by column merge and row merge using the procedure HISTOGRAM . Hence, it is called orthogonal projection. The whole procedure will be repeated until all the projection angles are covered. The overall algorithm is sum m arized below: (1) Com pute a current projection angle $ (0 < 6 < 180) by incrementing A from the previous 0. (2) Each processor scans next pixel at (x , y) in the column memory, and finds its p by the relationship p = xcos 0 + y sin 0 . (3) Each processor increases the partial count of H(0,p) (= H ^ (0 , p)) by one. It is to be stored in the memory m odule M qp, where q = p m od n. (4) If all pixel d ata are scanned, go to Step 5. Otherwise go to Step 2. (5) By row merge, accumulate the partial counts for the same p to obtain a histogram H(0,p). (6) If 0 < 180°, go to Step 1, else go to Step 7. (7) Find those U (0 ,p )’s whose values are larger th an a threshold. They correspond to detected line segments. A lg o rith m 4 : Line D etectio n by O rth ogon al P ro jectio n p roced u re HOUGH-TRANSFORM {INPUT: kn x kn binary image data and an angle resolution A OUTPUT: line segments in the image } fo ra ll processors Pp,p — 0,1, • • *, n — 1 d o p a ra lle l for * = 0 to M — 1 do {M = 180/A , where A is an angle increment } b egin 0 = iA for J = 1 to k 2n do {A processor Pp scans all pixels in a column memory.} if (Cr(x^,yjf^) 7^ 0) { i f the jth pixel in the column m em ory is not zero } th en p(xfKy^) = [x^cos 0 + y ^ sin 0} H IST O G R A M {p { 0 : k n - 1,0 : kn - 1),H{0,O : y/2 kn - 1)} * end(t) {There are at m ost y/2kn levels in histogram for a fixed angle S. } Select those H(0,pys which are greater th an a given threshold. en d forall {T h e y correspond to lines in the image.) * T h e o re m 4.2 The tim e complexity of the above line detection for an image ' with kn x kn pixels is 0 { k 2M n) using an OM P w ith n processors, where pro jections are com puted for M different angles. ! I 1 { P r o o f The scanning of all column d ata takes k2n tim e for a given angle 6 . His- i togram calculation takes 0 ( k 2n + L ) = 0 ( k 2n + >/2kn) = 0 { k 2n). Since we need [ M repetitions of the both procedures, 0 ( k 2M n) is needed. The thresholding | on the two dimensional histogram of y/2 k M n d ata (for each projection angle, at m ost V 2 kn levels of histogram is generated) dem ands O (kM ) tim e. Hence, total time complexity of line detection algorithm is 0 ( k 2M n) for a kn x kn image. Note th at this is the maxim um achievable perform ance using the n- . processor OM P because the total com puting tim e am ounts to 0 { k 2M n 2) (each : pixel should contribute to M different projections, generating k 2n 2M d ata points ■ in total). Q.E.D. 0 ( M + N ) tim e algorithm s have been proposed using N x N mesh-connected SIMD array processors for N x N image [Dav87,GH87]. Subsequent angle data are fed from the left column of the array one by one like pipeline computing, reaching the right column with projection d ata one by one after O(N) steps . later. j I i 4.4 Parallel Pattern Clustering I i 4.4.1 P attern C lustering P attern clustering is a classification technique where a priori inform ation on classification is not available [DJ80,DH73,NJ83]. The classification boundary is \ chosen dynamically by input p attern data. Suppose we have N p attern vectors Xi, X2, ... , Xjv, each of which has m features: x,- = (xiu Xi2, ... ,x im)T, or x = (x u x 2, ..., x m)T. 81 (4.4.15) The Euclidean distance D between two p attern x and z is defined as a sim ilarity m easure, where D 2 = |x —z |2 = Y%L\{xi ~ zi)2- A performance index is used to sum of all squared errors: •/ = £ ! > - m ;|2 (4.4.16) 3= 1 xeSy where Sy (j = 1,2, • • •, K ) corresponds to j t h cluster. The m ean value my above is defined as my = — £ ] x. N i x€s; In practice, choosing a proper K is also an im portant issue since it is very difficult to determ ine if a K-cluster structure makes sense for a given d ata set [JM87]. One way is to apply the cluster algorithm repeatedly for different K y s. The resulting partitions are then evaluated by performance m easure to select an optim um value K . Our objective is to classify the patterns into a set of clusters which minimize the index J for a given K . The K-means algorithm [Har79] is used for parallel clustering on the OM P as described below: (1) Initial Assignment Choose K initial cluster centers z ^ j Z ^ , These are selected as the first K samples in the sam ple set. (2) Label Assignment Process At k th iteration, make x G Sy (i.e., assign j as a label of x) if |x — z ^ | < |x — *j*>| for all i = 1,2, • • •, i f , i j , where S y*^ denotes the set of samples whose cluster center is z ^ . (3) Cluster Updating Process U pdate the cluster centers z ^ by finding the new sample means: 1 (*+i) j M. ^ c (t) xesr Feature 2 (Length) 8 6 4 2 Feature 1 (Width) 6 8 0 2 4 F igu re 4.8: 16 sam p le d a ta in 2-D featu re sp ace rep resen tin g 16 rect an gles o f various sh ap e. T h ey w ill b e classified in to 3 clu sters (4) Convergence Checking If Z y*+1^ = z f for every j — 1,2,•••, i f , the algorithm has converged and j the procedure term inates. The index J is com puted from the converged I set of K clusters. Otherwise go to Step 2. i | 4.4.2 Parallel A lgorithm Im plem entation ! A parallel algorithm (orthogonal p attern clustering) is implem ented below. Orthogonal processing is again the basic principle of the algorithm . N pattern vectors are stored evenly in the memory array. Each memory m odule holds N / n 2 vectors. The m apping is outlined below in four m ajor steps. Details can also be found in [HK87]. I Step 1 : Broadcast a cluster center Z y at j th iteration (j = 1,2, •• •, K ). Com pute the Euclidean distance between a p attern and the cluster center. If the distance is smaller than dmtn, replace dmi„ by the new distance and change its label accordingly; if not, keep dm ,„ and the label unchanged. Repeat this operation until there is no more p attern vector x left in the column memory. Then repeat the Step 1 w ith the next cluster center. The algorithm is stated in Algorithm 5. i j Step 2 : Employing the scheme used in histogram m ing, find K counts , of patterns th a t belong to each cluster; each processor Pp com putes K partial counts , • " ? (thus overall K n counts) by scanning the patterns in the j column memory one by one. Then using row merge, the individual counts are sum m ed in parallel. Step S : In exactly the same way as Step 2, find vector sums of patterns ! which belong to the same cluster. Find new cluster centers by dividing the sums | by the num ber of patterns in the cluster. Step 2 and 3 are implem ented in 1 Algorithm 6. i _ _ _ _ _ _ _ _ _ _ 84 A lg o rith m 5 : P a ttern C lustering: S tep 1 { INPUT: N pattern data OUTPUT: Find a current label of each pattern } forall Pp, p = 0,1 ,•••,» — 1 doparallel for m = 1 to N /n do ^mm(x !n ) = 0 0 {initiallization, x „ : m th pattern in a column m em ory of Pp. } for j — 1 to K do {Superscripts represent the processors handling the data. } b egin { column operations } B r o a d c a st Zj. {zy: current cluster center} for q = 1 to N /n do b egin dM(xM, Zj) = 0 for i — 1 to m do ( x j ^ ,Z j) + (x qi — Zji) 2 { Computing a Euclidean diatance} if th en b egin <41(*,W) = d ^ , z s) G (x |f pl) = J. { G (xj,p^ ) : a label o f a pattern x ^ . The initial label } end (if) { is that of the first pattern vector. } end(q) end(j) endforall A lg o rith m 6 : P a tte rn C lustering: S tep 2 an d 3 forall Pp, p = 0,1 , • • •, n — 1 d op arallel { The next operations are done by column accesses. } for j = 1 to K do b egin { Initialization } n \p ] = 0 0 {W e store and in j — 1st module in the column p. } end(j) for q — 1 to N /n do b egin { column merging } ( = G(xW). {G (x) returns the label ofx. } N i W _ ?y[p] N r + 1 . 8lP l r ” st end(q) { The following operations are done by row accesses. } for j = l to K step n d o {Processor Pp works on data with subscript j+ p . } b egin { row merging } N,-+p = 0 Sj+p = 0 {-/Vj+p and sy+p are computed by the processor Pp. } for q = 1 to n do b egin Nj+p — Nj+p + N ^ -r *vJ+p sj+p — sj+p + 8 j+p end(q) | end(j) { for j = l to K step n do l i z j+p = 8 j+p/Nj+p I 1 en d forall {A new cluster center } I Step 4 ' Broadcast the new cluster centers to all memory modules. Get I i the current performance m easure J using column accesses and row accesses. i I As an illustration, refer to Fig. 4.8 having 16 patterns th a t represent var ious shaped rectangles. The length and w idth form a two-dimensional feature space. The problem is to classify the rectangles into three groups according to j their shape. Suppose a 4-processor OM P is used. The related param eters are { N = 32, m = 2, K — 3 and n = 4. Snapshots of execution are shown in Fig. i 4.9. Figure 4.9.a is an initial partitioning of pattern d ata and selection of initial ; cluster centers A,B and C. Figure 4.9.b shows the Step 1 com puting Euclidian , distances to each cluster center. Three values in each memory m odule are dis tances, from top to bottom , to cluster centers A,B, and C, respectively. The ; partial counts and partial vector sums for each cluster are shown in Fig. 4.9.c. ' The top row memories keep those values for cluster A, and the second row for | cluster B, and so on. Partial counts of individual clusters and partial vector ; sums are com puted locally using column access, then sum m ed up using row ac- , cess to obtain global values. After Iteration 1, new cluster labels are determ ined according to the cluster center w ith the smallest distance. The resulting pop ulation counts and new cluster centers are shown at diagonal memories in Fig. 4.9.d. Figures 4.9.e-f similarly show the interm ediate results. After 3 iterations, the clustering converges as shown in Fig. 4.9.g. T h e o re m 4.3 Using n.-processor OMP, the p attern clustering of N patterns w ith m features into K classes can be achieved in 0[{(3m +4) AT + 2(m + l)}iV /n\ steps. P r o o f Suppose N » n » 1. Algorithm 5 consists of four for loops. The first loop for initialization requires m N /n steps. The innerm ost t-loop takes 3m op erations since each iteration needs one m ultiplication and two additions. Hence I i the g-loop demands (3m +4)iV/n operations. Hence Algorithm 5 needs m N /n + 87 00 01 02 03 00 01 02 03 10 12 13 10 12 13 20 21 22 23 20 21 22 23 32 33 33 30 31 30 31 32 (3,8) (4,6) 5.9 3.6 4.5 (7,3) 6.4 5.0 (3,1) 2.0 2.3 5.8 ( 2 , 2 ) (4,7) (4,8) 7.6 5.4 5.7 (3,6) ( 2 , 6 ) M i l ( 8 , 2 ) 6.2 2.0 ( 1 . 1 ) ( 2 , 2 ) 6.3 (2,3) (7,4) 6.7 5.0 (3,6) 5.4 3.2 5.4 (4,8) (7,4) ( 1 , 1 ) 2.3 7.6 (2,3) 2.3 (4,6) (8,5) (3,1) ( 8 , 2 ) (4,7) 6.7 4.5 5.0 (3,8) 7.3 5.1 6.4 (6,4) 5.9 4.1 2.0 (8,4) (8,5) 8.1 6.3 ( 2 , 6 ) 5.1 3.0 (a) Initial data distribution and initial cluster centers (b) Euclidian distance computation from patterns to each cluster center (Distances are listed in order with respect to the centers A, B, and C.) o o 02 03 10 12 13 22 23 20 21 32 33 31 30 ( 0 , 0 ) ( 2 , 2 ) ( 0 , 0 ) (5 ,1 2 ) (15,29) (2,3) ( 0 , 0 ) (4 ,2 ) ( 0 , 0 ) ( 8 , 2 ) (20,11) (16 ,9 ) M . 20 21 22 (t,1) N =2 A A A A (8,4) :<4 :«>: A A A A A (2,1) A A A A A A A A « • A A A A A A A A * A A A A > A A A A A A A A \ A A A A M 00 M 01 M 02 M 03 (3,1) (7,3) nb = a (3,5.8) i i i i i X 4 ;t> ; A A A A S A A A A A A A A A A A A M 10 M 11 M 12 M13 A A A /l (6,4) ‘•N~ = 6* 'S3 ?8V •b A A A A A A A A A A A A A A A A V A A A A \, C . A (7.3,3.7 I A A A A S A A A A A A A A 23 ( 8 ,2 ) 30 31 32 33 | (c) Partial cluster counts and partial (d) New labels (shown in different shading) i vector sums computed using column access and new cluster centers after Iteration 1 F igu re 4.9: S n ap sh ots o f parallel p a ttern clu sterin g on th e O M P. J 88 (1,1) 1.0 5.2 6.8 (2,3) 2.0 2.9 5.3 (8,4) 6.7 5.3 0.8 (4,6) 5.4 1.0 4.0 M M M M 00 01 02 03 (3,1) (7,3) (8,5) (4,7) 1.0 5.4 7.2 6.3 4.8 4.8 5.0 1.5 5.1 0.8 1.5 4.6 M M M M 10 11 12 13 (2,2) (6,4) (2,6) (3,8) 1.0 4.9 5.0 7.2 3.9 3.5 1.0 2.2 5.5 1.3 5.8 6.1 M M M M 20 21 22 23 (8,2) (7,4) (3,6) (4,8) 6.0 5.8 5.1 7.3 6.2 4.4 0.3 2.4 1.8 0.4 4.9 5.4 M M M M 30 31 32 33 (e) Euclidian distance computation (Iteration 2) (1 .1 ) n a = ( 2 , 1 . 8 ) M _ 00 W L 10 ( 2,2 } M 20 ( 8 ,2 ) M 30 (2,3) M 01 (7,3) nb - (3.3,6.8) M . 1 1 (6,4) 21 (7,4) M 31 (8,4) M. 02 (8,5) M . 12 32 A A A A A / , A A A A A A A 03 A' a a 'a • A A A d A A A A • A A A J A A A A M 13 *N _ = 6 * * . (7.3,3.7) i , i i ■ — i M 22 A A A A . A A A • A A A A M. 23 A A A A A A A A « ll4v8,K A A A A A A A A A A A A A A A A A M 33 (f) New labels and new cluster centers after Iteration 2 (1,1) Na =4 (2,1 8) (2,3) (8,4) A A A A • i A A A A A A A A • . A A A A A A A A . ■ A A A A M M M M 00 01 02 03 (3,1) l i « ! (8,5) N0 =6 > A A A A A A A A , (3.3,6.8 ) A A A A , , A A A A M M M M 10 11 12 13 (2,2) (6,4) wo: •:nc =.s A A A A . ■ A A A A (7.3,3.7 ) . A A A A A A A A ■ , A A A A M M M M 20 21 22 23 (8,2) (7,4) A 3 JfX V , A A A A A A A A V A A A A A A A A A A A A . A A A A > A A A A 4 , A A A A A A A A 4 M 30 M 31 M 32 M oo 33 (g) Labels and cluster centers after Iteration 3 (No changes) Fig. 4.9 (Continued) 89 K {(Zm + 4)N /n + l} « {(3 m -H )K + m } N /n steps. In Algorithm 6, K (m + 1) and (m + 2 )N /n operations are required for the first and second loop, respectively. Again we count one vector operation as m arithm etic operation. Since the inner m ost g-loop takes n(m + l) steps, the j-loop requires K /n -(m + l+ n (m + l)) opera tions. The last for (j-loop) needs K m f n operations. In total, Algorithm 6 needs K (m + l) + {m+2 )N /n + K (m r\-l)(n + l)/n + K m /n sw 2 K (m + l) + (m+2 )N /n . Thus the overall tim e complexity is {(3m + 4)K + m } N /n + 2K {m + l) + (m+ 2 )N /n } = { (3 m + 4 )if + 2(m + l)}JV /n • + • 2 K (m + l) « {{3m + 4)K + 2(m +l)}iV '/n}, since N /n :» 1 and both K and m are finite. Q.E.D. Sequential algorithm s also follow the basic sequence of the Algorithms 5 and 6, except th at some parallel loops requiring N /n iterations become N iterations, and some overhead for parallel operations is removed. In Algorithm 5, only the broadcast is not needed in sequential case, resulting in m N + K (3 m + 4 )N operations. For Algorithm 6, the third and fourth loops (j\q loops) can be avoided in the sequential im plem entation. Following the result (with minor modifications) for parallel algorithm , the sequential algorithm needs K (m + l) + N (m + 2) + K m operation for Algorithm 6. Hence the overall tim e complexity is m N + (3m + 4)K N + {m + l)K + {m + 2)N + K m » {(3m + 4 )K + 2(m+l)}AT for the uniprocessor case. 4.5 Scene Labeling by Discrete Relaxation 4.5.1 R elaxation Labeling To describe a scene w ithout ambiguity, we have to identify each object in the image w ith a proper label. There are interactions in assigning a label to a particular object w ith respect to the other objects w ith their own labels. Given a set of objects and labels together w ith a set of constraints, labeling i is a problem to assign a label to each object w ithout violating any constraint. t , We develop a parallel labeling algorithm by a relaxation m ethod [RHZ76]. The sam e task has been implemented using special hardw are [GWH87], and using SIMD m ultiple-broadcast array processor w ith associative memory [RK85]. ! Let A = {an, a2» • * • j <*n} be a set of objects to be labeled, and let A = {Ai, A 2, • • •, Am} be a set of possible labels. For a given object on, not every label in A may be appropriate. We denote A,-, (A,- C A) as a set of labels th at are com patible w ith object a , , 1 < « < n. Let A y , called object compatibility matrix, be the set of com patible pairs of labels for a,- and a y ; thus A y (p, q) = 1 m eans th at it is possible th a t on has a label A p and ay has label a Xq. Note that A y C A, x Ay. By a labeling T = {L\, • * •, L n) of A we m ean an assignment of set of labels L, C A to each on E A. The labeling T is called consistent if, for all : i ,j , we have {{A} x Ly} D A y ^ < j> , for any A E Li. The consistent labeling is derived from a initial labeling T0 = {Ai, ■ • •, An}. Let T* be the labeling at the Alh application of the following algorithm . To obtain the labeling at the k + 1st step, we discard from each L* any label A such th a t {{A} x Ly} P I Ay = < f > for some j . In other words we keep the label A at on if, for every ay,} = 1,2, •*•,», there is a label A' € L* of ay which is com patible w ith A,; otherwise we remove A. Suppose we use boolean vector for Li as Li = 2, • • • ,/*n ]> where /y is 1 if A y is a possible label for the object on. Then the relaxation procedure can be formalized as [GWH87]: Uk < — Uk * n 3 = 1 j The boolean operations “and’ and “or” are denoted by * and + , respectively, j The notation a * — b represents a replacement of a by b. A labeling is converged I when the above relation (or labeling) does not change for all i and k. The relation can be rew ritten for better understanding and parallel implem entation: P= 1 (1 < • < n, l < k < m ) (4.5.17) p=i m * ]0*2P * A»-2(fc>P)) p = i (4.5.18) * XX*nP * A<n(fc,p)) P = 1 Here the com patibility of one label of an object is checked w ith all other objects simultaneously. The first line (with the first £ ) in Eq. 4.5.18 seeks com patibility of At of object ai w ith labels of an object ax, the second line w ith labels of ct2, etc. Hence parallel execution is possible since n processors can com pute n £ ’s of the Eq. 4.5.18 simultaneously. The processor Pj checks the com patibility of object on to the objects ocj, 1 < j < n. The AND for the n boolean results can be obtained using the procedure RECURSIVE-DOUBLING in logn steps. For parallel execution on an OMP, L,- is stored in Mu, and each memory module Mij contains an m x m array A,j. An entry (p, g) in m odule Ai,, corresponds to the object com patibility Afy(p, q). The parallel algorithm is sum m arized below: (1) For j = 1 to n do Steps 1 to 5 (for an object a3). (2) For * = 1 to M do the Steps 3,* • *,5 (M = |Ly|) { Select zth label in Lj, say A*, then all processor examine the com patibility of A * of object otj w ith other objects in parallel.} (3) Each processor Pp in parallel scans each entry of row k of M jp and find a “product” (lpq * Ayp(fc,p)), and performs OR on the product. (4) All results from n processors are ANDed by recursive doubling. (5) If the result of Step 4 is false, remove A, in Lj (i.e. set ljq to zero). (6) If there has been any change in labeling, repeat Steps 1 to 6. Otherwise stop. 92 T h e o re m 4.4 The tim e complexity of the relaxation algorithm for labeling n objects w ith m labels is roughly 0 ( m 2n2(m + logn)) using an OM P w ith n processors. P r o o f The task of ORing (Step 3) takes 0 (m ) steps, and ANDing (Step 4) re quires log n by recursive doubling, and updating the label (Step 5) needs constant tim e. Hence the tim e requirem ent for one iteration of the i-loop is 0 {m + logn). The num ber of elements of a set \Lj\ is at m ost m. The j-loop needs n itera tions. Hence it takes 0 (m n (m + logn)) to complete one global iteration (Step 1 to 5). The overall relaxation converges in at m ost ran iterations, since at least one label among ran initial candidate ones is removed after each global iteration. Thus, the total tim e complexity of the algorithm is 0 (ra 2n 2(ra+ log n)). Q.E.D. Com pared w ith the serial algorithm w ith 0 (ra 3n 3) complexity [RHZ76], the parallel labeling results an 0 (ran /(ra + logn)) speedup. 4.5.2 H ouse Scene Labeling A simplified house scene [HR78] is used here to dem onstrate the consistent labeling. Suppose we have a scene with four objects to be labeled house, tree, sky and cloud with the following constraints (Fig. 4.10): (1) Object 1 shoud be either the sky or cloud. (2) Object 2 should be either a house or a tree. (3) Object 3 is neither a tree nor cloud. (4) Object 4 is not a house. (5) The tree, cloud, and the house should be adjacent to the sky. (6 ) Tree and cloud may not be adjacent. (7) No two objects should have the same label. Thus let oti— Object i, and let Ai, A 2, A 3 and A 4 be sky, tree, house and cloud respectively. Since Object 1 is either the sky or cloud, we have 93 Ai = [1 0 0 1]* and since Object 2 is either a house or a tree, possible labels for the Object is A2 = [0 1 1 0]* By constraint (3) the labels of the Object 3 can be As = [1 0 1 0]* Finally, since the Object 4 can be anything except the house, candidate labels for the Object are obtained as A4 = [1 1 0 1]* A label com patibility m atrix Cy for objects i and j is an m x m array, where Cij (p, g) = 1 if A p for object i is com patible w ith A, for object j ; 0 otherwise. From the constraint 3 and 4, the label com patibility m atrix becomes 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1 1 CW = C{c) = 0 0 1 0 \ J 1 1 0 1 V / 0 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0 for (a) the same object, (b) adjacent objects, (c) non-adjacent objects, respec tively. The constraints (5), (6), and (7) determ ine the above m atrices. All diagonal elements in C^) and are zeros due to the constraint (7). The object com patibility m atrix is defined as [GWH87] Ay = (Ai x A}) * Cij (4.5.19) The m atrix m ultiplication is denoted by x in the equation. We show below the com putations of a few Ay’s using Eq. 4.5.19: AX 1 = [1 0 0 1]* x [1 0 0 1] * C(a) 94 F igu re 4.10: Scene la b elin g p rob lem w ith four ob jects an d fou r lab els 95 : 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 * = 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 A1 2 = [1 0 0 1]* x [0 1 1 0] * C(b) i t 0 1 1 0 0 1 1 1 0 1 1 0 1 i 0 0 0 0 1 0 1 0 0 0 0 0 1 “ * — 1 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 i | A1 3 = [1 0 0 1]* x [1 0 1 0] * (7 (4 ) 1 0 1 0 0 1 1 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 * 0 0 0 0 1 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 A1 4 = [1 0 0 1]* x [1 1 0 1 ] * C( 6 ) 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 * = 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 l - A2 3 = [0 1 1 0]f x [1 0 1 0] * C(e ) 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 0 1 0 * = 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 96 In selecting a com patibility m atrix C , we consider th a t Objects 1&2, 1&3 and 1&4 are adjacent, but Objects 2&3, 2&4 and 3&4 are separated. Complete object com patibility m atrices stored in OM P memory modules (A ,-y at M tJ) are shown in Fig. 4.11. Initial labeling T0 = [Ai A2 A3 A4]* is shown below: r° = 1 0 0 1 0 1 1 0 1 0 1 0 1 1 0 1 i * Note th a t 4 i^ = 1 22 = I33 = 4 * P = 1- The first iteration computes T1 = [4y^]> (1 < i , j < 4 ). The superscripts represent the iteration num ber. Below illustrates a portion of com putations of the labeling T1: In and /31. 4 ? <- 4 ? * [4? * A n ( l , 1) + 4 ? * A n ( l ,2 ) + 4 ? * A n ( l ,3 ) + 4 ? * A U ( M ) ] 4 ? * A w ( l , l ) + 4 ? * A 12( l , 2 ) + 4 ? * A12( l ,3 ) + 4 ? * A 12( l ,4 )] 4 ? * A13(l, 1) + 4 ? * A13(l, 2) + 4 ? * A13( l , 3) + 4 ? * A13( l , 4)' ’4 ? * A 14( l , 1) + 4 ? * A 14( l , 2 ) + 4 ? * A14( l ,3 ) + 4 ? * A14( l ,4 )' = 1 * [1 * l + 0 * 0 + 0 * 0 + l * 0 ] * [ 0 * 0 + l * l H - l * l + 0*0] *[ l* 0 + 0 * 0 + l * l + 0 * 0 ] * [ l * 0 + l * l + 0 * 0 + l * l ] = l 4 ? < “ 4°i} * [4? * A31( l , 1) + 4 ? * A31( l , 2 ) + 4°> * A31( l , 3 ) + 4 ? * A31( l , 4 )] 4 ? * A32( l , 1) + 4 ? * A32( l ,2 ) + 4 ? * A32( l ,3 ) + 4 J> * A32( l ,4 ) 4 ? * A33( l , 1) + 4 ? * A33( l , 2 ) + 4 ? * A33( l , 3) + 4 ? * A33( l , 4 )' 4 ? * A34( l , 1) + 4° 2) * A34( l , 2 ) + 4 ? * A34( l , 3 ) + 4 P * A 34(l, 4) = l * [ l *0 + 0 * 0 + 0 * 0 + l * l ] * [ 0 * 0 + l *0 + l *0 + 0 * 0] * [ l * l + 0 * 0 + l *0 + 0 * 0] * [ l *0 + l *0 + 0 * 0 + l * 0] = 0 Complete labeling m atrices after each iteration are shown below. After two global iterations, the labeling converges (r2). It is obvious th a t the Objects 1,2,3, and 4 are labeled respectively sky, tree, house, and cloud. 97 r 3 10 0 0 0 10 0 0 0 10 0 0 0 1 Notice th a t the labeling m ay give an ambiguous result. In this case, an unam biguous labeling is found by searching among the possible sets of solution [RHZ76]. r° r 1 1 0 0 1 1 0 0 1 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 r 2 l 0 0 0 0 1 0 0 0 0 1 0 0 0 0 i 4.6 Conclusions We have developed parallel algorithm s for p attern classification and im age processing on the OMP. Image enhancem ent using histogram flattening, line detection by projection, p attern clustering, and scene labeling using dis crete relaxation are included. The parallelizations are based on the orthogonal processing, which is potentially im portant for other algorithm ic development on the OMP. Due to proper partitioning and almost independent executions of each task, m ost of them achieve linear speedups over the uniprocessor performance as shown in Table 4.2. The OM P is initially applied for coarse-grain com putation [HTK89]. How ever, this paper presents the effective usages of the OM P in solving problems for both low-level and high-level image d ata processing. The OM P supports both local and global operations. For several algorithm s above, the n-processor OM P achieves equal performance to th at of n 2 -PE M PP-type array processor This proves the efficiency of the architecture on the image processing/pattern recognition. By employing the orthogonal processing principle, m any other al gorithm s such as Convolution, Correlation, Thresholding, 2-D Linear Filtering, Image Segmentation, Connected Com ponent Problem , M otion Detection, Tem plate M atching, Contour Following, and Convex Hull Detection can be easily 98 A/12 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 A/21 A/22 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 A/31 A/32 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 A/41 A/42 0 0 0 1 ' 0 0 0 0 ' ’ 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 Mia 0 1 0 0 0 0 0 0 0 0 1 0 A/23 0 0 0 0 1 0 A/ 3 3 0 0 0 0 0 0 0 1 0 A/43 1 0 1 0 A/14 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 A/24 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 A/34 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 A/44 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 | F igu re 4.11: C om p atib ility m a trix for h ou se scen e la b elin g p rob lem r stored in th e m em ory o f th e O M P w ith four p rocessors i Table 4.2: C om p arison o f T im e C om p lexities o f P arallel A lg o rith m s on th e O M P w ith Serial A lgorith m s Algorithm Parallel Algorithm Serial Algorithm re-processor OMP n l-PE M PP Recursive Doubling (n data) O (log re) 0(n) 0(n) Histogramming (kn x kn data, L levels) 0 ( k 2n + L ) 0 ( K 2n + L ) 0 ( k 2n2) 2-D F F T (kn x kn data) 0 ( k 2n log kn) 0 ( k 2n log kn) 0 ( k 2n2 log kn) Histogram Equaliza tion (kn x kn data, L levels) 0 { ( k 2 + L)n} N/A 0 ( k 2n 2) Hough Transform (kn x kn data, M angles) 0 ( k 2M n ) 0 ( k 2(M + n)) 0 ( k 2M n 2) P attern Clustering (N data, m features, K clusters 0[{(3m + A)K +2 (m + 1 )}N/n] N/A 0[{(3m + 4)K + 2 (m + 1 )}N] Relaxation Labeling (n data, M labels) 0 { M 2n2- (M -f- log re)} N/A 0 ( M znz) 100 com puted to exploit their parallelism. They can take advantage of simple d ata partitioning, parallel execution and simultaneous communication. In conclusion, both fine-grain image-processing and coarse-grain pattern analysis can be efficiently performed using the orthogonal processing w ith great flexibility in program m ing and d ata m anipulation. Chapter 5 Multidimensional Orthogonal Multiprocessors An extension of the orthogonal m ultiprocessor to higher dimensions is con sidered in this Chapter. By the extension 0(JV2) memory requirem ent is re moved for iV-processor OMP. The orthogonal access principle applied for the original OM P is again enforced. The orthogonal memory references result in a high memory bandw idth due to conflict-free accesses. The regular structure, synchronized control, and various system configurations contribute to imple m enting massively parallel systems. 5.1 Binary Orthogonal Multiprocessors The original OM P (Fig. 2.1) can be viewed as a 2-dimensional architecture since the memories are organized into a 2-dimensional array. Before we jum p into a very general extension of the OM P, binary orthogonal m ultiprocessors are studied first to clarify the expanded architecture. I 5.1.1 The System A rchitecture A binary 3-dimensional OM P is shown in Fig. 5.1. This OM P consists of 1 22 = 4 processors and 23 = 8 memory modules which are interconnected by 4 1 dedicated buses. Each processor can be switched to access the memories in 3 orthogonal directions as m arked x, y, and z in Fig. 5.1. Each bus consists of i three branches in x , y, and z directions. Two memory modules are tied to each branch. One of them can be accessed in any direction; hence it is dedicated to the processor th at owns the bus. We call it a private memory. The private memory is located at the center where the three branch joins. Figure 5.1b-d shows three different ways of memory access by the orthogonal access principle. For the x-access only the branches in the x direction are active. The access path is established by selecting the bus in x direction by the 3-to-l switch in each memory module. Refer to Fig. 5.4 for the general switch configuration. The other accesses are performed in a similar fashion. A binary 4-dimensional OMP and 2 memory accesses out of 4 possible directions are illustrated in Fig. 5.2. Here eight processors can sim ultaneously access 16 memory modules in each direction w ithout any conflict. A binary n-dimensional OMP is constructed by 2n_1 processors and 2n ; memory modules based on a binary n-cube. All nodes are occupied by memory ! modules, and all edges form 2n_1 buses interconnecting processors and memory modules. Logically, we may consider th a t the 2n~1 processors are located at the position of their own private memories. Then processors are located every two j links away. This corresponds to two-bit difference in the node index when an I ra-bit binary representation is used. The processors (or private memories) can 1 be found in the even-numbered positions of the binary-reflected grey code. For example, they are 0, 3, 6, • • •, 9 for 4-dimensional OM P in Fig. 5.3 (see Table 5.1 for a grey code). 103 i P ro c e s s o rs m ip, 1 0 11 00 01 I B u s 0 B u s 1 B u s 2 B u s 3 M e m o ry f (a) Overall configuration (b) x-access (c) y-access (d) z-access j F igu re 5.1: A n arch itectu re o f a b in ary th ree-d im en sion al O M P I I i 104 Processors P P 1 00 (SiN 11 0 ill Memory ; - - f j r 000 T r 010 I ^^p 101 I ^ P1 -J., 001 r 011 101 000 00 001 o' (a) Overall configuration w-access w (c) z-access F igu re 5.2: A n arch itectu re o f a b in ary fou r-d im en sion al O M P 105 1 | I 5 F igu re 5.3: L ocation s o f p rocessors in b in ary fou r-d im en sion al O M P T able 5.1: B in ary-R eflected G ray C od e for 16 N u m b ers 0 0000 (0) 1 0001 (1) 2 0011 (3) 3 0010 (2) 4 0110 (6) 5 0111 (7) 6 0101 (5) 7 0100 (4) 8 1100 (12) 9 1101 (13) 10 1111 (15) 11 1110 (14) 12 1010 (10) 13 1011 (11) 14 1001 (9) 15 1000 (8) 5.1.2 Com parison w ith H ypercube C om puters The binary OM Ps are very similar to binary hypercnbe com puters. Buses supported by shared memories are equivalent to the point-to-point links in the hypercube com puter. There axe two disjoint communication paths between two adjacent processors in the OMP. This is because the num ber of memory modules is twice th at of the processors. A network diameter is defined as the m axim um num ber of links of the shortest path between any two processors to communi- | cate with. In the hypercube topology, the network diam eter is obtained by the I Hamming distance, i.e. the num ber of digits which are different from each other in their indices between the source and destination. The comm unication cost in the OM P is m easured by the num ber of memory reads or memory writes for a message to reach the destination via the shortest path. It corresponds to the ! num ber of edges of the shortest path between the source and the destination. However, the edges in the OM P are interpreted differently from those in hy 107 1 percube com puters. Communication between processors m edges apart in the hypercube needs m memory reads and m memory writes. Thus, two memory references in the OM P correspond to the routing cost for one edge in the hy- j percube. We will call them a routing unit. According to the above definition of 1 comm unication cost, the network diam eter of a binary n-dim ensional OM P is | re/2 ({re — l} /2 for odd re) routing units. H ard ware-wise, the OM P needs twice I i as m any memories th an the hypercube com puter w ith an equal num ber of pro- | cessors. The OM P needs 2” buses, each of which runs in re dimensions. While the hypercube requires re • 2n point-to-point links. Comparisons of hardware requirem ents and network characteristics are sum m arized in Table 5.2. T a b le 5.2: C h a ra c te ris tic s o f th e O M P a n d C o m p a ris o n w ith a H y - | p e rc u b e C o m p u te r binary re-d OMP binary (re — l)-d hypercube No. of processors N = 2n~1 No. of memory modules 2n 2 n -l No. of buses or links N W log A T Links (ports) per node re re — 1 Nodes per bus re - f 1 - Network diam eter re/2 re — 1 5.2 Generalized Orthogonal Multiprocessors i | 5.2.1 A (fc,n)-O M P A rchitecture In general, a fc-ary re-dimensional OM P consists of N = kn~* processors, M = k N — kn memory modules, and N dedicated memory buses. For a binary case k = 2, thus N — 2n_1, and M — 2". From now on, we use the notation (k, re) | |_OM P for a k-ary re-dimensional OMP. If we ignore the presence of processors, 108 I i i the memories and buses form an »-dimensional hypercube. Each node of (A ;, n) hypercube is occupied by a memory module. Collinear edges belong to one bus. The architecture may be called a shared-memory hypercube because the topology is hypercube and memories are shared. But the connection between two nodes is established by a bus instead of a point-to-point link. Each of the N buses consists of n branches spanning to n orthogonal di rections (dimensions), k memory modules are interconnected to each branch. The private memory is common to every branch of the same bus. Again the processors are at the nodes where their private memories are located. Thus a total of n(k — 1) + 1 memory modules are accessible by each individual processor. A memory module is shared by n different buses (processors) except the private memories. The actual access is controlled by an n-to-1 switch in each memory m odule th at allows only one bus at a time (Fig. 5.4). The orthogonal access rule is again observed to simplify the d ata routing and control of buses: all processors should access the memory in the same direction in a memory cycle. Processors which do not need memory accesses will not participate the memory access, n different patterns of memory access exist according to the rule. Considering the synchronism in memory access, the SIMD operation should be quite suitable for controlling high-dimensional OMPs. The d ata movement from one memory module to another w ithin the same branch takes two memory operations (one read and one write) for non-binary OM Ps (A ; > 2). Suppose we represent the processors in n — 1 dimensional space. A message passing takes one memory read and one memory w rite between two i processors w ith one digit difference in their indices. Thus the routing cost for l l processors w ith i different digits in their indices is 2{i — 1) memory references. The cost is equivalent to th a t of routing a path having * — 1 links for hypercube ; com puters. The m axim um difference in the indices of two processors is n — 1, j thus the network diam eter is n — 1. Section 5.3 deals w ith the issue in greater . detail. 109 n-to-1 switch Bus 1 Bus 2 Memoryf Bus 3 Bus n F igu re 5.4: A logical diagram o f a m em ory m od u le c o n sistin g o f n-to-1 sw itch an d a m em ory 5.2.2 Com parisons of H ardware R equirem ents The OM P architecture is very sim ilar to spanning bus hypercube (SBH) [Wit84,AJP86] and generalized hypercube structure (GHC) [AJP86,BA84] as shown in Fig. 5.5. SBH and GHC are distributed-m em ory multiprocessors while the OM P is a shaxed-memory architecture. A bus connects m ultiple processor- memory nodes in SBH. There could be bus contentions between processors. The OMP, however, does not have the contention problem since the buses are not shared. The GHC contains point-to-point links for internode communication. It is an extension from binary hypercubes. Assuming the three systems have equal num ber of processors, the OM P needs more memory components than the two hypercubes. B ut the num ber of different links (either buses or direct links) is lower th an the two architectures. Table 5.3 summarizes the comparisons of hardw are complexity of OMP w ith GHC and SBH. Table 5.3: C h aracteristic C om p arison s o f G en eralized O M P w ith G en eralized H y p ercu b e an d S p an n in g-B u s H yp ercu b e Architecture (k,n + 1) OMP (k,n) hypercube {k,n) Spanning-bus hypercube No. of processors N = k n No. of memory modules k N N N No. of buses or links N N n (k - l) /2 N n /k Links (ports) per node n n(k — 1) n Nodes per bus (n + l){k - 1) + 1 - k Network diam eter n n n Recall the relationship N — kn. For the OM Ps w ith the same num ber of processors, the larger the dimensionality, the smaller the num ber of the mem ory modules required. However, the communication delay becomes longer, since the network diam eter is n — 1 for a (k, n) OMP. For example, a (2,7)-OMP of I l l 321 021 020 311 010 301 001 200 1 0 0 000 300 (a) A 4x3x2 generalized hypercube network (b) A (3,3) spanning bus hypercube network spanning F igu re 5.5: T h e arch itectu res o f G eneralized H yp ercu b e an d S p an n in g B u s H y p ercu b e 112 64 processors requires only 128 memory modules w ith a diam eter of 6. While a (8,3)-OMP needs 512 modules w ith a diam eter 2. Table 5.4 lists character istics of different implementations for 64-processor and 64K-processor OM Ps. The tradeoff between hardw are requirements and comm unication speeds can be ! im portant for building large systems. Table 5.4: V arious Im p lem en tation o f M u ltid im en sio n a l O M P No. of processors N = 64 N = 65,536 (= 64K) Organization: (k-ary, n-d) (2,7) (4,4) (8,3) (2,17) (4,9) (16,5) No. of m emory modules 128 256 512 128K 256K 1000K Network diam eter 6 3 2 16 8 4 No. of buses 64 64 64 64K 64K 64K Memory modules per bus 8 13 22 18 28 76 5.3 Communication Strategy 5.3.1 M em ory H ypercube A (A;,n)-OMP is constructed by kn~l processors and k n memory modules located at all nodes of a (k,n) hypercube. Collinear edges of the hypercube belong to a branch. The memory organization including the bus connections is called a (A;,») memory-hypercube. The processors are considered to be located a t their own private memory nodes of the (k, n) memory-hypercube. Thus a private memory node in the memory hypercube m ay be regarded as combina tion of a processor and its local memory. The node may be called a processor node. Nonprocessor nodes are called memory nodes. The processor node is a center of the n branches of the associated memory bus. The processor nodes are not regularly distributed in the (k,n) memory-hypercube. The distribution is not unique either. It is not easy to figure out where a certain processor is lo- 113 cated in the memory-hypercube. The following sections deal w ith interprocessor communication w ith a simplified interconnection model. 5.3.2 P rocessor Interconnection M odel Processors in the (k , n) OM P can be represented by (n — l)-tuple: P (a n_2,a n- 3,•••Jai,ao)j (0 < a, < k for all * = 0 , 1 , ...,n — 2 ) (5.3.1) The distance between processors will be defined later. M emory modules are indexed similarly by n-tuple as Af(6n_i, 6n_2, &i , bo) for 0 < b j < k, j — 0,1, •••, n — 1. A processor P (a n_2, • • •, flo) whose processor node is located at (&„_!, • • •, bo) can access n{k — l) -f-1 memory modules: A f {bn — i , bn — 2 j 11 ■ >^2)^1) 3?) Af(6„ - i j 6n -2) • • • ,bz,x,bo) : (x = 0 , 1 ,- •• ,Jc — 1 ) A f(6n_ i , x , 6 n_3, • • • , b i , b o ) h d (x, bn — 2, 6n-3» ’ * ‘ j ^1 > ^o) Note th a t the memory modules have the same index as th a t of the processor node except in one coordinate. For illustration, consider a (3,3) OM P whose memory modules are M ( x ,y ,z ), where 0 < x, y, z < 3 (Fig. 5.6 The processor whose processor node is (0 ,1 ,1) can access 7 memory modules: M ( 0 , 1 , l) in all ; 3 three directions; Af(1,1,1), A f(2 ,l,l) in x-direction; A f(0,0, l), Af(0 , 2 , 1) in the y-direction; Af(0,1,0), Af (0,1,2) in ^-direction. From the above relationship of one processor w ith its accessible mem ory modules, it is possible to find processors sharing a specific memory mod- j ule. For a non-private memory module Af(6„_i, 6n_2, • • • ,&i, &o), the n proces- ; sors sharing the memory are at (6n_ i , 6n_2, • • • ,&i, zo), (6n_ i , 6n_2,- •• ,zi,&o), •••, ' (& n-i> zn- 2 , " , bu b0), and (z„_i, 6n_2, • • •, &i, b0), where z0, zn- 1 are fixed , constants. F igu re 5.6: M em ory m od u les accessib le by a p rocessor in (3,3) O M P 115 i The OM P architecture can be conceptually simplified to retain the proces sors and their interconnections for d ata routing purpose. Two processors will be thought connected via a logical link in the conceptual architecture if they ■ share a common memory module. The original (k, n) OM P can be reduced to j {k,n -1) hypercube by removing all memory nodes. The (k,n -1) hypercube will consist of processors at each node and links between nodes. Collinear edges form a logical bus in the new hypercube. Although the processors don’t have direct interconnections w ith one another, their communications via shared memories are considered via the hypothetical buses. Such a structure may be called a processor-hypercube. The (k,n - 1) processor-hypercube is created by projecting the (fc, n) mem- ory-hypercube. For example, this projection m ay be created along the nth ; dimension keeping only the processor nodes. Because no two processors exist on the same branch in the original (k , n) m em ory-hypercube, there is no over- , w rap of processors at the same node after the projection. A processor node at (bn- i,b n- 2, • • - ,b i,b 0) is m apped to jP(an_2, • • • ,a i,a o ) in the processor hyper cube, where a,- = 6, for all i = 0,1, • ■ •, n — 2. The (k,n - 1) processor-hypercube preserves the interprocessor connection relationship except the n th dimension. In other words, the information for the n th dimension is lost in the processor- ; hypercube. This results in reducing the physical proximity (distance) of those { projected processors by one. Since the fcn_1 processors are evenly distributed j in the n-dimensional memory-hypercube, there are always k processors on any I plane of k x k memory modules. The even distribution also implies th at one i sub hypercube, which consists of all processor nodes w ith 6n_i equal to zero, con- i ; tains N /k processors. Thus N jk processors have no change in their positions j even after the projection; however the rem aining N (k — l ) / k processors do have ^ changes. Refer to Fig. 5.7 for the illustration of (2,3) and (3,2) OM Ps. Thus j the routing distance based on (A ;,n - 1) processor-hypercube may not be precise. It may give either a lower bound or an upper bound depending on the weight set for each link of the processor-hypercube. If we set one memory operation to traverse a logical bus (this corresponds to 1/2 of hypercube com puter), it will result in lower bound due to ignoring the cost hidden along the projection direction. If we set two (i.e. one routing unit), it will result in upper bound since it overestimates each link for N /k processors. For simplicity, it is best to select one routing unit for one link (bus) traversal which would result in the upper bound of routing costs. 5.3.3 R outing A lgorithm s The routing strategy used in hypercube m ulticom puters can be employed for d ata routing between processors in the OMP. Two processors having only one-digit difference can communicate in one routing unit (distance one). Mes sages are sent via the bus in the dimension w ith unequal indices. Processors with two digit difference have a distance of two. For example, in the (4,4)-OM P, the processor P (0 ,0 ,0 ) can send a message to the processor P ( 0 ,1,3) in two routing units. There are two disjoint paths: P ath 1 : P ( 0 ,0,0) -> P ( 0 ,1,0) -> P ( 0 ,1,3) P ath 2 : P (0 ,0 ,0 ) -* P (0 ,0 ,3 ) -> ■ P (0 ,l,3 ) D ata routing between two processors w ith m ultiple digit differences can be done in a sequence of routings between two processors having one-digit difference. The d ata routing algorithm can be simplified by following the order of the index. If there are d-digit difference in the indices, arbitrarily choose the first dimension in which the processors have different indices. The d ata are sent along the direction, then along the next highest dimension w ith unequal indices, and so on. Once the lowest dimension is reached, the next dimension is determ ined in w rap-around fashion. We can find d disjoint paths to route d ata to the same destination due to d different choices for the first routing dimension. The (a) A (2,3) processor-hypercube from the (2,4) OMP (b) A (3,2) processor-hypercube from the (3,3) OMP F ig u re 5.7: (2 ,3 )- a n d (3 ,2 )-p ro c e sso r h y p e rc u b e s 118 Hamming distance is the routing distance required to reach the destination. The m axim um difference in the indices between processors is n — 1 for the (A :, n — l) processor hypercube. Thus the network diam eter is n — 1 for n-dimensional OM P (n = log* TV). Simultaneous data movement by m ultiple processors via j the same logical bus causes no bus contention. It is because the bus represents n separate memory buses associated w ith those processors in the direction of d ata movement. The orthogonal access guarantees parallel d ata movement for all processors w ithout conflict. Broadcasting d ata from a processor to all N other processors takes n — log*. N steps. The d ata propagate along j t h dimension at j t h step (j = 0 ,1 ,•• •, n — 2). At one memory cycle, each processor sim ultaneously writes the data received onto all k memory modules in the j th dimension. Each tim e the num ber of processors becomes k times previous one th a t have already received the data. , Thus, after n — 1 steps of propagation all processors will get the data. In another routing example, we show below th a t com puting a consensus function takes log2 N memory accesses. The consensus functions include the m axim um /m inim um , sum, an d /o r of N num bers. By recursive doubling intro duced in Section 4.1.1, k num bers stored in k processors of (k, 2) OM P can be sum m ed up in 0 ( log2 k) memory accesses. This corresponds to the sum m ing of k num bers stored in k processors tied to a straight line in a (A ;, n — 1) processor- hypercube. Notice th a t there are m any parallel lines in the direction. Thus, for i , each of those parallel lines, a sum m ation of k num bers is com puted simultane- ; ously. After one step of the concurrent sum, the num ber of interm ediate sums to be further added is reduced to 1/A: of previous one. By alternating the direction among the n — 1 dimension, the total is found after n -1 sum m ing operations ! since there are A ;"-1 num bers. The algorithm is sketched in Algorithm 7. I The result (SUM of the N num bers) is obtained in processor P (0 ,0 , • • •, 0). ' Step (l) can be done in 0(logA;) time. Step (2) takes constant tim e. Hence the t A lg o rith m 7 : S U M fo r j = 0 to n — 2 d o (1) and (2): (1) Make groups of k processors having same coordinate (using the processor-hypercube model) except ij {ij = 0,1, • • •, k — 1) w ith ij~i, ij- 2 , — , *0 all zeroes. Each group of k processors simultaneously computes a sum of k d ata by recursive doubling using k x k memory array associated w ith the group. (2) The results axe moved to the processors whose nodes have the same indices except the zero in jr-th dimension. overall complexity of the SUM is 0 { (n — l) log k} = 0 (lo g k n *) = 0(log N ), which is similar performance to th a t of hypercube computers. 5.3.4 A verage Internode D istance In the (k, n — 1) processor-hypercube, distance between two processors hav- ing the same index except in only one digit is unity. Here the unit of cost is m easured in a routing unit. The average internode distance plays a key role ! in determ ining the queueing delay in the communication network. The average i | distance d is defined as ' n— 1 I 3 = YL dNd/ ( N - 1) (5.3.2) j fei where is the num ber of processors in the distance d from a source node. L et’s j consider the P (0, • • •, 0) as the source processor. There are k-1 nodes which differ ! from the source node only in the *th dimension (* = 0,1, • • •, n — 2). Hence the total num ber N \ of nodes differing by distance 1 is | 120 l j n-l Ni = ^2 {k - 1) = (n - l)(k - 1) (5.3.3) i » ' = 1 The nodes which have a distance of 2 from the source node m ust differ in I their addresses by two coordinates * and j . In these two dimensions, (k — l ) 2 j different combinations can occur. Again, these two dimensions are selected out I of n — 1 such dimensions existing in the address space. Hence, the total num ber I of nodes differing by the shortest distance 2 (iV2) is i 1*2 = | " 2 1 | (fc - 1) = The same idea can be applied to calculate the num ber of nodes differing by a \ Hamming distance d: The average message distance is com puted from the above results: d = J 2 d N d/ { N - 1) d=l = (n - 1 )(k - l)k n~2/{ N - 1) It is approxim ated as (n — 1 )(k — 1 )kn~2/(k n~1 — 1) « n — 1, if A : > 1. In the binary case d « (n — l)/2 . Note th a t the average internode distance of generalized hypercube and spanning hypercube is n. 121 5.4 Conclusions T he multidimensional orthogonal multiprocessor construction is based on hypercube topology and shared memories. The OM P requires fc-times more memory modules than hypercube computers. However, only N independent buses are needed which contrasts to N logiV point-to-point links for an JV-node hypercube com puter. Also the shared-m em ory organization is more advanta geous in interprocessor communication than distributed one for fine-grain SIMD com putation. It has less overhead in specifying the message routing informa tion and does not need complicated routing algorithm s. Hence it can achieve faster comm unication than the general message-passing scheme used in hyper cube m ulticom puters. The m ajor characteristics are sum m arized below: • By introducing high-dimensional configurations, the num ber of memory modules required is reduced significantly from the original 0 { N 2) for N - processor OMP. The reduction in hardw are requirem ent aids in building SIMD fine-grain massively parallel multiprocessors. • The orthogonal memory access rule may be rigid for general-purpose appli cations. However, it contributes to the delivery of high memory bandw idth due to conflict-free memory accesses. • The performance of the OMP in interprocessor comm unication is compa rable to other similar architectures like GHC and SBH in term s of network diam eter and average internode distance. The OM P requires fewer bus in terconnections while it needs more memory modules th an the GHC and SBH. Conflict in memory access is removed by observing the orthogonal access rule. • There are various combinations of selecting radix k and dimension n for (&, n) OM P w ith the same num ber of processors. There is a tradeoff in the 122 , selection which affects comm unication delay and hardw are requirements. i To achieve faster communication, more memory modules are needed. M any regularly structured algorithm s, particularly m ost data-parallel al- [ gorithm s, can be m atched nicely w ith the OMP operations. O ther suitable algo rithm s include vector/m atrix arithm etic, signal/im age processing, optim ization process, PD E and graph algorithms. The generalized OM P architecture sup- | ports SIMD as well as MIMD com putations. The orthogonal architecture offers j a viable alternative to the conventional SIMD array architecture which uses distributed local memories. The significance of the high dimensional OMP is th a t it can implement massively-parallel shared-m em ory com puters based on hypercube topology with relatively small num ber of buses. 123 i Chapter 6 ! | Other Variation of the I j Orthogonal Architecture I I I I An SIMD array processor augmented w ith bypass connections, called mesh with bypass connections (MBC) is presented. This architecture reduces the di am eter of the n x n array from 0(n ) to O (l). MBC significantly improves the performance of the array processor, especially in global operations; finding the m inim um /m axim um , average, or sum takes 0(log2 n) tim e, which require 0{n) ; w ith ordinary mesh connections, 0{n^) in mesh w ith m ultiple broadcast. Ma- | trix m ultiplication can be done in 0(n) time. Image processing operations such ! as histogram ing, m edian row finding, convolution, and image projection can be efficiently m apped onto the architecture. MBC using circuit switching instead of j store-and-forward outperform s the mesh, mesh w ith broadcast, and mesh w ith j j multiple broadcast. ; 6.1 Mesh with Bypass Connections 1 Mesh connected computers are well recognized for handling a large num ber ! of d ata w ith relatively simple com putation requirem ents such as in image pro- j cessing [MS85,Hwa83,Ros83], m atrix m ultiplication, and sorting[TK77,NS79]. The homogeneous structure of each processor and its interconnection make the j mesh suitable for VLSI implementation. In general the mesh connected comput- | ers form a massively parallel structure with a huge num ber of simple (bit-serial) i | processors and operate in lock-step fashion (Single Instruction M ultiple D ata j operation). The mesh architecture nicely m atches to m any local operations I where the com puted result depends on the values of its neighbors. However, operations requiring global data or d ata from rem ote processing elements suffer j greatly. The weakness in this architecture is due to the lack of global inter- ' connections. Problems requiring d ata to travel all the way to the row/colum n I need 0(n) tim e for d ata routing for n x n mesh, even though the com putation I itself is quite simple and short. In this case the communication (data routing) j tim e is a bottleneck of the entire operation. Typical examples of this kind are com puting average, m atrix transposition, and finding a maximum, etc. There have been several attem pts to reduce the comm unication using proper m apping strategies [NS80,Orc75,Bok81]. But the m ethods are not usually simple, often I , involve some heuristics, and m apping overhead is large. I A hardw are broadcast has been introduced to send global d ata to all pro cessing elements (PEs) in a constant time. The broadcast greatly improves the performance. For example, finding a sum is done from 0 { n ) to O (n l) [Sto82,Sto83], Multiple broadcast further upgrades the power of the broadcast by adding n row broadcast buses and n column broadcast buses[KR85]. Broad casting is done sim ultaneously either for all the rows or for all the columns through row/colum n buses. This architecture improves the performance of the j mesh connected com puter up to O (ns) for the summing operation. The im- 125 plem entation of the buses in VLSI chip, however, is not simple since to route i I a p ath th at runs a long distance on the silicon is not easy. Also the control i | of each bus is a heavy burden to a very simple, single controller in the SIMD j architecture. Some type of interconnection scheme on the mesh is required th at i supports global communications and yet still perm its the massive parallelism. Mesh w ith bypass is designed to fulfill this requirement. In the next section the architecture of mesh w ith bypass connections is described. The essential feature is dem onstrated in binary-tree operations. His- togram ing, projection, and m atrix m ultiplication are m apped onto MBC. Some j algorithm s th a t can not be benefited by the bypass are identified. The perfor- i mance of the MBC is summarized at the end. ! 6.1.1 The A rchitecture i i In a conventional mesh such as an M PP, d ata travel only a unit distance | per clock cycle. For long-distance routing an incoming datum is stored (latched) : in a register of a processor, then sent to the next processor in the next clock i cycle (this is like store-and-forward). A few bypass connections are added in each PE of the conventional mesh such th a t a datum can pass the PE , when ' necessary, w ithout the latching delay. The traveling distance in a unit clock : cycle, called routing distance (d), can be adjusted arbitrarily long. It depends i on how many consecutive bypass connections are established at the moment. This reduces the network diameter (maximum tim e to comm unicate between any two points through the shortest path in the network) of the n x n mesh array from 0(n) to O (l). Fig. 6.1 shows the MBC for one dimensional and two dimensional cases. Note th at the diagram is only for logical viewing. More hardw are elaborations are needed; to disconnect the output of the P E performing bypass to prevent feedback, to control the bypass switch, etc. There should be | some overhead in establishing and controlling the bypass circuits and delays are invoked in the switch. These effects m ay change the practical network diam eter ' to between 0(1) and 0 (n). However, they are relatively small or constant , com pared w ith program control operations and global system clock. They are ■ thus ignored or absorbed in the program control. A very sim ilar architecture w ith bypass circuits have been indepently developed by Li et al. [LM87]. Further generalization is being considered by Miller et al. [MKR*88] using reconfigurable buses on mesh architecture. Ametek, Inc. has developed a message passing j network with mesh topology using circuit switching for m ultiprocessor systems [Sei88]. ; i A more system atic use of the bypass feature is provided below. The pro- , cessors are partitioned into groups of d = 2m (m: integer) consecutive PEs, and make routes whose distance is d. If the num ber of processors is n, it is possible to make up to n /d separate d ata paths simultaneously. They are called multiple point-to-point connections. The feature gives more parallelism than the bus which only one comm unication pair (the source and destination PEs of the data exchange) can use at a time. Those PEs enclosed by the comm unication pairs in the paths may not send data. However, they can still receive d ata traveling on the bypass circuits since the input of the processors is never dis connected. W hether the data is accepted or not is determ ined by a mask flag of each processor. Various routing schemes can be observed in Fig. 6.4. If d = n for one-dimensional mesh, the routing path becomes a broadcast bus. All the bypass circuits in the mesh are activated and all the processors can latch the d ata traveling through the bus. In a two-dimensional mesh o f n x n proces sors, n sim ultaneous column broadcast buses or n row broadcast buses can be established. They are called multiple broadcast buses [KR85], shown in Fig. 6.2. I ; Theoretically every bypass circuit can be switched independently. How- ^ ever, considering the implementing cost and efficiency, the bypass circuits of all processors in the same row (also column) of the array are to be controlled by i one signal. But each row (column) is controlled independently. Therefore for the four-neighborhood mesh, n binary control signals are needed for the switch- 127 P E X PE | (a) One-dimensional Mesh with Bypass i---- i---- i---- — i i______ i___ i---- i---- i---- i---- i_____ i_____ i---- i---- i_____ i___ PE PE PE PE PE PE PE PE PE PE PE PE (b) Two-dimensional Mesh w ith Bypass (4-neighbor case) (c.: control signal for colum n/row i) F igu re 6.1: M esh w ith b yp ass con n ection s 128 F igu re 6.2: M esh w ith m u ltip le b ro a d ca stin g J ing of n rows, n for columns. They can be further reduced to n since only : one of the horizontal and vertical bypasses is allowed at a tim e for the entire i array. The bypass scheme can be extended to arbitrary higher dimension and ' | larger degree of node (more than 4 neighborhood connections). Only the one and two-dimensional cases are considered in this thesis. [ The bypass connections are confined to each PE and all processors still j have the same structure after the addition of bypass facility. This preserves the l homogeneity of the mesh organization. Hence we can implement the structure j j into VLSI to exploit a massive parallelism. 6.1.2 B inary-tree O perations Consider a one-dimensional mesh w ith n processors. Suppose each P E (i) has one datum D(i) and a sum of all n items is required. By propagating the d ata to the right, the sum will be given in the right end processor. Using recur sive doubling technique [KS73], each active processor transm its a d ata to the processor 2‘-1 PEs apart to the right at step i. The received value is added to the sum currently stored in the processor. Note th a t the PEs th a t do not perform the addition will be disabled (using a mask flag) until the completion of the entire task. Usually the mask flag is set as a function of the position of the processor and execution step. After log2 n steps, the rightm ost processor obtains the desired sum. Throughout the paper we use log n for the logarithm of n w ith a base 2. A detailed scheme is given below. S*(i) here denotes the sum com puted by the processor t at step k. A lg o rith m 8 : S U M 1) Initialization SoU) = DU), (1 < 3 < w) 2) fo r k = 1 to log n d o 1. Send Sie-i to the processor 2k~l ahead. 2. Perform addition and store the result in the routing register. where i = 2km, for all m = 1,2, • • •, n /2 k I 130 I I I j The individual steps are illustrated in Fig. 6.3 for n = 8 processors. Here [ “4-8” represents the sum of D(4) through D (8). j , The same scheme can be extended to the two-dimensional mesh array. For ■ all rows in parallel, the summing operation is first completed in the horizontal 1 direction. This parallel com putation is called a row condensation. Each right- j m ost processor gets the row sum after the row condensation. Then the same j operation is performed in the vertical direction (column condensation). Only ■ the processors in the rightm ost column are needed in this step. The final result will be obtained in the bottom right corner processor. , The above operations correspond to a com putation represented by a binary- tree w ith n (n 2) leaf nodes. They may be called binary-tree operations. Suppose i the processor array is now ordered in row m ajor indexing. Each P E acts initially as a leaf node of the binary tree. At first every two adjacent processors perform a ! binary operation (for example, an addition); each odd-indexed processor sends its data to the next processor (routing distance = 1 ■ = 2°), then every even- num bered processor performs a com putation (refer to Fig. 6.4). This completes , the first level com putation of the binary tree. Then, each of the two closest active (i.e. unmasked) PEs do the same routing and addition in a sim ilar m anner. This tim e the routing distance is 2(= 21). Only a half of the processors worked in the previous step are active, representing nodes in the next level of the tree. The | rest of the procedures are done similarly. In logn2 = 2 lo g n steps, the result | of the binary-tree operation arrives at the lower right corner PE (root). Figure * 6.4 illustrates an example of summing up all 82 data. The routing distance in ! the ith steps is 2*-1. (By row m ajor indexing, the routing distance 1 in the i j beginning of column condensation is considered n in the row m ajor indexing.) | The above result can be sum m arized in the following theorem: j i i T h e o re m 6.1 If a problem can be m apped onto MBC of n 2 processors as a I 1 binary-tree operation, it can be solved in O (logn) time. 131 PE^ PEj P $ PQ PEj P® P $ P% Step 2 (d = 2) Step 1 (d = 1) 7,8 5,6 3,4 1,2 1-4 1,2 5,6 5-8 Step 3 (d = 4) 1-4 0 0 0 Result F igu re 6.3: P roced u res o f finding a su m o f n = 8 d a ta I 1 . .1 X '-1 ' I I . . 1 X I 1 . d = l I 1 d= 4 i 1 . x i x X I X I r d= 2 H - d= 8 d=16 d=32 F igu re 6.4: B in ary-tree o p eration o f 8 x 8 d a ta ! 133 ! There are m any problems th a t can be m apped which constitute the binary- tree operation. The requirem ent for the operation is th a t each node in the tree should represent a binary operation, and it should have an associativity. ' Hence problems of array reduction [Ree84] like finding a minimum/maximum, \ \ add/multiply, and A N D /O R of the d ata stored one in each of the n x n PEs can be solved in O(logn) steps on MBC. i I i I i I ; 6.1.3 Pyram id Em bedding A pyram id com puter has m ultiple layers w ith inter-layer connections. Full im plem entation for a pyram id w ith iV-processor at the base requires log4 iV different layers and about 4N /3 processors. A planar (l-layer) pyram id can be ! realized using the bypass mesh connection economically. The iV-processor mesh array is regarded a base of the pyram id. The next layer (second layer) is formed by N /4 processors interconnected w ith the mesh connection itself. Each upper left corner processor of every 2x2 processor block for in the base layer becomes the node of the second layer. The 2 x2 block is a set of 4 processors for the base layer and simultaneously a parent in the second layer. The original neighborhood interconnection in the block now represents the connections between a parent and its children. The internode connections within the second level can be em ulated by bypassed link w ith routing distance 2. In a sim ilar m anner, the third layer can be formed with 1/4 the processors in the second layer w ith d = 2 for interlayer connections and with d = 4 for lateral connections. The rem aining layers are sim ulated in the same m anner. The top left corner processor is the I processor in the apex for full implementation. The one-layer pyram id is as j powerful as the fully implemented one when the image data are processed level ' by level in a sequential m anner. The capacity of memory in this case should be ■ log N times th a t of the full pyram id. This architecture is inferior only when all layers are needed simultaneously for pipeline-like processing. 134 I 6.2 Mapping of Parallel Algorithms A few algorithms are implemented onto MBC to dem onstrate the perfor- mance. M atrix m ultiplication, histograming, image projection, and a m edian i row finding problem is presented. 6.2.1 M atrix M ultiplication Consider an n x n m atrix multiplication C = A B . a,y and % (and eventu ally c,y) are stored in P E [ i,j). M atrix m ultiplication results n? c’s where each Cij is an inner product defined as: c ij = a f ’ by = O i i b i j + a,-2&2j + • • • + Om&nj (6 .2 .1 ) Here a t - and by are column vectors w ith n components of the *th row of A and *th column of B, respectively. We can perform the m atrix m ultiplication in 0(n) tim e as follows. Each processor P E (i,j) can receive a complete vector a,- after n row broadcasts through the row i, and by after n column broadcasts through the column j . The inner product is then com puted requiring n m ultiply-and-add operations. However, each processor needs 2n storage spaces for memorizing two vectors a,- and b y. Instead of starting the m ultiplication-and-add after all d ata are ready, the procedure begins immediately after two m atching operands are ; I available. The m ultiply-and-add can be interleaved after one row broadcast and j one column broadcast. Hence, a sequence of a row broadcast (supplying o^), a column broadcast (supplying &*.y), and a m ultiplication-and-add is repeated n times. The procedures of computing the first two term s of c,y in the above j equation are illustrated in Figure 6.5 for 4 x4 m atrix m ultiplication. j From Algorithm 9 given below, it is apparent th a t the com putational tim e , I is 0(n). 135 A. Initial D ata a l l . A n , a 12 b 12 a 13 b l3 a 14 u b 14 a 21 b 2 1 J a 2 2 1>22 a 23 b23 ®24 l b24 a 31 b31 a 32 * > 3 2 a33 ^33 a34 ^ 4 a41 b41 a 42 b42 a43 b43 a44 b 44 1. Row Broadcast h -1 L a l l a l l al l a 21 j a 21 a 21 321 L _ a 31 _ a 31 _ a 31 a31 a 41 a 41 a 41 a41 2. Column Broadcast h i b 12 __b_13_ u h 4 h i b 12 b13 b14 h i b 12_| b l3 b 14 h i b 12 bl3 h 4 3. M ultiply-and-add a l l h i a l l b 12 a l l b 13 i H T T j a 21 h i a 21 b 1 2 . a 21 13 *21 i_ h 4 i I o' ! M L a 31 b 12 a 31 t>13 a31 h 4 r - t ^ a 41 b 12 a 41 b l3 a41 h 4 4. Another M ultiply-and-add a 12 _ _ h i_ . a 22 h i a 12 i_h?2 _ ! a 22 22 a 12 _ b23_ a 22 b23 i--------- 1 .... ... a 32 h i ! a 32 b 22 a 32 b23 < N a 42 h i a 42 ! b 22 a 42 b23 r a 42 h 4 B. Final Result n H * L 0 12 c 13 c 14 C21 j c 22 c 23 c 24 1 r * 4 C O u °32 c 33 c 34 C 41 c42 c43 c44 F igu re 6.5: S n ap sh ots o f a m a trix m u ltip lication w ith 4 x 4 P E s A lgorith m 9 : M a trix M u ltip lica tio n d j = 0, for all i , j = 1,2, • • •, n for k= 1 to n do 1) Row broadcast a value ay* in a row j . Mij[a\ «- ajk, for all i , j = 1,2, • • • , n, (M ij[a]: memory w ith index a in P E ( i ,j )) 2) Column broadcast a value bkj in a row j . Mij[/3] <- bkj, for all i , j — 1,2,• • • ,n 3) Compute the inner product: C H cv + Mij[a}Mij[0}, for all i , j — 1,2, • • •, n 6.2.2 Im age Projection Let D (i,j) be a picture gray value stored in P E ( i,j) . The vector of column sums ( E D ( 1 J ) , E ■ D (2 . i ) . • • • . E 0 { n , j ) ) (6.2.2) j - 1 j= l j = 1 is called x-projection of D, and y-projection is similarly defined. The com- i , putation of x ,y projections on the MBC is straightforward; each row/colum n of processor elements participates in the row/colum n condensation along the row/colum n in logn steps. The PEs in the rightm ost column or in the bottom row will contain the vector components. We can define a projection on any line having slope 0 with respect to positive i x axis by summing the gray levels of D along the family of lines perpendicular to 0. Projection of an image onto an arbitrary line requires 0 (n ) operations. j 137 ; i I | This is because all pixel d ata to be placed on the same point of the projection I : line should be merged to one processor, which is like summing operation. 1 Here we develop an 0(logn.) algorithm using the bypass features. This is a modification of the SUM operation. The d ata routing consists of a pair of ; movements to horizontal and vertical directions. The routing distance is dou- I bled in each step similar to recursive doubling. To avoid conflict among data paths an auxiliary d ata movement is interleaved between consecutive main rout ing operations. One half of the d ata are shifted backward. This requires some autonomous activity in the individual processors using the processor index to ' generate masks in the SIMD control. If the projection is performed for P differ- i ent angles, it can be completed in 0 ( P log n.) time. If the num ber of projection directions is relatively small (for example, P < ra/logn), this algorithm out- ; performs th a t of 0 (n + P ), which Cyper et al. developed using pipeline-like operation [Dav87]. It is assumed th at each line contour of projection is approxim ated by a 1 pixel wide band. All pixels centered in a given band contribute to the value of th a t band. The routing is based on the unit routing vector r ( l) = (vx,v y). The unit routing vector is the minimum-length vector with an angle 6 that precisely penetrates two processors in its direction. The routing distance is dou bled each time. At step k , the routing will be term ined by r(fc) = 2k~1(vx, vy) = I (2k~1vx,2 k~l vy). Since the band is one pixel wide, all the pixels enclosed by two adjacent unit routing vectors should be merged to one processor before the al- , gorithm starts. Projection for a certain direction is com puted in O (logn) time. : j Note th a t only the processors th at have an interm ediate result to be sent out are active in the com putation. Three operations are perform ed at fcth step: (1) Each active processors route previous results to the processors pointed | by the vector r (k). j (2) Add the routed value to the stored d ata i i (3) Even num bered processors among the active ones in the same columns transfer their data to the processor pointed by — 0.5r(fc). The storage requirement is constant since the current d ata overwrite the ] previous result, and only routing and one extra register are needed. For a given | angle, the projection takes O (logn) time. It becomes 0 ( P logra) for P angles. | 6.2.3 Finding a R ow C ontaining a M edian ! A m edian row is to be found in an n x n binary image (0,1 image). A median row is a row where about a half of the l ’s reside above it, and a half of the l ’s rem ain below. The problem is to find x such th at m m {x : ^ (6.2.3) «= 1 j=l i=x+l j=l j If there exist log n storage spaces in each processor element, it can be solved in ! 0(log n) tim e The algorithm is described in Algorithm 10. Using the binary- tree operation the total sum of l ’s (St) in the picture is com puted storing all interm ediate sums in the associated processors in 2 logn steps. The half value Sh = |S t is then computed. Each row generates one sum of Z>’s in the row (row sum ), n row sums are stored in order into n leaf nodes of the binary tree (see i Fig. 6.6). Now search backward (backtracking) from the root. Each node keeps ! the count of l ’s in all leaf nodes below it, i.e. a root of a subtree has the sum j of all leaf values of the subtree (e.g. 17 = 2 + 3 4 -5 + 7 in the left child of the j root in the Fig. 6.6. For a given node, a variable Accum is defined as the sum j of l ’s from the leftmost leaf of the overall tree to the rightm ost leaf node of the 1 subtree, whose root is the left child of the given node (For example, A ccum of N (ll/q 8 ) is 21). We search until the A ccum becomes the or until we reach the leaf node. Since the height of the search tree is log n , 0(log n) is sufficient to find the median row. The algorithm below gives the detailed procedures where Si means the value stored in the left child of a current node. A lg o rith m 10 : M e d ia n -R o w F in d in g fu n c tio n Search-down (Node, Accum, Sh) b e g in if (Node is a leaf node) th e n r e tu r n Node else if (A ccum — Sh) th e n r e tu r n Leftmost-Leaf of subtree whose root is Node else if ((viccum + Si) > Sh) th e n Search-down (Left-Child, Accum, Sh) else b e g in Accum = A ccum + Si Search-down(f2t<7/i£- Child, Accum, Sh) e n d e n d (* M ain Program *) Accum = 0 Find St and S h. Search-down (Root, Accum, Sh) i 140 I i Figure 6.6 illustrates median row searching procedures for 8 x 8 array. The processor qi corresponds to the ith PE in the rightm ost column. A leaf node w ith label qi contains the sum of the row i in the n x n mesh. The function Search — down is recursively called until the median row index node is found, s The broken arrows indicate the sequence of search. 6.2.4 Difficult M apping Problem s on M BC The new architecture cannot enhance the performance of all problems. Hard problems are those which need whole information flow for the m ost of the d ata in the array. Conventional operation in mesh is utilized in those problems. For a simple illustration, let us consider a perm utation for a m atrix trans pose. Each entry in the Fig. 6.7 contains a routing vector, the first component represents the routing distance in the horizontal direction (left to right for pos itive value), the second in the vertical direction (bottom to top). If we want to route a long distance d = 2k (for example, 4) in a cycle, where k > 0 , first split each column into groups of d elements. Then for all the groups in parallel, the value in the first element in each group travels a distance d. Since the rest processors (3 PE) in the same group were disabled during the data transm ission, they should travel the same distance later. Therefore the procedure m ust be repeated d — 1 (3) times. But the same result can be obtained by simply routing all d ata d times in parallel w ith a routing distance of 1 without using the bypass connections. All processors need to communicate with the others in this example. The intensive communication requirements prohibit establishing necessary disjoint connections between communication pairs to make use of the m erit of bypass. | Problems with such communication requirements are hard to get a performance enhancement even with bypass facility. Discrete Fourier Transform , F F T , 2-D sorting belong to this category. 141 , (qi : Processor where the node belongs to) Trace o f executions: Search-down(N(28/q8), 0, 14) Search-down(N(17/q4), 0, 14), Search-down(N(12/q4), 5, 14) Search-down(N(7/q4, 10, 14) RETURN (N (7/q4)} RETURN (N (7/q4)} RETURN (N (7/q4)} RETURN {N(7/q4)} Note: N(17,q4) denotes the node with a value 17 and processor label q4 — left child of the root. F igu re 6.6: M ed ian row finding (8 x 8 P E s) 142 I i I i 0,0 -1,-1 -2,-2 ____ j -3,-3 t ___ J -4,-4 L ____ j -5,-5 _____ -6,-6 _____ -7,-7 ______ 1,1 _ ____ 1 0,0 J -1,-1 -2,-2 I . 4 -3,-3 |. J -4,-4 L — -5,-5 -6,-6 2,2 ______1 1,1 — J _____ 0,0 — - 1 -1,-1 -2,-2 -3,-3 -4,-4 -5,-5 3,3 2,2 1,1 0,0 -1,-1 -2,-2 -3,-3 -4,-4 4,4 3,3 2,2 ( ■ -----1 1,1 r------- 0,0 -1,-1 -2,-2 -3,-3 5,5 4,4 3,3 -----i 2,2 1,1 0,0 r -1,-1 r - -2,-2 " 1 6,6 ------------ 5,5 4,4 _ 3,3 L _____ 2,2 1,1 L 0,0 -1,-1 7,7 6,6 5,5 4,4 ____ 3,3 2,2 1,1 0,0 t F igu re 6.7: R o u tin g vectors for a m a trix tra n sp o se ( 8 x 8 elem en ts) 143 6.3 Conclusions MBC supports both local operations and global operations due to its neigh- j borhood interconnections and bypass facilities. However, only the problems hard i in ordinary mesh-connected SIMD computers have been considered because of the global d ata routing requirement. The sources of the performance improve- : m ent of MBC can be divided into three groups; the bypass connection (binary- ' tree operation), the emulation of multiple broadcasting, and the original mesh : connection. Table 6.1 summarizes the performance of mesh-connected com puter w ith/w ithout additional interconnections. In m in/m ax, histogram , and projec tion, O(logn) tim e performance results from the fast summing scheme of row condensation and column condensation. M atrix m ultiplication exploits m ultiple ■ broadcasts to efficiently convey all necessary vector elements to each proces sor. Problems like sorting, FF T can not get a noticeable gain from the bypass connections due to intensive d ata routing requirements over the whole d ata set. To get some quantitative idea, a summing operation is chosen to compare the relative tim e performance (in operation steps) w ith various problem sizes in Fig. 6.8. Exact time requirements for the com putations are 2n, 5n5, and 2 logn for mesh with a broadcast, mesh w ith multiple broadcasting, and mesh withderi bypassv, respectively [KR86]. Although the operational steps may be different each other in a strict sense, it can be used as a rough estim ate of the execution time. The table shows mesh-with-bypass outperform s the mesh-with-multiple- broadcast in the actual com putation. We have proposed a mesh w ith bypass connections and dem onstrated its i efficiency in many algorithms. The mesh w ith bypass can perform m ultiple i ‘ broadcast as well as the ordinary mesh operations. MBC provides fast commu- : nications and more parallelism than the other modified architectures of mesh I array processors. The architecture reduces the network diam eter to 0(1) for n x n processor array. M ajor performance improvement is up to O (logn) from I 144 T ab le 6.1: P e rfo rm a n c e C o m p a riso n s o f M B C w ith n x n P ro c e s s o rs fo r S e lected O p e ra tio n s Algorithm Mesh w ithout Broadcast [Sto83] Mesh with a Broadcast [Sto83] Mesh w ith M ultiple Broadcast [KR85] Mesh w ith Bypass Sum, Min., Max. 0 (n ) O (n i) O(n-s) 0 ( log n) M atrix M ultiplication 0 ( n 2) 0 ( n 2) O(n) ° (n ) Histogram 0{n) O (nf) 0(n%) 0(log n) Projection 0 { n ) 0{n) O(n) 0(log n) M edian Row Finding O(n) 0(n.3) O (ns) O(logn) Sorting 0[n) 0 (n ) O(n) 0{n) i i Com putation Time 1,000 Mesh w ith a ✓ B roadcasts' S 1 0 0 - Mesh with M ultiple Broadcast 10 — * .0 ' . -cr _____ Mesh with Bypass ^ er‘ o Array Size 10 100 1,000 10,000 100,000 1,000,000 F igu re 6.8: R elative tim e requirem ents for com p u tin g a sum on th e m esh w ith b road cast, m esh w ith m u ltip le b road castin g and m esh w ith b yp ass ! 0 (n ) w ith a conventional mesh. Similar performance can be achieved using i pyram id [UML81,Tan83] and hypercube connections [Sei85], B ut, the MBC is | simple and efficient with respect to the total num ber of links and routing control j complexity. Chapter 7 Conclusions 7.1 Summary of Research Contributions The thesis contains architectural design for orthogonal m ultiprocessors, op eration/control principles, application algorithm development for image process ing and pattern recognition, and extensions/variations of the architecture. Ma jor contributions of this thesis work are summarized below: • The architectural characteristics of the OM P are analyzed in term s of hardw are complexity, interconnection control, sim ulation capability, and effective memory bandwidth. The operation principle is based on the orthogonal access rule which reduces the control complexity from both a hardw are and software point of view. The performance enhancement comes from the complete connection between processors, parallel memory access, parallel communication via shared memory, simplified memory ac cess arbitration control, high interconnection bandw idth, and so on. The regularity of the architecture and easy program partitioning are very pow erful to cover a large class of scientific algorithm s. M atching applications include m atrix com putation, signal/im age processing, sorting, and PDE problems. A few shortcomings of the OM P are also identified. The or- 148 thogonal memory access principle prohibits memory accesses w ith mixed modes. It prolongs the communication between processors and reduce the flexibility of the processor in general com putations. Another disadvantage is th a t the num ber of memory modules required is larger th an th a t of pro cessors. In the worst case, it is the square of the num ber of processors. (Chapter 2). • The orthogonal multiprocessor is extended to higher dimensions so as to reduce the memory requirement and to achieve massively parallel com putation. The architecture is based on hypercube topology for memory organization with bus interconnections. The orthogonal access rule is ob served to allow simultaneous memory access w ithout contention. Distinct characteristics and hardware requirements are identified. The communi cation performance is assessed and routing algorithms are developed. The OMP architecture offers a viable alternative to the conventional SIMD array processors which have distributed memories. (C hapter 5) • Parallel algorithms for image processing and pattern recognition onto the OMP are developed. Many problems in the areas are found m atched to the OMP to retain the orthogonal processing property. Due to proper partitioning and nearly independent execution of each task, m ost of the algorithms developed achieve linear speedups over the uniprocessor per formance. The OMP can play an im portant role in both fine-grain image- processing and coarse-grain pattern analysis applications w ith great flexi bility in programming and parallel execution. (Chapter 4) • The mesh-with-bypass architecture is devised to improve the communica tion speed in mesh connected array processors. The architecture supports both local operations and global com putations, which is often required in applications like scientific com putation and signal/im age processing. 149 • A prototype orthogonal multiprocessor is functionally designed for 16 pro cessors and 256 memory modules using printed circuit board and m ulti plexed bus connections. A m odular design is pursued for possible integra tion with VLSI or other advanced technologies. (Chapter 3) 7.2 Suggestions for Further Research Numerous problems occurred during the research. Some of them are ad dressed here as subjects for future research. • General applications: The performance enhancem ent of the OM P is due to the m atch of algorithms with the orthogonal processing strategy. If a problem mismatches to the operational principle, a severe performance degradation is expected. This issue should be analyzed using m athem ati cally using statistical models or experimentally using simulation. Also we have to set some policy to modify the orthogonal access rule adaptive to problems. We may even remove the orthogonal access scheme to allow arbitrarily mixed accesses. The scheme was discouraged due to the high cost of implementation. • Bandwidth simulation study. The performance of the OMP is analyzed in term s of effective memory bandwidth. We employ a few statistical pa ram eters characterizing the memory accesses. However the analysis could be more complete if the memory bandw idth is m easured using simulation. The simulation will reflect the policy change for the unsuccessful request, actual tem poral dependency in the access requests, and sensitivity of the bandw idth to the access probabilities. • System software issues: Software implementation is hardly dealt during the design of the OMP. The resource partitioning on the high-dimensional OM P should be done in some predefined (or autom atic) m anner to tackle 150 large-scale problems in case of massively parallel com putation. This issue is closely related to design an optimizing compiler for autom atic paral lelism detection and restructuring of program s for utilizing resources and achieving the maxim um performance. The OMP also needs an operating system for host-supported or stand-alone operations. • Performance of the multidimensional OMPs: The perform ance of the mul tidimensional OMPs is assessed only w ith respect to the network diam eter and average internode distance. The effective memory bandw idth analy sis can be performed using statistical model. Besides, we need to explore suitable applications for the massively parallel system. The work includes showing the effectiveness of m apping algorithms and analyzing their per formance. Bibliography ! [AAG*86] [AJP86] [AK84] [BA84] [Ben75] ! [Bhu84] | [Bok81] ( I [Bor85] i l | [Bue82] i M. Annaratone, E. Arnould, T. Gross, H.T. Kung, M.S. Lam, O. Menzilcioglu, K. Sarocky, and J.A . Webb, “Warp Architecture and Im plem entation,” Proc. of the 18th I n t’ l Symp. on Computer A r chitecture, pp. 346-356, 1986. D.P. Agrawal, V.K. Janakiram , and G.C. Pathak, “Evaluating the Performance of M ulticom puter Configurations,” IE E E Computer, vol. 19(5), pp. 23-37, May 1986. M .J. Atallah and S.R. Kosaraju, “Graph Problems on a Mesh- Connected Processor Array,” JACM , vol. 31(3), pp. 649-667, 1984. L.N. Bhuyan and D.P. Agrawal, “Generalized Hypercube and Hy perbus Structures for a Com puter Network,” IE E E Trans. Comput ers, vol. C-33(4), pp. 323-333, April 1984. N. Benwell, editor, Benchmarking Computer Evaluation and Mea surement, John Wiley and Sons, 1975. L. Bhuyan, “An Analysis of Processor-Memory Interconnection Networks,” IE E E Trans, on Computers, pp. 279-283, M arch 1984. Shahid H. Bokhari, “On the M apping Problem ,” IE E E Trans. Computers, vol. C-30(3), pp. 207-214, M ar. 1981. P.L. Borril, “M icroStandards Special Feature: A Comparison of 32-bit Buses,” IE E E Micro, pp. 71-79, Dec. 1985. R.E. Buehrer et al., “The ETH-M ultiprocessor EM PRESS: A Dy namically Configurab le MIMD System,” IE E E Trans. Computers, pp. 1035-1044, Nov. 1982. I ■ [Can.86] ; [CGG*85] ! [Cho88] i [Chu88] i [Dav87] [DH72] [DH73] [DJ80] i ) [DNS81] | [Don87] 152 V. Cantoni, “I.P. Hierarchical Systems: A rchitectural Features,” In Cantoni and Levialdi, editors, Pyramidal Systems for Computer Vision, Spring-Verlag, Berlin, pp. 21-39, 1986. W. Crowther, J. Goodhue, R. Gurwitz, R. Rettberg, and R. Thom as, “The Butterfly Parallel Processor,” Newsletter, Com puter Architecture Technical Committee, pp. 18-45, Sept./D ec. 1985. R.M. Chowkwanyun, “Dynamic Load Balancing for Concurrent Lisp Execution on a M ulticom puter System,” Ph.D . Dissertation, Dept, of Electrical Engineering, Univ. of Southern California, May 1988. C.Y. Chu, “Comparison of Two-Dimensional F F T M ethods on the Hypercube,” Proe. the Third Conf. on Hypercube Concurrent Com puters and Algorithms, Jan. 1988. D.B. Davis, “Parallel Computers Diverge,” High Technology, pp. 16-22, Feb. 1987. R.O. Duda and P.E. H art, “Use of the Hough Transform ation to Detect Lines and Curves in Pictures,” Comm. of the A C M 15, pp. 11-15, 1972. R.O. D uda and P.E. H art, Pattern Classification and Scene Analy sis, John Wiley & Sons, 1973. R.C. Dubes and A.K. Jain, “Clustering Methodologies in Ex ploratory D ata Analysis,” In M. Yovits, editor, Advances in Com puters, pp. 113-228., Academic Press, 1980. E. Dekel, D. Nassimi, and S. Sahni, “Parallel M atrix and G raph Algorithms,” S IA M Journal on Computing, 10, pp. 657-675, 1981. J.J. Dongarra, “Performance of Various Com puters Using Standard Linear Equations Software in a Fortran Environm ent,” In Multipro cessors and Array Processors, Simulation Councils Inc., pp. 15-33, Jan. 1987. [EA85] [Fis86] f [Fou85] [GH87] [Gho88] - [GLSW87] [GW85] [GWH87] t I ; [Har79] I [HD88] 153 K.A. El-Ayat and R.K. Agarwal, “The Intel 80386-Architecture and Im plem entation,” IE E E Micro, pp. 4-22, Dec. 1985. A.L. Fisher, “Scan Line Array Processors for Image Com putation,” 18th Annual In t’ l Symposium on Computer Architecture, pp. 338- 345, 1986. T .J. Fountain, “Plans for the CLIP7 Chip,” In Integrated Technol ogy for Parallel Image Processing, S.L. Levialdi, editor, pp. 199-214, 1985. C. G uerra and S. Hambrusch, “Parallel Algorithms for Line De tection on a Mesh,” Proc. Workshop on Comp. Arch, for Pattern Analysis and Machine Intelligence, pp. 99-106, Oct. 1987. Joydeep Ghosh, “Communication-Efficient Architecture for Mas sively Parallel Processing,” Ph.D . Dissertation, Dept, of Electrical Engineering, Univ. of Southern California, May 1988. R. Goldenberg, W.C. Lau, A. She, and A.M. W axman, “Progress on the Prototype P IP E ,” Proc. IE E E Workshop on Computer A r chitecture for Pattern Analysis and Machine Intelligence, pp. 67-74, Oct. 1987. D.B. Gennery and B. Wilcox, “A Pipelined Processor for Low-level Vision,” Proc. of Conf. on Computer Vision and Pattern Recogni tion, pp. 608-613, 1985. J. Gu, W. Wang, and T.C. Henderson, “A Parallel Architecture for Discrete Relaxation Algorithm,” IE EE Trans, on Pattern Analysis and Machine Intelligence, vol. PAMI-9(6), pp. 816-831, Nov. 1987. J.A. H art, “A if-M eans Clustering Algorithm,” Applied Statistics, , vol. 28, pp. 100-108, 1979. K. Hwang and D. DeGroot, editors, Parallel Processing for Super computing and Artificial Intelligence, McGraw-Hill, N.Y., (in press) 1988. [Hil85] W.D. Hillis, The Connection Machine, The M IT Press, 1985. I [HK87] i : [HR78] i | ; [HSJP87] i I 1 [HT85a] [HT85b] f \ [HTK89] [Hun81] I I I i [Hwa83] I [Hwa87] [HX88] 1 I | [JM87] K. Hwang and D. Kim, “Parallel P attern Clustering on a M ultipro cessor w ith Orthogonally Shared Memory,” In t’ l Conf. on Parallel Processing, pp. 913-916, Aug. 1987. A.R. Hanson and E.M. Riseman, “VISIONS: A Com puter System for Interpreting Scenes,” In Hanson and Riseman, editors, Com puter Vision Systems, pp. 303-333, 1978. E.B. Hinkle, J.L. Sanz, A.K. Jain, and D. Petkovic, “P 3E : New Life for Projection-Based Image Processing,” J. of Parallel and Distributed Computing, vol. 4(1), 1987. K. Hwang and P. S. Tseng, “An Efficient VLSI M ultiprocessor for Signal/Im age Processing,” Proc. I n t’ l Conf. Computer Design, pp. 172-176, Oct. 1985. K. Hwang and P.S. Tseng, “A VLSI-Based M ultiprocessor Archi tecture for Real-Time Image Processing,” Proc. 1985 IE E E Comp. Soci. Workshop on Computer Architecture for Pattern Analysis and Image Database Management, pp. 9-17, Nov. 1985. K. Hwang, P.-S. Tseng, and D. Kim, “An Orthogonal Multiproces sor for Lager-Grain Scientific Com putations,” IE E E Trans. Com puters, accepted to appear Jan. 1989. D .J. H unt, “The ICL DAP and Its Application to Image Process ing,” In Languages and Architectures for Image Processing, Duff and Levialdi, editors, pp. 275-282, 1981. K. Hwang, “Com puter Architectures for Image Processing,” IE E E j Computer, vol. 16, pp. 10-12, Jan. 1983. i K. Hwang, “Advanced Parallel Processing w ith Supercom puter Ar chitectures,” Proceedings of the IEEE, pp. 1348-1379, Oct. 1987. K. Hwang and Z. Xu, “M ultipipeline Networking for Compound Vector Processing,” IE E E Trans. Computers, pp. 33-47, Jan. 1988. ; A.K. Jain and J.V. M oreau, “Bootstrap Technique in Cluster Anal- ; ysis,” Pattern Recognition, vol. 20(5), pp. 547-568, 1987. , 155 I : [Joh87] S.L. Johnsson, “Communication Efficient Basic Linear Algebra Com putations on Hypercube Architectures,” J. of Parallel and Dis tributed Computing, vol. 4(2), pp. 133-172, 1987. [KKP87] I. Koren, Z. Koren, and D.K. Pradhan, “Wafer-Scale Integration of M ulti-Processor Systems,” Proc. 20th Hawaii I n t’ l Conf. on System Sciences, pp. 13-20, 1987. [KR85] V.K.P. K um ar and C.S. Raghavendra, “Array Processor w ith Mul tiple Broadcasting,” Proc. of the 12th Annual In t’ l Symposium on Computer Architecture, pp. 2-10, June 1985. [KR86] V.K.P. Kum ar and D. Reisis, “Pyram ids versus Enhanced Arrays for Parallel Image Processing” , Technical Report CRI-86-16, Dept, of Electrical Engineering-Systems, Univ. of Southern California, 1986. I [KS73] P.M. Kogge and H.S. Stone, “A Parallel Algorithm for Efficient Solution of a General Class of Recurrence Equations,” IE E E Trans. Computers, vol. C-22(8), pp. 786-793, Aug. 1973. [KS86] M. Kidode and Y. Shiraogawa, “High-speed Image Processor,” In L. Uhr, K. Prestone, S. Levialdi, and M .J.B. Duff, editors, Evaluation of Multicomputers for Image Processing, pp. 319-335, 1986. [Kun84] H.T. Kung, “Systolic Algorithms for the CMU Warp Processor,” In t’ l Conf. on Pattern Recognition, 570-577, 1984. [KWR82] T. Kushner, A.Y. Wu, and A. Rosenfeld, “Image Processing on M PP: 1,” Pattern Recognition, vol. 15(3), pp. 121-130, 1982. . [Lev87] S. Levialdi, “Issues on Parallel Algorithms for Image Processing,” j In L.H. Jamieson et al., editor, The Characteristics of Parallel Al gorithms, pp. 191-207, 1987. I [LM87] H. Li and M. Maresca, “Polymorphic-torus Architecture for Com- J puter Vision,” Proc. 1987 Workshop on Comp. Arch, for Pattern ; Analysis and Machine Intelligence, 1987. [LR82] [Mat85] [MC85] [MHW87] [MKR*88] [MMM84] [Mor86] [MS85] [NJ83] [NS79] [NS80] T. Lang and S.M. Reddy, “Bandwidth of Crossbar and Multiple- Bus Connections for M ultiprocessors,” IE E E Trans, on Computer, C-31(12):1227-1234, Dec 1982. R. Mateosian, “National’s NS32332 CPU: A Graceful Extension of the Series 32000™ ,” Proc. WESCON, Session 1, 1985. M.M. McCabe and P.V. Collins, “Image Processing Algorithms,” In R .J. Offen, editor, VLSI Image Processing, McGraw-Hill, 1985. T.N. Mudge, J.P. Hayes, and D.C. Winsor, “M ultiple Bus Archi tectures,” IE E E Computer, vol. 20(6), pp. 42-48, June 1987. R. Miller, V.K. Prasanna Kum ar, D. Reisis, and Q.F. Stout, “Meshes with Reconfigurable Buses,” Proc. of the M IT Conf. on Advanced Research in VLSI, 1988. D. MacGregor, D. Mothersole, and B. Moyer, “The M otorola MC68020,” IE E E Micro, pp. 101-118, Aug. 1984. S.G. M orton, “A Fault Tolerant, Bit-Parallel, Cellular Array Pro cessor,” Proc. Fall Joint Comp. Conf., pp. 277-286, 1986. R. Miller and Q. Stout, “Geometric Algorithms for Digitized Pic tures on a Mesh-Connected Com puter,” IE E E Trans, on Pattern Analysis and Machine Intelligence, vol. PAMI-7(2), pp. 216-228, M arch 1985. L.M. Ni and A.K. Jain, “A VLSI Systolic Architecture for P attern Clustering,” Proc. IE E E Comp. Soci. Workshop, on Computer A r chitecture for Pattern Analysis and Image Database Management, pp. 110-117, November 1983. D. Nassimi and S. Sahni, “Bitonic Sort on a Mesh-Connected Par allel Com puter,” IE E E Trans. Computers, vol. C-27(l), pp. 2-7, 1979. D. Nassimi and S. Sahni, “An Optim al Algorithms for Mesh- Connected Parallel Com puters,” J. of the Association for Com puting Machinery, vol. 27(1), pp. 6-29, 1980. ! [Nus82] . [Orc75] ! [OV85] I [Pav82] [Pea68] [PN85] [Pot83] [Pot85] ’ [Pra78] [Qui87] [Ree84] s I j [RHZ76] 157 H.J. Nussbaumer, Fast Fourier Transform and Convolution Algo rithms, Spring-Verlag, Berlin, 1982. S.E. O rcutt, “Implementation of Perm utation Function in Illiac IV Type Com puters,” IE E E Trans. Computers, vol. C-25(9), pp. 929- 935, 1975. J. M. Ortega and R. G. Voigt, “Solution of Partial Differential Equations on Vector and Parallel Com puters,” S IA M Review, pp. 149-240, June 1985. Theo Pavlidis, Algorithms for Graphics and Image Processing, Com puter Science Press, 1982. M.C. Pease, “An A daptation of the Fast Fourier Transform for Parallel Processing,” J. ACM , vol. 15(2), pp. 252-264, April 1968. G.F. Pfister and V.A. Norton, “Hot Spot Contention and Com bining in M ultistage Interconnection Networks,” Proc. I n t’ l Conf. Parallel Processing, pp. 790-797, 1985. J.L. Potter, “Image Processing on the Massively Parallel Proces sor,” IE E E Computer, pp. 62-67, Jan. 1983. J.L. Potter, The Massively Parallel Processor, M IT Press, Cam bridge, Mass., 1985. W.K. P ra tt, Digital Image Processing, John Wiley & Sons, 1978. M. Quinn, Designing Efficient Algorithms for Parallel Computers, McGraw-Hill, New York, 1987. > A. P. Reeves, “Parallel Pascal: An Extended Pascal for Parallel Com puters,” J. of Parallel and Distributed Computing, vol. 1(1), pp. 64-80, 1984. A. Rosenfeld, R.A. Hummel, and S.W. Zucker, “Scene Labeling by Relaxation Operations,” IE E E Trans, on System s, Man, and Cybernetics, SMC-6(6):420-433, June 1976. 1 [RK76] I i < [RK85] i I I [Ros83] I i [Sei85] ; [Sei88] [SF88] i [SH87] [SJRV87] , [SM87] I ) [ST86] [Sta86] A. Rosenfeld and A. Kak, Digital Picture Processing, Academic Press, 1976. i D. Reisis and V.K. Prasanna K um ar, “Parallel Processing of the La beling Problem ,” Proc. IE E E Comp. Soc. Workshop on Computer ■ Architecture for Pattern Analysis and Image Database Management, I pp. 381-385, Nov. 1985. A. Rosenfeld, “Parallel Image Processing Using Cellular Arrays,” IE E E Computer, vol. 16(1), pp. 14-20, Jan. 1983. C. L. Seitz, “The Cosmic Cube,” Comm, of the ACM , vol. 28(1), pp. 22-33, Jan 1985. C. Seitz et al, “The Architecture and Program ming of the AME- TEK 2010,” The Third Conf. on Hypercube Concurrent Computers ' and Applications, Jan. 1988. Y. Shih and J. Fier, “Hypercube Systems and Key Applications,” In K. Hwang and D. DeGroot, editors, Parallel Processing for Su percomputing and Artificial Intelligence, 1988. J.L.C. Sanz and E.B. Hinkle, “Computing Projections of Digital Images in Image Processing Pipeline Architectures,” IE E E Trans. Acoustics, Speech, and Signal Processing, vol. ASSP-35(2), pp. 198- 207, February 1987. A.A. Sawchuk, B.K. Jenkins, C.S. Raghavendra, and A. Varma, “Optical Crossbar Networks,” IE E E Computer, vol. 20(6), pp. 50- 60, June 1987. ! I.D. Scherson and Y. Ma, “Vector Com putations on an Orthogonal Memory Access Multiprocessing System,” Proc. 8th Symposium on Computer Arithmetic, pp. 28-37, May 1987. G. Saucier and J. Trilhe, editors, Wafer Scale Integration, North- Holland, Amsterdam , 1986. W. Stallings, editor, Reduced Instruction Computers, IEEE Com- i puter Society Press, 1986. 159 [Ste81] [Sto82] [Sto83] [SWH85] [Tan83] [Tan84] [THK85] [TK77] [UML81] [Wal87] S.R. Sternberg, “Parallel Architectures for Image Processing,” In Real-TimeParallel Computers: Image Processing, Onoe, Preston, and Rosenfeld, editors, Plenum Press, New York, 1981. Q.F. Stout, “Broadcasting in Mesh-Connected Com puters,” Proc. of 1982 Princeton Conference on Information Science and Systems, pp. 85-90, 1982. Q.F. Stout, “Mesh-Connected Com puter with Broadcasting,” IE E E Trans. Computers, vol. C-32(9), pp. 826-830, Sept. 1983. D.H. Schaefer, G.C. Wilcox, and V .J. Harris, “A Pyram id of M PP Processing Elements-Experiences and Plans,” Proc. Eigh teenth Hawaii I n ti Conf. on System Sciences, 1985. S.L. Tanimoto, “A Pyram idal Approach to Parallel Processing,” Proc. of 10th Annual I n ti Symp. on Computer Architecture, pp. 372-378, June 1983. S.L. Tanimoto, “A Hierarchical Cellular Logic for Pyram id Com puters,” J. of Parallel and Distributed Computing, vol. 1(2), pp. 105-132, Nov.-1984. P.S. Tseng, K. Hwang, and V.K. Prasanna Kum ar, “A VLSI-Based M ultiprocessor Architecture for Implementing Parallel Algorithms,” Proc. I n ti Conf. Parallel Processing, pp. 657-665, August 1985. C.D. Thompson and H.T. Kung, “Sorting on a Mesh-Connected Parallel Com puter,” Comm, of ACM , vol. 20(4), pp. 263-271, 1977. L. Uhr, M.Thomson, and J. Lockey, “ A 2-Layered SIMD/MIMD Parallel Pyram idal A rray/N et.,” Proc. of Workshop on Computer Architecture for Pattern Analysis and Image Database Management, pp. 209-216, 1981. David L. Waltz, “Applications of the Connection Machine,” IEEE Computer, vol. 20(1), pp. 85-97, Jan 1987. 160 [WH87] It.S. Wallace and M. Howard, “HBA Vision Architecture: Built and Benchmarked,” Proc. IE E E Workshop on Computer Architecture for Pattern Analysis and Machine Intelligence, pp. 209-216, Oct. 1987. [Wit84] L.D. W ittie, “Communication Structures for Large Networks of Microcomputers,” IE E E Trans. Computers, vol. C-30(12), pp. 273- 284, April 1984. [WKM*85] M. Weiser, S. Kogge, M. McElvany, R. Pierson, R. Post, and A. Thareja, “Status and Performance of ZMOB Parallel Processing System,” Proc. COMPCON, pp. 71-73, 1985. [WLHR87] C.C. Weems, S.P. Levitan, A.R. Hanson, and E.M. Riseman, “The Image Understanding Architecture,” Proc. Image Understanding Workshop, pp. 483-496, Feb. 1987. U M I Number: DP22775 All rights reserved INFORMATION TO ALL U SE R S The quality of this reproduction is d ep en d en t upon the quality of the cop y submitted. In the unlikely even t that the author did not sen d a com plete manuscript and there are m issing p a g es, th e se will b e noted. A lso, if material had to b e rem oved, a note will indicate the deletion. Published by ProQ uest LLC (2014). Copyright in the Dissertation held by the Author. Dissertation Publishing UMI D P 22775 Microform Edition © ProQ uest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United S ta tes C ode P roQ uest LLC. 789 E ast Eisenhow er Parkway P.O. Box 1346 Ann Arbor, Ml 4 8 1 0 6 - 1 3 4 6
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Protocol evaluation in the context of dynamic topologies
PDF
MUNet: multicasting protocol in unidirectional ad-hoc networks
PDF
Semantic heterogeneity resolution in federated databases by meta-data implantation and stepwise evolution
PDF
Iterative data detection: complexity reduction and applications
PDF
The ecology and behavior of the California moray eel Gymnothorax mordax (Ayres, 1859) with descriptions of its larva and the leptocephali of some other east Pacific muraenidae
PDF
The importance of using domain knowledge in solving information distillation problems
PDF
Internet security and quality-of-service provision via machine-learning theory
PDF
On the spin-up and spin-down of a contained fluid
PDF
Elements of kanban production control for dynamic job shops
PDF
Communications-efficient architectures for massively parallel processing
PDF
The problem of safety of steel structures subjected to seismic loading
PDF
Staged process for obsolete propellant disposal: polymer-matrix composite waste management
PDF
Probabilistic analysis of power dissipation in VLSI systems
PDF
An architecture for parallel processing of "sparse" data streams
PDF
Parallel processing techniques for production systems
PDF
Parallel language and pipeline constructs for concurrent computation
PDF
The impact of paperless systems and other technological changes upon auditing
PDF
Expert assisted robot fine motion skill acquisition
PDF
Transformation techniques for parallel processing of production systems
PDF
Protein-protein interactions in blood serum
Asset Metadata
Creator
Kim, Dongseung
(author)
Core Title
Orthogonal architectures for parallel image processing
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Degree Conferral Date
1988-06
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer science,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c17-771096
Unique identifier
UC11347898
Identifier
DP22775.pdf (filename),usctheses-c17-771096 (legacy record id)
Legacy Identifier
DP22775.pdf
Dmrecord
771096
Document Type
Dissertation
Rights
Kim, Dongseung
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
computer science