Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SPECIFYING PARALLEL PROCESSOR A RCH ITECTU RES FO R HIGH-LEVEL CO M PU TER VISION ALGORITHM S by Craig Charles Reinhart A D issertation Presented to the FACULTY OF TH E GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In P artial Fulfillment of the Requirem ents for the Degree DO CTO R OF PHILOSOPHY (Com puter Engineering) December 1991 Copyright 1991 Craig Charles Reinhart UMI Number: DP22830 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. UMI DP22830 Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author. Oisssrtation Pubi shmg Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, w ritten by Craig Charles Reinhart under the direction of hxs D issertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillm ent of re quirem ents for the degree of D O C TO R OF PH ILO SOPH Y Graduate Studies September 10,1991 DISSERTATION COMMITTEE Chairperson \j. fV H CpS '91 R37/ Dedication to LJR, with love. She gave so much to be so little, B ut angels always do. Acknowledgments Over the tim e period spanned by my Ph.D . studies, the list of people who have provided encouragement an d /o r support has grown well beyond my ability to recall. Therefore, to avoid overlooking anyone, I would like to say one blanketing “thank you” to everyone and provide specific mention of only those th at were directly involved (or affected.) To Drs. Medioni and Gaudiot for serving on my guidance com m ittee and pro viding invaluable suggestions during my qualifying exam. To Dr. Rajam oney for taking the tim e to serve as the outside m ember of my dissertation com m ittee. To Dr. Prasanna-K um ar for his technical guidance throughout the research phase of my studies. To Dr. Nevatia for his counsel, both technical and political, as well as for providing a comfortable environment for which to perform research, th at of the In stitu te for Robotics and Intelligent Systems. To all of the members of IRIS for making my experience at USC a pleasant one. To Hughes Aircraft Com pany for their support via the Doctoral Fellowship Program . To Ms. M argaret Robson for starting me along this long and winding road all those years ago. To Dave W inder for his generous contribution in the m idst of “crunch tim e”. And finally, to my wife Sherri and our boys for their support, tolerance, and patience throughout the duration of my studies. In response to the question asked all too often, How pleasant it is, at the end of the day, No follies to have to repent; B ut reflect on the past, and be able to say, T hat my tim e has been properly spent. Jane Taylor Contents D ed ica tio n ii A ck n o w led g m en ts iii L ist O f T ables viii L ist O f F igu res ix A b stra ct xii 1 In tro d u ctio n 1 1.1 O verview ........................................................................................................... 1 1.2 Contributions of the R e s e a rc h .................................................................. 3 1.3 O utline of the D isse rta tio n ......................................................................... 5 2 C om p u ter V isio n and P arallel P ro cessin g 6 2.1 Com puter V i s io n .......................................................................................... 6 2.1.1 In tro d u c tio n ........................................................................................ 6 2.1.2 Low-Level V is io n .............................................................................. 6 2.1.3 Mid-Level V i s io n ......................................................................................7 2.1.4 High-Level V is io n .............................................................................. 10 2.1.5 Vision Systems as Compared to O ther Systems ................... 11 2.2 Parallel Processor A rc h ite c tu re s............................................................... 13 2.2.1 Overview ............................................................................................ 13 2.2.2 Organizations . .............................................................................. 13 2.2.3 A rchitecture C h a ra c te ristic s ......................................................... 14 2.2.4 O ther A rchitectures.......................................................................... 17 2.3 Designing Parallel Processor S y s te m s .................................................... 18 2.3.1 Algorithm S p e e d u p .......................................................................... 18 2.3.2 Processor Efficiency.......................................................................... 19 2.3.3 System C o m p le x ity .......................................................................... 19 2.3.4 Program m er B u r d e n ....................................................................... 20 iv 2.4 Summary P re v io u s W o rk 3.1 O verview ........................................................................................................... 3.2 A lgorithm /Softw are D ev elo p m en t........................................................... 3.2.1 Reisis and P ra sa n n a -K u m a r........................................................ 3.2.2 Rice and J a m ie s o n ......................................................................... 3.2.3 Little et al............................................................................................ 3.2.4 DARPA B en c h m a rk s...................................................................... 3.3 A rchitecture/H ardw are D e v e lo p m e n t.................................................... 3.3.1 Weems et al......................................................................................... 3.3.2 Kuehn et al......................................................................................... 3.4 Theory/Tool D ev elo p m en t......................................................................... 3.4.1 S to u t.................................................................................................... 3.4.2 Levitan .............................................................................................. 3.5 Sum m ary ........................................................................................................ T h e M e th o d o lo g y 4.1 Technical Problem ....................................................................................... 4.1.1 The M apping P r o b le m .................................................................. 4.1.1.1 D e s c rip tio n ....................................................................... 4.1.1.2 Com puter Vision and the M apping Problem . . . 4.1.2 Software D e v e lo p m e n t.................................................................. 4.1.3 S u m m a r y ........................ ................................................................. 4.2 An Algorithm Driven A p p ro a c h .............................................................. 4.2.1 Overview .......................................................................................... 4.2.2 The M apping Problem - R e fo rm u la te d ................................... 4.2.3 The M eth o d o lo g y............................................................................ 4.2.4 S u m m a r y .......................................................................................... Im a g e M a tc h in g B y R e la x a tio n L a b e llin g 5.1 In tro d u c tio n .................................................................................................... 5.2 Algorithm D e s c rip tio n ................................................................................ 5.3 Coarse Grain A n a ly s is ............................... ................................................ 5.3.1 Overview .......................................................................................... 5.3.2 Control Structure A n aly sis........................................................... 5.3.3 D ata Structure A n a ly s is ............................................................... 5.3.4 Com m unication A n aly sis............................................................... 5.3.5 A rchitecture S p e c ific a tio n ........................................................... 5.4 Perform ance A n aly sis................................................................................... 5.4.1 Complexity A n a ly s is ..................................................................... 5.4.2 Observed P e rfo rm a n c e .................................................................. 5.4.3 System Development and M a in te n a n c e ..................................... 73 5.5 Fine G rain A n aly sis........................................................................................ 74 5.5.1 Overview ............................................................................................ 74 5.5.2 Control Structure A n aly sis............................................................. 74 5.5.3 D ata Structure A n a ly s is ................................................................ 78 5.5.4 Com m unication A n aly sis................................................................ 79 5.5.5 A rchitecture S p e c ific a tio n ............................................................. 79 5.6 A Heterogeneous A rc h ite c tu re ................................................................... 81 5.7 S u m m a r y ......................................................................................................... 83 6 O b je c t R e c o g n itio n B y T re e S e a rc h in g 85 6.1 In tro d u c tio n ...................................................................................................... 85 6.2 Algorithm D e s c rip tio n ................................................................................. 88 6.3 Coarse Grain A n a ly s is ................................................................................. 95 6.3.1 Overview ............................................................................................ 95 6.3.2 Control Structure A n aly sis....................................................... . 95 6.3.3 D ata Structure A n a ly s is ................................................................ 98 6.3.4 Com m unication A n a ly sis................................................................ 98 6.3.5 A rchitecture S p e c ific a tio n ............................................................. 99 6.4 Performance A n aly sis........................................................................................ 102 6.4.1 Complexity A n a ly s is .............................................. 102 6.4.2 Observed P e rfo rm a n c e ........................................................................104 6.4.3 System Development and M a in te n a n c e ........................................ 107 6.5 Fine Grain A n aly sis............................................................................................109 6.5.1 Overview ................................................................................................109 6.5.2 Control Structure Analysis ............................................................. 109 6.5.3 D ata Structure A n a ly s is ....................................................................110 6.5.4 Com m unication A n aly sis.................................................................... I l l 6.5.5 A rchitecture S p e c ific a tio n .................................................................I l l 6.6 A Heterogeneous A rc h ite c tu re .......................................................................113 6.7 Sum m ary .............................................................................................................115 7 L in e a r F e a tu re E x tr a c tio n 118 7.1 In tro d u c tio n ....................................................................................... 118 7.2 Process D e scrip tio n ........................................................................................... 120 7.3 Coarse Grain A n a ly s is ....................................................................................123 7.3.1 Overview ................................................................................................123 7.3.2 Control Structure A n aly sis................................................................ 123 7.3.3 D ata Structure A n a ly s is ....................................................................125 7.3.4 Com m unication A n aly sis....................................................................125 7.3.5 A rchitecture S p e c ific a tio n ................................................................ 126 7.4 Performance A n aly sis........................................................................................ 129 vi 7.4.1 Complexity A n a ly s is ..................................................................... 129 7.4.2 Observed P e rfo rm a n c e ..................................................................131 7.4.3 System Development and M a in te n a n c e ...................................136 7.5 S u m m a r y ........................................................................................................... 138 8 P e rcep tu a l O rgan ization S y stem 139 8.1 In tro d u c tio n ........................................................................................................139 8.2 System D e sc rip tio n ..........................................................................................141 8.2.1 Overview ................................................................................................141 8.2.2 System C o m p o n e n ts ..................................................................... 142 8.2.3 S u m m a r y .......................................................................................... 144 8.3 Methodology A p p licatio n ................................................................................144 8.3.1 Overview ................................................................................................144 8.3.2 Control Structure A n a ly sis...........................................................144 8.3.3 D ata Structure A n a ly s is .............................................................. 148 8.3.4 Com m unication A n a ly sis.............................................................. 149 8.3.5 A rchitecture S p e c ific a tio n ...........................................................150 8.4 P uttin g It All T o g e th e r...................................................................................157 8.4.1 Linear Feature E x tr a c tio n ...........................................................157 8.4.2 Linear Feature Extraction to Line Collation Form ation . . 157 8.4.3 Line Collation Form ation to Parallel Collation Form ation . 159 8.4.4 Parallel Collation Form ation to U Collation Form ation . . 159 ! 8.4.5 U Collation Form ation to Rectangle Collation 160 I 8.4.6 Rectangle Collation Form ation to Network Form ation . . . 161 I 8.4.7 Network Form ation to C onstraint Satisfaction Network . . 162 | 8.5 The Heterogeneous A rc h ite c tu re ................................................................. 162 8.5.1 Algorithm Speedup and Processor E fficien cy ........................ 163 8.5.1.1 The Individual A lgorithm s..............................................164 8.5.1.2 The Heterogeneous Algorithm Suite ...........................166 8.5.2 System Complexity and Program m er B u rd e n ........................ 166 8.6 S u m m a r y ........................................................................................................... 167 9 C on clu sion s and F uture R esearch 169 R eferen ce L ist 173 vii L ist O f T a b les 6.1 Tim ing statistics from static partitioning of the tree search . . . . 106 6.2 Tim ing statistics from dynam ic partitioning of the tree search. . . 108 7.1 Tim e complexities of the steps th at constitute the linear feature extraction process................................................................................................124 8.1 Tim e complexities of the processes th at constitute the perceptual organization system ............................................................................................ 145 viii L ist O f F ig u r e s 2.1 Low-level vision processing............................................................................ 8 2.2 Mid-level vision processing....................... 9 2.3 High-level vision processing........................................................................... 10 3.1 The Image U nderstanding A rchitecture.................................................... 31 ( 4.1 The classical m apping problem .................................................................... 41 j 4.2 The reformulated m apping problem ............................................................ 48 I 4.3 Our m apping methodology............................................................................. 50 4.4 Methodology steps............................................................................................ 51 5.1 W indow construction....................................................................................... 57 5.2 Image m atching algorithm flow.................................................................... 58 5.3 Image m atching control loops............................. 59 5.4 C lient/Server partitioning.............................................................................. 60 5.5 Image m atching d ata structures.................................................................. 61 5.6 Image m atching d ata partitions................................................................... 62 5.7 Logarithm ic diam eter architectures............................................................ 66 5.8 Perform ance of hypercube im plem entation - test 1............................ 71 5.9 Perform ance of hypercube im plem entation - test 2............................ 71 5.10 Perform ance of hypercube im plem entation - test 3............................ 72 5.11 Serial code for the image m atching algorithm ........................................ 75 5.12 Client code for the parallel image m atching algorithm ...................... 76 5.13 Server code for the parallel image m atching algorithm ...................... 77 5.14 A heterogeneous processing elem ent........................................................... 82 6.1 Screener block diagram ................................................................................... 90 6.2 G raph m atcher block diagram ...................................................................... 91 6.3 Analyzer block diagram .................................................................................. 91 6.4 Pipelined architecture for the tree search algorithm .................................102 6.5 Perform ance of static hypercube im plem entation......................................105 6.6 Perform ance of dynam ic hypercube im plem entation................................107 6.7 A heterogeneous processing elem ent..............................................................115 IX 7.1 Linear feature extraction algorithm /data flow............................................121 7.2 Airfiled image, contours, and linear features............................................... 122 7.3 Numeric and symbolic d ata representations................................................ 126 7.4 Linear feature extraction com m unication topologies.................................127 7.5 Linear feature extraction architecture............................................................129 7.6 E stim ated performance of the linear feature extraction process. . . 132 7.7 D ata partitioning for the airfield im age.........................................................133 7.8 Freeway image, contours, and linear features.............................................. 134 7.9 D ata partitioning for the freeway im age....................................................... 135 7.10 Perform ance of sim ulation - airfield...............................................................136 7.11 Perform ance of sim ulation - freeway.............................................................. 137 8.1 Collation detection com m unication topologies............................................151 8.2 C onstraint satisfaction network processing elem ents................................ 152 8.3 Fully connected topology....................................................................................153 8.4 Two alternative com m unication topologies..................................................153 8.5 Systolic ring architecture....................................................................................155 8.6 A rchitecture for the linear feature extraction algorithm ......................... 157 8.7 LSE to LCF architecture interface................................................................158 8.8 LCF to PC F architecture interface...............................................................159 j 8.9 P C F to UCF architecture interface.............................................................. 160 8.10 UCF to RCF architecture interface..............................................................161 8.11 RCF to NF architecture interface................................................................. 162 8.12 NF to CSN architecture interface...................................................................163 x Abstract Com puter vision systems incorporate a wide variety of algorithm ic techniques ranging in com plexity from simple repetitive processing to elaborate rule-based control structures. Serious com putational bottle necks occur at each level of processing, though perhaps for different reasons. At low levels, the processing i steps are simple, numerical and local, but the am ount of d ata to be processed { | is huge. At higher levels, the quantity of data is reduced (though still large), but the operations to be perform ed are more complex, symbolic, and not entirely local. Also, th e execution characteristics of com puter vision algorithm s are less J predictable as we move up the hierarchy from low to high-level vision. The deter m inistic processing th at exists in low-level vision tasks turns into d ata dependent, non determ inistic processing as we move into high-level tasks. In th e past, much of the research in parallel processing for com puter vision j has focussed on the im plem entation of low-level com puter vision algorithms. This may have been due to the simplicity of this task as well as the suitability of the available hardw are which has consisted m ostly of a mesh of SIMD processors. In our research, we have focussed on parallel processing for mid and high-level algorithm s as well as suites of such algorithms. There is relatively little work on this topic in the field. Such algorithms are usually symbolic in nature and the processing is not entirely local. Furtherm ore, vision problems require suites (sequences) of algorithm s to achieve their objectives. Such sequences often require shuffling of d ata when transitioning from one algorithm to the next. Again, little work has been done on this topic. We believe th a t when dealing with such complex algorithm s (or suites of algorithm s), the parallel im plem entation m ust be concerned w ith the following four characteristics: xi • Algorithm speedup - defined as the ratio of elapsed tim e when executing a program on a single processor to the elapsed tim e when N processors are available. • Processor efficiency - defined as the average utilization of the available processing elements. • System complexity - defined as the am ount of algorithm restructuring re quired to delineate the inherent parallelism of the algorithm . • Programmer burden - defined as the degree of difficulty in developing and m aintaining th e parallel im plem entation. Most research in parallel im plem entation of com puter vision algorithm s has been concerned solely w ith the first two characteristics. This may have been acceptable for low-level algorithm s, due to their regularity and simplicity. However, this is not sufficient for high-level algorithm s or im plem entation of algorithm suites in light of their complex and evolutionary nature. Classically, one specifies a parallel processor architecture then “m aps” an al gorithm onto it. Such m appings may dictate complex structural alterations to the algorithm s and the im plem entations are difficult to develop, m aintain, and modify. Also, they do not address the issue of data shuffling between algorithms. Our approach to the development of parallel im plem entations is algorithm driven. Through analysis of the algorithm and algorithm transitions we determ ine a par allel processor architecture th a t is well suited to the inherent parallelism of the algorithm . As a result, our im plem entations possess all of the characteristics described above. In this dissertation we present the characteristics of m id high-level com puter vision algorithm s and suites of such algorithms th a t make them a challenge to parallelize. We present our approach to parallelizing such algorithm s. And, we present parallel im plem entations of some characteristic com puter vision algo rithm s as derived via our approach. xii I Chapter 1 Introduction 1.1 Overview Of the senses utilized by hum ans, vision is by far the m ost powerful. It pro vides a rem arkable am ount of inform ation about our environm ent while requir ing no direct physical contact. The desire to understand the source of this power has caused the field of com puter, or m achine, vision to flourish. Through in tense research, com puter vision has m ade its way into various application areas. Among those areas are industrial inspection, vehicle navigation, m edical diagno sis, surveillance, and aerial mapping, to nam e just a few. Although research in the field of com puter vision is abundant, understanding of how the hum an visual system works rem ains elusive. Researchers in the fields of psychology, biology, and engineering have provided various theories w ith regard to the mechanics of the system. One common them e among all theories is th at it is complex. Therefore, one can justifiably conclude th a t any com puter based system designed to perform visual tasks will be necessarily complex. Such has been shown to be the case tim e and tim e again [34] [8] [39]. Various subtasks necessary for achieving results similar to those of the hum an visual system have been identified. They include, but are certainly not lim ited to, edge detection, m otion analysis, stereo image analysis, shape description, object recognition, . . . Techniques for im plem enting these subtasks are drawn from the fields of psychology, biology, engineering, com puter science, and m athem atics. Any system design w ith hopes of perform ing some useful task m ust encompass 1 some com bination of these subtasks. Therein lies the complexity of a com puter vision system. The com puter vision research com m unity generally recognizes the existence of three classes in which these subtasks fall. They are referred to as Low, Mid, and High-level vision. Each level possesses characteristics, relative to its processing requirem ents, d ata requirem ents, and objectives, th at distinguish it from the oth ers. These characteristics are described later in this dissertation. Of im portance here is the fact th a t the characteristics of each level produce a com putational burden on a classical von Neum ann (serial or sequential) com puter in their own way. In order to cope w ith this com putational burden researchers have turned to parallel processor architectures for the execution of com puter vision algorithm s [86], In the study of parallel im plem entation of com puter vision systems emphasis has been placed prim arily on low-level vision algorithm s with some work on m id level algorithm s [84] [59] [81] [48] [98] [56]. This is due to the fact th a t these algorithm s m ust process a large am ount of d ata from the imaging sensor (e.g. a 512x512 m atrix of 8-bit intensity data) often at video rates (60Hz.) Furtherm ore, these algorithm s are well defined and generally application independent. For instance, tasks which utilize 3-D range d ata and those which utilize 2-D intensity d ata often require similar low-level processing such as noise removal and edge detection. This sim ilarity exists because at this level the d ata is merely a m atrix of values w ith little or no sem antic interpretation. As a result of this emphasis, parallelization of low-level vision algorithm s is well understood from algorithm design to the prediction of process throughput. The algorithm s th at constitute mid and high-level vision present a different situation. A lthough they utilize some very general techniques such as search ing m ethods, algorithm details are application dependent. D etection and recog nition of objects from 3-D range data requires different procedures than does recognition of objects from 2-D intensity data. Even in the context of a single application there is no generally accepted approach among researchers. This is due to the variety of sem antic interpretations and representations available, such as 3-D surfaces or 2-D regions of homogeneous intensity, as well as our lim ited 2 understanding of how m id and high-level processes are perform ed in th e hum an mind. Furtherm ore, accom plishm ent of m id and high-level vision objectives often requires chaining m ultiple algorithm s with complex interfaces (d ata exchanges) between algorithms. Research on the parallel im plem entation of suites of m id and high-level vision algorithm s is scarce. Perhaps this is due to th e specific n atu re of th e algorithms. A parallel im plem entation of a particular algorithm m ay not be of interest to anyone but its developer. Perhaps it is due to th e evolutionary nature of these algorithm s. As there is no generally accepted approach to m id and high-level vision problems, algorithm s are in a continual state of change. Investm ent in a parallel im plem entation may not be cost-effective. Regardless of the cause, parallel im plem entations of mid and high-level vision algorithm s are scarce. Thus, parallel im plem entation of mid and high-level vision algorithm s is the topic of this thesis. We investigate the process of developing parallel algorithm im plem entations and how this process applies to mid and high-level vision al gorithm s and suites of vision algorithms. In doing so we define a methodology for the developm ent of parallel algorithms th a t prom otes im plem entations th at fare well w ith regard to the issues of algorithm speedup and processor efficiency as well as system com plexity and m aintenance. We then apply the methodology to two specific instances of high-level vision algorithm s, a heterogeneous suite of low/mid-level vision algorithm s, an algorithm suite consisting of low, mid, and high-level vision algorithm s. Although the details of the various algorithms are application specific, they are characteristic of the fundam ental categories of com puter vision. 1.2 Contributions of the Research The m ain contributions of this thesis are: • A reform ulation of the mapping problem (and m ethodology th a t implements the reform ulation) th a t promotes parallel im plem entations of high-level vi sion algorithm s characterized by 3 1. Significant algorithm speedup, 2. Significant processor efficiency, 3. M inimal system complexity, 4. Software th a t (a) is of com parable com plexity to the serial im plem entation of the same algorithm , (b) can be developed in a short am ount of tim e, (c) can be m aintained w ith an effort com parable to th a t of the serial im plem entation of the same algorithm , (d) is portable with m inim al effort. An understanding of the characteristics of high-level vision algorithm s and how they affect the parallel im plem entation of such algorithm s. These characteristics include 1. The use of abstract d ata structures, 2. The interleaving of num eric and symbolic processing, 3. The extensive use of complex program logic, I 4. The existence of d ata dependent processing, ' 5. Use of local as well as global inform ation (relative to the imaged scene) | in decision making processes. Analysis, in term s of parallel characteristics, of two stand-alone high-level vision algorithm s, a heterogeneous low/mid-level vision algorithm , and a heterogeneous vision system th a t encompass a variety of widely used tech niques in com puter vision: 1. Relaxation labelling (stand-alone algorithm ), 2. Tree search (stand-alone algorithm ), 3. Linear Feature E xtraction (heterogeneous algorithm ), 4 4. Perceptual organization (vision system ). Each algorithm (system) possesses some or all of the characteristics listed above. 1.3 Outline of the Dissertation The rem ainder of this dissertation is organized as follows: • C hapter 2 provides a description of th e components of a com puter vision system and of a parallel processor architecture, and presents a discussion of why combining the two technologies, parallel processing and com puter vision, is both im portant and non-trivial. • C hapter 3 provides a survey of current research in the field of parallel pro cessing of com puter vision algorithm s. • C hapter 4 presents the “classical” approach to combining parallel processing and com puter vision followed by our proposed m ethodology for doing so. • C hapter 5 presents a parallel im plem entation of an image m atching algo rithm th a t utilizes relaxation labelling w ith symbolic and geom etric con straints. • C hapter 6 presents a parallel im plem entation of an object recognition algo rithm th a t utilizes tree search with symbolic and geom etric constraints. • C hapter 7 presents a parallel im plem entation of linear segment extraction algorithm consisting of edge detection, edge linking, and linear segment approxim ation. • C hapter 8 presents a parallel im plem entation of a perceptual organization system th at comprises various algorithm s (representative of a typical com p uter vision system ) and then discusses the system issues th a t arise in parallelizing such com puter vision systems. • C hapter 9 offers discussions, conclusions, and avenues for future research. 5 Chapter 2 Computer Vision and Parallel Processing 2.1 Computer Vision 2.1.1 In tro d u ctio n “...we arrived at the idea of a sequence of representations, starting with de scriptions th a t could be obtained straight from an image but th at are carefully designed to facilitate the subsequent recovery of gradually more objective, phys ical properties . . . ” [61]. To derive the sequence of representations spoken of, j I a com puter vision system m ust utilize algorithm s th a t operate at various levels of abstraction, th a t is, algorithm s th at take, as input, one representation and provide, as output, another representation containing m ore objective properties. The levels are distinguishable, albeit w ith fuzzy boundaries, by their inputs and outputs (representations) and by their processing characteristics. The com puter vision research com m unity generally recognizes the existence of three levels. We refer to them as low, mid, and high. In the following sections th e levels are described in term s of their d ata structures, processing requirem ents, and tim e complexities. Examples of processes from each level are given. 2.1.2 L ow -L evel V isio n In low-level vision, d ata is received directly from th e imaging sensor in a 2- dimensional array of picture elements or pixels. Each pixel represents an area of the scene being imaged and has a property determ ined by the sensor type. 6 Typical sensors types include Visible, which represents objects in term s of light reflectance, Infrared, which represents objects in term s of therm al em ittance, and Range Finder, which represents objects in term s of distance from th e sensor. The choice of sensor type is application dependent. Also, the processes th a t constitute a particular low-level vision task m ay be dependent on the sensor type. T h at is, although general techniques exist, they may be fine tuned to the type of sensor data. O perations in low-level vision generalize to small 2-Dimensional neighborhood arithm etic operations such as convolution w ith n x n kernels where n is typically in th e range 3 < n < 11. pixels. Every pixel in the image array is processed in the same m anner regardless of its value or location in the array w ith the exception of pixels residing near the borders. These are ignored. The output from low-level vision is a 2-dimensional array of pixels, each a function of a small j ] neighborhood of input pixels, figure 2.1. The goal is to transform the sensor d ata into a representation th a t will facilitate the extraction of sem antic content from the scene by the m id and high-level vision processes. For exam ple, the detection of edges in the input image facilitates the detection of linear features which, in turn, facilitates the detection of runways in an airport scene. Given a square image array of iV x N pixels and a fixed-size convolution kernel, the tim e com plexity of low-level vision processes is typically 0 ( N 2). Low-level vision operations include edge detection, noise filtering, and thresh olding. D etailed descriptions of these and other low-level vision processes can be found in image processing texts [29] [78]. 2.1.3 M id -L ev el V isio n In mid-level vision, d ata is received from the low-level processes in a 2-dimensional array com parable in size to the sensor image array. Processing is characterized by th e detection of local and global relationships among the array elements. O perations at th e mid-level are both num erical and symbolic and utilize a wide variety of data structures including arrays, lists, and graphs. 7 n S C E N E IM A G E /C O N V O LU TIO N K E R N E L n - m/2 C O N V O LV ED IM AGE Figure 2.1: Exam ple low-level vision processing. i The goal of the mid-level processes is to group pixels into extended features j and form ulate symbolic descriptions of these features in order to build up a de- j scription of the entire scene, figure 2.2. T he output is the set of symbolic de- j scriptions of the features detected in the scene. The representation is an abstract d ata structure such as an attrib u ted graph or sem antic network. From these de tected features and their descriptions along w ith descriptions of the relationships between the features, hypotheses can be m ade w ith regard to their identities. D etected features include linear and curved segments, regions of homogeneous intensity, texture, or depth, and perceptual line groupings such as anti-parallel pairs or rectangles, to nam e a few. They are formed based on local relationships between pixels, such as grouping neighboring edgels 1 to form a line, and global relationships, such as grouping lines to form a surface. Once formed, attributes can be com puted for each feature. A ttribute types are dependent on feature types, *An e d g el is a line segment having an extent of one pixel. 8 n - m/2 n - m/2 E D G ES/IN TEN SITIES \ S U R F A C E S LIN ES Figure 2.2: Exam ple mid-level vision processing. for instance, surfaces may have an area, a perim eter length, and an orientation whereas lines may have a length and an orientation. Processing of d ata item s by mid-level vision processes is conditional, in con trast to the repetitive, unconditional processing of low-level vision. All d ata items are not processed in the same m anner. For example, two edgels are linked to gether to form a line if and only if their m agnitudes and orientations are similar. This, together w ith the wide variety of processes, makes it impossible to specify a single tim e complexity for mid-level vision. Typical values are 0 ( N 2) and 0 ( N 4) for an image of N x N pixels. Mid-level vision operations include line finding, region labelling, ribbon detection, and surface description from stereo images, to nam e ju st a few. These and other mid-level vision processes are described in com puter vision texts [71] [61] [37]. 9 MODEL 2 VIEWS MODEL 1 VIEWS ^✓LINES Figure 2.3: Exam ple high-level vision processing. 2 .1.4 H ig h -L ev el V isio n Due to the symbolic nature of the data, high-level vision receives input from the mid-level processes in an abstract d ata structure. T he d ata structure stores the symbolic descriptions of the detected objects along w ith descriptions of the relationships between the objects. Thus, the representation of the entire scene is distributed throughout the structure. D ata may also be received from other sources as a priori knowledge required to com plete the vision task. Such is the case in model-based vision. This a priori knowledge is sim ilar in form to the detected objects provided by the mid-level processes but includes additional inform ation such as object identities (semantics) and viewing param eters. The output of high-level vision may be as simple as a “yes” or “no” recognition decision or as complex as a fully labelled scene including inferences of occluded or om itted objects and 3D coordinate transform ations between the model view and the scene view, figure 2.3. 10 Typical high-level processing proceeds by comparing the known m odel objects to th e detected objects. Due to the distributed nature of the representations it is necessary to find both local and global correspondences between the objects in the d ata structures. Local correspondences identify sim ilarities between scene and model objects and global correspondences verify th at the objects are located in corresponding places within their respective scenes. By detecting both local and global correspondences a confident recognition decision can be m ade and an accurate view-point transform ation can be com puted. High-level vision processing is characterized by the generation of, either im plicitly or explicitly, and a search through a solution space for th e desired result. Techniques include graph m atching, relaxation labelling, and geom etric corre lation. Because of the nature of such problems, tim e com plexity is a concern in high-level vision algorithm s. One widely used technique, the search for an isom orphism between two attrib u ted graphs, is known to be a m em ber of the NP-complete class of problems [24]. Such results lead researchers to the use of heuristics and specific knowledge domains to help I'elieve the com putational burden of high-level vision processes. D etailed descriptions of high-level vision processes can be found in com puter vision texts [71] [61] [37]. 2 .1.5 V isio n S y ste m s as C om p ared to O th er S y ste m s Com puter vision systems possess unique properties th a t differentiate them from other system s. The properties relate to the processing and d ata structures used throughout the system and therefore m ust be thoroughly considered if an efficient system im plem entation is to be achieved. These properties are described in the following paragraphs. F irst and foremost is the heterogeneous nature of the processes as described above in term s of levels. Com puter vision system s utilize algorithm ic techniques developed in diverse fields such as digital signal processing and artificial intelli gence. Providing for these varied techniques in a single system requires careful planning at the highest level of system design. Decisions m ade regarding the 11 design of processes at one level may dram atically influence designs at th e other levels. Diversity in processing gives rise to diversity in d ata structures. D ata struc tures used in a com puter vision system include 2-dimensional arrays used by low-level processes, linear and linked-lists used by mid-level processes, and n- trees and graphs used by high-level processes. D ata types also vary among the levels. Input images typically consist of 8-bit unsigned integer values received from the imaging sensor. Then, as processing progresses up the levels, the use of long word integers and floating point values is common. Enum erated symbolic types are also readily used at the higher levels. The relationships among d ata item s are also im portant in com puter vision systems. D ata item s are ordered in a very specific way. For example, adjacent pixels w ithin th e 2-dimensional image array represent areas in the real world th at are adjacent, th a t is, the adjacency relationship is im plicit in the d ata ordering. M aintenance of such relationships is critical throughout all processing. Adjacency and geom etrical relationships between objects of m id and high-level processes are m aintained via links and pointers w ithin abstract d ata structures. All of these properties m ake it difficult to im plem ent com puter vision systems. W hen considering a serial im plem entation the designers m ust carefully design the interfaces betw een the various processes to assure the m aintenance of critical d ata item relationships. Furtherm ore, efficient use of m em ory is difficult since some d ata item s m ust be m aintained throughout all processing stages of the system while others are required only at particular processing stages. This can require complex book keeping strategies. Issues such as these are com pounded when considered in the realm of a parallel processor architecture. M aintenance of d ata item relationships often requires shuffling of item s between processing elements which further com plicates, via im plem entation, an already complex system , by nature. As stated above, all of these properties of processes and d ata structures m ust be considered together if an efficient design and im plem entation of a com puter vision system is to be achieved in any type of processor, serial or parallel. Fur therm ore, if a parallel processor architecture is specified, various complications 12 are certain to arise due to the natural com plexity of the algorithm s and due to the rigidity and n atural program m ing difficulties associated w ith the parallel pro cessor. Algorithm s th a t are not well suited to th e characteristics of the parallel processor architecture will force the development of obscure code th a t is difficult to develop as well as m aintain. These difficulties are of prim ary concern to us as little research has been done towards their solution yet they are often the m ain source of system costs and the m ain reason for avoiding th e use of parallel processor architectures. 2.2 Parallel Processor Architectures 2.2.1 O v erv iew M any applications exist which reach the perform ance lim its of th e classical stored- program com puters, th e so called serial or von Neum ann machines. Com puter vision is one such application. O thers include w eather forecasting, industrial applications, medical diagnosis, and energy research, to nam e a few. As the de m ands of these applications surpass the capabilities of serial machines, com puter architecture designers have stepped up to the challenge of developing higher per form ance machines w ith designs th a t utilize m ultiple processing elements (PEs) giving rise to multiprocessor or parallel processor com puters. Since their concep tion in the m id 1960s parallel processors have taken on a variety of shapes and sizes. In this section we describe the organizational param eters (characteristics) of parallel processor architectures. 2 .2 .2 O rgan ization s The m ost widely cited taxonom y for com puter architectures, due to Flynn [21], is based on the m ultiplicity of instruction and d ata stream s w ithin the architec ture. Four classes are identified: 1) Single-Instruction stream Single-Data stream (SISD) which corresponds to sequential machines where one instruction operates on one d ata item at any given time; 2) Single-Instruction stream Multiple-Data stream (SIMD) where a central controller provides identical instructions to a set 13 of processing elem ents via a broadcast mechanism. Upon receipt of the instruc tion each processing elem ent applies it to its own d ata item w ithin its own local memory; 3) M ultiple-Instruction stream Single-Data stream (MISD) is provided for com pleteness only, as it is unclear exactly w hat processing architectures fit this class. Pipeline architectures, where a d ata item is operated on by one processing elem ent then passed on to another for a different operation, are th e closest to fitting the description of this class; 4) M ultiple-Instruction stream Multiple-Data stream (M IM D) are architectures which consist of m ultiple sequential machines. Each m achine executes its own instructions on its own data. This taxonom y provides basic divisions of the space of com puter architectures. Among those divisions are three which describe parallel processor architectures, nam ely SIMD, MISD, and MIMD. O ther classes are possible. For exam ple, M ul tiple Single-Instruction stream Multiple-Data stream (MSIMD) describes an ar chitecture which utilizes m ultiple SIMD machines, each independently executing their own instructions on their own d ata sets. These hybrid classes are generally perm utations on the SIMD and MIMD classes and, in practice, are somewhat rare. 2 .2 .3 A r c h ite c tu r e C h a ra cteristics The m ultiprocessor classes, SIMD, MISD, and MIMD can be further subdivided based on the characteristics of their m em ory accessing protocols, the characteris tics of the processing elements, techniques by which they synchronize operations, and the way in which they com m unicate w ith one another. All of these item s are param eters th a t m ust be considered when designing (or selecting for use) a parallel processor architecture. In this section we provide a brief description of each of these design param eters. More com plete descriptions can be found in [40]. • Coupling - This refers to the degree of interaction between individual pro cessing elem ents (PEs) of the architecture. W hen the degree of interaction is low the system is called loosely coupled and com m unication proceeds via message passing between PEs. The Cosmic Cube [92] is one exam ple of a 14 loosely coupled system . W hen the degree of interaction is high the system is called tightly coupled and com m unication is accomplished via a shared memory. T he C.m m p [40] is a tightly coupled MIMD architecture. Homogeneity — This refers to the type of processing elem ents th a t constitute the system. An architecture consisting of a set of identical PEs is called homogeneous. The Connection Machine [35] is one such machine. If the architecture comprises two or more different types of processing elements it is called heterogeneous. The Image Understanding Architecture [106] is a heterogeneous architecture th at consists of a set of PEs th a t operate in SIMD m ode and a set th at operate in MIMD mode. Synchronicity - This refers to source of the control tim ing provided to each of th e processing elem ents w ithin the system. If all PEs operate under con trol of a single, global clock the system is called synchronous. This is the m ode of operation for all SIMD architectures. W hen each P E executes in structions under control of its own clock the system is called asynchronous. This is the norm al mode of operation for MIMD machines. A popular hy brid technique is a loosely synchronous system. In this type each processing elem ent executes identical instructions, as in a SIMD system , but under the control of its own clock, as in a MIMD system. A prim ary issue pertaining to the synchronicity of an architecture is com m unication between PEs. In a synchronous system all PEs com m unicate at the same tim e, synchronously. In an asynchronous system com m unication between PEs occurs arbitrarily and synchronization is left to the program m er. In a loosely synchronous system , synchronization is achieved when all PEs reach a point of commu nication w ithin the program , although some m ay reach th a t point sooner th an others and be forced to wait. Processing Elem ents - This refers to the com plexity of th e processing el em ents used in th e architecture. A wide spectrum of designs is available bounded by simple, bit-serial PEs on one end and by complex, general pur pose microprocessors on the other. Simple PEs lend themselves to dense 15 VLSI packaging of m ultiple elem ents on a single integrated circuit. W hat is given up in instruction set support is gained back in processing speed and m inim ized com m unication distances between processing elem ents. The 3-D Com puter [60], the M P P [96], and the Connection M achine [35] are some exam ples of architectures th a t encorporate simple processing elem ents. Sys tem s th a t utilize complex processing elem ents in their designs use fewer PEs and require less com m unication paths between PEs. The power of these ar chitectures is a function of the available complex instruction sets, floating point coprocessors, and available “support” chips for each processing ele m ent. Examples of such machines are the Cosmic Cube [92] and th e iPSC H ypercube [44]. • Com m unication Network Topology - This refers to the p attern of com m u nication paths among processing elements. T he goal of th e architecture designer is to m inim ize the diameter (the worst-case tim e required for a message to travel from one P E to another where a direct connection be tween two PEs is defined as one unit of tim e) while maxim izing the band width (the num ber of messages th a t can be sent/received by all PEs in one unit of tim e.) B oth of these goals can be m et by connecting every processing elem ent to every other processing elem ent but, when a system consists of a large num ber of PEs this solution is not feasible due to hardw are complexity. Therefore, researchers have proposed a variety of com m unication topologies which attem p t to achieve the versatility of a fully connected topology but require less hardw are and have a lower design complexity. Among these are Linear [51], Ring [53], 2-D im ensional Mesh [96], Cube [92], Tree [94], Pyram id [1], Mesh o f Trees [77], Mesh o f Meshes [77], Star [56], Common Bus [101], as well as others th a t are specialized for particular algorithm s. As one can see, the lim it of such topologies lies solely in the im agination of th e designer. An alternative to the static topologies describe above, are switching net works [40]. In these systems, rath er th an have PEs com m unicate w ith one another via direct paths between them , messages are passed via a network 16 of switches. Two fundam ental switching methodologies exist: 1) circuit switching where a direct physical path is established (via switches) between two PEs th a t need to communicate; and 2) packet switching where d ata is routed from one PE to another w ithout establishing a direct physical path. In general, circuit switching is best suited to bulk message transfers whereas packet switching is best suited to m any short message transfers. 2.2.4 O th er A rch itectu res We have presented the m ost widely used param eters for describing parallel pro cessor architectures. Most architectures fall into categories defined by these pa ram eters. B ut, there are some th at defy classification yet still deserve m ention. These are Data Flow, Systolic Array, and Artificial Neural Network. Unlike control flow computers th a t execute instructions sequentially under the control of a program counter, Data Flow com puters execute instructions when all required operands are available. By doing so, such architectures are capable of achieving m axim al parallelism in a program . Various com puters have been designed around th e d ata flow concept [26] [25]. W hereas m ost parallel processor architectures incorporate spatial parallelism by distributing d ata item s among processing elem ents and allowing each PE to perform a com plete operation on its assigned data, Systolic Arrays incorporate temporal parallelism. T h at is, the com putation, as opposed to the data, is par titioned into stages and distributed among processing elem ents. A com putation is not com plete until a single d ata item has passed through each processing ele m ent. The Warp processor [51] and the P IP E processor [46] utilize this style of parallelism . The philosophy adhered to in the design of parallel processor architectures discussed thus far is to provide a set of program m able processing elem ents and a set of interprocessor com m unication paths p atterned in a way th at minimizes the num ber of paths and the distance between PEs. Use of the architecture proceeds by distributing d ata over the processing elem ents and issuing instructions by which the d ata is to be m anipulated. 17 An alternative philosophy derived from the study of biological neural systems is th a t of Artificial Neural Networks (ANNs.) The basic idea is to provide tens of thousands of extrem ely simple processing elements th at are densely connected by a dynam ic set of com m unication paths [90]. Such architectures are “program m ed” by the adaptation of th e connections between processing elem ents, a process known as learning. T he am ount of literature on ANNs is continually expanding. For an introduction, a survey of ANN architectures, and references see [58]. 2.3 Designing Parallel Processor Systems Various issues come into play in the decision to utilize parallel processing. They I are divided into two m ajor categories. First are issues pertaining to algorithm j perform ance. These are algorithm speedup and processor efficiency. Second are issues pertaining to system life-cycle. These are system com plexity and program m er burden. In the following sections we describe th e m ain points of these issues and sum m arize by discussing how they pertain to com puter vision algorithm s. 2.3.1 A lg o rith m S p eed u p The first question th at arises when considering a parallel im plem entation of any algorithm is usually “how m uch of a perform ance im provem ent can be expected?” To determ ine the perform ance im provem ent of a parallel algorithm over its se quential counterpart we use a m easure called algorithm speedup. Algorithm speedup is defined as the ratio of elapsed tim e when executing a program on a single processor to the elapsed tim e when N processors are available. T hat is, for N processing elem ents, algorithm speedup is defined as where Xi and Tn are the elapsed tim es for 1 and N processing elem ents, respec tively. A m dahl’ s law [2] states th a t the degree of speedup obtainable is dependent on the inherently serial portions of an algorithm , those which cannot be perform ed in parallel, and therefore, optim al speedup (linear w ith th e num ber of processing elements) is rarely achieved. 18 2.3.2 P rocessor Efficiency The second question th at arises when considering a parallel im plem entation of an algorithm is “how m any processing elem ents can be effectively utilized?” P ro cessor utilization in a parallel processor architecture is determ ined by a m easure called ■processor efficiency. Processor efficiency is defined as the average utiliza tion of the available processing elem ents and can be specified in term s of algorithm speedup, S n - For N processing elem ents, processor efficiency is defined as F - S n If the efficiency, E n , of a parallel im plem entation rem ains constant (ideally 1) as th e num ber of processing elem ents, N , is increased, the parallel im plem entation of the algorithm is said to have achieved linear speedup. 2.3.3 S y stem C o m p lex ity System com plexity is an im portant issue in the im plem entation of an algorithm on a parallel processor in th at it affects all aspects of the design, its efficiency, flex ibility, expansibility, cost, etc. Characteristics of parallel processor architectures were described previously. As one m ight expect, different algorithm s favor differ ent characteristics. Furtherm ore, different operations w ithin a single algorithm m ay favor different characteristics. Such a situation brings one to th e conclusion th a t the parallel processor ar chitecture should be heterogeneous to m atch the requirem ents of all aspects of an algorithm . Such a system should comprise off-the-shelf microprocessors and custom built special purpose processors along w ith a variety of com m unication topologies and program m ing models in order to achieve generality. U nfortunately, this generality comes a the cost of custom hardw are required to provide commu nication protocols between processing elem ents of different types and to control th e entire system as a single unit. Furtherm ore, separate program m ing models, and probably separate program m ing languages, m ust be supplied for each type of processing elem ent. All of these factors add to the cost a system and place a burden on its user. 19 A lternatively, a homogeneous parallel processor architecture is less costly to design, build, and program but m ay introduce processing bottle necks th a t lead to inefficiencies in algorithm execution perform ance. A single, static (non- reconfigurable), homogeneous parallel processor architecture will not satisfy all of the processing requirem ents of all algorithm s. T he task of designing a parallel processor architecture for a specific algorithm reduces to one of optim ization, give and take. The m ost im portant issue is th e analysis of all processing and d ata requirem ents prior to the selection of architectural com ponents and to the making of any design decisions. The design m ust also be flexible in order to m eet th e needs of an evolving algorithm . As algorithm specifications change an architecture th a t cannot also change will be obsolete soon after its development. 2.3.4 P rogram m er B urden Software developm ent for a sequential com puter is well understood. The task is of com plexity com parable to th at of the algorithm being coded. On a parallel processor, all aspects of the software task are exaggerated. We refer to this exaggeration as program m er burden. In general, it is not a single step from algorithm conceptual design (flowchart) to a parallel program . An interm ediate step, solution of the mapping problem [7], is required. This is the step which defines how th e algorithm will “fit” onto the parallel processor architecture. Testing and debugging a parallel algorithm are also more complex th an for a serial algorithm . Process synchronization is perhaps the biggest obstacle to be overcome. It can cause an algorithm th at executes determ inistically on a serial m achine to run nondeterm inistically in parallel. Bugs are difficult to locate as they are often not determ inistically reproducible. Parallel software debugging tools have not yet m atured and offer little assistance. Program m ers m ust often resort to m onitoring all of the PEs w ith a front end host com puter’s debugging en vironm ent or using extrem ely slow sim ulators to assist them in locating program bugs. 20 Software m aintenance is perhaps the m ost complex aspect of parallel processor software. W hereas program m ers of serial machines need only concern themselves w ith the characteristics of the algorithm , program m ers of parallel machines m ust possess knowledge of both the algorithm and the m achine architecture. This in cludes the capability and num ber of processing elem ents, the interactions between processing elem ents, and th e program m ing model (SIMD, M IM D, etc.) If the p ar allel processor is heterogeneous the situation is com pounded as the program m ers m ust understand m ultiple environm ents. Also, original program designers often resort to obscure (unconventional or unintuitive) techniques to m ap an algorithm onto a parallel processor. One widely used technique is an unnatural partition- i ing of d ata structures to m atch the P E com m unication network topology of the m achine. Such obscurities are often difficult to com prehend even by the original program designer, much less by the unwary program m er charged w ith software m aintenance. Parallel program m ing languages and development tools are still in their in fancy. W ith one or two exceptions such as OCCAM [43], they are merely serial program m ing tools am ended w ith parallel constructs and com m unication prim itives via libraries. True parallel debuggers are rare. W hen they are offered, they typically provide capabilities such as m em ory viewing across all processing elem ents sim ultaneously and single stepping all PEs. These are useful for SIMD m achines but offer little help in debugging MIMD program s. In general, full capability debuggers, such as those offered w ith serial com puters, are m achine dependent and difficult to develop. Therefore, they are typically not pursued un til well after the successful introduction of a machine into the community. Q uite often, the parallel processor environm ent inherits th e debugging tools from the serial host com puter w ith some slight modifications to facilitate parallel debug ging. Such is the case w ith the Connection M achine which utilizes the Symbolics Lisp M achine as a host and utilizes its Common Lisp environm ent. In general, software tools for program m ing in parallel are far less m ature than parallel processing hardware. This makes it very difficult and costly to im ple m ent all b u t the m ost m ature of algorithm s in parallel processor architectures. Parallelism compounds the algorithm designer’s burden. To achieve a successful 21 im plem entation a program m er m ust be cognizant of all aspects of th e algorithm , the parallel processor architecture, and the available tools. 2.4 Summary C om puter vision systems incorporate a wide variety of algorithm ic techniques ranging in com plexity from simple repetitive processing to elaborate rule-based control structures. The am ount of active d ata at any given point in the system execution can range from tens of thousands of individual scalar values to a few m ulti-field record structures. Each diverse algorithm utilized in a com puter vision system taxes a classical von Neum ann m achine in one way or another. T he m ul tiplications and additions required by low-level convolutional processing are easy for a sequential central processing unit (CPU ) to execute but th e num ber of times they m ust be executed, the required index com putations, and the large num ber of operand fetches overwhelm it. Conversely, the small num ber of abstract data structures utilized in high-level vision are easy for a sequential m achine to m ain tain but the am ount of processing and th e complex control structures required to search through a solution space containing perm utations of the d a ta soon exceed the lim its of the machine. Also, the execution characteristics of com puter vision algorithm s are less pre dictable as we move up th e hierarchy from low-level to high-level vision. The processing regularity th at exists in low-level vision processes, such as convolution, is essentially nonexistent in high-level vision processes, such as object recognition where d ata dependencies are wide spread. This, too, has an affect on th e ultim ate perform ance of the system im plem entation. To sum m arize, com puter vision system s challenge serial m achines through both d ata intensive and com pute intensive approaches. For these reasons re searchers have recently begun investigating the im plem entation of com puter vi sion system s on parallel processor architectures. Due to the diversity of the algorithm s, all of issues described above are especially pertinent to the design of parallel im plem entations of com puter vision system s. M uch success has been 22 achieved in im plem enting low-level vision algorithm s on parallel processor ar chitectures [11] [59] [81] [54] due prim arily to their sim plicity and inherent par allelism. T he system designs are relatively simple (2-D Mesh connected SIMD m achines) giving way to intuitive software designs and the achieved perform ance is high. T he situation is somewhat different for m id and high-level vision algorithm s. Of the im plem entations th a t have been reported [59] [83] [104] the perform ance (observed or projected) is high but the system designs are quite complex and difficult to program . As the com puter vision problem is far from being solved, algorithm s from m id and high-level vision, especially high, continue to evolve. Such com plexity of system design is exceptionally costly in light of algorithm evolution. It is this situation th a t we wish to avoid. In the following chapters we review some of the current research in th e field of parallel im plem entation of com puter vision algorithm s, present our approach to developing such im plem entations, and provide detailed application and analysis of some im plem entations derived via our approach. O ur concentration is on high-level stand-alone vision algorithm s as well as heterogeneous vision algorithm suites. 23 Chapter 3 Previous Work 3.1 Overview C urrent research in the area of m apping com puter vision system s onto parallel processor architectures can be placed into three categories: 1) A lgorithm /Softw are developm ent; 2) A rchitecture/H ardw are development; and 3) T heory/T ool devel opm ent. Of th e three categories, A lgorithm /Softw are developm ent encompasses m ost of the research. In the A lgorithm /Softw are development category researchers are interested in developing parallel im plem entations of specific algorithm s on specific parallel processor architectures for the sole purpose of achieving short execution times. Em phasis has been placed on low and mid-level vision algorithm s w ith little research being done on high-level vision algorithm s. T he parallel processor ar chitectures utilized are thoroughly understood and readily available. In the A rchitecture/H ardw are developm ent category researchers are interested in developing innovative parallel processor architectures for efficient execution of com puter vision systems. Again, emphasis has been placed on low and mid-level vision algorithm s w ith prelim inary designs specified for im plem entation of high- level vision algorithm s. Researchers proceed by developing theoretical models which are sim ulated and then built in custom VLSI hardw are components. In th e Theory/Tool developm ent category researchers are interested in the characteristics of com puter vision algorithm s w ith respect to parallel im plem en tatio n and the design of tools to recognize those characteristics autonomously. 24 This category also includes research into methodologies for m apping com puter vision system s onto parallel processor architectures and tools to assist in this task. In the following sections we present sum m aries of various research projects in th e three categories. Also, throughout th e rem ainder of this dissertation, other projects are discussed as they pertain to our research. 3.2 Algorithm/Software Development 3.2.1 R eisis and P rasan n a-K um ar Reisis and Prasanna-K um ar present parallel im plem entations of a high-level vi sion algorithm , relaxation labelling for th e purposes of scene recognition and stereo m atching [83]. The targeted parallel processor architectures are a 2- dim ensional m esh connected SIMD m achine and a systolic array. In both archi tectures th e processing elem ents consist of an arithm etic/logic unit and a small j num ber of registers. I T he algorithm [64] takes as input, two sets of linear segments, one derived j from a sensor image and one from a m ap (or some other a priori knowledge source), or one each from a pair of left-right stereo images. T he goal is to find correspondences between the linear segments of each set via relaxation labelling. T he algorithm searchs through a solution space until a globally consistent set of correspondences is found. The serial algorithm is of com plexity 0 ( m 6) where the m is th e num ber of linear segments in each set. T he interesting aspect of this research is the m ethod in which th e authors p artitio n the task to fit onto th e m esh connected architecture. They develop a preprocessing algorithm th a t attem p ts to subdivide the sensor im age into rect angular regions each containing k linear segments where k is th e num ber of pro cessing elem ents in the architecture. In doing so they claim th a t all k PEs will be kept busy during each iteration of the process and th a t com m unication between processing elem ents will be local due to th e natural d ata dependencies, th a t is, 25 linear segments th a t are adjacent in the image will be processed by adjacent pro cessing elem ents. One drawback of this approach is th a t linear segments th at cross partitio n boundaries m ust be treated as special cases. T he authors provide a m ethod for handling this. For th e systolic array im plem entation, the authors note an advantage over the m esh im plem entation in th a t the d ata space need not be partitioned at all. Instead, th e processing steps are distributed over th e stages of the systolic array and the d ata is allowed to flow through th e architecture. Details can be found in the reference. This work presents an excellent exam ple of th e m apping problem . T he authors were cognizant of both the algorithm ic requirem ents and of the parallel processor architectures. In turn, they were able to design an im plem entation th a t they claim achieves 0(N) speedup and m axim al efficiency given N processing elem ents connected by a 2-dimensional mesh topology, for any value of N. This is linear speedup. T he claim ed speedup for the systolic array im plem entation was not as good but still significant. N either im plem entation was actually coded and run on the target architectures. A nother interesting aspect of this work is th at it is one of only a few studies j th a t investigates the m apping of a high-level vision algorithm onto a parallel { i processor architecture. Most other studies either concentrate on low and mid-level vision, postulate how an architecture m ay execute high-level vision processes, or postulate how a low or mid-level vision process can elim inate the need for high- level vision processes. 3.2.2 R ice and Jam ieson Rice and Jam ieson present a sim ulation environm ent in which parallel im plem en tations of com puter vision algorithm s can be studied [84], The sim ulation allows the user to program in a high level language and provides a set of macros and support subroutines to facilitate the developm ent of parallel program s. Given a serial program , the user need not know its intricate detail in order to convert it to a parallel program . 26 The parallel processor architecture sim ulated consists of any num ber of pro cessing elem ents, specified by the user, th a t are connected via a cube com m uni cation network topology. T he PEs operate in loosely synchronous MIMD mode. The d ata m apping scheme is fixed, w ith each processing elem ent receiving a strip, either horizontal or vertical, of the input image. All strips are of equal dimension. T he authors present analysis and sim ulation results for three algorithm s drawn from low and mid-level vision. They are: 1) intern al/ex tern al point classification, where the input image consists of a line drawing of one or m ore nonoverlapping objects and the goal is to segment th e objects from the background based on closed contours; 2) locating the center of mass of each object; and 3) com putation of various perim eter statistics for each object. They found th a t, given an input image of size 64x64 pixels, significant speedup is achieved through the use of up to eight processing elem ents. Beyond th at, com m unication overhead begins to dom inate over com putation and further speedup is insignificant. This is due to the fixed d ata partitioning scheme. As the num ber of PEs increases, the w idth of the strips of im age d ata decreases. Therefore, the ratio of border pixels to non-border pixels on th e strip increases. Since these border pixels m ust be com m unicated to neighboring processing elements the am ount of com m unication between PEs increases. The conclusion drawn by th e authors is th a t their system can provide ana lytic and sim ulation capabilities necessary to estim ate the num ber of processing elem ents required to satisfy specific task and speed requirem ents for a given com puter vision system. 3.2.3 L ittle et al. Little et al. present m appings of various com puter vision algorithm s onto the Connection M achine (CM) [59]. D etailed im plem entations are presented for some low and mid-level vision algorithm s as well as outlines for im plem entation of some high-level vision and geom etrical construction algorithm s. The low and mid-level vision algorithm s presented include M arr-H ildreth edge detection [62], Canny edge detection [9] connected com ponent labelling, and I ! 27 Hough Transform com putation [38]. T he high-level vision algorithm s outlined are triangle visibility, where the problem is to determ ine the visible vertices from a set of to opaque triangles placed at random in 3-dimensional space, and graph m atching, where all isom orphic subgraphs H of a larger graph G are sought after. T he geom etrical construction algorithm s outlined are com putation of the convex hull and Voronoi diagram s for a set of n points [79]. The Connection M achine consists of up to 64K simple bit-serial processing ele m ents each w ith 4K bits of memory. Com m unication among PEs takes place over a 2-dim ensional mesh network, the N E W S network, for neighbor com m unication and over a cube network, the Router, for distant com m unication. These, together ' I w ith three powerful prim itive operations, provide the CM w ith its uniqueness and J versatility. The three prim itive operations are routing, scanning, and distance \ doubling. Routing allows any processing elem ent to com m unicate w ith any other via th e cube interconnection network. This facilitates the diverse com m unication pattern s often required by com puter vision algorithm s. Concurrent reads and w rites are allowed b u t at a significant cost of tim e, therefore the authors adhere to the exclusive read-exclusive w rite (EREW ) program m ing m odel for their algo rithm s. Finally, combiners are available which combine m ultiple messages sent to a common destination by operations such as logical AND and OR, sum m ation, and extrem um . Scanning is a set of prim itives th a t utilize the cube connections to distribute and aggregate (reduce) values among a set of processing elem ents using binary associative operators. The operators include m axim um and m inim um value com p u tatio n and sum m ation. Scanning also allows for distributing a single d ata item over a set of PEs w ithout modifying its value. All scan operations can also op erate over th e NEW S network allowing sum m ation, extrem a com putation, and copying to occur rapidly over rows and columns of processing elem ents. These operations, called grid-scans, m ake im plem entation of various low-level vision algorithm s very efficient. 28 Distance Doubling is sim ilar to scanning in th a t it com putes associative opera tions b u t is applicable to a linked-list or ring of processing elem ents. PEs grouped in this fashion can propagate values, for instance, extrem a, in logarithm ic tim e. W ith these features in m ind, the authors present their algorithm im plem enta tions pointing out the usage of specific features to provide easy im plem entations and significant algorithm speedup. In a later paper [76], th e authors, w ith others, describe the M IT V ISIO N M A C H IN E and the role of th e Connection M achine and these algorithm s in its development. T heir conclusions include the usefulness of the Connection M achine for experim entation and developm ent but point out th e need for its replacem ent by VLSI circuitry upon “perfection” of th e algorithm s in order to achieve required processing speeds. 3.2.4 D A R P A B enchm arks Rosenfeld provides a sum m ary of the results of a workshop on architectures for Im age U nderstanding (com puter vision) [85]. For the workshop, researchers were asked to report on the ability of their architectures to perform seven well de fined com puter vision tasks. The tasks, or benchm arks, included edge detection, connected com ponent labelling, hough transform com putation, com putation of th e convex hull, th e Voronoi diagram , and th e m inim al spanning tree of a set of planar points, visibility com putation for a set of opaque triangles in 3-space, find ing subgraphs of a given graph th at are isom orphic to another given graph, and finding the m inim um -cost p ath between two vertices of an edge-weighted group. Eight parallel processor architectures were reported on by various research orga nizations. They are given in the reference. T he goal of the workshop was to gather inform ation about the perform ance of these architectures on a set of image understanding tasks as well as to define a m ethod for com paring architectures. Various issues were raised w ith regard to th e exercise including: 1) requirem ents of research differ drastically to those of applications; 2) specific algorithm s should be used as benchm arks b u t not specific im plem entations; 3) benchm arks should be representative of all levels of com puter vision; and 4) a standard technique for reporting results should be specified. Of 29 th e benchm arks them selves, concern was voiced in th a t they were structured as stand-alone tasks, and therefore, the operation sequencing of an entire com puter vision system was absent. Furtherm ore, techniques requiring high-level reasoning about d ata were not represented at all. T he reference includes results reported by the participating organizations as well as details of the benchm ark tasks. Details of each participants work can be found in [88]. The author concludes th a t the exercise was enlightening and should continue as an ongoing project. In [104], the authors present a second set of benchm arks which address some of th e issues raised regarding th e first. 3.3 Architecture/Hardware Development 3.3.1 W eem s et al. Weems et al. describe a heterogeneous pyram id architecture designed specifically for com puter vision applications called the Image Understanding Architecture (IUA) [106]. The design of th e IUA was driven by the processing requirem ents of th e three levels of abstraction of a com puter vision system . Each abstract level has been m apped onto a pyram id level th at comprises processing elem ents well suited to perform ing the appropriate operations, figure 3.1. The lowest level of the pyram id is called the Content Addressable Array Par allel Processor (C A A PP). It consists of a 512 x 512 square grid of 1-bit serial processing elem ents th a t operate in SIMD or m ultiple SIMD modes. The PEs are linked via a 2-dimensional mesh topology for local com m unications as well as a coterie network which allows independent groups of processing elem ents to be formed. The intent of the C A A PP level is to perform the pixel operations of low- level vision. Results from the CA A PP are reported to th e interm ediate level via associative processing and global feedback mechanisms. These aid in achieving th e stated project goal of building an autom ated real-tim e com puter vision system as opposed to ju st a fast, sim ple image processing system . 30 Symbolic Processing Array Intermediate and Communications Associative Processor 1 \ V \ ^ ICAP T O V W v II' Content A ddressable Array Parallel Processor CAAPP imammv.'M itiamnV Figure 3.1: The Image U nderstanding A rchitecture. T he interm ediate level of the pyram id, called the Interm ediate Com m unica tions Associative Processor (IC A P), is responsible for mid-level vision processes. It consists of a 64 x 64 square grid of digital signal processors. Each ICA P pro cessing elem ent is associated w ith an 8 x 8 grid of CA A PP PEs. It has direct access to d ata stored in any one of th e 64 C A A PP processing elem ents as well as global sum m ary inform ation for those PEs. Com m unication betw een th e ICAP processing elem ents is via a 2-dimensional m esh topology for local com m unication and a cube topology for long distance com m unication. T he ICA P processing elem ents are capable of operating in synchronous-M IM D and pure MIMD modes. Control is provided by PEs at the topm ost level of the pyram id. T he topm ost level of the IUA is called the Symbolic Processing A rray (SPA.) A lthough the design has not yet been com pleted, initially it calls for an 8 x 8 square grid of M otorola M68020 microprocessors. An alternative design utilizes a shared m em ory m ultiprocessor. In either case, the m ode of operation will be M IM D and program m ing will be done in LISP [97]. 31 T he prim ary processing framework of th e SPA is a blackboard m odel [16]. This provides the knowledge-based processing utilized in high-level vision pro cessing. Various schemas will, when activated, provide focus-of-attention control over PE s at the ICA P and, ultim ately, th e C A A PP levels. The system will operate in both top-dow n and bottom -up modes. P rototypes of the IUA exist as well as software sim ulations. It is currently being built in VLSI by the University of M assachusetts in collaboration w ith the Hughes A ircraft Company. T he authors present various com puter vision scenarios for im plem entation on the IUA as well as estim ated execution tim es for various com puter vision algorithm s. 3.3.2 K u eh n et al. The research done by K uehn et al. [48] is a reflection of th e authors’ goal of de signing a flexible parallel processing system th a t can be dynam ically reconfigured to m eet the processing needs of image and speech analysis tasks. Their proposed system is called P A SM , a PA rtitionable S1M D /M IM D m achine. To achieve the desired flexibility, the system design was driven by th e characteristics of algo rithm s in th e specified domains. These algorithm s influenced decisions ranging j 1 from high-level system layout to low-level VLSI chip selections and designs. They present an algorithm for extracting contours from gray-level images. It is divided into two phases. T he first phase selects a threshold, based on inten sity gradients, for binarizing the image. The second phase binarizes the image then traces th e boundary contours of the objects in th e scene. This particular algorithm was selected because the processing requirem ents of its two phases are quite different from one another. Given th a t the input image is partitioned so th a t each processing elem ent receives a square patch of equal size, the first phase dictates a SIMD m achine w ith regular, synchronous, local com m unication p a t terns. Each image patch is processed in th e same way regardless of content. In the second phase of the algorithm , processing is d ata dependent and com m uni cation is irregular and asynchronous, although still local, thus dictating a MIMD m achine. D etails of the algorithm are given in the reference. 32 Given an algorithm th a t comprises such drastically different processing re quirem ents, the authors set out to design a parallel processor architecture suited to the task. At the highest level of design, the m achine m ust be able to oper ate in both SIMD and MIMD modes. Furtherm ore, it should be able to switch betw een modes dynam ically and under its own control. N ext, although it is not required by these algorithm s, the machine should be able to partitio n itself into independent sections each capable of operating in its own required m ode, SIMD or M IM D, effectively becoming m ultiple machines. Finally, the processing ele m ents should be powerful enough to support the com putational needs of these and other com puter vision tasks w ithout having to resort to software libraries, th a t is, they should have hardw are support for floating point operations as well as special functions such as square-root, sine, cosine, etc. As an additional fea ture, SIMD mode com m unication should be perform ed in direct m em ory access (DMA) fashion to allow overlapping of com putation and com m unication. W ith these criteria, the authors set out to design a parallel processor architec- i I ture, the details of which are outlined in the reference. T he prim ary functional sections are: 1) the Parallel C om putation Unit, m ade up of th e processing el em ents and the com m unication network; and 2) the M icro Controller, which houses a set of controllers each responsible for controlling a set of PEs in SIMD j m ode or for coordinating a set of PEs in MIMD mode. It is the Micro Controller th a t perform s dynam ic reconfiguration and partitioning. A m achine consisting of 16 processing elem ents and four controllers, all M otorola MC68000s, is cur rently being built. Custom VLSI is being built to handle com m unications, DMA operations, hardw are semaphores (test-and-set), as well as other functions. T he authors believe th a t, because the design is based on real algorithm ic requirem ents, PASM satisfies the requirem ents of a com plete com puter vision system . They have shown evidence tow ard this end by sim ulation of an algorithm th a t they feel is representative of these requirem ents. 33 3.4 Theory/Tool Development 3.4.1 S tou t Stout investigates the m apping of three low and mid-level vision algorithm s onto five parallel processor architectures [98]. T he algorithm s are: 1) reduction, in which each processing elem ent holds a value and all values m ust be combined to form a single result; 2) extrem e point location, in which the convex hull, the sm allest convex polygon containing a planar set of pixels, is determ ined; and 3) labelling, in which all dark pixels are assigned a label and two pixels have the same label if and only if they are connected by a p ath of adjacent dark pixels. T he architectures studied all consist of a large num ber of simple processing elem ents, each w ith a small am ount of memory, operating in SIMD mode. The com m unication networks differentiate th e architectures. Those studied include: 1) a 2-dim ensional mesh; 2) a cube; 3) a homogeneous pyram id; 4) a m esh of trees; and 5) a parallel random access m achine or shared m em ory m achine. T he a u th o r’s prim ary em phasis is in transferring program s betw een architec tures, thus relieving the program m er of the burden of understanding th e architec tu re and allowing sole concentration on th e algorithm . To this end he postulates three approaches. The first is to develop sim ulations of each m achine on each other m achine. Having done so, every algorithm need only be developed once for any one m achine, then they can be executed via sim ulation on all other machines. A lthough this technique makes all algorithm s available on all m achines, the au thor concludes th a t they are m ost likely to be inefficient on any architecture other than the original target m achine (as opposed to the sim ulated m achines.) T he second approach involves defining an ideal architecture for every algo rith m then sim ulating this architecture on each actual architecture. T he author points out th a t this approach entails a m ajor effort as each algorithm favors a different ideal architecture. Furtherm ore, some algorithm s favor m ultiple ideal architectures and the effort in sim ulating each varies dependent on the actual architecture. 34 The th ird approach requires describing the algorithm s in term s of standard d ata m anipulation and d ata movement operations. Having done so, the program m er need only im plem ent these standard operations on each architecture once. This technique is analogous to the task of com piler w riting where a single lexical analyzer/parser is developed for use on all m achines and individual code genera tors are developed for each m achine. A lthough this approach is th e m ost flexible and easiest to im plem ent, it too has drawbacks in algorithm execution efficiency, as pointed out by the author. The standard operations m ay not take full advan tage of the features of a given machine or processing shortcuts applicable to a given algorithm . Each of th e approaches is studied in depth using the three nam ed algorithm s as research vehicles. The author concludes th a t in the forseeable future, program mers will not be relieved of the burden of fully understanding the target parallel processor architecture. To m eet the perform ance requirem ents of com puter vision system s the program m er m ust know both th e algorithm and the architecture. 3.4.2 L evitan L evitan’s stated goal is to develop and evaluate a set of m easures, or m etrics, for parallel com m unication structures and architectures [56]. His prim ary interests lie in classifying and characterizing parallel system s and, ultim ately, developm ent of a theory of parallel com putation complexity. T he approach taken is to exam ine several algorithm s as im plem ented on several parallel processor architectures. The algorithm s studied are well understood as serial algorithm s and involve conditional decision m aking rath er th an num erical com putation. As stand-alone algorithm s, their usefulness is lim ited but, they encom pass techniques often used in larger, application specific program s. T he algorithm s are: 1) broadcasting, the task of sending a single message to all processing elem ents; 2) reporting, the task of gathering a response to a query from every processing elem ent; 3) extrem a finding, searching a set of values distributed among processing elem ents for a m inim um or m axim um ; 4) packing, the m ovem ent of d ata item s among processing elem ents to elim inate spaces but retain relative ordering; 5) sorting; 35 and 6) m inim um spanning tree com putation of a graph. These algorithm s were also selected because they span a wide range of com plexity and com m unication requirem ents. T he parallel processor architectures were chosen to be representative of cur rent or proposed machines. They include: 1) th e C ontent Addressable Parallel Processor which is a SIMD m achine w ith additional broadcast and response cir cuitry; 2) a Star connected machine; 3) th e Broadcast Protocol M ultiprocessor which is a star connected m achine w ith th e hub being a single register rather th an a processing elem ent; 4) a Linear connected m achine; 5) a Tree connected machine; 6) a Shuffle connected machine; 7) a Full connected machine; and 8) a Full with C A P P which is a fully connected m achine where each processing ele- j m ent has a C ontent A ddressable Parallel Processor as p art of its local memory. Each m achine is classified based on m etrics of diam eter and bandw idth. A perform ance analysis is provided for each algorithm on each m achine and a ranking of each m achine based on these analyses is derived. As one m ight expect, different algorithm s were suited to different machines. T he author proceeds to relate the architecture m etrics to algorithm perform ance and thus evaluate the ability of the m etrics to predict perform ance of a given architecture on a given algorithm . T he final conclusion is th a t the selected algorithm s and m etrics can form the basis for characterization of the com m unication structures of parallel archi tectures and com m unication needs of parallel algorithm s. It is hoped, by the j author, th a t this will ultim ately lead to th e form ulation of a parallel com plexity theory. 3.5 Summary In summ ary, research on the parallel im plem entation of com puter vision algo rithm s has em phasized th a t of low and mid-level algorithm s [84] [59] [48] [98] [56]. We believe th a t th e reasons for this em phasis are: 1) these algorithm s m ust process a large am ount of d ata from the imaging sensor (e.g. a 512 x 512 m atrix of 8-bit intensity data) often at video rates (60Hz); 2) these algorithm s are well 36 defined and generally application independent. For instance, tasks which utilize 3-D range d a ta and those which utilize 2-D intensity d ata often require similar d ata intensive low and mid-level processing algorithm s such as noise removal and edge detection. As a result of this em phasis, parallelization of low and m id level vision algorithm s is well understood from algorithm im plem entation to the prediction of process throughput. The algorithm s th a t constitute high-level vision present a different situation. A lthough they utilize some very common techniques such as searching, graph m atching, and relaxation m ethods, the details of the algorithm s are task de pendent. Recognition of objects from 3-D range d ata requires a different pro cedure th an does recognition of objects from 2-D intensity data. Even in the context of a single application there is no generally accepted approach among researchers. This is due to the variety of sem antic interpretations available, such as 3-D surfaces or 2-D regions of homogeneous intensity, as well as our lim ited understanding of high-level processing. Research on the parallel im plem entation of high-level vision algorithm s is scarce. Perhaps this is due to the specific nature of th e algorithm s. A parallel im plem entation of a particular algorithm m ay not be of interest to anyone but its developer. Perhaps it is due to the evolutionary nature of these algorithm s. As there is no generally accepted approach to high-level vision problem s, algorithm s are in a continual state of change. Investm ent in a parallel im plem entation m ay not be cost-effective. Perhaps it is due to the perceived difficulty of parallelizing these algorithm s. The com plexity of high-level vision tasks necessitates complex algorithm s when cast as serial im plem entations. Parallelizing these algorithm s further contributes to this com plexity resulting in im plem entations th a t may be difficult to conceive as well as m aintain. Regardless of the cause, parallel im plem entations of high-level vision algorithm s are scarce. A nother em phasis in the parallel im plem entation of com puter vision algo rithm s has been on the com putational perform ance of th e resulting im plem enta tion. Issues of algorithm speedup and processor efficiency are of prim ary concern. System life-cycle issues such as developm ent cost and m aintenance cost are ne glected. For research centered around low and mid-level vision tasks this trend 37 (em phasized perform ance issues, neglected life-cycle issues) is less serious as the algorithm s are conceptually simple and typically not subject to change. In high- level vision where processing is complex and algorithm s tend to evolve, this trend can result in a m ajor undertaking to get a parallel im plem entation running ini tially (design, coding and debugging) as well as to modify it when algorithm “refinem ents” are specified. T he scarcity of parallel im plem entations of high-level vision algorithm s and th e im portance of system life-cycle costs provides the m otivation for our research. We are interested in investigating the inherent com plexities of high-level vision algorithm s and how those complexities can be accom m odated in a parallel im ple m entation. Specifically, we are interested in four im plem entation characteristics. • Conventional, quantitative characteristics. 1. Algorithm Speedup as previously defined. 2. Processor Efficiency as previously defined. • Proposed, qualitative characteristics. 3. System Complexity is characteristic of th e am ount of algorithm re structuring required in order to delineate the inherent parallelism of th e algorithm . A good im plem entation, in term s of system complexity, is one th a t requires m inim al control logic (not directly attrib u tab le to th e algorithm ) in order to produce the parallel im plem entation. 4. Programmer Burden is characteristic of the degree of difficulty in devel oping and m aintaining the parallel algorithm im plem entation. A good im plem entation, in term s of program m er burden, is one th a t contains m inim al differences between th e software of the serial im plem entation of an algorithm and th at of the parallel im plem entation. T he first two characteristics are of interest because they provide a m eans for evaluating th e im plem entation w ith regard to run-tim e perform ance. T he last two characteristics are of interest because they provide a means for evaluating the im plem entation w ith regard to system life-cycle issues. 38 ! In th e “classical” approach to developing a parallel algorithm im plem entation, | a parallel processor architecture is specified and then th e algorithm is m apped ! ; onto it. In our approach, we perform some basic analysis steps in order to iden- : tify th e inherent parallelism of the algorithm . We then specify the com ponents of a parallel processor architecture th a t is well suited to th e requirem ents of the algorithm . T h at is, rather th an map the algorithm onto the parallel processor ar- ; chitecture we m atch a parallel processor architecture to th e algorithm . Through j this approach we are able to address th e issues of system com plexity and pro- I gram m er burden as well as algorithm speedup and processor efficiency. j ! In th e following chapters we present the “classical” approach to developing I a parallel algorithm im plem entation, our m ethodology for developing a paral- j lei algorithm im plem entation, and results of the application of our methodology to various high-level vision algorithm s. We also present the application of our I m ethodology to a low /m id-level algorithm suite and a low /m id/high-level algo- j \ rith m suite. T he reason for doing so is th a t high-level vision algorithm s cannot ; exist on their own. They m ust be provided w ith d ata from the low and mid-levels | The inclusion of all of the processes in the design of a parallel processor architec- j tu re introduces additional constraints on the problem. Therefore some algorithm I suites representative of these com puter vision system s are included. i 39 I Chapter 4 j The M ethodology i 4.1 Technical Problem p | T he determ ining factor w ith regard to the algorithm speedup and processor ef- 1 ficiency achieved by a parallel algorithm im plem entation is the solution of the ! mapping problem [7], A formal statem ent of the m apping problem is: i i I i j the search fo r a correspondence between the interaction pattern o f the algorithm processes and the communication network topology o f the architecture. A good solution, or m apping, is one th a t minimizes the com m unication over- | head and thus maximizes th e algorithm speedup and processor efficiency. In the ! | following sections we discuss the “classical” approach to solving the m apping | i problem and how it applies to com puter vision algorithm s. We then present a ; reform ulation of th e m apping problem and a discussion thereof which leads to l our m ethodology for th e parallel im plem entation of high-level com puter vision algorithm s. j 4.1.1 T h e M ap p ing P ro b lem j | ! ! 4.1.1.1 D escription I | Given a target parallel processor architecture and an algorithm to be imple- | m ented, the task at hand is conceptually simple: w rite code th a t im plem ents the 40 ALGORITHM ARCHITECTURE SPECIFICATION SPECIFICATION ©IL&Hfri © AL ISfl&IPIPIKKl I P G 3 ® ! L I I I S ] PARALLEL ALGORITHM IMPLEMENTATION i Figure 4.1: A pplication of the classical m apping problem. 1 1 | algorithm on the architecture so as to m axim ize algorithm speedup and processor efficiency. This task is broken down into two fundam ental steps. T he first step ' is solving th e mapping problem for th e given architecture and algorithm . The ; i second step is th e software development. | In its “classical” form ulation, the m apping problem addresses th e situation: given an algorithm and a target parallel processor architecture, develop an im- I plem entation of the algorithm th a t achieves m axim al algorithm speedup and I i I processor efficiency on the target architecture, figure 4.1. Solution of th e m ap ping problem proceeds in two steps, the first is partitioning the algorithm into ( independent processes and the second is assignm ent of the processes to individual processing elem ents. Briefly stated, partitioning involves determ ining the set of independent processes th a t constitute th e algorithm and assignm ent is the place m ent of those processes onto processing elem ents so as to reduce th e physical distance betw een processes th a t m ust com m unicate w ith one another. 1 j . , ! Of prim ary influence on the partitioning of an algorithm and the assignm ent \ i i of its independent processes to processing elem ents is the target parallel proces sor architecture. Its organization (program m ing model, com m unication proto cols, synchronization requirem ents, processing elem ents, com m unication network 41 topology, etc.) places constraints on the design of th e parallel im plem entation. For exam ple, a SIMD m achine of N processing elem ents constrains the design < to N processes th a t m ust execute the same instructions in lock step and m ust J I com m unicate sim ultaneously and synchronously. Conversely, a MIMD machine I of N processing elem ents constrains the design to N processes b u t each may execute a unique set of instructions and pairs of processes m ay com m unicate asynchronously and independent of the others. M apping problem solutions fall into two categories, static and dynamic. In j static m apping, process partitions and processing elem ent assignm ents are de- j term ined by the designer prior to program execution. T h at is, once a process ' . . 1 1 is assigned to a processing elem ent, it rem ains there throughout program execu- | tion. In dynam ic m apping, th e process partitions are determ ined by the designer i prior to execution b u t the processing elem ent assignm ents are determ ined during \ 1 execution either by an operating system or by logic w ithin the program itself. T h at is, the system designer has no a priori knowledge of w hat processes will be assigned to w hat processing elem ents and furtherm ore, an individual process may be moved from one processing elem ent to another during program execution. ; T he constraints placed by the target parallel processor architecture on the al- ! gorithm m apping can place great difficulties on th e designer. In order to achieve j significant algorithm speedup and processor efficiency it m ay be necessary to de vise an obscure, unintuitive solution to the m apping problem . Such a solution I tends to be plagued w ith system overhead due to process com m unication and d ata dependencies and thus achieves far less th an optim al speedup. F urther more, it m ay require a great deal of algorithm restructuring so th a t a bug-free serial im plem entation provides no assistance in debugging th e parallel im plem en tation. Finally, algorithm m odifications m ay require additional restructuring thus requiring a m ajor redesign of the parallel im plem entation. 4.1.1.2 C om p u ter V ision and th e M apping P ro b lem T he m apping problem is of special interest to those im plem enting com puter vision ! system s on parallel processor architectures. As described above, com puter vision system s com prise a variety of algorithm s. The algorithm s are diverse, ranging from sim ple, repetitive processing to complex d ata dependent processing. Fur therm ore, a variety of d ata structures ranging from 2-D arrays of scalar values to structures of m ulti-field records are utilized. Not surprisingly, each algorithm favors a particular architectural organization b u t still, the tendency has been to partitio n th e algorithm based on the 2-D image array as it is the basis for com puter vision. In doing so, each processing elem ent is assigned an area of the array on which to operate. This is acceptable for low-level vision tasks where d ata dependencies exist prim arily between adjacent pixels w ithin the array. It m ay not be th e best choice for partitioning the processes of m id and high-level vision tasks where perceptual groupings of pixels give rise to objects th a t span the im age array thus deeming pixel adjacency relationships insignificant. S tatic versus dynam ic m apping is also an issue in the parallel im plem enta tion of com puter vision systems. In low-level vision tasks the algorithm s are repetitive, of relatively short duration, and are not d ata dependent. Thus, load balancing is trivial and a solution using static m apping is sufficient. On th e con trary, high-level vision tasks use algorithm s th a t are highly d ata dependent and require complex, tim e consuming processes th a t do not require a lot of interpro cess com m unication. A static m apping can lead to an unbalanced work load and a system th a t achieves poor perform ance. A solution based on dynam ic m apping m ay provide b ette r measures of algorithm speedup and processor efficiency in spite of th e additional com m unication overhead. Due to the com plexity and heterogeneous n atu re of the processing in com puter vision, the difficulties introduced by th e m apping problem are exaggerated when confronting such algorithm s. Since an algorithm comprises a variety of process control and d ata structures, each structure m ust be considered when solving th e m apping problem. Each structure m ay favor a different set of architecture organizational param eters, none of which m atches those of the given parallel processor architecture. Therefore, trade-offs m ust be m ade to determ ine the best m apping. T he end result is a design th a t is difficult to im plem ent and m aintain and achieves less th an optim al perform ance. 43 4.1 .2 S oftw are D e v e lo p m e n t W hile the cost of com puter hardw are is on the decline, th e cost of software con tinues to soar, especially th a t of custom software [6]. As com puter vision is a I topic of continuous research, software to im plem ent such system s definitely falls j ; under th e label of custom. Due to the com plexity of th e algorithm s and the diversity of th e fields from which they are derived, the software to im plem ent such algorithm s is necessarily complex. It is not out of the ordinary for a sin- j gle high-level vision algorithm to contain processes th a t perform signal, image, | statistical, geom etrical, and symbolic processing. Software developm ent for such ! an algorithm is a m ajor undertaking on a serial com puter where the program m er need only be concerned w ith the algorithm itself. T he degree of difficulty for de- j veloping a parallel im plem entation on a given architecture is dram atically higher. As stated earlier, one m ust be cognizant of all of the organizational param eters ; of the target parallel processor architecture. Boehm [6] gives various m ethods for increasing software productivity (de creasing software developm ent and m aintenance tim e.) Among them are: j ! I < I ! • provide efficient software developm ent environm ents, j I ^ t • elim inate steps, • build sim pler products, and • reuse com ponents. An additional issue, pertinent to parallel processing, is the “dusty deck problem ,” ; a special case of reusable com ponents. It addresses th e issues involved in mov ing an algorithm im plem entation from a serial com puter to a parallel processor where the goal is to m inim ize m odifications to the code (ideally no m ore than a recom pilation.) Due to th e com plexity of some com puter vision algorithm s this m inim ization is desirable to ascertain th a t algorithm functionality is not altered in order to utilize particular features of the parallel processor architecture. i | Boehm was considering sequential com puting in deriving his list of software issues and the m ethods for addressing them . Confronting these issues in the 44 realm of parallel processing is not as easy as in serial processing due m ainly to th e diversity of available parallel processor architectures. M ost efforts focus on the developm ent of tools to assist th e user in solving a specific instance of the i m apping problem. These include tools for specific m achines or classes of m a chines [107], the developm ent of a set of fundam ental prim itives used in parallel | program m ing and the porting of these prim itives to various m achines so as to pro m ote program portability [23], and the developm ent of generic sim ulation tools j to m odel various architectures [3]. In [103] [82] [105] th e authors discuss tools I designed specifically for parallel im plem entation of com puter vision algorithm s ; (low and m id-level.) Some of these tools assist in the initial design and implemen- I tatio n of parallel software whereas others provide a “com fortable” m aintenance I (debugging) environm ent. E xcept for th e developm ent of parallelizing compilers, which are m achine spe- 1 cific, th e existing tools for parallel program m ing require the user to solve the m apping problem , at least partially. As stated previously, this requires intim ate knowledge of the algorithm as well as th e target parallel processor architecture i and probably m ajor coding activities. T he “dusty deck” problem is typically not | addressed. Furtherm ore, if an algorithm is not well suited to the target architec- j ture, the resulting software stru ctu re is often quite obscure in th a t it comprises i code th a t is representative of the architectural features m ore so th an th e algo- j rith m th a t it im plem ents. Such an im plem entation is difficult to program initially, difficult to debug, and difficult to m aintain and modify at a later date. W hile the basic structure of the parallel software is obtained in solving the m apping problem , quite often the task of program m ing and debugging the de sign is m onum ental. A m apping betw een an algorithm and a parallel processor ! | architecture m ay be conceptually sim ple in its most abstract form b u t, many details m ust be “filled in” during the actual program m ing. D etails pertaining to process com m unication, synchronization, and program m ing m odel m ust be dealt w ith. Because high-level vision algorithm s are complex and evolutionary, these details can pose grave difficulties in algorithm im plem entation. Exam ples of the | difficulties in dealing w ith these details in the realm of high-level com puter vision ! algorithm s are cited in [83] and [59]. Each presents a m apping of a high-level ; I « l vision algorithm onto a parallel processor architecture. As testim ony to th e dif ficult n atu re of software developm ent for a parallel algorithm im plem entation, neither attem p ted to actually w rite and debug code. A uthors of th e later work even adm it th a t “there are m any thorny issues . . . yet to be analyzed.” I 4 .1 .3 S u m m a ry ■ T he design of a parallel algorithm im plem entation under th e constraints of the architecture can be difficult if the structure of the inherent parallelism of the algorithm is dissim ilar to th a t of the architecture. A complex algorithm tends to m agnify this difficulty. The resulting im plem entation is one th a t achieves a degree of algorithm speedup th a t is less th an th e desired optimal or linear speedup and requires software th a t is difficult to design, debug, and modify. ; In light of these difficulties and the lack of robust autom ated tools available to contend w ith them , we have chosen a “m iddle of the road” approach betw een au tom atic parallelization and brute force program m ing. T he basis for our approach ; i is a reform ulation of the m apping problem . A lthough our reform ulation lacks some of th e m athem atical rigor of the “classical” m apping problem , it does lend itself to designs th a t are em pirically pleasing. In the following sections we present our reform ulation along w ith the m ethodology for developing parallel implemen- ! tations of high-level com puter vision algorithm s based on the reform ulation. 1 1 j 4.2 An Algorithm Driven Approach 4.2.1 O v erv iew To perform the tasks of algorithm partitioning and process assignm ent, (i.e. solve the m apping problem ) the designer m ust be cognizant of both the algorithm ' and th e target parallel processor architecture. The solution m ay be relatively simple if th e architecture is well suited to the algorithm . B ut, in the event ! th a t the architecture is not well suited to th e algorithm , the features of the | architecture becom e the driving factor in the search for a good solution to the I m apping problem . This often results in an extrem ely complex im plem entation j th a t does not achieve a significant am ount of speedup. T h a t is, the difficulties in j developing parallel algorithm s are brought about by th e constraints placed on the i m apping problem solution by th e parallel processor architecture organization. To remove the influence of the architectural features from the design process, and in tu rn , alleviate th e difficulties in developing and m aintaining parallel algo rith m im plem entations, we utilize a reform ulation of the m apping problem. This reform ulation and its utilization are discussed in th e following sections. ! 4 .2 .2 T h e M a p p in g P r o b le m — R efo rm u la ted t i t i Our version of the m apping problem addresses th e situation: given an algorithm , ■ develop a parallel im plem entation of the algorithm th a t achieves m axim al algo rith m speedup and processor efficiency and requires m inim al system com plexity and program m er burden, figure 4.2. We have removed all m ention of the parallel processor architecture from the problem statem ent and included objectives per taining to the system life cycle. T he reformulated mapping problem may now be stated as: the search for a parallel processor architecture that satisfies the requirements of the algorithm processes. By removing the specification of the target parallel processor architecture from the problem statem ent we encourage the designer to investigate all of the parallel characteristics of the algorithm w ithout any constraints. Furtherm ore, rath er than place prim ary em phasis on the com m unication netw ork topology of the architecture and the com m unication patterns of the algorithm , as does the classical form ulation of the m apping problem, our reform ulation places equal em phasis on all of the organizational param eters of an architecture. Through the identification of the algorithm ’s inherent parallelism the designer will im plicitly : : specify a parallel processing architecture th a t is well suited to th e algorithm . Together, the algorithm ’s parallel characteristics and th e architecture will specify an im plem entation. Because the algorithm and architecture are well suited to one another the im plem entation will achieve significant algorithm speedup and processor efficiency as well as low system com plexity and program m er burden. ALGORITHM SPECIFICATION PARALLEL ARCHITECTURE ALGORITHM SPECIFICATION IMPLEMENTATION Figure 4.2: A pplication of the reformulated m apping problem . Specifically, addressing th e issues brought out by Boehm , this reform ulation j of th e m apping problem i I • provides an efficient software developm ent environm ent in th a t the soft ware for the parallel im plem entation is sim ilar to th a t of the serial im ple m entation and, therefore, software developm ent can take place on a serial ; com puter, • elim inates steps in th a t th e software for the parallel im plem entation is not w ritten from scratch but is a m odification of the serial software, • creates software th a t is sim ple in th a t it is based on the algorithm th a t it im plem ents rath er than the parallel processor architecture on which it is im plem ented, and • creates software th a t is reusable in th a t it is portable (w ith slight modifica tions) betw een the serial com puter and th e parallel processor architecture. Furtherm ore, because of th e sim ilarities between the serial and the resulting | parallel code, the “dusty deck” problem is im plicitly addressed. A lthough the u ltim ate solution of a m ere recom pilation is not achieved, a reasonable solution 48 requiring m uch less th an a total code rew rite is achieved. In the next section we describe th e steps th a t m ake up th e methodology. f 4 .2 .3 T h e M e th o d o lo g y i j To achieve a parallel im plem entation of a high-level com puter vision algorithm ! th a t possesses the qualities of significant algorithm speedup and processor effi ciency and m inim al system com plexity and program m er burden, we have devised a two stage process, figure 4.3. T he goal of th e first stage is to expose the prim ary j i drivers of the algorithm ’s com plexity along w ith the inherent parallelism of those 1 prim ary drivers. We refer to this first stage as the coarse grain analysis stage. The result of th e coarse grain analysis stage is a set of independent processes P — {pi\l < i < n } j of the algorithm A such th a t ! U"=i Pi = (pi U p2 U . . . U pn) = A I and a description of the com m unication p attern s among each process p*. I In the second stage we investigate each of the independent processes pi to ; identify any inherent parallelism contained w ithin them . This stage is referred ' to as th e fine grain analysis stage. The result of the fine grain analysis stage is a set of sets of independent subprocesses S = { { P o l 1 < 3 < m i } | 1 < * < n) of the processes P such th a t Up=l P ij = (p.1 U Pi2 U . . . U Pirn) = PfiJi and a description of th e com m unication p attern s among each subprocess pij. Note th a t this com m unication is restricted to be between subprocesses of the same set. ! To determ ine the “coarse” independent processes th a t m ake up the algorithm (coarse grain analysis stage) and the “fine” independent processes th at make up th e “coarse” processes (fine grain analysis stage) each stage comprises four basic j steps, figure 4.4. They are: j ALGORITHM, A Q sK t& Q W ] aKiaiLYSIlS paws l@ & 0 G & I M & I L V S I 1 8 P = { p i I 1 £ i s . n j s u c h th a t n K J pi=(plUp2U---UpB )-A i= l S = { { p j j 1 1 £ j i m j j f 1 s i ^ n } s u c h th a t m U Pij={RlUpi2U---UpiJ=Pi V i j= l Figure 4.3: T he two fundam ental stages of our methodology. j I • Control S tructure Analysis In this step we identify th e independent processes th a t con stitu te the algo rith m through inspection of th e processing constructs. Of prim ary interest | are iterative constructs (loops) th a t determ ine the overall com plexity of the j algorithm and offer potential for parallelization. This step results in the identification of the inherent parallelism contained w ithin the algorithm . • D ata S tructure Analysis In this step we determ ine th e d ata requirem ents of each process identified above. T he result of this step is the specification of which d a ta structures to partitio n and how to p artitio n them (distribute them among processes.) This step, together w ith the previous step, are roughly analogous to the partitioning step of th e classical m apping problem. 50 SUBROUTINE LOOP e tc ... A MALY SO S ARRAY GRAPH LIST e tc ... SHUFFLE BROADCAST REDUCTION POINT-TO-POINT FINE GRAIN COARSE GRAIN e tc ... SYNCHRONOUS ASYCHRONOUS LOOSELY SYNCHRONOUS HOMOGENEOUS HETEROGENEOUS LOOSELY COUPLED TIGHTLY COUPLED CUBE TREE MESH IRREGULAR etc ... ©aim STKUJNBTUIiail AM&LYtasi I a ® ©m o t ® IPS 1 SISD SIMD MISD MIMD 10’s of PEs 1000's of PEs COMPLEX INST. SET REDUCED INST. SET BIT SERIAL etc ... Figure 4.4: Steps involved in each of the fundam ental stages of the methodology. i i i l • C om m unication Analysis j Identification of the independent processes and the d ata structure parti- j tioning scheme will determ ine the com m unication requirem ents betw een the processes. T h a t is, a d ata structure m ay be distributed among processes | such th a t one process is assigned a d ata item required by another process i [ to com plete its task. In this step such requirem ents are determ ined as well as the appropriate com m unication protocols for their im plem entation, such J as synchronous message passing among all processes, asynchronous message exchanges betw een two processes, and message broadcasting and reduction. I The result of this process will lead to the specification of the communica- j tion network topology of th e architecture. This step is roughly analogous ! to the assignm ent step of the classical m apping problem . ' • A rchitecture Specification Given the results of the previous steps, this step is where we specify the architecture in term s of its organizational param eters. T he result is the , specification of a parallel processor architecture th a t is well suited to the requirem ents of the specified algorithm in term s of algorithm speedup, pro cessor efficiency, system complexity, and program m er burden. ; 1 4 .2 .4 S u m m a ry 1 Again, this m ethodology differs from the “classical” application of the m apping i problem in th a t it does not assume any parallel processor architecture. Thus, 1 I ! th e designer can, and is encouraged to, take full advantage of any inherent par- j allelism w ithin th e algorithm . We have found th a t when applied to algorithm s representative of typical high-level com puter vision, this approach produces im plem entations th a t achieve significant algorithm speedup and processor efficiency via software th a t resembles th e serial im plem entation of the algorithm and is | therefore no m ore difficult to develop and m aintain. This is due to the inclusion j of the life-cycle issues of system com plexity and program m er-burden as design 52 objectives rath er th an tasks required to im plem ent a design. Furtherm ore, al- , though our prim ary interest is in th e parallel im plem entation of high-level vision algorithm s, this approach lends itself to the design of parallel im plem entations of com plete com puter vision system s (heterogeneous algorithm suites) which can be im plem ented via a reconfigurable or a heterogeneous parallel processor ar chitecture. It also encourages a thorough analysis prior to a potentially costly I investm ent in a parallel algorithm im plem entation on an architecture th a t may not be well suited to the algorithm . This is of obvious im portance in a production environm ent where tim e and m oney are critical as well as in a research environ- i m ent where m achines and algorithm s are being developed or w here a choice of ' m achines is available and the best m achine for th e job is to be selected, i In the following chapters we present the results of applying our m ethodology to two representative high-level vision algorithm s, relaxation labelling for image m atching and tree search for m odel based object recognition, a heterogeneous algorithm suited for th e detection of linear features, and to a com puter vision i system , perceptual organization. 1 I i 53 ___ i | Chapter 5 I | Image Matching By Relaxation Labelling 5.1 Introduction ! I i i M any problem s in com puter vision require the assignm ent of labels to objects. It is often th e case th a t these assignm ents cannot be m ade determ inistically and, therefore, th e problem is reduced to one of finding a consistent labelling. Relax ation algorithm s are often used to obtain these consistent labellings. Inform ally stated, relaxation labelling is an iterative process th a t attem p ts to assign labels to objects based on local constraints (i.e. how well the o b ject’s description m atches th e object or class represented by the label) and on global constraints (i.e. how < th e assignm ent of a label to an object affects other label/object assignm ents.) j The assignm ents m ay be m ade either probabilistically or discretely. | i R elaxation techniques have been applied to all levels of processing in com- j p u ter vision. In low-level vision they have been used for region segm entation [14] and thresholding [89], in mid-level vision for shape segm entation [91] and ] line labelling [102] [87], and in high-level vision relaxation techniques have been used for object recognition [64] and scene description [20]. One common them e : am ong these im plem entations is th e inherent com putational com plexity of the relaxation process, 0 ( n 2m 2) or greater where n is the num ber of objects to be assigned labels and m is the num ber of labels from which to choose. For this rea son, researchers have turned to parallel im plem entations of relaxation algorithm s, : which are n atu ral candidates for parallelization due to their iterative nature. 54 A variety of parallel architectures for relaxation processing have been proposed including systolic processing (pipelining) [10] [32] [14] [31], SIMD processing with I various interprocessor connection networks [45] [83], and M IM D processing with j various interprocessor connection networks [63]. Common to all of these studies j (w ith th e exception of [83]) is the sim plicity of th e “relaxation operator” , th a t is, I the operation used in determ ining the global consistency of a labelling. It is either assum ed to be precom puted, is restricted to a local neighborhood surrounding a j given object, or utilizes simple logic operations. 1 | In our work we study th e parallel im plem entation of a relaxation algorithm I which utilizes higher level, m ore complex operations. Specifically, th e algorithm i 1 I utilizes symbolic and geom etric constraints in its a tte m p t to m atch line segments ! extracted from one image to those from another im age [64]. T he basic technique has been applied to th e stereo correspondence problem [65], th e m otion corre spondence problem [74] [27], and th e im age/m odel m atching problem [64]. Our interest in this particular algorithm stem s from its utility, its com putational com plexity, and th e generality of the operations of which it comprises, i.e. we believe ! th at it possesses characteristics common to other algorithm s of this class and, > therefore, produces results th a t are generalizable to the im plem entation of other algorithm s. In the following sections we describe the relaxation algorithm to the detail 1 i i required by th e present context. We then present the steps of our m ethodology as applied to this algorithm and describe a m apping th a t achieves significant speedup and efficiency w ith low system com plexity and program m er burden. We describe our im plem entation of th e parallel algorithm and present a perform ance 1 analysis of the im plem entation. Finally, we sum m arize th e work. Throughout this chapter we discuss how our analysis and im plem entation com pares to th at of th e researchers cited above. i 5.2 Algorithm Description r j T he objective of this algorithm is to m atch objects w ithin two images or an im age and a m odel [64]. T he approach used is one of discrete relaxation utilizing , 55 symbolic and geom etric constraints betw een objects and labels. We provide an overview of the algorithm w ith enough detail to discuss our analysis and parallel j im plem entation. For details and explanations beyond the scope of our discussion, 1 th e reader should see th e referenced work. I T he prim itives used by the im age m atching algorithm are linear segments, represented symbolically by their end point coordinates, orientation, and average contrast. Given two sets of linear segm ents extracted from two images (or an im age and a m ap), the object is to find correspondences betw een the segments of each set based on their symbolic descriptions (local constraints) and on the i geom etric relationships betw een segments of the same image (global constraints.) T he set of prim itives, A = {a8 j l < i < n ), from one image is called th e SCENE and th e prim itives, a*, are called OBJECTS. T he set of prim itives, L = {l3|1 < 1 j < ^ } , from th e other im age is called th e MODEL and th e prim itives, l3, are called LABELS. T he algorithm proceeds to com pute th e quan tity p{i,j) € {0,1}, which is the P O S S IB IL IT Y th a t object a t - corresponds to label l3. It is possible th a t an object has no corresponding label due to occlusion or scene change, th at several objects correspond to the same label due to fragm entation, or th a t an i object corresponds to several labels due to merging. T he m ethod for com puting p(i,j) relies on geom etrical constraints, th a t is, when a label, lj, is assigned to an object, a;, we expect to find an object, ah, w ith a label, Ik, in an area defined ; by i, j, and k. The area is called a W IN D O W and is denoted w(i,j , k). The m ethod for com puting w(i,j,k ) is as follows. T he object, <q, is rep- ! resented by th e two dim ensional vector AiBi and the label, l3, by PjQj. By “sliding” l3 over a* an area is described by the corresponding m otion of label Ik, PkQk, figure 5.1. This parallelogram shaped area is the window w(i,j, k ). Two o b ject/lab el assignm ents, (i,j) and (h,k ), are COMPATIBLE, denoted (i, j)C(h, k), if and only if object ah lies w ithin window w(i,j,k ) and object at - lies w ithin window w(h,k,j). Using these definitions, the algorithm searches for o b ject/lab el correspon- j dences by first identifying all possible correspondences based on the symbolic , descriptions of th e objects and labels. This set of correspondences constitutes 56 I I I |W (i, j, k ) y SCENE MODEL Figure 5.1: W indow construction. the possibilities at iteration step 0, p°(i,j)- Subsequent values of pt(i,j ) are com puted by the iteration form ula: j V(hJ),Pt+l(hj) = 1 if = 1 AND B subset S of [1, m] (labels) w ith q elem ents such th a t j Vs in S, B A : in [1, n] (objects) such th a t pt(k,s) = 1 and (i, j)C(k, s). T he algorithm halts when V(i, j ) , p t+1(i, j) = p*(i, j)- T he value q is th e fit parameter. If a perfect m atch is desired then its value | 1 should be set to m , th e num ber of labels. Otherwise it should be set to a value , defined by th e desired degree of m atch betw een th e two images. A flow diagram of the im age m atching algorithm is provided in figure 5.2. ! 5.3 Coarse Grain Analysis 5.3.1 O v erv iew Recall th a t th e objective of the coarse grain analysis is to identify the prim ary source of com putational com plexity w ithin the algorithm , identify parallel con structs th a t contribute to th a t complexity, and specify a parallel processor ar chitecture th a t suits those parallel constructs. In specifying a parallel processor architecture we m ust address various organizational param eters: Programming model; Processing element type; Processing element coupling; Processor homo geneity; Processor synchronization; and Communication network topology. In the ; following sections we present the results of th e four steps of th e coarse grain anal- i ysis, 1) control stru ctu re analysis, 2) d ata stru ctu re analysis, 3) com m unication 57 i 1 i I Figure 5.2: Image m atching algorithm prim ary flow. j I analysis, and 4) architecture specification, and how they influence the specifica tion of each organizational param eter for th e relaxation labelling algorithm . j 5 .3 .2 C o n tro l S tr u c tu r e A n a ly sis In analyzing the control structure of an algorithm our objective is to determ ine its overall tim e com plexity and to identify the specific structures th a t dictate I th a t tim e complexity, typically loop constructs. We call these constructs primary control structures. Identification of the prim ary control structures will help us to identify independent processes and thus, identify areas w here parallelism can be applied to provide significant algorithm speedup. j The tim e com plexity of th e image m atching algorithm is determ ined as fol lows. Given a scene containing n objects and a m odel containing m labels, the I m axim um num ber of possible o b ject/lab el pairs is nra, which occurs when every object is sim ilar to every label. At each iteration at m ost one o b ject/lab el pair is discarded, th a t is, its possibility is set to 0, therefore, the process converges in at j m ost n m iterations. During each iteration the algorithm com putes th e possibility FLAG * TRUE W HILE FLAG-TRUE FLAG - FALSE FOR EACH OBJECT (i) FOR EACH LABEL (j) COUNT = 0 FOR EACH LABEL {k) M AK E WINDOW w(l, j, k) h - 0; FOUND » FALSE W HILE FOUND - FALSE P!(h,k) - TRUE AND (i, |) C (h, k) FOUND-TRUE NEXT OBJECT (h) FOUND-TRUE JU S . COUNT = COUNT + 1 COUNT > FIT PARAMETER (i.j) - FALSE FLAG - TRUE 58 FLAG-TRUE While flag - true FLAG - FALSE FOR EACH OBJECT { ij FOR'EACH'LABEL (j) COUNT = 0 FOR EACH LABEL (k) M A K E WINDOW w (i. j. k) h - 0; FOUND = FALSE W HILE FOUND - FALSE P (h,k) = TRUE AND { i, j) C (h, k) yes NEXT OBJECT (h) FOUND-TRUE COUNT - COUNT + 1 COUNT > FIT PARAMETER P (ij) - FALSE FLAG - TRUE Figure 5.3: Image m atching prim ary control loops. of the o b ject/lab el pair which is a m easure of how well it “fits” w ith the rem ain- I ing o b ject/lab el pairs. In the worst case, this requires investigating nm pairs j where each pair requires, at worst, investigation of nm other pairs to determ ine I global consistency. Therefore, the com plexity of the algorithm is 0 ( n 3m 3). We j can assum e, w ithout loss of generality, an equal num ber of objects and labels, m, j and th en th e algorithm tim e com plexity can be expressed as 0 ( m 6). Figure 5.3 I shows, shaded, the nested loops th a t give rise to this tim e complexity. These j constitute th e prim ary control structures. N ested w ithin the loops is th e possibility computation. As described above, it consists of checking w hether or not a given o b ject/lab el pair has any compatible o b ject/lab el pairs. This, in tu rn , requires the com putation of a window and the search for an object w ithin it. Once a candidate o b ject/lab el pair, (a,-,/j), has been queued, the possibility com putation, p(i,j), can proceed as m 2 independent com putations. Each com putation is structured so th at it operates on an isolated d a ta set, th a t is, successive passes through th e inner loops (the possibility com- j p utation) are independent of one another. Thus, the possibility com putation can J 59 Client Servers Figure 5.4: C lient/Server algorithm partitioning. proceed in parallel and has the potential to provide significant algorithm speedup. For these reasons, it provides the basis for our process partitioning scheme. Having selected the possibility com putation as the process on which to p arti tion th e algorithm , we have produced a client/server m odel. T h at is, one process will queue possible ob ject/lab el pairings via the outer two prim ary control loops, constituting th e client, and a set of independent processes will determ ine the possibility of th a t pairing via execution of th e inner two control loops and their encom passed procedures in a distributed fashion, constituting th e servers. Fig ure 5.4 shows the client/server algorithm partitioning. 5 .3 .3 D a ta S tru ctu re A n a ly sis Having identified the possibility com putation as th e task on which to p artitio n the algorithm into processes, we m ust now determ ine the d ata requirem ents of each process. In doing so we will identify th e primary data structures and determ ine an appropriate partitioning of these structures. 60 . L IA a / i W POS SIE LIT E S Figure 5.5: Image m atching prim ai'y d ata structures. For th e image m atching algorithm , three prim ary d a ta structures can be iden tified. T he first two are linear arrays of to symbolic records, one array each for storage of the set of objects and the set of labels. The th ird d a ta stru ctu re is an to x m m atrix of logical values th a t store th e results of the possibility com puta tion, pt(i,j), for each iteration, t. Figure 5.5 shows the prim ary d ata structures, pictorially. Each possibility com putation (process) requires two entries from th e object array, cq and ah, and two entries from the label array, Ij and Ik. All processes receive the same (a,-,/,-) pair, th e candidate ob ject/lab el pair, and each receives a unique (ah, h) pair, an o b ject/lab el pair th a t is used in determ ining the global consistency of the candidate pair. From these inputs the windows, w(i,j, k) and w(h, k , j ), are formed. The relation (i,j)C(h, k ) is then com puted by determ ining w hether or not ah lies w ithin w(iyj,k) and oq lies w ithin w(h,k,j). A value of 1 is returned if th e relation holds, otherwise a value of 0 is returned. T he value of pt+l(i,j) is determ ined by sum m ing the results from all of th e individual processes and com paring th a t sum to th e fit param eter, q. If we assume the availability of T V = to 2 processing elem ents, the obvious way of partitioning th e d ata structures is to assign each P E , 0 < p < T V — 1 , an o b ject/lab el pair, (ah,h) € A x L. If the num ber of processing elem ents available is less th an to2, th a t is, T V < C to 2, then th e m ost intuitive way to partitio n the d ata structures, from a program m er’s view point, is to assign each P E , 0 < p < T V — 1, to a 1 / T V sized portion of the label array and the entire 61 m ¥ *os S I B L IT E S P B » ! Figure 5.6: Image m atching horizontal swath partitions. object array thus giving each a set Sp = {(afe,Ijt)| 1 < h < m,p * (m/N) < k < p * (m / N ) + m / N — l}Vp : 0 < p < i V — lof objects and labels. This creates N horizontal swathes through the possibility m atrix as depicted in figure 5.6. These horizontal swathes constitute our d ata partitioning scheme. | 5 .3 .4 C o m m u n ica tio n A n a ly sis I Having designed our process and d ata partitions, we m ust now identify the in ter process com m unication required to com plete th e parallel im plem entation. As described previously, a possibility com putation requires access to th e can- S didate o b ject/lab el pair, (a4 -,/j), provided by the client process, and the set of ; possible o b ject/lab el pairs from which the server processes com pute a degree of support. T he set of possible pairs are statically distributed among th e server processes once, upon algorithm initiation, as described above. Conversely, the pair ( ) m ust be provided to each server, dynam ically, by the client process. This is achieved via a broadcast operation from th e client to every server. Having received (a;, lj), each server process com putes a degree of support for th e pair based on its set of possible o b ject/lab el pairs (its d ata p artitio n .) Upon com pletion, each server reports its degree of support to the client where the individual degrees of support are combined into a single result and th e possibility ; com putation, pi(i,j), is com pleted. This is achieved via a reduction operation [ from every server to the client. Finally, the client m ust report pl(i,j ) to the server process whose d ata par tition includes the pair (a;, lj) so th a t it can u p d ate its possibility value. This is achieved via a point-to-point send/receive operation from the client to the ! particular server. In sum m ary, our p ro cess/d ata partitioning scheme requires three types of I com m unication: 1) broadcast; 2) reduction; and 3) point-to-point send/receive. I ! I This concludes our analysis of the image m atching algorithm . We have de- i scribed the algorithm , identified its prim ary control structures, identified its pri m ary d ata structures, partitioned it into independent processes, and identified all required inter-process com m unication. Our rem aining task is to specify a paral- j lei architecture th a t is well suited to th e requirem ents identified by our analysis. This is presented in the following section. We th en present an evaluation of the system design through architecture sim ulation and actual im plem entation. 5 .3 .5 A r c h ite c tu r e S p e cifica tio n In specifying a parallel processor architecture we m ust address various organi zational param eters: Programming model, Processing element type; Processing element coupling; Processor homogeneity; Processor synchronization; and Com munication network topology. We base our specification of these param eters on the results of our algorithm analysis. In the following paragraphs we address each of these organizational param eters and discuss how they are influenced by the processing requirem ents of the image m atching algorithm . P r o g r a m m in g M o d e l - T he image m atching algorithm (more specifically, th e possibility com putation) contains various processing steps th a t are d a ta de pendent, th a t is, all d ata item s are not processed identically. T he M ultiple In stru ctio n M ultiple D a ta protocol is best suited to this situation. In this model each processing elem ent can execute code dictated by its particular d ata items. Conversely, the algorithm could be im plem ented under the Single In stru ctio n M ultiple D ata protocol, as dem onstrated in [83], b u t the algorithm speedup and processor efficiency achieved would be reduced due to th e synchronous n atu re of th e SIMD protocol. 63 P r o c e s s in g E le m e n t T y p e - C om putation of the com patibility relation ship betw een two pairs of o b ject/lab el correspondences requires creation of two | ! windows and a search for th e objects w ithin their respective windows. These | com putations require th e use of transcendental functions as well as floating point arithm etic (unless integerization is perform ed.) Therefore, th e processor utilized m ust support these com putations. Furtherm ore, to reduce system com plexity 1 and program m er burden, the processor m ust be program m able in a high-level j language th a t allows specification of th e prim ary d ata structures in a natural j way, th a t is, via m ulti-field records. Processors best suited to these constraints : are of th e complex instruction set variety such as a general purpose microproces- i sor. P r o c e s s in g E le m e n t C o u p lin g - As the com m unication betw een processes i is in bursts, th a t is, at the beginning of each possibility com putation (the broad- i i j cast) and at th e end of each possibility com putation (the reduction), a tightly ' coupled or shared m em ory system would not suffice because of m em ory access conflicts. W ithout special arbitration logic to allow concurrent reading and w rit ing of memory, a com m unication bottle neck would exist. B etter suited to the algorithm is a loosely coupled or message passing architecture. These systems ; facilitate high bandw idth com m unication w ithout the requirem ent of special pur- j pose hardware. P r o c e s s o r H o m o g e n e ity - O ur partitioning scheme provides each server process w ith identical tasks. The client process is com putationally sim ilar to th e server processes in th a t it utilizes the sam e d ata structures as well as sim ilar logic. Therefore, th e parallel architecture should be homogeneous, th a t is, should consist of a set of identical processing elem ents. This facilitates program m ing (reduction of program m er burden) as well as hardw are interfacing of processing elem ents (reduction of system complexity.) P r o c e s s o r S y n c h ro n iz a tio n - In light of th e fact th a t there is com puta tional sim ilarity betw een all of the identified processes as well as d a ta dependent processing, the parallel architecture should operate in loosely synchronous mode. T h a t is, all processes incorporate identical code, w ith th e exception of th e client process, b u t execute under control of their own program counter. Synchroniza- i tion occurs only at points of com m unication. As we shall see, this also facilitates j program m ability of the im plem entation which reduces system com plexity and program m er burden. C o m m u n ic a tio n N e tw o rk T o p o lo g y - Perhaps the m ost interesting aspect of a parallel processor architecture is its com m unication network topology, the ; processing elem ent interconnect p attern . As we showed via th e com m unication analysis, the im age m atching algorithm places three constraints on th e com m uni cation netw ork topology. The first is th a t it m ust facilitate an efficient broadcast j operation, th e second is th a t is m ust facilitate an efficient reduction operation, ; and th e th ird is th a t it m ust facilitate an efficient point-to-point send/receive [ operation. In the following paragraphs we consider each of these constraints. W ith regard to the broadcast operation, the ideal message passing architec tu re is one containing a single common bus to which all processing elem ents are , connected. In this topology a broadcast operation is com pleted in 0 (1 ) tim e. W ith regard to the reduction operation, the ideal algorithm requires 0 (lo g n) tim e, th a t is, “order no less th an lo g n tim e”, assum ing concurrent read and ■ w rite operations are forbidden [13]. This ideal tim e is achieved by an algorithm j th a t utilizes a divide and conquer paradigm . T he result is obtained by dividing | the d ata set into two halves, finding the two partial results, and com bining the p artial results to get the final result. T he dividing is done recursively until the d ata sets are indivisible. Such a divide and conquer scheme yields a binary tree | w ith n 2 nodes w ith the d ata item s starting at the leaves. For th e im age m atching algorithm the d ata item s, th e objects and labels used to determ ine the global validity of a candidate ob ject/lab el pair, can be distributed among all nodes of the tree, not just the leaves. W ith regard to the point-to-point send/receive operation, the ideal message passing architecture is, again, one containing a single common bus to which all processing elem ents are connected. In this topology a send/receive operation is com pleted in 0 (1 ) tim e, since the com m unication w ithin the algorithm is coarse grain and, therefore, bus contention is not an issue. 65 i Figure 5.7: Two logarithm ic diam eter architectures - H ypercube and Binary Tree. T he reduction operation produces the m ost stringent constraint dictated by ; the im age m atching algorithm . A com m unication network topology th a t facili- ; tates this operation will also facilitate th e other two as they are of lower order complexity. Therefore, for parallel im plem entation of th e im age m atching algo- | rithm , th e processing elem ents should be connected via a binary tree topology. A nother suitable, and m ore popular architecture, is the hypercube. It has qual- t i ities sim ilar to those of the binary tree in th a t it too possesses a logarithm ic ( i t _ i diam eter and high bandw idth and, therefore,can also facilitate th e constraint j dictated by the reduction operation. B oth topologies are depicted in figure 5.7. To sum m arize, th e organizational param eters of a parallel processor architec tu re th a t is well suited to the image m atching algorithm should be specified as follows: ■ • Program m ing m odel - MIMD I t • Processing elem ents - Complex Instruction Set Com puters • Processor coupling - Loosely Coupled • Processor hom ogeneity - Homogeneous • Processor synchronization - Loosely Synchronous • Com m unication network topology - Logarithm ic D iam eter ; 66 5.4 Performance Analysis Having com pleted our parallel im plem entation of the im age m atching algorithm , we now present an evaluation of the im plem entation in term s of algorithm speedup, processor efficiency, system complexity, and program m er burden. T he evaluation \ t i is perform ed on th e basis of four “d ata points.” F irst, we use a serial im plem enta- I tion of the algorithm as a baseline w ith which com parisons can be m ade. Second, we use a sim ulation to validate the parallel im plem entation T hird, we use an actual system im plem entation utilizing INMOS Transputers [42]. A nd fourth, we use an actual system im plem entation utilizing a 32 node Intel iPSC2 Hypercube | [44]. Due to the lim ited num ber of Transputers available this im plem entation was : included to dem onstrate the ease of porting the serial code to parallel implemen- j tations as opposed to quantitatively evaluating algorithm speedup and processor efficiency. ! I 5.4.1 C o m p le x ity A n a ly sis Previously we determ ined the com plexity of the image m atching algorithm to I I be 0(?Ti6), assuming, w ithout loss of generality, an equal num ber of objects and labels, m. This is due to the nested loop structure of the algorithm where every ' o b ject/lab el pair, (a,-, lj), is checked against every other o b ject/lab el pair, (a^, Ik) for com patibility w ithin each iteration of th e relaxation process. In our partitioning strategy, we distribute the m 2 com patibility com putations for each o b ject/lab el pair possibility com putation evenly am ong N processing elem ents. Therefore, m 2 is the m axim um num ber of processing elem ents th a t | j can be utilized. Barring the existence of any d ata dependencies or overhead, we j expect to achieve 0 ( m 2) speedup and com plete processor utilization, th a t is, an efficiency of I. U nfortunately, both d ata dependencies and overhead exist. T he d ata dependencies are such th a t d ata item s (ob ject/lab el pairs) are sep arated into two classes. In one class are th e pairs (si, j ) th a t are possible m atches, Pt(hj) = and in the other are the pairs th a t are not possible, p 4 (U j ) = 0. For th e class of pairs th a t are possible, the com patibility com putation, (i,j)C(h,k) is perform ed w ith th e candidate o b ject/lab el pair (h,k ). This includes the for m ation of windows etc. as described previously. For the class of pairs th a t are not possible, no com putations are perform ed. Also, note th a t it is a sim ple test to determ ine to which class an o b ject/lab el pair belongs. If N = m 2 processing elem ents are used, one for each (i,j ) pair, the system will be dram atically under utilized in all b u t th e extrem e problem instance where all objects m atch to all labels. T h at is, in a typical case a large num ber of processing elem ents will be idle throughout th e process. Thus, algorithm speedup and processor efficiency will be extrem ely low. On th e other hand, consider using N = m processing elem ents and assign ! each to a set of object label pairs, (i,j) : 1 < j < n. In this case, each processing elem ent is responsible for a single label and all objects th a t form possible m atches w ith th a t label. If the distribution of labels and objects is isotropic then all processing elem ents will have the same am ount of work and linear speedup will be achieved. In general, the algorithm speedup achievable for th e algorithm is bounded as described in th e following argum ent, which applies to any num ber of processing elem ents used. Let us define the value P j to be the set of possible object correspondences for each label lj, 1 < j < to. For N = to 2 PEs this value will be 0 or 1. For N = m PEs this value will be between 0 and the num ber of objects. We can then define P = max™=1 |P; |, P = a n d m 7 k - t K — p. T he value k is an indication of how evenly the o b ject/lab el correspondences are distributed. For instance, if every label forms possible correspondences w ith the sam e num ber of objects, k will be 1. Conversely, if one label forms possible correspondences w ith a large num ber of objects and the rem aining labels form possible correspondences w ith a small num ber of objects, then k will be large. Using these definitions, values for algorithm speedup and processor efficiency are bounded by 68 s n = f and Tpb __ 1 N ~ k- As an appeal to one’s intuition, consider the following two cases. In th e first case all sets of o b ject/lab el pairs contain the same num ber of possible correspon dences, which corresponds to the extrem e problem instance m entioned above if N = m 2. Then P = P = $ ■ k = 1 => Spj = N and E h N = 1. This implies th a t each processing elem ent is assigned th e sam e am ount of work. In the second case one set contains m ore possible correspondences th an all of the rest, 3j : \Pj\ ^ j, 1 < i < m , then P » P = > A; » 1 = > S% c N and E e N <C 1. This implies th a t th e processing elem ent assigned th e set Pj m ust do more work th an any of th e other processing elem ents. These values are bounds on the speedup and efficiency due to overhead in curred by th e im plem entation due to d ata dependencies. One m ust rem em ber th a t th e actual distribution of possible o b ject/lab el correspondences is dynam ic as it is the goal of the algorithm to reduce this to a canonical set of correspondences via th e relaxation operation. Given these bounds, the num ber of processing el em ents th a t can be effectively utilized is equal to the num ber of labels in the model. 5 .4 .2 O b serv ed P erfo rm a n ce To m easure the actual values of algorithm speedup and processor efficiency we devised three test cases. T he first consists of lines extracted from real im agery and m atched against them selves, th a t is, the set of objects is the same as the set of labels. T he second is an “extrem e” case consisting of two identical sets of vertical lines. In this case k = 1 initially. The th ird is a set of hand picked labels 69 m atched to a set of com puter extracted lines. In this case th e m odel consisted of \ 35 labels and the scene consisted of 643 objects. This is a typical m odel m atching case. As a prelim inary step we ran the test cases on a sim ulation of the parallel t im plem entation running on a serial m achine. In doing so we were able to validate j j ! th e parallel im plem entation (the process and d ata partitioning schemes) in an environm ent which lends itself to code developm ent and debugging. We then perform ed tests using the parallel im plem entation running on an In tel iPSC2 H ypercube consisting of up to 32 processing elem ents. T he system is ! ' expandable to incorporate any num ber of processing elem ents w ithout any system j redesign. These m easurem ents reflect algorithm speedup and processor efficiency as affected by our process and d ata partitioning schemes, d a ta dependencies, inherently serial operations, and inter-processor com m unication overhead. Fig- j ures 5.8, 5.9, and 5.10 show th e m easured speedup and efficiency for each of the cases when in stan tiated w ith various problem sizes. In addition, figure 5.10 shows the labels and objects so th a t th e reader can visualize th e problem being solved. ! In th e two “contrived” cases th e “flattening” of th e curves is due prim arily to the artificial (atypical) distribution of d ata item s. T he effects are less prevalent i as th e problem sizes are increased. For th e real model m atching scenario where j the num ber of processing elem ents is com parable to th e num ber of m odel labels the effects are not drastic in spite given the small problem size (35 labels.) ] In all cases, the speedup achieved is less th an optim al. This is due to the com m unication betw een processes, the inherent sequential processing of the al gorithm , and the very nature of the algorithm . One m ust rem em ber th a t the relaxation process is dynam ic. T h at is, in reducing the initial set of possible } o b ject/lab el pairings to the canonical set, pairs will be deem ed “not possible” I and the processing elem ents assigned to work on those pairs will becom e idle for the rem ainder of the execution tim e. A lthough some parallel im plem entations t i m ay fare b ette r th an others, the effects caused by d ata dependencies cannot be , avoided when faced w ith a static d ata partitioning scheme. Furtherm ore, analysis 70 0.9 0.7 0.6 0.5 0.4 03 0.2 ' + -- Sm all problem size x — M edium problem size . o -- L arge problem size 0.1 20 Processing E lem ents + — Sm all problem size ' x — M edium problem size o -- Large problem size 20 © » 20 Processing Elem ents (a) Speedup. (b) Efficiency. Figure 5.8: Test results for H ypercube im plem entation when a m odel is m atched against itself. 0.9 0.8 0.7 r 06 I 0.5 S U 2 0.4 0.3 0.2 + - - Sm all problem size x - M edium problem size . o — Large problem size 0.1 P rocessing Elem ents _ + -- Sm all problem size ’ x — M edium problem size o — Large problem size 30 P. ■ 3 I Processing Elem ents I (a) Speedup. (b) Efficiency. Figure 5.9: Test results for H ypercube im plem entation when all label m atch to all objects. 71 (a) Model labels. 20 20 30 P rocessing E lem ents (c) Speedup. (b) Scene objects. 0.9 0.8 0.7 0.6 S ' i s U J 0.4 0.3 0.2 0.1 20 25 30 P rocessing Elem ents (d) Efficiency. Figure 5.10: Test results for H ypercube im plem entation of a typical m odel m atch ing scenario. and sim ulation show th a t a dynam ic scheme is not feasible due to th e granular ity of th e processing, th a t is, th e interleaving of th e broadcast, processing, and reduction operations. O ur m easurem ents indicate the following trends in term s of algorithm speedup ! and processor efficiency for our parallel im plem entation of the im age m atching : algorithm : i • Speedup and efficiency are dependent on the d ata distribution. E xtrem e cases lead to low perform ance whereas typical (real) cases lead to significant speedup and efficiency. i , • Significant algorithm speedup and processor efficiency are achieved when 1 th e num ber of processing elem ents is less th an or equal to th e num ber of j labels (when d ata dependencies are taken into account.) i j • In light of the previous item s, inter-processor com m unication does not dom inate the im plem entation. I i 5 .4 .3 S y ste m D e v e lo p m e n t an d M a in ten a n ce j System com plexity and program m er burden are m easures of how closely the par- j allel im plem entation of an algorithm resembles the serial im plem entation. They can also be viewed as the am ount of effort (cost) required to realize and m aintain the parallel im plem entation of the algorithm . In all four of our im plem entations, th e serial im plem entation, the parallel sim- j ulation, th e T ransputer (binary tree) im plem entation, and the iPSC2 (hypercube) j i im plem entation the program m ing language used was ‘C ’[47j. T he im plem enta- ' tions were developed increm entally in four steps: 1 ) code the serial algorithm ; 2 ) m odify th e serial algorithm code to form the parallel sim ulation; 3) add Trans- j p uter com m unication routines to the parallel sim ulation code to form the binary i tree im plem entation; and 4) add iPSC 2 com m unication routines to th e parallel sim ulation code to form the hypercube im plem entation. Each step involved only m inor m odifications to th e code developed at the pre vious step. All algorithm -specific constructs and control structures were left “in ta c t” from serial to parallel im plem entation. T he original serial code was portable | to the extent th a t it only required th e inclusion of the m achine specific subroutine \ calls to perform inter-processor com m unication for parallelization. Figure 5.11 shows the prim ary code segment for the serial im plem entation and figures 5.12 ; and 5.13 show the prim ary code for the client/server parallel im plem entation. From th e sim ilarities in th e two im plem entations one can conclude th a t the com- J plexity of, as well as th e effort required to im plem ent and m aintain, th e parallel ' software is no greater th an th a t of the serial im plem entation. This is attrib u tab le to th e fact th a t th e parallel im plem entation was designed based on the structure ; of the algorithm , and not on the structure of a prespecified parallel processor | architecture. I i | I 5.5 Fine Grain Analysis 5.5.1 O v erv iew Recall th a t the object of the fine grain analysis is to identify parallelism w ithin ' j j the independent processes identified via the coarse grain analysis. The objectives | J and the technique employed for m eeting them are the sam e as for the coarse grain I ! i ! analysis, the difference being the input to th e analysis. W hereas for the coarse grain analysis the algorithm was the input, for th e fine grain analysis the inde pendent processes identified in the coarse grain analysis are th e inputs. In the following sections we present the results of the four steps of the fine grain analysis, 1) control structure analysis, 2) d ata structure analysis, 3) com m unication anal ysis, and 4) architecture specification, and how they influence the architecture I 1 | specified during th e coarse grain analysis of th e relaxation labelling algorithm . j : i 5 .5 .2 C o n tro l S tru ctu re A n a ly sis In the coarse grain analysis we identified th e 'possibility computation as the process | on which to p artitio n the algorithm . Recall th a t each processing elem ent is w h ile (fla g ) { f la g - 0; / * I t e r a t i o n o f p o s s i b ilitie s /c o m p a tib ilitie s . * / f o r (i = 0 ; i < n u m b e r_ o f_ o b je c ts ; + + i ) f o r (j — 0 ; j < num ber_of_1 a b e I s ; + + j ) i f ( p D IH ) { / * if P ( t ) ( i j ) = = 1 * / c a rd _ s = 0; / * C o m p u t e t h e d e g r e e o f s u p p o r t f o r * j / * t h e o b j e c t / l a b e l a s s ig n m e n t. * / f o r ( k = 0 ; k < n u m b e r jo f J a b e ls ; + + k ) { / * f o r a ll l a b e ls * / m a k e _ w in d o w (o b je c ts [i], la b e ls [j], la b e ls [k ] , w i n J j k ) ; h = 0; f o u n d = 0; w h ile ( ( h < n u m b e r_ o f_ o b je c ts ) & & ( I f o u n d ) ) -{ i f (p [k ][h ] & & in _ w in d o w (o b je c ts [h ], w i n J j k ) f o u n d = c o m p a tib le ( o b je c ts [ i] , la b e ls [j], o b je c ts [ h ] , la b e ls [k ]); + + h ; } j* w h ile ( ( h < n u m b e r_ o f_ o b je c ts ) ... * / if ( f o u n d ) + - f c a r d _ s ; } I* f o r ( k = ... * / if ( c a r d - s < q ) { fla g = 1; p[i][i] = o; } I* if (c a rd _ s ... * / } / * i f (p[j][i] ... * / } / * w lrile (fla g ) ... * / Figure 5.11: Serial code for the image m atching algorithm . fla g = 1; w h ile (fla g ) { f la g = 0; / * I t e r a t i o n o f p o s s i b ilitie s /c o m p a tib ilitie s . * / f o r (i = 0; i < m m ib e r_ o f .o b je c ts ; + + i ) f o r (j = 0 ; j < n u m b e r j^ f - la b e ls ; H —b j) if (p [i][i]) { / * i f P ( t ) O J ) = = 1 * / c a rd _ s = 0; B R O A D C A S T C A N D ID A T E O B J E C T / L A B E L P A I R . / * C o m p u te t h e d e g r e e o f s u p p o r t f o r t h e o b j e c t / l a b e l * / / * a s s ig n m e n t p r o v id e d b y th i s P E s 1 / n t h o f t h e la b e l t a b l e . * / f o r ( k = 0 ; k < m y _ s h a re ; + + k ) { fo r m y s h a r e o f la b e ls * / m a k e _ w in d o w (o b je c ts [i], la b e ls [j], la b e ls [k ] , w i n J j k ) ; h = 0; f o u n d = 0 ; w h ile ( ( h < n u m b e r j o f .o b j e c t s ) & & ( !f o u n d )) { if (p [k ][h ] &&: in _ w in d o w (o b je c ts [h ], w i n J j k ) f o u n d = c o m p a tib le ( o b je c ts [ i] , la b e ls [j], o b je c ts [ h ] , la b e ls [k ] ); + + h ; } / * w h ile ( ( h < n u m b e r_ o f_ o b je c ts ) ... * / if ( f o u n d ) + + c a r d _ s ; } / * f o r ( k = ... *! R E D U C E C O N T R I B U T I O N S . c a r d - s + = r e d u c e ( ’+ ’); if ( c a r d -s < q ) { fla g = 1; p Ij I W = o ; } / * if (c a rd _ s ... * / S E N D C H A N G E S T O T H E P E W I T H P A I R ( i,j) . } /* if (Ptaw - */ } / * w h ile (fla g ) ... * / Figure 5.12: Client code for the parallel image m atching algorithm . 76 d o n e = 0; w h ile (!d o n e ) { R E C E I V E O B J E C T / L A B E L P A I R V IA B R O A D C A S T . / * C o m p u te t h e d e g r e e o f s u p p o r t f o r t h e o b j e c t / l a b e l * / / * a s s ig n m e n t p r o v id e d b y t h i s P E s X /n th o f t h e l a b e l t a b l e . * / c o n t r i b u ti o n = 0; f o r (k = 0 ; k < m y - s h a r e ; - f + k ) { f o r m y s h a r e o f la b e ls * j m a k e _ w in d o w ( o b je c ts J , la b e ls - j, la b e ls [k ], w i n J j k ) ; h — 0 ; f o u n d = 0; w h ile ( ( h < n u m b e r - o f - o b je c ts ) & & ( I f o u n d ) ) { if (p [k ][h ] & & in _ w in d o w ( o b je c ts [ h ], w i n J j k ) f o u n d = c o m p a tib le ( o b je c ts [ i] , la b e ls [j] , o b je c ts [ h ] , la b e ls [k ] ); ++h; } / * w h ile ( ( h < n u m b e r a o f .o b je c ts ) ... * / i f (fo u n d ) + + c a r d _ s ; } / * f o r (k = ... */ R E D U C E C O N T R I B U T I O N S . c a rd _ s + = r e d u c e ( ’+ ’); R E C E I V E C H A N G E S W H E N A P P L I C A B L E . } / * w h ile (fla g ) ... * / Figure 5.13: Server code for the parallel im age m atching algorithm . assigned a set of labels of which it m ust search for a com patible object. Also recall th a t each com putation requires counting the num ber of labels Ik th a t can be assigned to objects ah while m aintaining global geom etric com patibility with the candidate o b ject/lab el pair, (a i,lj). Geom etric com patibility is determ ined ! by form ing windows and searching for objects w ithin them . Given a candidate I o b ject/lab el pair, (a i,lj), and a label being tested for possible assignm ent, Ik, all objects, ah, are to be tested for com patible assignm ent of h , (i,j)C (h , k). I T he loop over each object gives rise to the com putational com plexity of the j possibility com putation and, therefore, constitutes th e prim ary control structure. B ut, each object can be tested for com patibility independent of all other objects ] so each possibility com putation can be partitioned into n, th e num ber of objects, independent processes. We shall refer to each of these independent processes as I f a compatibility computation. The partitioning of the possibility com putation into i m ultiple com patibility com putations constitutes the process partitioning for our j fine grain analysis step. I l 5.5.3 Data Structure Analysis I Input to each compatibility computation is the candidate o b ject/lab el pair, (a,, Zj), i the label being tested, Ik, and an object, av h where 1 < p < n and n is the total ; num ber of objects. Thus, each horizontal sw ath created by the coarse grain par titioning is partitioned into n fine grain partitions to be processed independently. ! An interesting observation to be m ade here is th a t not all objects av h need be 1 processed. O b ject/lab el pairs th a t have been determ ined to be “not possible” , pt(i,j), are not processed. Thus, fine grain processors assigned those objects will be idle resulting in less th a n optim al algorithm speedup and processor efficiency given N = n fine grain processing elem ents. In the case where there are N < n processing elem ents, o b ject/lab el pairs w ith pt( i , j ) = 0 can be identified and “filtered o u t” by the coarse grain MIMD processing elem ent prior to being as- , signed to a fine grain processing elem ent thus only th e o b ject/lab el pairs which | are possible will be processed. This creates a dynam ic partitioning scheme. i I 78 5 .5 .4 C o m m u n ica tio n A n a ly sis In this partitioning scheme, the only com m unication required is the broadcast of ; th e candidate o b ject/lab el pair, (a,-, /j), th e broadcast of the label under consider- : ation, 4 , and the report of the com patibility between the object a F h, (*, j)C (h , k). i This com m unication takes place betw een the possibility com putation and each of its constituent com patibility com putations and no com m unication is required am ong com patibility com putations. i 5 .5 .5 A r c h ite c tu r e S p e c ific a tio n In specifying a parallel processor architecture to m atch th e results of the fine | grain analysis we are actually specifying a “sub-architecture” to th a t specified as a result of th e coarse grain analysis. Again, all of the organizational param eters j of th e architecture m ust be addressed b u t this tim e w ith regard to constraints ‘ d ictated by both th e coarse grain architecture and the fine grain analysis. P r o g ra m m in g M o d e l - T he com patibility com putations (i,j ) C ( h , k ) for all objects a/,, w ithin a given possibility com putation p ( i , j ) are identical given th at the o b ject/lab el assignm ents (h, k ) are possible prior to th e possibility com puta- | tion. In th e event th a t (h , k ) is not possible then processing of th a t particular , pair is bypassed via our dynam ic “filtering” scheme. Such processing is ideally suited to th e SIMD protocol. W ith N = n processing elem ents, some processing elem ents will not be assigned any work. W ith N < n processing elem ents, only ob ject/lab el pairs th a t are possible are processed as described above. P r o c e ssin g E lem e n t T y p e - T he com patibility com putation com prises pro cessing th a t requires arithm etic m anipulation of scalar values. Such processing can be achieved by use of relatively sim ple processing elem ents w ith a small am ount of local memory. P r o c e ssin g E lem en t C o u p lin g - As com m unication betw een processes is not required, processing elem ent coupling is not a critical issue. B ut, as we | shall see later, a tightly coupled (shared m em ory) system will be th e best choice j w ith regard to the com m unication requirem ents between the M IM D processing elem ent, specified by the coarse grain analysis, and th e SIMD processing elem ents specified here. P r o c e sso r H o m o g e n e ity - As is th e case w ith all SIMD architectures, the architecture for th e possibility com putation should be homogeneous. P r o c e sso r S y n ch ro n iza tio n - Again, by definition, the SIMD architecture j i for the possibility com putation will execute synchronously. ; l C o m m u n ica tio n N etw o rk T o p o lo g y - As specified, the only com m unica tion required in partitioning the possibility com putation into independent com patibility com putations is between the “p aren t” M IM D process (the possibility I com putation) and the “children” SIMD processes (the com patibility com puta tions.) This is restricted to passing object a; and labels lj and to each of th e i SIMD PEs and an object av h to each of th e SIMD PEs and then A N D ’ing th e re sults. This can be achieved efficiently through common m em ory (registers) shared am ongst th e M IM D processing elem ent and th e SIMD PEs w ithout experiencing ! I com m unication bottle necks. i To sum m arize, the organizational param eters of a parallel processor architec- j j ; tu re th a t is well suited to the possibility com putation are: • Program m ing m odel - SIMD • Processing elem ents - Simple Instruction Set Com puters • Processor coupling - Tightly Coupled i • Processor hom ogeneity - Homogeneous • Processor synchronization - Synchronous • C om m unication network topology - Shared M emory Sim ulations of this im plem entation show th a t near linear speedup is obtained. This is typical of applications th a t operate synchronously, require little or no | com m unication, and receive d ata via non-conflicting shared m em ory access. 5.6 A Heterogeneous Architecture P artitioning schemes sim ilar to th a t derived via our coarse grain analysis have been used by other researchers in m apping relaxation algorithm s onto pure SIMD protocol architectures [10] [32] [14] [31] [45] [83]. T he broadcast and reduction | com m unication operations are typically perform ed systolically to avoid commu- j j nication b o ttle necks. T he bounds on algorithm speedup and processor efficiency as derived previously are still applicable. T h at is, barring any com m unication overhead, they are Sn = T anc^ i TPb 1 J N k j where N is the num ber of processing elem ents and k is an indication of the ■ j o b ject/lab el correspondence distribution as previously defined. ! B ut, d ata dependencies will play a significant role in the actual speedup and l ; efficiency obtained by a pure SIMD architecture. Given m 2 processing elements w ith one o b ject/lab el pair assigned to each, m any processing elem ents will be idle due to an o b ject/lab el pair th a t doesn’t m atch based on symbolic constraints or j is deem ed not possible in reducing th e initial set of assignm ents to th e canonical set. This will produce significantly less th an optim al algorithm speedup and processor efficiency, as described earlier. Given m or less processing elem ents where each is assigned a set of o b ject/lab el pairs it is possible (and probable) th at the sets will be “out of phase” , th a t is, elem ent i of one PEs set is “possible” and elem ent i of another PEs set is “not j possible.” Since the SIMD protocol dictates synchronicity among PEs, all PEs whose elem ent i is “not possible” will rem ain idle during step i. Once again, this situation will worsen as th e initial set is reduced to th e canonical set. In each of these situations the speedup and efficiency will be far less than optim al. Furtherm ore, th e synchronous d ata m ovem ents required by th e pure I SIMD architectures can be difficult to program due to th e strict tim ing relation ships am ong PEs. To sum m arize, a pure SIMD architecture suffers from less th an optim al algorithm speedup and processor efficiency and requires high de- ; grees of system com plexity and program m er burden when faced w ith a complex j 81 - s - MIMD PROCESSING ELEMENT I Figure 5.14: T he structure of a processing elem ent w ithin th e heterogeneous ; architecture. I . j algorithm im plem entation. For this type of algorithm , a M IM D architecture has a clear advantage over a SIMD architecture in th a t it can effectively handle the | “phasing” and “program m ability” problems. On th e contrary, although a pure MIMD architecture will achieve significant algorithm speedup in spite of d ata dependencies, it will not take advantage of th e potential speedup available due to the fine grain parallelism inherent to the | i algorithm . Therefore, an architecture well suited to the im age m atching algo- J | rith m in term s of algorithm speedup, processor efficiency, system com plexity, j i ! and program m er burden should com prise both protocols. T he basic structure of our architecture, as specified in our coarse grain anal ysis, is of th e MIMD protocol. Each of the MIMD processing elem ents is to be equipped w ith a vector processor to perform the SIMD style operations of the com patibility com putations as specified in our fine grain analysis, figure 5.14. Parallel processor architectures utilizing this type of heterogeneous processing elem ent are currently available (Intel iP S C 2/V X [44].) ! In this im plem entation, the SIMD protocol is applied only after d ata de pendencies have been “rem oved” by th e MIMD processing elem ents. Unlike a pure SIMD im plem entation th a t uses static d ata partitions which m ake them especially prone to d ata dependencies, w ith this scheme the M IM D processing elem ent can dynam ically “filter out” objects th a t cannot possibly be assigned a I ! 82 | given label thus leaving a SIMD P E idle while others are processing. And, since each MIMD P E is working on a subset of th e original problem and com m unica tion betw een a MIMD P E and its SIMD PE s is via shared memory, th e scheme is not overwhelm ed by additional, non-algorithm related processing or coding. W ith this approach we can effectively cope w ith d ata dependencies (via the M IM D architecture) yet provide an efficient m eans to com pute o b ject/lab el as signm ent com patibilities (via the SIMD architectures) while m aintaining a sim ple com m unication structure. Furtherm ore, th e specialized SIMD code is restricted to a small p art of the algorithm where it is m ost effective. T he overall structure of th e algorithm (serial im plem entation) is not altered. A lthough we did not im plem ent our heterogeneous architecture, we expect to m aintain perform ance curves of sim ilar shape to those presented previously b u t shifted “upw ard” , th a t is higher degrees of algorithm speedup and processor efficiency due to exploitation of the fine grain parallelism . This is because sim u lation of the “fine grain” parallelism showed th a t near linear speedup is achieved independent of th e coarse grain processing thus increasing throughput. Of coarse it is difficult to quote algorithm speedup and processor efficiency for a heteroge neous architecture as both m easures are dependent on th e num ber of processing elem ents, but it is not clear how simple and complex processing elem ents should be weighted in the count. 5.7 Summary Parallel im plem entation of relaxation labelling algorithm s has been perform ed by various researchers [10] [32] [14] [31] [45] [83] some utilizing the SIMD protocol, others utilizing th e MIMD protocol, all but [83] incorporating sim ple relaxation operators. All claim to achieve optim al or near optim al algorithm speedup and processor efficiency. As a result of our work, we have shown th a t d ata depen dencies do not allow this due to the very n atu re of th e problem . We confirmed this through sim ulation and actual parallel im plem entation. In order to cope w ith these d ata dependencies we derived a parallel processor architecture, via 83 application of our methodology, th a t is well m atched to th e inherent d ata depen- j dencies w ithin th e relaxation labelling algorithm . T he particular algorithm th at we studied utilizes a com plex relaxation operator and is applicable to a variety of com puter vision problems. We achieved significant algorithm speedup and processor efficiency w ith low system com plexity and program m er burden. We did not alter th e stru ctu re of the algorithm or devise elaborate d a ta partitioning ■ I and com m unication schemes. Such results are significant when considering the com putational com plexity of th e algorithm as well as its com plex structure. By using our two stage approach we were able to design (specify) an archi tectu re th at: 1 ) facilitates the iterative and dynam ic n atu re of all relaxation I labelling algorithm s; 2 ) facilitates the complex relaxation operator specific to our j algorithm ; and 3) m eets our design objectives of significant algorithm speedup and processor efficiency w ith m inim al system com plexity and program m er b u r den. Furtherm ore, the resulting architecture can be constructed from readily available com ponents and is applicable to other processes th a t m ay constitute a com plete com puter vision system incorporating this relaxation labelling algo- ; ! rithm . i i i i l 84 | Chapter 6 i i I Object Recognition B y Tree Searching 6.1 Introduction Perhaps th e m ost fundam ental problem in com puter vision is object recognition, th a t is, specification of the identity of an unknown object. O bject recognition ' • | is often perform ed by com paring th e descriptive features of th e unknow n object to those of a set of known objects called models. T he unidentified object is assigned th e identity of th e m odel object whose features and relations between features m ost resem ble its own. Two m ajor factors influence th e perform ance and com petence of a recognition system . One is th e m ethod em ployed to describe the object and m odels, the other is th e m ethod employed to perform and quantify the com parisons between the object and models. These factors can be sum m arized i as follows (from [17]): t j • Description — How do we represent th e 3-D objects? W hat features should ! be extracted from a scene (originally viewer centered) in order to describe physical properties of objects and th eir spatial inter-relationships? i • Matching — How do we establish the correspondences betw een scene fea tures and object models in order to recognize objects in a complex scene? In the context of a com plete com puter vision system , object description is per form ed by low and mid-level processes and m atching (recognition) is perform ed by high-level processes. We focus our attention on the m atching process. j | i i I 85 i . - - _ * | In general, th e m atching of an object to a m odel is perform ed by generating a 1 j search tree, either im plicitly or explicitly, whose nodes represent possible feature i . i | correspondences betw een the two then searching th e tree for a consistent set i ; of feature correspondences which is sufficient to determ ine w hether or not the object and the m odel represent th e same physical entity. Various approaches to th e search problem have been form ulated. These include depth first search, breadth first search, best first search, and m ethods which incorporate heuristics in order to system atically reduce the size of th e search tree [93]. j T he m ethod used in the search process is generally a function of th e problem 1 a t#hand. T h at is, some problems m ay only require a solution th a t is “good ! ! enough” and therefore, heuristics m ay be applied to elim inate branches of th e ] search tree th a t do not “appear prom ising” and others m ay be form ulated such ' 1 th a t th e “p o ten tial” of a branch can be com puted prior to traversing it and one | branch th a t “appears m ore prom ising” th an another can be searched first. B ut, if | th e “b est” solution to a problem is to be found, th e entire tree m ust be searched j and all possible solutions exam ined. In any case the search problem is, in general, . a tim e consum ing process as, in its purist form , it is a m em ber of th e NP-complete ■ class of problem s [24]. • The instance of search th a t we investigate here utilizes a tree search to com- j pare two a ttrib u ted graphs in order to determ ine a common subgraph betw een j th e two th a t is “good enough” based on a predefined threshold. T he search tree | is built by com paring symbolic attrib u tes of the nodes of each graph to deter- \ m ine candidate correspondences. T he search proceeds by traversing th e tree in search of a branch containing a set of consistent node correspondences where consistency is determ ined by the sim ilarity of symbolic attrib u tes on the links (edges) of each graph and the geom etric alignm ent of the objects th a t th e nodes j represent. A match value is assigned to th e set and com pared to a threshold to I determ ine w hether or not to continue the search. The threshold is representative of th e degree of m atch required in order to m ake a confident recognition decision. If th e use of such a threshold is undesirable, the tree is searched exhaustively, I retaining the set of node correspondences w ith th e largest m atch value. 8 6 Use of tree search, a ttrib u ted graphs, symbolic prim itives, and geom etric alignm ent in this m anner is a common approach to th e problem of object recog nition in com puter vision [30] [75] [73]. To tolerate th e inherent tim e com plexity 1 of th e search, researchers often resort to th e use of dom ain specific heuristics to | reduce th e search space and accept “good” approxim ations of th e actual solution | I [17]. A n alternative approach is to utilize parallel processing. , A ttrib u tab le to its fundam ental applicability to com puter vision, graph m atch- ; ing (which utilizes search techniques) was included in th e proposed set of tasks for the DARPA Image U nderstanding A rchitecture evaluation effort [104]. The form ulation of the problem utilizes symbolic attrib u tes on th e graph nodes (scene objects) to determ ine sim ilarity as well as symbolic attrib u tes of the graph links j (spatial relationships) to determ ine a subgraph isom orphism . Parallel depth first | search, as it applies to solution of th e 15-puzzle problem , is investigated in [50] I ' [69] [70]. A lthough this is not a vision problem , th e basic stru ctu re of the search j algorithm is applicable to vision problems. The authors present a scheme for j dynam ic work distribution (load balancing) and show results on a variety of | j distributed m em ory MIMD machines. A parallel depth first search algorithm * ) im plem ented on a shared m em ory MIMD m achine is described in [41]. T he sys- 1 tern was applied to a set of com binatorial optim ization problem s. In [49] the authors cast the search space problem as a branch and bound form ulation and ; provide two parallel im plem entations of such problem s. Results of searching a j game tree obtained via sim ulation of a distributed m em ory MIMD m achine are ; presented. In [100] th e authors specify a m ethod for solving branch and bound in ; i parallel and present a specialized architecture for im plem entation of th e m ethod, j Finally, in [59] the authors present a scheme for perform ing graph m atching on the connection m achine. T he work presented was only a prelim inary design and the authors adm it th a t an actual im plem entation would require the solution of some very difficult problems. Com m on to all of these studies is the sim plicity of the consistency checks in determ ining a set of consistent node correspondences, th a t is, sim plicity of the ! link attrib u tes. They are either logical com parisons of two scalar values or merely ; a check for the existence of a link between two nodes. Such sim plicity allows the I 87 system designers to use the knowledge th a t all nodes have the sam e “cost to expand” to devise straight forward work distribution (load balancing) scheme. It has been dem onstrated th a t vision system s require m ore com plex relationships be represented by the graph links in order to provide robust recognition capabil ities [30] [75] [73]. Such relationships do not lend them selves to sim ple, straight forw ard solutions. In our work we study the parallel im plem entation of a graph m atching al- j gorithm th a t uses complex link attrib u tes as well as sim ple node and link at- ! tributes. Specifically, the algorithm utilizes sim ple attrib u tes, such as object size ^ and orientation, to determ ine candidate node correspondences th en applies geo m etric constraints to determ ine consistent correspondences. C om putation of th e geom etric constraints is nontrivial and consumes a large am ount of tim e in ex panding a node of the search tree. This approach is sim ilar to those of [30] [75] [73]. Therefore, we believe th a t this algorithm is representative of the class of visual recognition algorithm s th a t utilize tree search techniques and contend th a t our results are generalizable to th e im plem entation of these other algorithm s. In th e following sections we describe the 3-D object recognition algorithm to the detail required by the present context. We then present the steps of our I m ethodology as applied to this algorithm and describe a m apping th a t achieves j significant speedup and efficiency w ith low system com plexity and program m er : burden. We describe our im plem entation of the parallel algorithm and present | a perform ance analysis of th e im plem entation. Finally, we sum m arize th e work. | Throughout this chapter we discuss how our analysis and im plem entation com- , ! pares to th a t of the researchers cited above. i j 6.2 Algorithm Description The objective of this algorithm is to recognize 3-D objects w ithin a scene by com paring th em to a database of models [19]. T he approach used is m ultistage ! a ttrib u te d graph m atching. We provide an overview of the algorithm w ith enough j detail to discuss our analysis and parallel im plem entation. For details and expla nations beyond the scope of our discussion, the reader should see th e referenced work. T he prim itives used to represent 3-D objects w ithin the scene and models are surface patches. Using dense range d ata as input, extraction and description algorithm s describe the surface patches by a second order polynom ial [18] [19] and identify and classify their boundaries as either Jump or Crease. T he result of th e surface patch extraction and description algorithm s is a symbolic representation of the scene in the form of an attrib u ted graph whose nodes represent th e surface patches and links express geom etric relationships betw een surface patches. From this representation, individual surface patches can be grouped into objects and refined descriptions of th e patches and their boundaries provided. Specifically, each surface patch is represented by a node, rii, w ith attributes: • Visible Area, A{i) • O rientation, ft(i) • Average of Principal Curvatures, K \(i) and /Cf?-) • E stim ated R atio of Occlusion, R(i) • C entroid of Inertia, C(i). T he geom etric relationships betw een pairs of surface patches connected by one or m ore boundaries (represented by nodes n; and rij) are represented by a link, f j w ith attributes: • T ype of Adjacency, t(i,j) (occlusion, convex, concave) • C onnection Possibility, p(i,j ) D etailed descriptions of the attrib u tes can be found in the referenced work. T he point to be m ade here is th a t objects w ithin th e scene and the models are 89 Views of Models Scene Object Sj M M f 1,..-, Mffn Screening Ordered list {M1 0). M2 (j),- - M p(j)}, p^5 (Top candidates for S,•) Figure 6.1: Screener functional block diagram . represented by attrib u ted graphs m ade up of nodes and links containing both symbolic and geom etric attributes. To achieve 3-D object recognition in th e presence of occlusion, m ulti view surface models are used. Each m odel consists of several views taken so th a t most of th e significant surfaces of the model objects are contained in a t least one view. Thus, the database of known objects, the models, consists of m odel objects, M i, M 2, ..., Mjq. Each m odel, Mi, consists of several views, M } , M f , ..., , where a view is represented by an a ttrib u te d graph as described above. In [17] each m odel is represented by approxim ately four views (graphs) and each view contains an average of twelve surfaces (nodes). The m odel database comprises eight m odels. W ith this database of m odels, the problem of recognizing scene objects is reduced to finding, w ithin th e database, th e m odel view th a t is most sim ilar to the scene object. The scene object is then recognized as the model which contains th e m ost sim ilar view. The recognition process is perform ed by three modules: th e screener, the graph matcher, and the analyzer. Functional block diagram s of these modules are shown in figures 6.1, 6.2, and 6.3, respectively. Screener: This m odule identifies candidate model views th a t are m ost likely to m atch each object in th e scene by com paring properties of th e a ttrib u ted graphs. These properties include: • N um ber of Nodes (Surface Patches) 90 Model View M S cene Object S Do They Match? Yes No Maybe Try Nexts lode! View, Searching Stop Figure 6 .2 : G raph m atcher functional block diagram . List ot All M atches / P o o r \ M atches for O bject S; > Jnm atched O bjects S: Try to m erge with existing m atch es Try to split Sj into subgraphs and p ro ce ss ea ch subgraph Figure 6.3: A nalyzer functional block diagram . • N um ber of P lan ar Nodes (P lanar Surface Patches) • T he Visible 3-D A rea of the Largest Node (Largest Surface P atch). O u tp u t from th e screener is a list of candidate views (at m ost 5) sorted in descending order of difference to th e scene graph. Graph Matcher: This m odule perform s an extensive com parison betw een the graphs of th e candidate m odel views and the scene object. G raph com parisons are perform ed by th e steps: 1 ) com pute all pairs of sim ilar m odel and scene nodes, ( ) , to form th e search tree; 2 ) perform a shallow search of the tree to identify a “core” common subgraph, of cardinality 4, betw een th e m odel and scene graphs; 3) perform a deep search of th e tree to identify a com m on subgraph betw een the m odel and scene graphs which contains the core subgraph found previously and whose m atch value exceeds th e predeterm ined recognition thresh old; 4) perform fine m odifications of th e largest common subgraph to include or elim inate m atched pairs based on strong constraints. T he sequential algorithm incorporates a depth first search as a m eans for perform ing both th e shallow and deep searches for the following reasons: 1. T he tree is of finite (bounded) depth. 2. D epth first search makes efficient use of storage (linear in th e depth of th e tree.) T he constraints used to determ ine the com m on subgraph are: • Compatibility Between Nodes of the Model View and the Scene Graph, (£o)-' M odel node m,- and scene node s, are £0-com patible if they possess a ) Sim ilar 3-D Visible A rea b ) Sim ilar Average C urvature K \ c) Sim ilar Average C urvature K 2 Sim ilarity between attrib u tes is determ ined using a norm alized difference m easure. 92 Compatibility Between Two Pairs of Matching Nodes: W hen a pair of m atching nodes, (ra,-, s;), is selected it is com pared to each of th e previously selected m atched pairs, (mj,sj), using a com patibility constraint m ade up of the following consistency checks: a ) Uniqueness Consistency (£1 ).' (to^S j) and (mj,Sj ) are ^-co m p atib le if and only if mi mj and Si Sj. b ) Connection Consistency (£2 ) ' Let li and I2 represent th e links between I mi and m 3 and between and sj, respectively. (m 2 -, s,-) and (‘ m 3, s3) j are ^ -co m p atib le if and only if U and Z 2 are °f sim ilar type. i c) Direction Consistency (£3 ): Let 6 1 and 02 denote th e angles between the ! 1 orientations of to* and m 3 and betw een s,- and Sj, respectively, and let 6 = | 6 \ — 6 2 I, then (m,-,st) and (mj,Sj) are ^ -co m p atib le if and only if 9 < 0 atoi, a. predefined, fixed angular tolerance. d ) Distance Consistency (£4 ).’ Let L\ and L 2 denote th e distances between th e centroids of in ertia of m, and m 3 and betw een Si and s3, respec- 1 tively. Let [ 1 T _ _ l-L i L j 2 I 1 m a x ( L ] ,-£ < 2 ) ’ 1 I th en (mi, Si) and (m 3 ,S j) are ^-co m p atib le if and only if L < Ldtoi, a j predefined, fixed distance tolerance. e) 3-D Geometry Consistency (£5 ): For all m atched pairs (mk,Sk ), k ^ i - 4 - f — * — ♦ and k 7 ^ j, let Uij, Vij, Uik, and Vik represent th e vectors connecting th e centroids of mi to m 3, S{ to s3, mi to m^, and Si to Sk, respectively. Let 6 1 and 02 denote the directed angle from Uij to Uik, and from Vij to Vik, respectively. And let e = \ e 1 - e 2 \. Then th e three pairs (m,-,s,), (mj,Sj ), and (mk,Sk ) are ^-co m p atib le if and only if 6 < 9gtoi, a predefined, fixed geom etry tolerance. f) Enclosure (^6 )-' (m ;, «i) and (mj,Sj ) are ^ -co m p atib le if and only if m t - encloses (is enclosed by) mj and Sj encloses (is enclosed by) Sj. Any 93 two pairs, (m^, s4 ) and (m j , Sj), th a t are ^-co m p atib le are accepted re gardless of w hether they fulfill th e consistency conditions listed above, w ith the exception of £1 , the uniqueness condition. If these constraints are m et, we say th a t and (rrij,Sj) are m utually consistent. • Geometric Transform (£7 ): C om puting the geom etric transform between m atched objects not only indicates how to bring them into correspondence, b u t also helps to verify th e m atching process. Based on th e location and ori en tatio n of all m atched node pairs, (m j, Sj), the geom etric transform which brings the m odel view into registration w ith the scene can be com puted. This transform ation can then be applied to m atched node pair (m,-,$t -) to determ ine w hether or not it is consistent w ith th e geom etry of the previ ously m atched pairs. If it is, (m;, s 8 ) is said to be ^-co m p atib le w ith th e set of previously m atched nodes and added to the set. T h at is, th e geom etric transform between th e m atched m odel view and th e scene is com puted in crem entally as surface patches are m atched. D etails on how th e geom etric transform is com puted can be found in [19]. In short, it involves com put ing directed angles betw een vectors, intersections of vectors, intersections of planes, and determ ination of th e angle 9 th a t m inim izes th e equation Ee = T*Q{R{0)Pi,Qi) _ w here 0 ( a , b) denotes the directed angle betw een vectors a and b and R(9)Pi denotes th e resultant vector when vector Pi is ro tated by 0 about a given axis. If th e com patibility constraints are not satisfied, the chosen pair, (m „ s ,) is dis carded. O u tp u t from th e graph m atcher is a list of corresponding nodes (surface patches) between th e m atching m odel view and the scene objects and th e ge om etric transform ation between the two. 94 Analyzer: This m odule perform s a final analysis of th e subgraph produced by th e graph m atcher. Its purpose is to rectify any unsatisfactory m atches result ing from inaccuracies in the m atching constraints by splitting objects th a t were inadvertently joined or by m erging pieces of a single object th a t was split due to occlusion, shadowing, or viewing angle. T he m ethod used is to hypothesize object splits and merges based on th e relationships between nodes and then apply j th e graph m atcher to verify the hypotheses. i I O u tp u t from th e analyzer is sim ilar to th a t of th e graph m atcher. 6.3 Coarse Grain Analysis j 6 .3 .1 O v erv iew j Recall th a t th e objective of the coarse grain analysis is to identify th e prim ary ] source of com plexity w ithin the algorithm , identify parallel constructs th a t con- | J trib u te to th a t complexity, and specify a parallel processor architecture th a t j I _ | i suits those parallel constructs. In specifying a parallel processor architecture we ' m ust address various organizational param eters: Programming model, Process- j ing element type', Processing element coupling-, Processor homogeneity, Processor I f synchronization; and Communication network topology. In th e following sections we present th e results of th e four steps of th e coarse grain analysis, 1 ) control , stru ctu re analysis, 2) d ata structure analysis, 3) com m unication analysis, and ; 4) architecture specification, and how they influence th e specification of each I organizational param eter for the tree search algorithm . ! i 6 .3 .2 C o n tro l S tru ctu re A n a ly sis I T he tim e com plexities of each com ponent of th e 3-D object recognition algorithm I ! are derived as follows. ! j • Screener - Let N be th e num ber of models to be com pared to the scene and 1 I ; V be the num ber of views per model. T hen the to tal num ber of m odel views ! to be com pared to the scene is M = N V . For the exam ples presented in t 95 : [17] N = 8 , V zz 4 — > M ~ 32. T he objective of th e screener is to reduce the j num ber of candidate m odel views to five by com paring th e properties of the a ttrib u ted graphs th a t represent each m odel view to th a t of the scene. This is perform ed by com puting a set of values for each m odel view /scene graph pair resulting in a tim e com plexity of O(M) steps, each step consisting of sim ple scalar value comparisons. • Graph Matcher - As a result of th e screener task, th e graph m atcher is applied at m ost five tim es, once for each candidate m odel view. W ithout ! loss of generality, let the num ber of nodes in th e candidate m odel and scene graphs be th e same, and equal to n. In [17] n ~ 12. C om putation of sim ilar m odel/scene graph node pairs is perform ed by com paring all m odel graph nodes to all scene graph nodes in term s of sim ple attrib u tes. ] This is perform ed, in the worst case where all nodes are sim ilar, in 0(n2) j steps, each consisting of simple scalar comparisons. In [17] n 2 < 144 but th e system restricted the num ber of node expansions to 1 0 0 to facilitate processing tim e. Let p < n 2 be the num ber of possible m odel/scene node i correspondences identified by com paring simple node attrib u tes. This leads i i to a com plete p-ary search tree of depth p, size p PP ~ 1 P ~ 1 I nodes. T he shallow search (to identify a core subgraph) is perform ed by j traversing the tree to a depth of four. T he tim e com plexity of th e traversal i is p4 — 1 O ^ P s h a l l o w ) • P s h a l l o w — P ~ 1 j node expansions, each consisting of sim ple scalar com parisons and complex j geom etric com putations. T he search m ay be term in ated early via the use 1 of th e recognition threshold, although one reason for perform ing the search ] in parallel may be to avoid such usage. For th e database used in [17] typical j values are 4 < Pshaiiow < 2 0 . Having identified a core subgraph between th e m odel and scene graphs via the shallow search the num ber of m atching pairs is reduced to pt = p — 4 and th e size of the search tree is reduced to = i H L z l P pf - 1 nodes. T he deep search (to identify a “good enough” common subgraph) is then perform ed by traversing the pruned tree in 0(Pdeep) node expansions also consisting of sim ple scalar com parisons and complex geom etric com putations. Again, the search m ay be term inated early through th e use of th e recognition threshold. For the database used in [17] typical values are 2 < Pdeep < 100. Finally, th e fine modifications to th e best com m on sub graph are perform ed by checking the node correspondences included in the subgraph as well as those not included against a set of strong constraints. This is perform ed in 0 (p ) steps consisting of simple scalar com parisons and complex geom etric com putations. • Analyzer - T he tim e com plexity analysis of the analyzer com ponent follows th a t of the graph m atcher in th a t it hypothesizes various alternative graph configurations (by splitting and m erging objects) then applies th e graph m atcher to validate the hypothesizes. T he hypotheses are form ed via simple com parisons betw een m atched node pairs (split hypothesis) and unm atched node pairs (merge hypothesis). As th e discussion shows, th e tim e com plexity of the 3-D object recognition al gorithm is driven by th e graph m atcher com ponent. This is confirm ed by tim ing results presented in [17]. Not only does it have the largest tim e com plexity of the three com ponents but it also comprises the m ost complex operations. It perform s symbolic com parisons, simple arithm etic com putations, and complex geom etric constructions in determ ining th e com patibility of node correspondences. In per form ing th e tree search (traversal) each node of the search tree can be expanded independent of all others. Thus th e tree traversal can proceed in parallel and has the potential to provide significant algorithm speedup. Therefore, it pro'vides the basis for our process partitioning scheme. 97 W ith this partitioning scheme, each independent process will determ ine the best com m on subgraph th a t can be obtained from its node. Upon com pletion of th e expansion of all nodes along a single branch, th e m atch value of th e branch is t com puted. W hen a branch (common subgraph) w ith a m atch value larger than | th e recognition threshold is found, the search term inates. 6 .3 .3 D a ta S tr u c tu r e A n a ly sis ; Having identified th e tree traversal as th e task on which to p artitio n th e algorithm 1 into processes, next we m ust identify the d ata requirem ents of each process. J T he prim ary d ata structure for the graph m atching com ponent is th e the i search tree. This is created (im plicitly) from a linear array of size p, the num ber of m atched node pairs, for the shallow search and of size pf = p — 4 for the deep : search. Each process requires a copy of th e entire array. As the num ber of nodes j ! in each of the graph is generally order of m agnitude 10 (per test results in [17]) this does not place a serious m em ory constraint on the processing elem ents. If we assum e N = Pshaiiow (TV = Pdeep) processing elem ents then each PE , 0 < i < N — 1, receives a single node to expand. If we do not have this m any processing ■ elem ents each P E is assigned ( Eissz ) nodes to expand. As we shall see, the t distribution of nodes is critical in determ ining algorithm speedup and processor ! efficiency and this static partitioning does not lead to an optim al solution. 1 I < i 6 .3 .4 C o m m u n ic a tio n A n a ly sis j Given th e process and d ata partitions described above and the two stage search algorithm em ployed by th e graph m atcher, th e required com m unication opera- j tions are a reduction and a broadcast. U pon com pletion of all node expansions along a branch, d istributed results m ust be com bined to determ ine the m atch value of th e subgraph along th a t branch. This is accom plished via a reduction operation utilizing a sum m ation operation. U pon com pletion of all branch trav er sals and determ ination of all common subgraphs, the best common subgraph can I be determ ined via a reduction operation utilizing a m axim um operation (rather th an retaining any subgraph whose m atch value exceeds th e recognition thresh- 1 old.) Also, upon com pletion of the shallow search, th e core subgraph m ust be ; broadcast to all processing elem ents so th a t it can be used as the basis for tree | pruning and the deep search. ! i i i 6 .3 .5 A r c h ite c tu r e S p e c ific a tio n As stated earlier, in specifying a parallel processor architecture we m ust address various organizational param eters: Programming model; Processing element type; j ! . . j 1 Processing element coupling; Processor homogeneity; Processor synchronization; j j and Communication network topology. Specification of these param eters is based on th e results of our algorithm analysis. In the following paragraphs we address each of these organizational param eters and discuss how they are influenced by th e processing requirem ents of th e 3-D object recognition algorithm . i P r o g r a m m in g M o d e l - T he graph m atching com ponent of th e 3-D object ; I recognition algorithm contains various processing steps th a t are d ata dependent, j th a t is, a node expansion w ithin th e search tree is dependent on d ata values at | th e node. The M ultiple in stru ctio n M ultiple D a ta protocol is best suited to this situation. ; P r o c e s s in g E le m e n t T y p e - D eterm ination of th e com patibility of two node correspondences requires the com putation of a set of com patibility con- ! straints £i — £7 . These com putations rely on the use of transcendental functions as i well as floating point arithm etic. Therefore, the processor utilized m ust support these com putations. Furtherm ore, to reduce system com plexity and program m er burden, the processor m ust be program m able in a high-level language th a t allows specification of the prim ary d a ta structures in a n atu ral way, th a t is, via m ulti- ' field records. Processors best suited to these constraints are of a powerful nature such as a general purpose m icroprocessor. 1 P ro c e s s in g E le m e n t C o u p lin g - The com m unication betw een processes is coarse grain, th a t is, processes spend much m ore tim e perform ing algorith m ic com putations th an they do perform ing com m unication. Therefore, a loosely coupled, message passing architecture is well suited to this algorithm . 99 P r o c e sso r H o m o g e n e ity - O ur partitioning scheme provides each process w ith identical tasks. Therefore, th e parallel architecture should be homogeneous, th a t is, should consist of a set of identical processing elem ents. This facilitates program m ing (reduction of program m er burden) as well as hardw are interfacing of processing elem ents (reduction of system complexity.) P r o c e sso r S y n ch ro n iza tio n - In light of th e fact th a t there is com puta tional sim ilarity between all of th e identified processes as well as d ata dependent processing, the parallel architecture should operate in loosely synchronous mode. T h at is, all processes incorporate identical code, w ith th e exception of th e one process th a t receives the final result of the reduction operation and then broad casts it back out. Synchronization occurs only at points of com m unication. As we shall see, this also facilitates program m ability of the im plem entation which reduces system com plexity and program m er burden. C o m m u n ica tio n N etw o rk T o p o lo g y - T he graph m atching com ponent of th e 3-D object recognition algorithm places two constraints on th e com m unica tion netw ork topology. T he first is th a t it m ust facilitate an efficient reduction operation and the second is th a t it m ust facilitate an efficient broadcast operation. In th e following paragraphs we consider each of these constraints. W ith regard to th e reduction operation, in th e previous chapter we showed th a t th e lower bound tim e com plexity is fl(lo g n ) tim e, which is achieved via a divide and conquer approach. W ith regard to the broadcast operation, we showed th a t it could be perform ed in 0 ( 1 ) steps. Once again the reduction operation produces the m ost stringent com m unica tion constraint. A com m unication netw ork topology th a t facilitates this operation will also facilitate the the broadcast operation as it is of lower order complexity, j Therefore, for parallel im plem entation of th e 3-D object recognition algorithm , th e processing elem ents should be connected via a logarithm ic diam eter topology. To sum m arize, the organizational param eters of a parallel processor architec tu re th a t is well suited to the 3-D object recognition algorithm should be specified as follows: • Program m ing m odel - MIMD 100 • Processing elem ents - Complex Instruction Set C om puters ! ! ; • Processor coupling - Loosely Coupled i 1 j ! • Processor hom ogeneity - Homogeneous • Processor synchronization - Loosely Synchronous • C om m unication network topology - Logarithm ic D iam eter i In addition to th e item s listed above, the graph m atcher com ponent of the | 3-D object recognition algorithm is a candidate for a pipelined architecture as it incorporates a two stage process, th e shallow search and th e deep search. As there are m ultiple m odel views to be com pared to the scene view, one stage of th e pipeline will be perform ing th e shallow search on m odel view while the | other stage is perform ing th e deep search on m odel view M,-. T he two stages of the graph m atcher process, the shallow and deep searches, com prise identical algorithm s. The difference betw een them is th a t they oper- I ate on different size search trees (due to pruning after the shallow search) and j utilize different cut-off depths. Thus, th e num ber of processing elem ents th a t j each can effectively utilized is different. A lthough actual values of Pshaiiow an d 1 Pdeep cannot be predicted prior to algorithm execution, th e relationship betw een ; them is defined by th e algorithm , as presented earlier. Clearly th e relation, Pshallow Pdeep, holds as the num ber of m atched pairs increases. Therefore, th e num ber of processing elem ents th a t can be effectively utilized in each of the i searches is different. Figure 6.4 shows the resulting two stage pipelined hypercube architecture. ; | N ote th a t th e reduction operation proceeds across the first stage w ith th e final j result, th e best core subgraph from the shallow search ending up in th e “com m on” processing elem ent. The broadcast operation then em anates from the common node to th e second stage of the pipeline. Finally, a reduction operation to de term ine th e best common subgraph proceeds across the second stage leaving the j result in th e “end” processing elem ent. j 101 A — 7 I Figure 6.4: Two stage pipelined architecture for th e 3-D object recognition algo rithm . ! 6.4 Performance Analysis Having com pleted our parallel im plem entation of the 3-D object recognition algo rithm , we now present an evaluation of the im plem entation in term s of algorithm speedup, processor efficiency, system complexity, and program m er burden. As was done for the relaxation labelling algorithm , th e evaluation is perform ed on the basis of four im plem entations. First, we use a serial im plem entation of the algorithm as a baseline w ith which com parisons can be m ade. Second, we use a sim ulation to validate th e parallel im plem entation. T hird, we use an actual system im plem entation utilizing INMOS Transputers [42]. A nd fourth, we use an actual system im plem entation utilizing a 32 node Intel iPSC2 H ypercube [44]. Again, due to the lim ited num ber of Transputers available this im plem entation ■ I . . . ■ j was included to dem onstrate th e portability of th e serial code to parallel im- ' plem entations as opposed to quantitatively evaluating algorithm speedup and i processor efficiency. i | 6.4.1 C o m p le x ity A n a ly sis i ! T he tim e com plexity of the 3-D object recognition is 0 ( P s h a i l o w ) (0{Pdeep)), as derived above. This is due to the tree search used in determ ining a common 102 subgraph between th e m odel and scene a ttrib u ted graphs whose m atch value exceeds th a t of recognition threshold. : , In partitioning the tree search, we distribute the P s h a i i o w ( P d e e p ) tree nodes ; ■ evenly am ong the N processing elem ents. Therefore, barring the existence of any i d a ta dependencies or overhead, we expect to achieve 0 ( N ) speedup and com plete i ' processor utilization. B ut, as was the case in the relaxation labelling algorithm , both d a ta dependencies and overhead exist. Unlike th e relaxation labelling algorithm , where the d ata dependencies could be expressed quantitatively, the d ata dependencies th a t exist in th e 3-D object recognition algorithm are a function of the varying degrees of difficulty in com p u tatio n of th e com patibility constraints, £ 1 - £7 , and thus are b e tte r discussed j qualitatively. ■ Recall th a t the com patibility constraints are used to determ ine th e com pati- j bility of a node correspondence w ith other node correspondences already included in th e com m on subgraph being created. In other words, the consistency checks are used to determ ine w hether or not th e nodes along a branch of th e tree are all m utually com patible. C om putation of th e com patibility constraints ranges from a sim ple scalar com parison of two integers to th e determ ination of th e 3-D geom etric transform ation betw een th e m atched m odel/scene graph nodes. In the 1 event th a t a processing elem ent receives m any nodes th a t are incom patible based j on th e sim ple constraints, th at P E will com plete its task relatively quickly since there is no need to perform the com putations required to check th e com plex con straints. Conversely, a P E th a t receives nodes th a t are com patible, or can only be deem ed incom patible based on complex constraints, will require m ore tim e to com plete its task. | Not only are these d ata dependencies difficult to quantify, but they cannot 1 ’ be predicted prior to tree traversal. This m eans th a t the loading across pro- cessing elem ents will, m ost likely, be unbalanced thus providing less th an 0 ( N ) speedup. One possibility of coping w ith this situation is to design a partitioning strategy based on preprocessing the search tree utilizing th e sim ple com patibility constraints for pruning and th en use only th e complex constraints during the actual search. A lthough this m ay produce b etter results for algorithm speedup i 1 I 103 | and processor efficiency, it will require algorithm restructuring th a t will increase J I * ; the system com plexity and program m er burden which is inconsistent w ith our j objectives. As we will show in the next section, there is another solution th a t will provide significant algorithm speedup and processor efficiency b u t m inim al increase in system com plexity and program m er burden. 6.4.2 Observed Performance Prior to actually running th e algorithm on a parallel processor architecture, we ; sim ulated th e parallelism on a serial m achine in order to validate the im plem en- | tation. We then ported the parallel sim ulation to an Intel iPSC2 H ypercube consisting of up to 32 processing elem ents. T he im plem entation is such th a t any num ber of processing elem ents can be used w ithout any redesign or restructuring. T he m easurem ents reflect algorithm speedup and processor efficiency as affected by our process and d ata partitioning schemes, d ata dependencies, inherently se rial operations, and inter-processor com m unication overhead. ( The test cases used include a single m odel view m atched against an identical scene view in which th e only differentiating a ttrib u te betw een nodes is the surface j orientation. These test cases are extrem e as com pared to those used in [17] in th a t i they cause a larger num ber of nodes to be expanded, th a t is, they contain m ore am biguity. B ut, this graph m atching algorithm is applicable to problem s other th an th a t of 3-D object recognition, i.e. aerial scene analysis, where am biguities j I do exist and large num bers of nodes do have to be expanded in order to determ ine ; i th e final result. Therefore, these test cases are actually not so extrem e. Take, for instance, th e “real” im age test case presented as input to the relaxation labelling algorithm . In th a t case th e model consisted of approxim ately 30 labels and the scene consisted of over 600 objects. A lthough th e attrib u tes m ust be altered, the graph m atching algorithm is a valid approach to the solution of th a t problem. O bjects of varying num ber of surfaces were tested of which results for th e two extrem e cases are depicted in figure 6.5. T he figure shows th a t good speedup was achieved for the sm all problem while poor speedup was achieved for the large problem . Again, th e only difference in th e two test cases is the size of the graphs. 0.9 0.7 0.6 • 8 0 - 5 3 B U 0.4 0 3 0.2 + — Sm all problem size . x * * Large problem size 0.1 25 30 20 10 P rocessing E lem ents " + -* Sm all problem size x — Large p roblem size p. I s o 10 30 20 (a) Speedup. (b) Efficiency. j Figure 6.5: Test results for hypercube im plem entation of the tree search. This inconsistency is due to the d a ta dependencies described earlier. Table 6.1 shows tim ing statistics for the same two test cases. T he wide range in variance in P E execution tim es is indicative of th e unbalanced load and is attrib u tab le to the d a ta dependencies. This, in tu rn , accounts for the inconsistent speedup and efficiency results. As stated earlier, one (unacceptable) approach in achieving consistency in J t the algorithm speedup and processor efficiency m easures is to preprocess the I j search tree using the sim ple constraints, partition this preprocessed tree, and then ; I perform th e tree search using th e complex constraints. This solution was deem ed | ' unacceptable because of th e increased system com plexity and program m er burden i j th a t it dictated. A nother (acceptable) approach exists. Recall th a t the com m unication p a ttern of the parallel im plem entation is coarse grain. This m eans th e com m unication network is not being used very often. This, coupled w ith th e fact th a t a static d ata partitioning provides an unbalanced processing load, leads us to an approach utilizing dynam ic load balancing. 105 P E s Mean Std — Dev Variance M axim um M inim um Range 1 545 0 0 545 545 0 2 272 5 25 276 269 7 4 136 3 9 138 132 6 8 68 1 1 70 66 4 16 34 1 1 35 33 2 32 17 2 4 20 14 6 1 3475 0 0 3475 3475 0 2 1734 306 93825 1951 1517 434 4 871 206 42640 1180 759 421 8 437 148 21970 803 365 438 16 218 105 11038 611 181 430 32 109 51 2696 325 86 239 Table 6.1: Tim ing statistics from static partitioning of th e tree search In m oving from the static to th e dynam ic d ata partitioning scheme, each step : of our analysis is still valid. Realize th a t it was this analysis th a t ultim ately lead us to this approach. Therefore, we use the same architecture specification (the ' ( algorithm requirem ents have not changed) but in a different m anner. Instead J of assigning a predeterm ined set of nodes to each processing elem ent we use the , P E s as a processor farm . All nodes to be expanded are kept in one processing elem ent and as PEs becom e idle they request a node for expansion. W hen nodes branches are expanded, the reduction operation is perform ed. In this scheme only th e com m unication software is modified, the algorithm related software is left unchanged. Figure 6.6 show the results of th e dynam ic partitioning im plem entation and table 6.2 shows the tim ing statistics. For th e small problem set th e (initially acceptable) results are left essentially unchanged as th e num ber of processing elem ents increases. For the large problem set the (initially poor) results are dra m atically im proved. T he initial dip in th e efficiency curves occurs because in the dynam ic scheme one P E is responsible for serial processing and d a ta m anagem ent and does not particip ate in th e “algorithm ic” processing. As th e num ber of PEs increases th e am ount of “algorithm ic” processing taking place in parallel at any 106 0.9 0.8 0.7 0.6 8 0.5 & w 0.4 0.3 0.2 + — Small problem size . x -- Large problem size 0.1 30 10 P rocessing Elem ents 30 * + -- Small p roblem size X -- Large problem size 25 20 15 10 5 0 0 5 10 1 5 20 25 30 (a) Speedup. (b) Efficiency. Figure 6.6: Test results for dynam ic hypercube im plem entation of tree search. I i given tim e overshadows the am ount of “farm ” processing. Based on th e lim ited am biguity am ong objects and th e num ber of nodes in the graphs, for th e database j of m odels and scenes presented in [17] th e num ber of processing elem ents th a t I can be effectively utilized is approxim ately 8 for the shallow search and 16 for th e deep search. For larger, m ore am biguous problem s the graphs show effective utilization beyond 32 processing elem ents. The actual num ber should be kept less th a n th e to tal num ber of nodes in th e search tree to assure th a t th e PEs spend m ore tim e com puting th an com m unicating w ith one another. 6 .4 .3 S y ste m D e v e lo p m e n t an d M a in te n a n c e As previously stated , system com plexity and program m er burden are m easures of how closely th e parallel im plem entation of an algorithm resembles the serial im plem entation. This can be equated w ith the am ount of effort (cost) required to realize and m aintain the parallel im plem entation of the algorithm . 107 P E s Mean Std — Dev Variance M axim um M inim um Range 1 545 0 0 545 545 0 2 545 0 0 545 545 0 4 182 0 0 182 182 0 8 78 2 4 79 76 3 16 37 2 4 38 34 4 32 18 2 4 19 15 4 1 3475 0 0 3475 3475 0 2 3475 0 0 3475 3475 0 4 1160 4 16 1163 1154 9 8 497 4 16 501 492 9 16 232 5 25 240 226 14 32 113 4 16 120 106 14 j Table 6.2: Tim ing statistics from dynam ic partitioning of th e tree search. j As was the case w ith the relaxation labelling algorithm , all four of our im- ; plem entations, the serial im plem entation, the parallel sim ulation, the T ransputer ; i im plem entation, and the iPSC2 im plem entation th e program m ing language used | was ‘C ’[47]. Also, the im plem entations were developed increm entally in four steps as was done for th e relaxation labelling algorithm : 1) code th e serial algorithm ; I 2) m odify th e serial algorithm code to form the parallel sim ulation; 3) add Trans p u ter com m unication routines to the parallel sim ulation code; and 4) add iPSC2 | ' . . 1 com m unication routines to th e parallel sim ulation. Each step involved only m inor modifications to th e code developed at the previous step. All algorithm -specific constructs and control structures were left “in ta c t” from serial to parallel im plem entation. T he original serial code was j portable to the extent th a t it only required the inclusion of the m achine spe- ! cific subroutine calls to perform inter-processor com m unication for paralleliza- tion. ^From th e portability of the code betw een different processor architectures (both sequential and parallel) one can conclude th a t th e com plexity of, as well as the effort required to im plem ent and m aintain, the parallel software is no greater I th an th a t of the serial im plem entation. This is attrib u tab le to th e fact th a t the j 1 0 8 parallel im plem entation was designed based on the stru ctu re of th e algorithm , and not on th e structure of a prespecified parallel processor architecture. I 6.5 Fine Grain Analysis i i 6.5.1 O v erv iew i Recall th a t the objective of th e fine grain analysis is to identify parallelism w ithin th e independent processes of the coarse grain analysis. T he objectives and the ; technique em ployed for m eeting them are the same as for the coarse grain analysis, th e difference being the input to the analysis. W hereas in the coarse grain analysis th e algorithm was the input, in th e fine grain analysis th e independent processes | identified in th e coarse grain analysis are th e inputs. In th e following sections I we present th e results of th e four steps of the fine grain analysis, 1 ) control i stru ctu re analysis, 2) d ata structure analysis, 3) com m unication analysis, and 4) architecture specification, and how they influence th e architecture specified during th e coarse grain analysis of th e tree search algorithm . I I 6 .5 .2 C o n tro l S tr u c tu r e A n a ly sis ; I ( j In th e coarse grain analysis we identified the graph m atcher com ponent to be the determ ining factor of th e com putational com plexity of the 3-D object recogni tion algorithm . W ithin the graph m atcher we determ ined th e tree search to be partitionable into independent processes and showed th a t significant algorithm speedup and processor efficiency is achieved using this scheme. I In expanding a node each process m ust apply the com patibility constraints to determ ine its m utual consistency w ith other nodes along its branch. Recall th at those constraints range from simple com parisons to complex calculations. The m ost com plex of th e constraints is th e com putation of th e Geometric Transfor mation betw een the scene and m odel objects, £7 . C om putation of the geom etric ; transform ation requires finding directed angles betw een vectors, intersections of . vectors, intersections of planes, and a search for the angle th a t m inim izes the equation ^ = E 0 ( W ^ « - ) i w here © (a, b) denotes the directed angle between vectors a and b and R(6)Pi denotes th e resultant vector when vector P i is ro tated by 9 about a given axis. In com puting £ 7 each pair of m atched m odel and scene objects, pk = (m ;, S j ) , in th e current set of m utually consistent m atched pairs, M C M P — (pk 1 1 < k < N ), N is the current size of th e set, is considered. T he size of the set is dynam ic as it is th e goal of the graph m atching algorithm to increm entally build th e set culm inating in the best set for a given branch. There are various I subprocesses involved in perform ing the com putation ranging in com plexity of , 0(N) to 0 ( N 4). The subprocesses are perform ed sequentially as th e com puta- 1 tion of subprocess Si is dependent on the results of subprocess Si- 1 b u t w ithin each subprocess, each pk is processed independently. U pon com pletion of the last subprocess, results derived independently from each pk are com bined to form a single result, th e scene to m odel object transform ation. Since it is com puta- 1 I tionally com plex and comprises independent processing (possesses potential for ; significant algorithm speedup and processor efficiency), the Geometric Transfor- j j mation constraint com putation is th e process to be partitioned as a result of our fine grain analysis. 6 .5 .3 D a ta S tr u c tu r e A n a ly sis Input to the geom etric transform ation constraint com putation is the current set : of m atched pairs, M C M P , as described above. In processing each m atched pair, P k € M C M P , independently, each process requires access to every other m atched pair, pj E M C M P , j ^ k, in order to com pute appropriate vector orientations, plane intersections, . . . B ut, access to th e entire M C M P set at one tim e is not required. Furtherm ore, access to each pj is order independent, th a t is, it does not m a tter w hat order process k accesses pair pj. This provides for a very sim ple | 1 partitioning of and access to the current m utually consistent m atched pairs set, j | M C M P . 6 .5 .4 C o m m u n ic a tio n A n a ly sis W ith th e partitioning scheme described, processes k : 1 < k < N need not com m unicate w ith one another. The only com m unication required is betw een j th e “p aren t” £ 7 com putation and the constituent processes. j i I \ 6 .5 .5 A r c h ite c tu r e S p e cifica tio n 1 In specifying a parallel processor architecture to m atch th e results of the fine grain analysis we are actually specifying a “sub-architecture” to th a t specified as a result of the coarse grain analysis. Again, all of the organizational param eters j of the architecture m ust be addressed b u t this tim e w ith regard to constraints dictated by b o th th e coarse grain architecture and th e fine grain analysis. ! P ro g ra m m in g M o d e l - In com puting th e geom etric transform ation, £7 , betw een m atched scene and model objects each m atched pair, pk, in the current set of m utually consistent m atched pairs is treated identically. Therefore, the operations required to com pute the geom etric transform ation are well suited to i the SIMD protocol for parallel com putation. There are no d ata dependencies. P r o c e ssin g E le m en t T y p e - The com putation of £ 7 comprises processing th a t requires arithm etic m anipulation of scalar values. Such processing can be | achieved by use of relatively simple processing elem ents w ith a small am ount of I j local memory. \ P r o c e ssin g E lem en t C ou p lin g - As com m unication betw een processes is j not required, processing elem ent coupling is not a critical issue. B ut, as we j shall see later, a tightly coupled (shared m em ory) system will be th e best choice w ith regard to the com m unication requirem ents betw een the M IM D processing elem ent, specified by the coarse grain analysis, and the SIMD processing elem ents l specified here. | P r o c e sso r H o m o g e n e ity - As is the case w ith all SIMD architectures, th e j architecture for th e possibility com putation should be homogeneous. | P r o c e sso r S y n ch ro n iza tio n - Again, by definition, the SIMD architecture for the possibility com putation will execute synchronously. I l l C o m m u n ic a tio n N e tw o rk T o p o lo g y - As specified, th e only com m unica tion required in partitioning th e possibility com putation into independent com- j patibility com putations is betw een th e “p aren t” MIMD process (the com putation of £7 ) and th e “children” SIMD processes (for each pk.) This is restricted to sup- j I plying each SIMD processing elem ent w ith inputs and receiving th eir outputs, j As described earlier, each process, k , does require access to th e pj,j ^ k values assigned to other processes b u t th a t access is sequential and order independent. Therefore, com m unication betw een the MIMD process and the SIMD processes can be efficiently achieved via common m em ory (registers) shared am ongst the MIMD processing elem ent and the SIMD PE s by allowing each SIMD process k ' to access values pj, j ^ k in an interleaved m anner thus avoiding m em ory location j contention. Results of the SIMD processes can then be com bined by the MIMD I process producing the final result of the £ 7 com putation. To sum m arize, th e organizational param eters of a parallel processor architec tu re th a t is well suited to the geom etric transform ation constraint com putation should are: ! I • Program m ing m odel - SIMD • Processing elem ents - Simple Instruction Set Com puters | j • Processor coupling - Tightly Coupled I • Processor hom ogeneity - Homogeneous ! I • Processor synchronization - Synchronous j • Com m unication network topology - Shared M emory Sim ulations of this im plem entation show th a t near linear speedup is obtained. i | Again, as we saw w ith the fine grain processing of the relaxation labelling algo- : rithm , this is typical of applications th a t operate synchronously, require little or no com m unication, and receive d ata via non-conflicting shared m em ory access. i 1 112 6.6 A Heterogeneous Architecture As we showed in previous sections, for the graph m atcher com ponent significant algorithm speedup and processor efficiency can be achieved through the paral- lelization of th e tree search process and th e use of dynam ic d ata partitioning. ; I O ther researchers have also dem onstrated this [50] [69] [70] [41] [49] [100]. The | prim ary difference between th e applications of other researchers and ours is the heterogeneous n atu re of our algorithm . Most researchers have considered prob- 1 lems composed of large search trees w ith sim ple com patibility constraints, i.e. th e 15-puzzle, whereas our algorithm contains a wide variety of constraints with j respect to com plexity of calculation (typical of high-level vision.) Knowledge of these sim ple constraints allows them to statically allocate work loads to each * l processing elem ent as the tim e to expand each tree node is the sam e and known j a priori. Even in th e dynam ic partitioning scheme presented in [50] the “splitting ^ strateg y ” , a critical param eter in determ ining algorithm speedup and processor efficiency, can be predeterm ined knowing th a t the cost of expanding a node is identical for every node. Also, others have considered problem s where th e solution can be recognized as soon as it is found and all solutions are equally good. For instance, any series of tile m ovem ents th a t produces the sequential order is considered a good solution to the 15-puzzle. Thus, the search can be term inated early w ithout j concern for having found an optim al solution. Various heuristics and optim al j search m ethods can be applied leading to super-linear algorithm speedup (often I | referred to as speedup anomalies [57].) This occurs when, in parallel depth first I J search, one of th e processing elem ents encounters the goal state and term inates the search earlier th an the sequential processor does. This is dem onstrated in [50] where the authors present parallel im plem entations of depth first and IDA* searches. In our problem a good solution is determ ined by an em pirically defined recognition threshold. Some good solutions m ay be b etter th an others and in perform ing th e search in parallel, a different solution m ay be found th an was by th e sequential algorithm . This is a situation th a t m ay or may not be acceptable, | to be determ ined by the algorithm designer. Such non-determ inistic behavior i ! 113 can be avoided by finding th e best (optim al) solution through searching the tree exhaustively. This also elim inates the reliance on the recognition threshold. T he j j search can proceed efficiently in a hybrid b re a d th /d e p th first m anner (as opposed { to a pure depth, breadth, or best first.) A lthough th e prim ary com putational com plexities of th e search algorithm s studied by us and others are the same, in th e sim ple constraint problem s it is solely the m agnitude of the search tree th a t drives the execution tim e upward. , In our problem (and high-level vision problem s in general) th a t utilize complex j constraints, both th e com putation of the constraints and th e m agnitude of the j search tree contribute to the execution tim e. Therefore, a parallel processor I architecture th a t facilitates the search of the tree m ust be specified, as was done i via our coarse grain analysis, as well as one th a t facilitates th e com putation of j the constraints, as was done via our fine grain analysis. I We showed th a t th e traversal of the tree is well m atched to the MIMD hy- j ( percube architecture. We also showed th a t the com putation of the geom etric transform ation constraint is well m atched to a SIMD architecture w ithout any requirem ents for com m unication betw een processing elem ents. Therefore, the ! processing elem ents well suited to the requirem ents of the graph m atcher com- t ponent of th e 3-D object recognition algorithm are th e same as those for the ; ! relaxation labelling algorithm . Each of th e M IM D processing elem ents is to ! be equipped w ith a vector processor to perform th e SIMD style operations of the com patibility com putations as specified in our fine grain analysis, figure 6.7. Again, com m unication betw een the MIMD P E and its associated SIMD PEs is via shared memory. Parallel processor architectures utilizing this type of hetero geneous processing elem ent are currently available (Intel iP S C 2/V X [44].) In this im plem entation the SIMD protocol is applied to th e com putation of i com patibility constraints, specifically th e geom etric transform ation constraint, while th e M IM D protocol is used to im plem ent th e dynam ic d ata partitioning i required by the tree traversal process. As w ith th e relaxation labelling algorithm we did not im plem ent our h et erogeneous architecture. B ut, based on results of the sim ulation of “fine grain” j architecture, we expect to m aintain the same algorithm speedup and processor i MIMD PROCESSING ELEMENT I | Figure 6.7: T he structure of a processing elem ent w ithin th e heterogeneous ar- 1 chitecture. ' j I efficiency curves presented previously b u t to reduce our overall execution tim e. ; T h a t is, th e fine grain im plem entation achieves near linear speedup independent of the coarse grain processing and, therefore, should not alter the shape of its perform ance curves. Also, as the SIMD portions of the algorithm are restricted to algorithm ic constructs th a t are d a ta intensive rath er than com putationally intensive, th e basic stru ctu re of th e code is not altered and therefore system ! j com plexity and program m er burden are not increased. ! I ' I 6.7 Summary i Parallel processing of the tree search process has been studied by various re searchers [50] [69] [70] [41] [49] [100] as it is an integral p art of a variety of systems ranging from puzzle solving to object recognition. Each of these studies, as well as our own, result in partitioning of the problem into independent subprocesses ■ th a t each traverse a set of branches of the tree. Of prim ary difference between i th e algorithm s considered by others and th a t considered by us is the com plexity i 1 of the com patibility constraints used in determ ining a problem solution. O thers I have studied system s th a t utilize simple comparisons whereas our system utilizes a range of com putations from simple comparisons to complex calculations. In the realm of visual object recognition, it has been shown th a t system s m ust utilize 115 _____ i com plex constraints in order to be robust in w hat they can recognize as well as tolerant to expected scene distortions [30] [75] [73]. i To determ ine th e processing requirem ents of the algorithm , the regular pro cessing of th e tree traversal and th e d a ta dependent processing of the constraint com putations, we applied our two stage analysis and, through it, specified a par allel processing architecture th a t is well suited to those requirem ents. In doing so we arrived at conclusions sim ilar to those of other researchers (relative to th e tree j traversal) b u t were able to adapt those conclusions to the m ore difficult problem ; at hand (the diverse nature of the com patibility constraints.) We were able to achieve significant algorithm speedup and processor efficiency while m aintaining low system com plexity and program m er burden. Such results are im portant to the general class of search problem s described j b u t are especially im portant to the specific instance of the problem we stud- j ied. O ur algorithm makes extensive use of heuristics in order to increase pro cessing through-put (reduce execution tim e) [17]. In general, this is th e reason | i for incorporating heuristics into an algorithm design. The algorithm ’s author j showed the “robustness” of the heuristics em pirically by com paring results of ' < j th e sequential heuristic based im plem entation w ith th a t of a sim ulated parallel j | non-heuristic based im plem entation on a given d ata set. There is no guarantee ! I th a t the heuristics will be equally robust for a different d ata set. This is very ! common when applying heuristics to the solution of a search problem . Therefore, ' speedup and efficiency via parallel im plem entation of the algorithm are necessi- i j ties if a non-heuristic im plem entation is to be considered feasible, th a t is, one | th a t incorporates an exhaustive search. Also, one m ust realize th a t parallel so- ® lutions to heuristic based search algorithm s operate non-determ inistically. T h a t is, m any solutions m ay be “good enough” and th e first one found is w hat is re ported by th e sequential algorithm . In th e parallel im plem entation, th e order in which solutions are found m ay change due to processing requirem ents along the ! branches containing the solutions. Therefore, a different “good enough” solution j m ay be reported. Again, this can be alleviated via exhaustive search for the | “b est” solution. I t i i i I We have contended th a t as high-level vision algorithm s are complex and in- | tricate, it is necessary th a t their parallel im plem entation not dictate alterations ! to the stru ctu re of th e algorithm th a t will increase th e possibility of altering its ; perform ance. In this case one m ight argue th a t th e perform ance has been altered i due to the non-determ inism . B ut, given th a t the solution produced is still w ithin j algorithm specifications, we can also argue to the contrary. Furtherm ore, th e im- j plem entation provides enough algorithm speedup th a t th e entire argum ent can be avoided via moving to an exhaustive search solution. Also, this is achieved w ith low cost, th a t is, w ith low am ounts of system com plexity and program m er 1 burden. I Chapter 7 Linear Feature Extraction 7.1 Introduction In the previous two chapters we presented parallel im plem entations of two high- level vision algorithm s, relaxation labelling and tree search. T he algorithm s were considered in isolation, th a t is, were treated as stand-alone tasks w ithout regard to how they would interface to low and mid-level vision tasks. In this chapter we consider a task th a t comprises two basic algorithm s. T he task is to extract linear segments from th e input im age, a fundam ental process to m any com puter vision applications. T he two algorithm s th a t m ake up the task are edge detection and linear segment approxim ation, w ith each algorithm consisting of various sub- ! algorithm s. j i Edge detection algorithm s have been proposed by m any researchers, [5] pro- ] I vides a survey of this research. Common to these are th e detection of edges via j a convolution operation, th e removal of “false” edges via a thinning operation, and th e removal of “weak” edges via a thresholding operation. Upon detection of edges, a linear approxim ation algorithm is perform ed to group the detected edges into line segments. T he algorithm proceeds by linking neighboring edges of sim ilar orientation into contours th en approxim ating th e shape of the contours w ith piecewise linear segments. This process is generally perform ed via an iterative end-point fit algorithm [15]. Edge detection and linear segment approxim ation are classified as low and mid-level vision respectively. And, holding tru e to form, th e edge detection process overwhelms a sequential com puter via th e vast am ount j i 118 ; of d a ta to be processed. Similarly, th e linear approxim ation process overwhelms j a sequential com puter via the large am ount of d ata and the required num eric and ! symbolic operations. Therefore, the process of linear feature extraction is a good ; j candidate for parallel im plem entation. | | Various researchers have presented parallel im plem entations of stand-alone j i edge detection algorithm s [59] [54] [3] [104] b u t few have investigated parallel im- I | plem entation of the entire linear feature extraction process. In [95] the authors present an im plem entation of the entire process on the Image U nderstanding Ar chitecture [106]. T he im plem entation utilizes the custom design gated-connection netw ork (G CN ) of th e C A A PP (SIM D) level to perform the actual algorithm steps and th e ICA P (M IM D) level to provide control signals. T he im plem enta tion, although based on th e typical approach described above, is very complex and is fully dependent on custom hardw are designs to provide global communica- j i tion betw een otherw ise locally connected SIMD processing elem ents. T he results ; presented are based on analysis and sim ulation. ! I In [99] the authors present an im plem entation based on a sm all num ber of j | ] | powerful MIMD processing elem ents. T heir approach is to divide the image ! 1 ! into N equal size strips and assign 1 strip to each of N PEs. T hrough actual i ! im plem entation, their findings show th a t the process of linking edge pixels into i j contours is th e m ost tim e consuming step as well as the m ost difficult to perform ! i ! in parallel. O ur approach to developing a parallel im plem entation of the linear feature ex tractio n process is to apply our m ethodology to each of the constituent algorithm s to determ ine suitable parallel architectures. We then utilize the m ethodology to determ ine a suitable interface between the algorithm s. In the following sections we describe th e linear feature extraction algorithm s to the detail required by the present context. We present the application of our m ethodology to the individual algorithm s th a t constitute th e process and discuss how those individual results I affect the design of a parallel processor architecture well suited to th e system j J as a whole. Finally, we present results of sim ulation of our im plem entation and . I ! sum m arize th e work. I i 1 1 119 7.2 Process Description t ! T he objective of th e linear feature extraction process is to extract linear segments ' from th e input image. Input to the process is th e 2-D im age array of pixels and i I outp u t is a list of linear segments w ith attrib u tes of end-point locations, length, < j orientation, and contrast. The approach em ployed is one of m ultiple (hetero- j | geneous) steps including edge detection, edge thinning, edge linking, contour j extraction, and piecewise linear approxim ation. D etails of the entire process can be found in [72]. Briefly stated, the following steps are perform ed: • Edge detection - convolve the input im age w ith a kernel or set of kernel masks. I • Edge thinning - com pare each edge to its neighbors (in the directions or thogonal to th e edge’s orientation) and retain th e edge if its m agnitude | is greater th an th e m agnitudes of its neighbors and greater than a fixed | threshold. 1 i i | • Edge linking - com pare each edge to its neighbors (in the direction of the ; 1 edge’s orientation) and form a link to the neighbors if they are of sim ilar J | orientation. i • Contour extraction - extract (from the 2-D image array) the edge locations ((x,y) coordinates) and save them in an ordered (contour) list. • Linear approximation - join the end points of th e contour list w ith a single line approxim ation, m ark th e point of m axim um error for this approxim a tion (creating two approxim ating line segm ents), iterate the process on the j two new segments until the m axim um error is w ithin an acceptable bound. 1 T he processing steps and resultant d ata representations are depicted in fig ure 7.1. Figure 7.2 shows an exam ple of an input im age, th e d etected /th in n ed edges (contours), and th e resultant linear segments. Of prim ary interest is the contour extraction step. It is this step th a t bridges the gap betw een th e num eric (pixel) representation and th e symbolic (linear seg m ent) representation. Prior to this step, d a ta is represented as num bers in a 2-D 120 Edge Detection Edge Thinning Edge Linking Contour Tracing Linear Approximation 3 . 0 7,0 3 , 1 7 , 1 • * • • • • 3 . 8 7 ,8 3 ,9 7,9 ((3.0M3.9)) ((7.0M7.9)) Figure 7.1: Processing steps and d ata representations of the linear feature tractio n process. m m ms (b) Detected edges (contours) * ?n & \, j~4 f j ^ m p r n 'f < > ✓ '-,» Y '-"’* * W ’v ^ 1 (c) Extracted lines. Figure 7.2: Input image (airfield), contours, and resultant linear features. 122 array. Following this step, d ata is represented as (#, y) coordinates in a linked- j I list. We show, in the following sections, th a t this is also the step th a t poses th e \ greatest challenge in developing a parallel im plem entation. j I 7.3 Coarse Grain Analysis j 7 .3.1 O v erv iew As our reason for studying the linear feature extraction process is the parallel I ! im plem entation of a heterogeneous algorithm suite, we present an abbreviated application of our m ethodology to th e individual algorithm s. We concentrate on the coarse grain analysis and om it th e fine grain analysis. Our justification j for this is th a t the coarse grain analysis defines th e stru ctu re of the “p rim ary” ; I parallelism of the algorithm s and this structure is not affected by th e results of th e fine grain analysis. It is this prim ary stru ctu re th a t defines th e interfaces i betw een interacting algorithm s. 7 .3 .2 C o n tro l S tru ctu re A n a ly sis i Table 7.1 shows th e tim e com plexities of each of th e constituent algorithm s. T he , tim e com plexities are predicated on the following concepts. T he edge detection, ! edge thinning, and edge linking steps require every pixel in the im age to be processed in a uniform m anner. Edge detection requires a convolution w ith a > fixed size kernel, edge thinning and linking require com parisons of an edge w ith ! each of its neighbors. T he contour extraction step requires a scan of the image ; i plane to detect contour start points followed by a traversal of each contour from its start to its end point. T he linear approxim ation step requires traversing the length of each contour detecting points of inflection. T he prim ary control structures for each processing step are presented in the following pseudo code. 123 1 A lgorithm Com plexity Com m ents Edge D etection 0 (z 2) i : i x i pixel image Edge Thinning 0 ( i 2) i : i x i pixel im age Edge Linking 0(* 2) i : i x i pixel im age Contour E xtraction 0(i2l) i : i x i pixel im age I : average contour length Linear A pproxim ation 0(nl) I : average contour length n : num ber of contours Table 7.1: Tim e com plexities of th e steps th a t constitute th e linear feature ex traction process. (edge detection) for each pixel convolve with kernel (edge thinning) for each edge compare neighbor edges and retain if it is of greater magnitude (edge linking) for each edge link to neighbor edges of similar orientation to form contours (contour extraction) for each contour extract constituent edge coordinates to form contour lists (linear approximation) for each contour list fit piece-wise linear segments 124 I — • , I In th e edge detection, edge thinning, and edge linking steps each image pixel location can be processed independently and synchronously, th a t is, th e same j processing steps are applied to each location. In the contour extraction and j | linear approxim ation steps, each contour can be processed independently but j ! processing is d a ta dependent and therefore, processed asynchronously. | ! I I I 7 .3 .3 D a ta S tr u c tu r e A n a ly sis j T he input d ata structure to the linear feature extraction algorithm is the 2-D im age array. This is th e prim ary d ata structure of the edge detection, edge thinning, and edge linking steps. If each pixel location is to be processed independently, J I • ; then each process requires access to the values of its neighboring locations. T he input d ata structure to the contour extraction step is th e 2-D array j containing the linked edges. T he resultant d ata stru ctu re is m ultiple linked-lists of (x , y ) coordinates, one linked-list for each contour. If each contour is to be processed (extracted) independently, then each process requires access to the entire 2-D array since a single contour m ay traverse any part of th e array. t T he linear approxim ation step accepts, as input, the linked-lists of (x,y) co- ! ordinates, the contours. To process each contour independently a process only ■ l needs access to its own contour. O utput from the linear approxim ation step is a j l list of linear segments represented by their end points, orientation, contrast, and length. Figure 7.3 depicts the two prim ary d ata structures used by th e linear feature extraction process. I 7 .3 .4 C o m m u n ica tio n A n a ly sis C om m unication among independent processes of th e edge detection, edge th in ning, and edge linking steps is restricted to local neighborhoods. Each process m ust obtain the pixel value of its neighbors in order to com plete its com putation. For th e edge detection step th e size of th e neighborhood is dependent on th e size of the convolution kernel, 5 x 5 for the N evatia-Babu linear feature extractor [72], I selectable for the Canny edge detector [9]. For the edge thinning and edge linking j i 1 1 1 1 1 1 1 1 IE / HEAD 2 1 2 r m 8T8 r NULL T Figure 7.3: N um eric and symbolic d ata representations (structures) used in the linear feature extraction process. ! steps th e neighborhood is 3 x 3. Messages are single valued and com m unication can proceed synchronously among all processes. For th e contour extraction step, no com m unication among processes is re quired b u t, recall th a t access to the entire 2-D array is required. As we shall see later, when we specify the parallel architecture for the entire process, this situation will change. P artitioning of the low-level algorithm s (edge detection, thinning, and linking) will have adverse effects on th e contour extraction algo rithm . For th e linear approxim ation step, no com m unication among processes is re quired. Each contour can be processed independently of all others. 7.3.5 A r c h ite c tu r e S p e cifica tio n Given the local neighborhood com m unication requirem ents, the synchronous na tu re of the com m unications, the sim plicity of the required processing, and the d ata independence of the algorithm s, the architecture for the edge detection, edge thinning, and edge linking steps of th e linear feature extraction process should com prise sim ple processing elem ents th a t operate in SIMD m ode and are 1 2 6 (a) Mesh topology to support edge detec tion, thinning, and linking. (b) Unconstrained topology to support contour extrac tion and linear approxim a tion. Figure 7.4: Two com m unication topologies to support th e linear feature extrac tion algorithm s. connected via a 2-D m esh, figure 7.4. This concurs w ith the findings of other researchers [54] [59]. For the contour extraction and linear approxim ation steps, since the algo rithm s are d ata dependent, comprise relatively sim ple operations, and do not re quire inter-process com m unication an architecture consisting of m oderately pow erful processing elem ents operating in MIMD mode will suffice, figure 7.4. The com m unication topology is, as depicted, non-critical. In utilizing such a heterogeneous approach to the problem of linear feature extraction, one m ust be cognizant of the interface betw een th e two distinct ar chitectures. It is at this interface where d ata is transitioned from a num eric representation, the 2-D array of pixel values, to a symbolic representation, the linked-lists of contours. The contour extraction algorithm perform s this transi tion. T he num ber of contours detected in a scene will be significantly less th an the num ber of edges detected. This implies th a t the num ber of processing elem ents th a t can be effectively utilized by th e contour extraction and linear approxim ation algorithm s will be significantly less th an th a t used by th e edge detection, edge thinning, and edge linking algorithm s. Therefore, the interface between th e two 127 architectures is a pyram id structure. This is the same scheme used in th e design of th e Image U nderstanding A rchitecture to interface th e C A A PP (SIM D) and IC A P (M IM D) layers. Each MIMD processing elem ent is directly connected to a group of SIMD PEs, and the groups are non-overlapping. I W ith this arrangem ent a single contour m ay be distrib u ted across m any groups of SIMD PEs, or equivalently, across m any MIMD PEs. Therefore, after each M IM D processing elem ent perform s the contour extraction algorithm on its p artitio n (d ata residing in its associated SIMD PE s), and before each perform s th e linear approxim ation algorithm on its set of contours, a contour merging step i m ust be perform ed. Since the distribution of partial contours is arbitrary, the I i 1 MIMD processing elem ents m ust perform a complete exchange operation in order ! i 1 : to gather all of their partial contours. \ Recall th a t th e contour extraction and linear approxim ation algorithm s do not place any constraints on the com m unication topology of th e architecture. Therefore, we can specify a topology utilizing th e com plete exchange operation of th e contour m erging step as th e constraint. This operation is perform ed efficiently on a ring topology by “circulating” d ata values through th e ring systolically. Each processing elem ent is m ade responsible for a set of contours and as sections of those contours are received, they are saved in local memory. Sections of contours th a t do not belong to the set are passed on. Upon receipt of all sections of its set of contours, each processing elem ent m ust reconstruct the com plete contours by perform ing the contour m erging step. Upon com pletion, linear approxim ation can proceed. T he contour m erging step requires 0 ( N ) com m unication steps for a system consisting of N processing elem ents. The additional execution tim e com plexity introduced by the contour m erging step is discussed in the following section. T he resulting interface betw een th e two architectures of th e linear feature extraction algorithm is depicted in figure 7.5. I 128 , Figure 7.5: Heterogeneous architecture for th e linear feature extraction algorithm . 7.4 Performance Analysis i 7.4,1 C o m p le x ity A n a ly sis As shown previously (in tabular form) th e tim e com plexity of the entire linear feature extraction process is L S E seq = 0 ( i2) + 0 ( i 2) + 0 { i2) + 0 ( i2l) + 0(nl) j i w here th e term s correspond to the edge detection, edge thinning, edge linking, contour extraction, and linear approxim ation algorithm com plexities respectively. For synchronous, d ata independent (SIMD) operations such as those per form ed by the edge detection, edge thinning, and edge linking algorithm s, re searchers have shown th a t high degrees of algorithm speedup and processor ef ficiency are readily achieved [59] [85] [77]. The determ ining factor for speedup and efficiency is th e “degree of m atch” betw een the p attern of com m unication betw een processes and the interconnect topology of th e architecture (solution of th e m apping problem .) O ur specification of the 2-D m esh for the local com m u nication p attern s of the linear feature extraction algorithm has been shown by others to perform very well [54] [3] [104]. High degrees of algorithm speedup and processor efficiency are readily achieved. The operations perform ed by th e contour extraction and linear approxim a- t tion algorithm s do not suffer any overhead due to process com m unication and, j I 129 therefore, will achieve high degrees of algorithm speedup and processor efficiency so long as the num ber of processors is kept in a range bounded by the problem size. Thus, our estim ate for the tim e com plexity of th e parallel im plem entation j of the five constituent algorithm s is j L S E par ~ 0 (1 ) + 0 (1 ) + 0 (1 ) + O ^ l / N ) + 0 (n l/N ) (near linear speedup) given an i x i SIMD mesh and N M IM D processing elements. The open issue regarding the perform ance of the parallel im plem entation of ! th e linear feature extraction algorithm is the tim e com plexity introduced by the inclusion of th e contour m erging step and its associated com m unication require m ents. This is where we focus our attention. The algorithm to perform this step is as follows: ! : (contour merging) for each sub-contour, i change < — TRUE while change change < — FALSE for each sub-contour, j ^ i if head(i) = tail(j) connect(i,j) change < — TRUE else if = head(j) connect(j, i ) change « — TRUE I ] i T he algorithm traverses th e list of p artial contours (sub-contours) combining spa tially adjacent sub-contours reducing the length of th e list by one every iteration. Given th a t a contour is partitioned into n sub-contours, th e tim e com plexity of th e contour m erging algorithm is M E R G I N G = 0 ( f » = 0 ( W ^ + ■ .) ). *=i 2 i 130 Then th e estim ate m ade above (near linear speedup) for th e tim e com plexity of th e entire parallel im plem entation of th e linear feature extraction process is am ended w ith this new term to i LSEP „ = 0(1) + 0(1) + 0(1) + 0(i2l/N) + 0 ( ’l("2 + 1 ) ) + 0(nl/N). i T he open issue is thus reduced to one of com paring 0 ( i2l/N ) + 0 ( n (n 2 + 1 ) ) + 0 (n l/N ) w ith 0 {i2l) + 0(nl) for various num bers of processing elem ents, N , and contours, n. A lthough the length and num ber of sub-contours th a t a single contour is partitioned into are j | also critical, they are bounded by the image size, i, and the num ber of MIMD 1 j _ _ ! processing elem ents. Figure 7.6 shows th e com parison of these term s along w ith ; th e estim ated algorithm speedup and processor efficiency for th e parallel imple- | m entation. We have assum ed th a t all contours are the same length and th a t each j contour is “m axim ally” fragm ented (partitioned into N , the num ber of MIMD processing elem ents) which is the worst case. The shape of these curves indicates th a t for a 256 x 256 processing elem ent SIMD mesh, 64 MIMD PE s can be ef fectively utilized before perform ance begins to degrade. We use these curves for com parison to sim ulation d ata in th e next section. 7 .4 .2 O b serv ed P erfo rm a n ce Figure 7.7 shows th e partitioning of the airfield image (shown earlier) into ar eas covered by the individual MIMD processing elem ents for various values of ! N. Figure 7.8 shows a second exam ple of an input image, th e d etected /th in n ed j | j j edges (contours), and the resultant linear segments. Figure 7.9 shows the par- . titioning of this image. In each of these figures one can see how increasing th e \ 131 J *- 250 1 I 250 (a) Processing tim e (dashed: sequential; solid: parallel). (b) Algorithm speedup. 0.95 0.9 0.85 Processing Elements (c) Processor efficiency. Figure 7.6: E stim ated perform ance curves for th e contour extraction, contour m erging, and linear approxim ation algorithm suite. 132 (a) 4 PEs (b) 16 PEs. (c) 64 PEs (d) 256 PEs. Figure 7.7: Image plane partitioning for various num bers of processing elem ents (b) Detected edges (contours) (c) Extracted lines. Figure 7.8: Input image (freeway), contours, and resultant linear features. 134 L . . . ...... — ____________________________________________________-__ -______________ (b) 16 PEs. (c) 64 PEs. (d) 256 PEs. Figure 7.9: Image plane partitioning for various num bers of processing elem ents. 135 1 00 90 70 SO 40 30 20 10 0 , 2S0 P ro c e llin g E lem ents O S & s 1 Processing E lem ents (a) Algorithm speedup. (b) Processor efficiency. Figure 7.10: Perform ance curves from sim ulation of the parallel linear feature extraction im plem entation, airfield image. num ber of MIMD processing elem ents fragm ents long contours thus affecting the perform ance of the contour merging step. Figure 7.10 and 7.11 show th e observed perform ance curves (from sim ulation of the parallel im plem entation) for the airfield and freeway images respectively. The sim ulation reflects tim e required by th e algorithm as well as overhead due to com m unication and d ata dependencies. Since this overhead is included in the sim ulation the shape of the curves differ from th a t of the estim ated curves but th e same trends exist. T h at is, perform ance degrades when the num ber of MIMD i processing elem ents is increased beyond 64 processing elem ents. 7.4.3 S y ste m D e v e lo p m e n t and M a in te n a n c e W ith regard to system com plexity and program m er burden, our prim ary concern for the linear feature extraction process is the addition of the contour merging step. This addition is contrary to our desire of keeping system com plexity and program m er burden to a m inim um . B ut, a trade-off is m ade th a t is beneficial to the run-tim e perform ance and not too detrim ental to th e life cycle perform ance. 136 120 I 1 OS 0 1 0.7 0.1 OS 0.4 ISO 200 250 300 SO 100 0 Processing 1 3 em ails (a) Algorithm speedup. (b) Processor efficiency. Figure 7.11: Perform ance curves from sim ulation of th e parallel linear feature extraction im plem entation, freeway image. T he addition of th e contour m erging step is such th a t it allows the original algo rithm s (edge detection, edge thinning, edge linking, contour extraction, and linear approxim ation) to achieve significant algorithm speedup and processor efficiency while m aintaining low degrees of system com plexity and program m er burden. Furtherm ore, the additional step does not affect th e im plem entations of th e orig- ; inals, it is strictly a “m utually exclusive” addition. Therefore, th e stru ctu re of th e parallel im plem entation of the entire process still resembles th e stru ctu re of th e sequential im plem entation w ith the addition of the contour m erging step. Thus, the effort (cost) required to realize and m aintain the parallel im plem en tatio n is only slightly greater th an th a t of the sequential im plem entation. This is attrib u tab le to the fact th a t the parallel im plem entation of each algorithm is j based on the structure of the algorithm , and not on the structure of a prespecified parallel processor architecture. 137 7.5 Summary Parallel im plem entation of edge detection algorithm s has been studied by various researchers [59] [54] [3] [104]. All im plem entations achieve high degrees of algo rith m speedup and processor efficiency and, presum ably, low degrees of system ■ com plexity and program m er burden as they are typically based on a 2-D m esh ' | connected SIMD m achine which is well suited to th e task. Parallel linear feature extraction (line finding) is another topic. Various researchers have considered parallel im plem entations of line “finders” based on th e Hough transform b u t fall short of actually extracting th e end points of the lines [104] [33]. Few have stu d ied parallel im plem entations of the entire linear feature extraction process. In [95] the authors present an im plem entation for th e Image U nderstanding Archi tectu re th a t is tightly coupled w ith the custom hardw are designs of the SIMD ! I C A A PP level. T he im plem entation is very complex and has not actually been im- j plem ented. In [99] th e authors present an im plem entation based on a few MIMD processing elem ents. The im plem entation is straight forward but run-tim e per form ance suffers in having the MIMD PEs perform the inherent SIMD processing of the edge detection, edge thinning, and edge linking steps. | In our im plem entation we have combined the “best of both worlds.” Through j the use of a heterogeneous architecture we are able to achieve significant algorithm j speedup and processor efficiency while m aintaining low degrees of system com plexity and program m er burden for th e constituent algorithm s. Furtherm ore, I the entire system also possesses these characteristics in spite of th e additional algorithm , contour m erging, required to “bridge” th e two architectures. Our resulting architecture specification concurs w ith the decision to use a 64-to-l re duction am ong C A A PP (SIMD) and the ICA P (M IM D) the levels of the Image U nderstanding A rchitecture. B ut, we are able to achieve significant algorithm speedup and processor efficiency w ithout the being dependent on special purpose hardw are and complex algorithm im plem entations. Chapter 8 Perceptual Organization System 8.1 Introduction i In th e previous chapter we presented th e parallel im plem entation of a heteroge neous process, linear feature extraction. T he process comprises tasks from low and mid-level com puter vision. In this chapter we consider a system th a t receives, ! as input, the line segments produced by th e linear feature extraction process, i builds a hierarchical description of th e im aged scene by successively grouping j f th e linear segm ents into com posite objects, then selects pertin en t com posite ob- j jects, those th a t fit a given model. Various algorithm s are required to build the j description and they m ust interact w ith one another in order to com plete the j task. In this chapter we investigate the structure of th e parallelism w ithin each of th e algorithm s as well as how those individual structures affect th e required inter-algorithm com m unication interfaces. Our intention is to investigate th e in teractions of the algorithm s th a t constitute a com puter vision system , as opposed to a stand-alone algorithm . We do not provide im plem entation or sim ulation of th e system , but instead restrict our study to high level design and analysis based on algorithm specifications. A few researchers have considered the parallelization of heterogeneous com p u ter vision system s, th a t is, system s th a t comprise a series of algorithm s. In [12] th e author considers a system for m otion analysis. The system includes al gorithm s for convolution, zero crossing detection, stereo correspondence, m otion correspondence, m otion param eter estim ation, and object recognition. In [4] the 139 authors study a system for edge guided thresholding th a t includes algorithm s for edge detection, extrem a detection, threshold com putation, and threshold appli cation (segm entation.) In [55] th e authors propose a queuing m odel as a m eans of determ ining system param eters such as th e num ber of buffers and th e optim al size of th e buffers betw een the algorithm s of th e low, m id, and high-levels, but do not apply th e system to a specific com puter vision system . Com m on to these projects is the assum ed hom ogeneity of the system s studied. T h a t is, th e sam e d ata structure is used by each algorithm of the system , where each individual algorithm receives th e d ata structure, perform s some operation on it, then returns a transform ed d ata structure of sim ilar (often identical) struc ture. Furtherm ore, independent processes w ithin each algorithm operate on local areas w ithin th e d ata structure and interaction betw een algorithm s is confined to sim ilar locations w ithin the d ata structure. No d ata reorganization is required betw een algorithm s. For instance, in th e m otion analysis system investigated in [12] all processes operate on some form of the 2-D image plane array. Each algorithm receives a 2-D array as input, perform s its operations, and produces a transform ed 2-D array. Process n of one algorithm operates on the same locations w ithin the array as process n of the prior and subsequent algorithm s. T he basic stru ctu re of th e architecture is a “m ulti-threaded” pipeline architecture. In [55] a sim ilar m odel is assum ed on which th e authors propose a fram ework for analysis based on queuing systems. Homogeneous structures as such are useful for studying the load balancing aspects of a system design but deem phasize th e topologies of each of the individual algorithm s as well as the interfaces between algorithm s and provide little insight into th e design of heterogeneous com puter vision systems com prising low, mid, and high-level processes. In our work we study the parallel im plem entation of a system th a t comprises algorithm s of varying com putational structure, th a t utilize various d ata stru c tures, and require global access to, and reorganization of d ata structures in order to perform their tasks. A system of sim ilar characteristics is proposed in [104] as a benchm ark suite for im age understanding architectures but it consists of 140 well defined algorithm s th a t contain lim ited application specific details th a t of ten arise in com puter vision system s and tend to influence th e parallel processor architecture. The system we consider is a perceptual organization system. It builds a hi erarchical scene description by grouping simple objects into com posite collated features, then selects pertinent collated features via a constraint satisfaction net work [66]. The system has been applied to problem s of stereo correspondence [66], 3-D shape extraction [67], and the detection of structures in aerial images [68]. Due to its utility and the diversity of algorithm s th a t it utilizes, we believe th a t this system is representative of a large class of com puter vision system s and | I therefore, th a t our results are generalizable. j In th e following sections we describe th e perceptual organization system to j th e detail required by th e present context. We present the application of our m ethodology to the individual algorithm s th a t constitute th e system and discuss how those individual results affect the design of a parallel processor architecture well suited to th e system as a whole. Finally, we sum m arize our work and discuss how it com pares to th a t of the researchers cited above. 8.2 System Description 8.2.1 O v erv iew i The objective of th e perceptual organization system is to ex tract structural in form ation from an image [66]. The approach used is to build a hierarchical description of the scene by successively grouping objects of one level of abstrac tion to form objects of a higher level of abstraction. We provide an overview of th e system w ith enough detail to discuss our analysis and parallel im plem enta tion. For details and explanations beyond th e scope of our discussion, the reader should see the referenced works. T he system consists of two phases, the detection of collated features, and the selection of collated features. To detect collated features, the system comprises six distinct algorithm s: 1) linear feature extraction; 2) line collation formation-, 141 3) parallel collation formation; 4) U collation formation; 5) rectangle collation formation; and 6) network formation. T he detection phase often produces m ul tiple collations for the same underlying im age tokens (objects.) It is the task of th e selection phase to determ ine th e best collated features for each token. This is accom plished via a constraint satisfaction network. In the following paragraphs we describe, briefly, each of these com ponent algorithm s of th e system. 8.2.2 System Components L in ear F ea tu re E x tr a c tio n - This algorithm detects edges in the input image then links them to form linear segments. T he m ethod used is th e “N evatia-Babu line finder” [72]. The result is a list of linear segments each described by its end points, length, and orientation (contrast.) Details of th e parallel im plem entation of this algorithm were presented in the previous chapter. They will only be repeated in th e current context when necessary for clarity. L in e C o lla tio n F o rm a tio n — Due to various image characteristics, poor contrast, noise, .. .th e lines produced by th e linear feature extraction algorithm m ay be fragm ented. Single lines w ithin th e scene are often represented by m ultiple line segments w ithin the image. T he line collation form ation algorithm detects collections of parallel lines th a t are closely bunched along the same linear axis and then represents the parallel lines as a single line collation. A dditionally, corners (line intersections) of type “L” and “T ” are detected and lines are extended to th e vertex point of the corner. T he resultant line collations are one level of abstraction higher than the linear segments received as input. P a ra lle l C o lla tio n F o rm a tio n - This algorithm groups pairs of parallel and overlapping lines (w ithin specified tolerances) into a single parallel collation. As a side effect, each of the lines m ay be extended or contracted based on th e am ount of overlap betw een them , thus m odifying the list of line collations received as input. The resultant parallel collations are one level of abstraction higher th an the line collations received as input. U C o lla tio n F o rm a tio n - As pairs of parallel line collations evolve into parallel collations (w ith their ends aligned by extension or contraction), the U 142 collation form ation algorithm detects th e presence or absence of an orthogonal line connecting th eir ends. If the line is present, a U collation is created. If the line is absent, an appropriate line collation is created followed by th e creation of a U collation. Therefore, as a side effect, th e form ation of U collations also modifies the list of line collations. T he resultant U collations are one level of abstraction higher th an the parallel collations received as input. R e c ta n g le C o lla tio n F orm ation - The rectangle collation form ation algo rith m detects pairs of U collations th a t share common lines. Such pairs form a rectangle collation. Furtherm ore, the existence of a rectangle collation proposes th e presence of two new U collations, orthogonal to those th a t form ed th e rectan gle. Therefore, as a side effect, the form ation of rectangle collations also forms U collations. The resultant rectangle collations are one level of abstraction higher th an th e U collations received as input. N etw o rk F orm ation - The network form ation algorithm builds a network, j or graph, by extracting “part-of” (supporting inform ation) and “utilizes” (con- j flirting inform ation) relationships from the collation form ation results. S upport ive links exist between collations of different levels of abstraction such as lines th a t form parallels and U ’s th a t form rectangles. Conflicting links exist between collations of the same level of abstraction such as two parallel collations th a t share a common line. The result is a n x n m atrix of ternary valued links be tween pairs of nodes where n is the to tal num ber of collations formed. T he three values taken on by the links are 1, — 1 and 0 representing support, conflict, and no relationship, respectively. C o n stra in t S a tisfa ctio n N etw o rk - T he network created above is used as a constraint satisfaction network via slight m odifications to the m odel proposed by Hopfield [36]. Once created, the netw ork is “relaxed” and allowed to iterate to convergence via Hopfield’s form ulation. The result is a set of selected nodes (collations) th a t are m utually com patible and representative of the stru ctu re (as defined by lines, parallels, U ’s, and rectangles) w ithin the scene. 143 8.2.3 S u m m a ry A lthough th e system , as described, utilizes only linear features, it has been ex tended to detect other types of collations [66]. T he basic structure of the system is th e same b u t the individual detection algorithm s vary. Our objective is to ana lyze th e entire system w ith regard to its parallel im plem entation, rath er th an the details of th e parallel im plem entation of its constituent algorithm s, and therefore, our analysis is also applicable to th e extended system w ith m inor modifications. 8.3 M ethodology Application 8.3.1 O v erv iew As our reason for studying the perceptual organization system is the investigation of th e interaction pattern s of various algorithm s th a t m ake up a com puter vision system , we present only an abbreviated application of our m ethodology to the individual algorithm s. We sum m arize the results of the coarse grain analysis for each algorithm and om it th e fine grain analysis. O ur justification for this is th a t th e coarse grain analysis defines the structure of the “prim ary” parallelism of th e algorithm s and this structure is not affected by th e results of the fine grain analysis. It is this prim ary structure th a t defines the interfaces between juxtaposed algorithm s. 8 .3 .2 C o n tro l S tru ctu re A n a ly sis Table 8.1 (adapted from [66]) shows the tim e com plexities of each of the con stituent algorithm s. Details and derivations of the tim e com plexities can be found in [66]. Briefly, they are predicated on the following concepts. L in e C o lla tio n F o rm a tio n - Prim ary control for the line collation form ation algorithm is provided by the following pseudo code: for each linear segment for each linear segment if segments are parallel and in close proximity 144- A lgorithm Com plexity Com m ents Line Collations 0 ( n 2) n: linear segm ents Parallel Collations 0 ( n 2) n: line collations U Collations 0(npu) n: line collations p: parallels to each line collation u: distance betw een parallel lines Rectangle Collations 0{np) n: line collations p: parallels to each line collation Network Form ation 0 (n p 2) n: line collations p: parallels to each line collation C onstraint Satisfaction 0(n p 2) n: line collations p: parallels to each line collation Table 8.1: Tim e com plexities of the processes th a t constitute the perceptual organization system. combine segments create a line collation Given n linear segments, the tim e com plexity is 0 ( n 2). Each linear segment (iteration of the outer loop) can be processed independently and processing is d ata dependent. P a ra lle l C o lla tio n F o rm a tio n - Prim ary control for the parallel collation form ation algorithm is provided by the following pseudo code: for each, line collation for each line collation if lines are parallel and overlapping create parallel collation extend/contract lines Given n line collations, th e tim e com plexity is 0 ( n 2). Each line collation (iter ation of the outer loop) can be processed independently and processing is d ata dependent. 145 U C o lla tio n F orm ation - P rim ary control for the U collation form ation algorithm is provided by the following pseudo code: for each parallel collation if no line connects its aligned edges create a line collation create a U collation Assum ing an average of p parallel lines for each linear segment resulting in np parallel collations w ith an average distance of u between parallel lines, th e tim e j com plexity is O(npu). Each parallel collation can be processed independently j and processing is d ata dependent. I R e c ta n g le C o lla tio n F orm ation - P rim ary control for th e rectangle colla tion form ation algorithm is provided by the following pseudo code: for each parallel collation create a rectangle collation create two U collations Given np parallel collations, the tim e com plexity is 0{np). Each parallel collation can be processed independently and processing is d ata dependent. N etw o rk F orm ation - Prim ary control for th e network form ation algorithm is provided by the following pseudo code: for each line collation for each parallel collation containing the line make supporting link between line and parallel make supporting link between parallel and U make supporting link between U and rectangle make supporting link between line and U make supporting link between line and rectangle make supporting link between parallel and rectangle for each parallel collation i i 146 i for each parallel collation make conflicting link for each U collation for each U collation make conflicting link for each rectangle collation for each rectangle collation make conflicting link Each collation is “linked” to a fixed num ber of other collations in the network. Also, there are at m ost p (the average num ber of parallel lines to any given line) alternate collations for any given collation. Therefore, form ation of all links is accom plished in 0(p(n-\-np-\-np-\-np)) or 0 (n p 2). This is the num ber of links th a t m ust be created for th e Ofa + np + np+np) nodes of th e network. T he supporting links th a t em anate from each line collation can be created independently. The conflicting links betw een pairs of parallel collations can be created independently as can those betw een pairs of U collations and pairs of rectangle collations. C o n s tr a in t S a tis fa c tio n N e tw o rk - Prim ary control for the constraint satisfaction network algorithm is provided by the following pseudo code: for a fixed number of iterations for each collation for each linked collation perform multiply/accumulate operation apply activation (threshold) function Following the discussion presented above for the network form ation algorithm , each iteration of th e network requires 0(p(n-\-np+np + np)) or 0(n p 2) steps. The network is relaxed for a constant num ber of iterations thus the tim e com plexity of th e entire relaxation process is 0(n p 2). Each node, or collation, can be processed independently and processing is independent of d ata values. 147 8 .3 .3 D a ta S tr u c tu r e A n a ly sis T he following paragraphs briefly describe the prim ary d a ta structures of the algorithm s th a t constitute the perceptual organization system . L in e C o lla tio n F orm ation - The input d ata stru ctu re to the line collation form ation algorithm is the list of linear segments. In order to process each linear segment independently, each process m ust have access to all other linear segments in the list. T he resultant d a ta structure is the list of line collations. P a ra llel C o lla tio n F orm ation - T he input d ata stru ctu re to the parallel collation form ation algorithm is the list of line collations. In order to process each line collation independently, each process m ust have access to all other line collations in the list. T he resultant d ata structures are the list of parallel collations and a modified list of line collations. U C o lla tio n F orm ation - The input d ata structures to th e U collation form ation algorithm are the list of parallel collations and the list of line collations. T he resultant d ata structures are modified versions of th e list of line collations and the list of parallel collations augm ented w ith U collation inform ation. In order to process each parallel collation independently, each process m ust have access to the list of line collations. R e c ta n g le C o lla tio n F orm ation - The input d ata stru ctu re to the rectan gle collation form ation algorithm is the list of parallel collations augm ented w ith U collation inform ation. T he resultant d ata structure is a modified version of the list of parallel collations augm ented w ith rectangle collation inform ation. Each parallel collation is processed independently not requiring access to any other data. N etw o rk F orm ation - The input d ata structures to the netw ork form ation algorithm are th e list of line collations and the list of parallel collations (aug m ented w ith inform ation regarding U and rectangle collations.) Form ation of the links em anating from every node can proceed independently, as described above, provided th a t each process has access to all other collations (line and augm ented parallel collation lists.) 148 C onstraint Satisfaction N etw ork - T he prim ary d ata stru ctu re for the constraint satisfaction network is the network of collations. In order to process each node independently, each will require inputs from and link weights between all nodes connected to it. 8 .3 .4 C o m m u n ica tio n A n a ly sis T he following paragraphs briefly describe the com m unication requirem ents of th e algorithm s th a t constitute the perceptual organization system . Descriptions are based on the control and d ata structure analyses and the resultant process partitionings presented above. Line C ollation Form ation - Two possibilities exist for th e im plem entation of this algorithm , one favoring reduced execution tim e, the other efficient m em ory utilization, a standard trade-off. Previously we determ ined th a t each linear segment can be processed independently so long as it has access to th e list of all other linear segments. One way to provide this access is to give each process a copy of th e entire list thus elim inating th e need for com m unication between processes. This m ethod allows each process to proceed independently and asyn- chronously thus providing a processing environm ent, relative to execution tim e, th a t is suited to the d ata dependent n atu re of the processing. The disadvantage is th a t the required am ount of local m em ory to a process is a function of th e spe cific problem instance, th a t is it is dependent on the num ber of linear segments detected in the image. A complex, textured scene will require m ore local m em ory per process th an will a simple, bland scene. This may be undesirable. The alternative im plem entation assigns each process its linear segment and com m unicates inform ation regarding all other linear segments systolically. For exam ple, given n processes, one for each linear segment, connect the processes via a ring structure and circulate the linear segments through the ring until each process has “seen” each linear segment. This will require n fundam ental com m unication steps. At each step, a process will access one linear segment, perform the required task, and then pass the linear segment on in the subsequent step. Such a m ethod makes efficient use of local memory, independent of the 149 num ber of linear segments detected. The disadvantage is th a t processing m ust proceed synchronously and, therefore, will not execute as fast as it would if run asynchronously in light of th e d ata dependent n atu re of the processing. P a ra llel C o lla tio n F orm ation - The discussion presented above for line collation form ation also applies to parallel collation form ation. U C o lla tio n F orm ation - The discussion presented above for line collation form ation also applies to U collation form ation. R e c ta n g le C o lla tio n F orm ation - No com m unication among the indepen dent processes of th e rectangle collation form ation algorithm is required. N etw o rk F orm ation - The discussion presented above for line collation for m ation also applies to th e network form ation process. T h at is, each processing I elem ent is assigned a collation for which to form the links em anating from it to ! all other collations. To do so, access to the entire lists of line and augm ented parallel collations is required as described above. This access can be supplied via j sufficient m em ory in each P E to store the entire tables (and processing can pro- | ceed asynchronously) or via systolic message passing (w ith processing proceeding \ synchronously.) ! C o n stra in t S a tisfa ctio n N etw o rk - T he p attern of com m unication be- | I tween independent processes of the constraint satisfaction network is arbitrary, j depending on the problem instance. T h at is, it cannot be predicted prior to sys tem execution but, once determ ined (after the collated feature detection process is com plete) it rem ains fixed. Messages are single valued and com m unication proceeds synchronously across all processing elements. 8 .3 .5 A r c h ite c tu r e S p ecifica tio n T he following paragraphs briefly describe th e architecture specifications for each of the algorithm s th a t constitute th e perceptual organization system. Descrip tions are based on th e analyses presented above. L ine C o lla tio n F orm ation - As described above, two alternative im ple m entations exist for the line collation form ation algorithm , one th at reduces ex ecution tim e, and one th a t reduces th e required am ount of local memory. For 150 • • • (a) Tim e optim al. (b) Memory optim al. Figure 8.1: Two alternative com m unication topologies to support th e line colla tion detection algorithm . the im plem entation optim ized for execution tim e, no com m unication between between processes is required, thus the com m unication netw ork topology is un constrained and processing is asynchronous. For the im plem entation optim ized for local m em ory usage, com m unication proceeds systolically around a ring topol ogy and processing is synchronous. B oth topologies are depicted in figure 8.1. The com plexity of operations required by the algorithm suggests th a t both im ple m entations should incorporate complex, powerful processing elem ents. For the tim e optim al im plem entation, a loosely synchronous M IM D protocol is sufficient. For the m em ory optim al im plem entation, the protocol is SIMD. P a ra llel C o lla tio n F orm ation - The discussion presented above for line collation form ation also applies to parallel collation form ation. U C o lla tio n F orm ation - T he discussion presented above for line collation form ation also applies to U collation form ation. R e c ta n g le C o lla tio n F orm ation - As there is no required com m unication betw een processes, the com m unication network topology for the rectangle colla tion form ation algorithm is unconstrained, figure 8.1 (tim e optim al portion.) The protocol is MIMD to facilitate the d ata dependent nature of th e processing. N etw o rk F orm ation - The same discussion presented above w ith regard to the line collation form ation algorithm applies. One of th e two solutions can 151 * W kl NEURON i+1 k SYNAPTIC , n x CONNECTIONS xi+ 1=/k|S W&ij Figure 8.2: Processing elem ent structure for constraint satisfaction network. be selected dependent on the am ount of local m em ory available to processing elem ents. C o n stra in t S a tisfa ctio n N etw o rk - Processing of the constraint satisfac tion netw ork is based on th e connectionist model of com putation [90]. T he com p utatio n is perform ed by a set of simple processing elem ents m odeled after the biological neuron, figure 8.2. O perations are lim ited to m ultiply, accum ulate, and threshold. This style of the processing suggests simple processing elem ents th at operate in SIMD m ode [52]. The arb itrary n atu re of the com m unication p attern betw een processing elem ents suggests th e topology should be fully connected to provide all possible com m unication paths figure 8.3. B ut, given th e requirem ent of a large num ber of processing elem ents, a fully connected topology is not feasible due to physical lim itations of VLSI design. Also, M ohan has shown th a t for typi cal scenes the resultant network is sparsely connected, bounded by the num ber of lines parallel to any given line. The sparsity of the network is attrib u tab le to the utilization of high level prim itives, th e object collations, as input to th e network as opposed to other, m ore typical “neural netw ork” system s th a t utilize im age pixel binary p attern d ata [22] [36]. Therefore, the robust com m unications provided by the fully connected topology are not necessary. A lternatively, com m unications can be achieved via a feasible (and available) im plem entation through either a dynam ic, reconfigurable topology or a static topology th a t supports efficient d ata routing algorithm s such as shown in figure 8.4. 152 Figure 8.3: Fully connected topology to support the com m unication requirem ents of th e constraint satisfaction network. PEoO ISo OSo O Sn-i (a) Dynam ic switching network. (b) An exam ple fixed topology with routing. Figure 8.4: Two alternative com m unication topologies th a t can em ulate a fully connected topology. 153 Due to the large num ber of connections between processes as well as the large num ber of short messages to be passed (fine granularity) a dynam ic topology based on a circuit switching netw ork is not a suitable solution because of the overhead involved in the reconfiguration process and routing delays [40]. Algo- j rith m speedup and processor efficiency would be data dependent. These systems are b etter suited to algorithm s th a t require bulk messages passed betw een pro cessing elem ents (coarse granularity.) One way to make b etter use of such a system is to supply enough switches so th a t the system can be configured once, during set up of the network. This is feasible, from the algorithm point of view, in th a t once th e constraint satisfaction network is created, its connections are static. Such a system may require complex custom hardw are in order to facilitate the potentially large num ber of connections required of each processing elem ent. A static topology th a t supports efficient d ata routing procedures is b etter suited to the im plem entation of such “neural netw orks” [28] [52] [80]. T he sim plest of such architectures is the ring systolic topology, figure 8.5 [52]. In this scheme, for each network iteration d a ta values are circulated around the ring un til each processing elem ent (neuron) has “seen” every d ata value, perform ing its required m ultiply and accum ulate operations. A fter “seeing” every d ata value, th e activation function is applied. Thus, one iteration of th e relaxation process requires 0 ( N ) steps for a network of N neurons, regardless of the num ber of synaptic connections between neurons. Advantages are its sim ple design and the ease w ith which networks consisting of m ore neurons (processes) th an available processing elem ents can be m apped onto it. The cost of such sim plicity is an inef ficient im plem entation given a sparsely connected network. W hen the netw ork is sparsely connected, as is the case for our algorithm , this topology com m its a fare am ount of tim e to unnecessary message passing. A nother disadvantage is th at it assumes the availability of enough local m em ory in each processing elem ent to store th e entire table of synaptic connections (link weights.) T he required am ount of local m em ory increases as th e square of the problem size (num ber of neurons.) In the event th at local m em ory to each processing elem ent is lim ited, a m ap ping based on a 2-D mesh topology is presented in [80]. There, the authors 154 C K 2 0(3 W u W i2 W 1 3 Figure 8.5: Systolic ring architecture to support the constraint satisfaction n et work. present a general m apping scheme th a t m aps each neuron as well as each synap tic connection onto a processing elem ent thus alleviating the need for each P E to store th e table of synaptic weights. An algorithm for circulating the d ata through th e netw ork is also presented. It requires 24(y/n + e — 1) com m unication steps where n is th e num ber of neurons, e is th e num ber of synaptic connections, and n + e is th e num ber of processing elem ents. M apping of networks containing m ore neurons th an available processing elem ents is also presented. T he disadvantage of this system is th e large num ber of processing elem ents required for a densely connected netw ork (although this is not a concern in our case) as well as the com plexity of the d ata movements. A th ird alternative topology th a t facilitates th e com m unication requirem ents of connectionist com putations is the hypercube. It provides a m ore robust set of com m unication paths than either the ring systolic or mesh topologies. A hyper cube of dim ension D, N — 2D processing elem ents, provides a com m unication bandw idth of A T , like th e ring and th e m esh, b u t provides a diam eter of log N, com pared to N for th e ring and \/N for the mesh. In [28] the authors present a scheme for m apping neural networks onto architectures th a t utilize the hypercube topology. They investigate techniques for clustering network nodes and m apping them onto neighboring processing elem ents of th e architecture. Then, com m uni cation between nodes can proceed as either point to point message passing or by using m any “partial broadcasts.” Their conclusion is th a t the high bandw idth and 155 low diam eter of the hypercube m ake it suitable for the im plem entation of con- nectionist networks. This is also em phasized in our sparsely connected network in th a t clustering of nodes is m ore viable th an for a densely connected network. The conclusion to be draw n regarding a com m unication netw ork topology th at is well suited to the requirem ents of the constraint satisfaction netw ork is th a t it is dependent on th e connectivity of th e network, th a t is, th e expected num ber of synaptic connections per neuron. A dynam ic, reconfigurable topology is feasible bu t requires custom designed hardw are, a scenario th a t we wish to avoid. A ring systolic topology always achieves 0 ( N ) (N is the num ber of neurons) com m uni cation com plexity regardless of th e num ber of synaptic connections per neuron. In the event of a densely connected network, this solution is very attractive but requires an am ount of local P E m em ory th a t increases as 0 ( N 2). If the network is sparsely connected, this solution com m its a fare am ount of tim e to unnecessary message passing and th e m em ory increase still applies. The hypercube topology works well in both densely and sparsely connected networks in th a t a ring can be em bedded onto it to achieve th e 0 ( N ) com m unication com plexity for the densely connected network, again, given enough local P E m em ory (although m uch VLSI area is wasted by im plem enting unnec essary com m unication paths.) If the required am ount of local P E m em ory is not available, the hypercube still achieves 0 ( N log N) com m unication complex ity through utilization of the cube connections. In this case, the am ount of local m em ory required increases only linearly w ith the num ber of neurons in th e n et work. Also, the cube connections can be used directly for sparsely connected networks achieving 0(k log N) com m unication com plexity given k synaptic con nections per neuron. This is superior to the ring systolic topology when k < y ^Tn - In this case, the am ount of local m em ory is merely a function of the num ber of synaptic connections per neuron. As the network utilized by th e perceptual organization system is sparsely con nected, bounded by the num ber of lines parallel to any given line, the hypercube topology is b etter suited to its im plem entation than are other topologies. 156 Figure 8.6: A rchitecture for the linear feature extraction algorithm . 8.4 Putting It All Together In th e previous section we presented the results of applying the coarse grain analysis stage of our methodology to each individual algorithm of th e percep tual organization system . The analysis resulted in m ultiple parallel processor architecture specifications, one for each algorithm . In this section we discuss the interfaces and d ata m ovements required to integrate th e various architectures into a single parallel processor architecture for end to end processing of the percep tu al organization system . T he technique employed is to discuss each algorithm interface individually, in order of application. 8.4.1 L inear F eatu re E x tr a c tio n D erivation of the heterogeneous (pyram id) parallel processor architecture for the linear feature extraction (LSE) process was presented in th e previous chapter. T he resultant architecture is reiterated in figure 8.6. 8 .4 .2 L inear F eatu re E x tr a c tio n to L ine C o lla tio n F orm ation Upon com pletion of th e linear feature extraction (LSE) algorithm th e list of linear segments detected in th e image is distributed across the MIMD processing elem ents th a t perform the linear approxim ation step of the LSE architecture. 157 Figure 8.7: Inverted pyram id interface between the LSE and LCF architectures. Upon com pletion, the LSE processing elem ents m ust send the linear segments to th e processing elem ents of the line collation form ation (LCF) architecture. Since each edge contour forms one or m ore linear segments, the num ber of processing elem ents th a t can be effectively utilized by the LCF architecture will be greater th an th a t of th e LSE architecture. Upon receipt of the line segments from their associated LSE processing ele m ents, the list of line segments is now distributed across LCF processing elements, j In order for the LCF PEs to perform the line collation form ation algorithm , j processing can proceed in one of two ways, synchronously or asynchronously as previously described. Recall th a t the asynchronous solution does not place any constraints on the com m unication network topology of the architecture so long as each P E has a com plete list of line segments. In order to provide the com plete list a complete exchange operation m ust be perform ed. Therefore, we can specify a topology utilizing th e com plete exchange operation as the constraint. This operation is efficiently perform ed by a ring architecture. For the synchronous solution a ring connected topology also provides an ef ficient im plem entation. Thus, the resulting interface between the LSE and the LCF is identical, regardless of the solution we select. Since there is an increas ing num ber of processing elem ents required in moving from th e LSE to th e LCF architecture, th e interface should be an inverted pyram id, figure 8.7. 158 Figure 8.8: Direct connection interface between LCF and P C F architectures. 8.4 .3 L in e C o lla tio n F o rm ation to P a ra llel C o lla tio n F orm ation Upon com pletion of th e line collation form ation (LCF) algorithm the line col lations are distributed among the processing elem ents of the LCF architecture. Each LCF P E is responsible for th e form ation of one line collation em anating from a single linear segment. Also, our partitioning scheme for the P C F algorithm is such th a t each line collation constitutes an independent process. Therefore, the num ber of processing elem ents required by the parallel collation form ation (PC F) algorithm will be th e same as the num ber for th e LCF algorithm . Thus, th e two algorithm s can effectively utilize th e same num ber of processing elements. As was th e case for the line collation form ation algorithm , two alternative solutions exist for th e parallel collation form ation algorithm , one asynchronous, the other synchronous. The same discussion applies leading us to once again specify a ring connected topology. Since th e num ber of processing elem ents th at can be effectively utilized by the LCF and P C F algorithm s is the same, the interface should be directly connected, figure 8.8. 8 .4 .4 P a ra llel C o lla tio n F orm ation to U C o lla tio n F orm ation Upon com pletion of the parallel collation form ation (PC F) algorithm the paral lel collations are distributed evenly among the processing elem ents of the PC F 159 Figure 8.9: Direct connection interface between P C F and UCF architectures. architecture. Also, each P E has new contributions to the list of line collations, those th a t were extended or contracted during the parallel collation form ation j algorithm . T he P C F processing elem ents m ust send the parallel collations and ; I the line collations to the processing elem ents of the U collation form ation (UCF) architecture. Since each process in the UCF will operate on a single parallel col lation the num ber of processing elem ents th a t can be effectively utilized by each is the same. As w ith th e LCF and P C F algorithm s, once again two solutions to th e UCF algorithm im plem entation apply. Thus, the same discussion applies leading us to specifying a ring connected topology for the required com m unications of the I U CF algorithm . Also, since the num ber of PEs is the same for the P C F and UCF ! architectures, they too can be directly connected figure 8.9. 8 .4 .5 U C o lla tio n F orm ation to R e c ta n g le C o lla tio n Upon com pletion of the U collation form ation (U CF) algorithm the U collations (augm ented parallel collations) are distributed evenly among the processing el em ents of the UCF architecture. Each P E m ay have new contributions to the list of line collations, those th at were created during the U collation form ation algorithm . T he UCF processing elem ents m ust send th e augm ented parallel col lations and the line collations to the processing elem ents of the rectangle collation form ation (RCF) architecture. Since each process in th e R C F will operate on a single U collation (parallel collation) the required com m unication between the 160 Figure 8.10: Direct connection interface between UCF and R C F architectures. U C F and the R C F architectures can be achieved via direct connections between corresponding processing elem ents. Each U (parallel) collation can be processed by the rectangle collation form ation algorithm w ithout any additional inform a tion. Therefore, com m unication between PEs of the R C F is not required and paths need not be supplied. T he resulting interface betw een the U CF and the RCF architectures is depicted in figure 8.10. 8 .4 .6 R e c ta n g le C o lla tio n F orm ation to N etw o rk F orm ation Upon com pletion of the rectangle collation form ation (R C F) algorithm the rect angle collations (augm ented parallel collations) and the line collations are dis trib u ted evenly among the processing elements of th e R C F architecture. The R C F processing elem ents m ust pass all of the collations (line and augm ented parallel collations) to the processing elem ents of the netw ork form ation (NF) ar chitecture. In order to create th e network (form the links betw een collations) each NF processing elem ent m ust com pare its collations (line, parallel, U, and rectangle) to every other collation. As described previously, this can be done asynchronously or synchronously. Therefore, a ring connected topology is well suited to th e requirem ents of th e N F algorithm . Also, since th e num ber of process ing elem ents th a t can be effectively utilized by the NF and the RC F algorithm s is the same, they too can be directly connected figure 8.11. 161 Figure 8.11: Direct connection interface betw een R C F and NF architectures. 8 .4 .7 N etw o rk F orm ation to C o n stra in t S a tisfa ctio n N etw o rk Upon com pletion of the network form ation (NF) algorithm , the netw ork nodes representing line, parallel, U, and rectangle collations are distributed evenly across the N F processing elem ents, each P E containing some of each type of collation. Also, stored w ith each node is the list of (synaptic) weights th a t con nect it to every other node. Each N F P E will have m ultiple network nodes, therefore, in order to distribute the nodes to th e constraint satisfaction network (CSN) architecture one node per CSN P E , an inverted pyram id structure is re quired betw een the NF and CSN architectures, figure 8.12. This will produce th e distribution of objects required by the constraint satisfaction network w ith out further d ata m ovem ent operations, thus placing no additional constraints on th e com m unications required to perform the CSN algorithm (hypercube, systolic ring, or m esh architecture as described above.) 8.5 The Heterogeneous Architecture In the previous sections we specified, utilizing our methodology, an architecture for each of the individual algorithm s th a t constitute the perceptual organization system while om itting specific im plem entation details. We then specified, again via our m ethodology (com m unication requirem ents analysis), the interfaces be tween each architecture required to integrate the entire system. How the design 162 Figure 8.12: Inverted pyram id interface between N F and CSN architectures. of the end to end system m eets our four design goals is discussed in th e following sections. 8.5.1 A lg o r ith m S p eed u p and P ro c esso r E fficien cy M easures of algorithm speedup and processor efficiency are defined in term s of execution tim es and the num ber of available processing elem ents. The m ethod of com puting these values for an algorithm running on a homogeneous parallel processor architecture is trivial: 1) m easure the execution tim e on a single PE; 2) m easure the execution tim e on all available PEs; and 3) insert the m easured values into the formulas. It is not entirely clear how to com pute th e values for an algorithm running on a heterogeneous architecture. Since processing elements are of a range of “power” , should they be weighted appropriately when com put ing efficiency? If so, how is an appropriate weighting to be determ ined? For exam ple, if a bit-serial P E is idle throughout portions of the algorithm , should th a t reduce the processor efficiency m easure the same am ount as when a powerful 32-bit m icroprocessor is idle? Because of these difficulties, we discuss algorithm speedup and processor efficiency for each individual algorithm /architecture and only algorithm speedup for the entire system (it is dependent only on execution tim e.) For the perceptual organization system , we did not im plem ent or sim ulate our designs. O ur study was restricted to the analysis of algorithm specifications and therefore we do not com pute values for algorithm speedup and processor 163 efficiency. Instead, in the following sections we discuss th e factors th a t most influence individual algorithm and system perform ance. 8.5.1.1 T h e Individual A lgorithm s Perform ance of the parallel linear feature extraction im plem entation was dis cussed in the previous chapter. Through sim ulation and analysis we showed th at the heterogeneous pyram id structure perform ed well with regard to algorithm speedup and processor efficiency. In general synchronous, d ata independent (SIMD) processes lend them selves to high degrees of algorithm speedup and processor efficiency [59] [85] [77]. The | determ ining factor for speedup and efficiency is the “degree of m atch” between the p attern of com m unication among processes and the interconnect topology of the architecture. For general neural network models, [52] and [80] show ways to m inim ize th e affects of com m unication overhead via the use of ring systolic and 2-D m esh topologies, respectively. Following their reasoning along w ith th a t of [28] we showed th a t our instance of the problem is well suited to a hypercube topology. T he analyses indicate th a t our im plem entation will achieve significant degrees of algorithm speedup and processor efficiency. T he rem ainder of the processes are d ata dependent. For each of these we pre sented two solutions, one asynchronous tim e efficient, and the other synchronous m em ory efficient. Our designs are scalable as to the num ber of processing ele m ents used in each architecture. This num ber is critical in determ ining algorithm speedup and processor efficiency for d ata dependent processing. Use of too many PEs will achieve high degrees of speedup b u t low degrees of efficiency, leaving PEs idle, perform ing no useful work. Alternatively, use of too few PEs will provide high degrees of efficiency but less then th e potential speedup, leaving all PEs overworked. The num ber of processing elem ents in the system is critical as their utilization is dependent on the distribution of d ata item s, th a t is, is scene dependent. In each of our algorithm s, the d ata dependencies cause data item s to be divided into two classes. T he first requires th a t a set of simple operations (com parisons and 164 arithm etic com putations) be perform ed whereas th e second requires processing to be bypassed. This is typical behavior of m id and high-level vision algorithm s as previously witnessed in the tree search algorithm . D eterm ination betw een the two classes is perform ed by d ata value comparisons. High degrees of algorithm speedup and processor efficiency are achieved when th e num ber of processing elem ents is approxim ately equal to the num ber of d ata item s requiring the m ax im um am ount of processing (assuming an even distribution of those d ata item s among processing elem ents.) B ut in the (likely) event th a t the num ber of processing elem ents is not ap proxim ately equal to the num ber of item s requiring the m axim um am ount of execution tim e, our designs are such th a t m ultiple item s can be assigned to each processing elem ent. Since the nature and am ount of processing required by the first class (comparisons and arithm etic com putations) is simple, only in extrem e d ata distributions will perform ance be drastically degraded. An exam ple of such a distribution is when one processing element is assigned a set of d ata item s all requiring m axim um processing where the rem ainder of th e PEs are assigned item s where processing is “bypassed.” As was shown for the linear approxim ation step of the linear feature extraction process, the num ber of PEs th a t can be effectively utilized by each of these processes is 0(100). A nother situation th a t affects perform ance of the synchronous solutions is as follows. Given less processing elem ents than th e to tal num ber of item s and the nature of the algorithm s (each item is com pared to every other item and processing proceeds or is bypassed dependent on th e outcom e of each comparison) the synchronous solutions may suffer from phasing problems. Each P E will be responsible for processing a set of d ata items. In the extrem e case it is possible th at only one processing elem ent is active at any given tim e due to the order th a t item s are presented to each PE. Execution tim e will degrade to the sum of the execution tim es for the item s requiring m axim um processing. This m ay be only slightly b etter than on a sequential m achine in th a t the only parallelism is in perform ing object comparisons to determ ine w hether or not further processing is required. B ut, again due to the fact th a t th e “maxim al processing” is not 165 excessively complex, this situation will only degrade system perform ance in such an extrem e case. 8.5.1.2 T h e H eterogen eou s A lgorith m Suite W ith regard to the entire algorithm suite, each algorithm transition essentially represents an algorithm of its own, one th a t reorganizes data. We have considered each algorithm transition and its required d ata restructuring individually, apply ing th e same analysis techniques as we did to the vision algorithm s themselves. In the case of the linear feature extraction process, we needed to “invent” an algo rith m to perform the restructuring. T he operations required by these “transition algorithm s” are m atched to architecture organization param eters (com m unica tion topologies) just as the vision algorithm s were and therefore, will provide the required d ata restructuring in an efficient manner. In light of the dependency of perform ance on d ata distribution, as discussed above, it may also be possible th a t these “transition algorithm s” could be de signed to incorporate knowledge of the two processes th a t they bridge and thus provide a “desirable” distribution. Techniques for doing so are described in [11]. There, the techniques are applied to algorithm s which operate on a common data structure, the 2-D im age plane array. Processing loads are balanced at every al gorithm transition so as to provide good processor utilization. In our application, it may be possible to apply sim ilar techniques based on analysis of the d ata dis tributions and structures coming from and going to each vision algorithm . 8.5 .2 S y ste m C o m p le x ity an d P ro g ra m m er B u rd en W ith regards to individual (stand-alone) algorithm s we showed, in previous chap ters, th a t our methodology provides im plem entations of low system complex ity and program m er burden for th e relaxation labelling, tree search, and linear feature extraction algorithm s. By following the same analysis techniques (our m ethodology) we specificed architectural param eters th a t are well suited to each individual algorithm of the perceptual organization system . Furtherm ore, we apply the same techniques to the specification of the algorithm interfaces, the 166 transition algorithm s. By doing so we are able to separate the com m unication required to perform d ata restructuring between algorithm s from the actual vi sion algorithm s, thus not altering the structure of the vision algorithm related software. This is w orth noting for two reasons. F irst, the vision algorithm s are not cluttered w ith extraneous code unrelated to th e actual algorithm . Thus algo rith m im plem entation can be perform ed m odularly and algorithm modifications can be m ade strictly on “fam iliar” code, again, m odularly. Second, since the transition algorithm s are separate, they can be developed to their fullest from th e point of view of dynam ic load balancing to provide high degrees of algorithm speedup and processor efficiency w ithout modifying the vision algorithm code. 8.6 Summary Stand-alone com puter vision algorithm s are interesting but, in achieving th e u lti m ate goal of a vision m achine, they provide only a portion of th e big picture. The same is tru e of parallel im plem entations of vision systems. M any researchers have provided parallel im plem entations of stand-alone vision algorithm s th a t achieve high perform ance m arks through elegant architectures [77] [95] and through ele gant algorithm m appings [59] [83] [80]. These results are an im portant first step in the parallel im plem entation of heterogeneous com puter vision systems. Also, parallel processor architectures have been proposed as vision machines [106] [48] but, to date concentration has been on VLSI im plem entation of the low- level portions of the architectures. Also, most algorithm s im plem ented either on the actual m achine or machine sim ulations have been of the stand-alone variety. Again, another im portant portion of the big picture. Finally, various researchers have considered parallel im plem entations of vision algorithm suites (vision systems) [11] [4] [55]. B ut, as com puter vision is an evolv ing field, the algorithm s th at comprise these system s m ay not be representative of th e kind th a t are currently being developed or are necessary to “solve” the vi sion problem. The tendency has been to study num eric (pixel-based) algorithm s, whereas m uch of m id high-level vision processing is symbolic. Also, they utilize a sim ilar d ata structure throughout processing, whereas m any system s require the 167 use of a variety of abstract d ata structures to perform a vision task. B ut, these are still im portant steps into the parallel im plem entation of com puter vision systems as they expose various issues th a t arise in designing such im plem entations. In our study, although we do not build a vision m achine, we do point out the issues th a t m ust be addressed in designing such a m achine and propose our design m ethodology as a m eans for addressing those issues. We consider a system th at comprises techniques representative of low, mid, and high-level vision algorithm s. They utilize techniques of image processing, symbolic and geom etric reasoning, and constraint satisfaction to perform a complex visual task, perceptual organi- | zation. I In studying the parallel im plem entation of the perceptual organization sys tem , we used a two stage approach. First we specified a parallel processor archi tecture for each individual algorithm using a “non-com m ittal” approach. T hat is, if one of the architecture organizational param eters was not constrained by the algorithm , we left it unspecified. Then we studied the interfaces between algorithm s and specified appropriate com m unication topologies to support the required d ata restructuring. We found th a t the requirem ents of these interfaces may be facilitated by specification of organizational param eters left unspecified in (unconstrained by) the individual algorithm im plem entations. A lthough we did not build a system based on our resultant design, our analysis indicates th at such a system will achieve high degrees of algorithm speedup and processor efficiency while m aintaining low degrees of system com plexity and pro gram m er burden. As dem onstrated in previous chapters, details of th e algorithm im plem entations can be extracted directly from available serial im plem entations. The operations for interfacing algorithm s are perform ed separate from the algo rithm s and, therefore can incorporate d ata modeling techniques th a t may provide load balancing among processing elem ents thus avoiding degraded perform ance due to d ata dependencies. We do not claim to have solved the problem of im plem enting heterogeneous com puter vision systems on parallel processor architectures. We merely claim to have taken further steps towards this end. 168 Chapter 9 Conclusions and Future Research In light of the complex operations employed by high-level com puter vision algo rithm s, processing through-put is a critical issue. It is not unusual for algorithm s to require hours of CPU tim e on a sequential processor. Researchers have turned to parallel processor architectures to reduce this burden and have showed th a t it is an effective solution. B ut, the price norm ally paid for th e reduced processing tim e is a long developm ent tim e as well as difficulty in incorporating algorithm modifications. Given the recent rapid increase in processing power of sequential com puters, the question of w hether or not it is cost effective to invest in a p ar allel im plem entation arises. By the tim e the parallel im plem entation is com plete (designed, developed, and debugged) a “next generation” sequential processor is introduced th a t surpasses the perform ance of the parallel processor architecture. T he im plem entation effort becomes nothing m ore th an a learning experience. In this thesis we have put forth th e proposition th a t th e parallel im plem en tation of high-level com puter vision algorithm s does not necessitate an undue am ount of developm ent costs in order to reduce execution tim e. Furtherm ore, the im plem entations can be designed to facilitate (inevitable) algorithm modifi cations. The key issue is the specification of a parallel processor architecture th a t m atches the structure of the inherent parallelism contained w ithin the algorithm . We have proposed and dem onstrated a m ethodology th a t produces an architec ture specification th at produces a parallel im plem entation th at is characterized by • Significant algorithm speedup 169 • Significant processor efficiency • Low system complexity • Low programmer burden. T he resulting parallel processor architecture specification can be used as a 1) way of selecting between various existing architectures; 2) top level design for a custom built architecture; or 3) configuration specification for a flexible architecture. T he feature of our approach th a t allows for rapid prototyping while m aintain ing high perform ance is the reform ulation of the mapping problem. W hereas the classical m apping problem is stated the search fo r a correspondence between the interaction pattern o f the algorithm processes and the communication network topology o f the architecture. our reform ulation reads the search fo r a parallel processor architecture that satisfies the requirements o f the algorithm processes. T he difference between the two being th e movement of th e architecture specifi cation from the “in p u t” to the “o u tp u t” of the problem statem ent. O ur m ethod ology yields parallel processor architectures and software th a t are robust enough to support algorithm requirem ents yet flexible enough to endure algorithm evo lution. W ith the definition of our m ethodology in place, we were able to develop parallel im plem entations of some very complex high-level com puter vision algo rithm s as well as some heterogeneous vision algorithm suites. Unlike m any of the algorithm s studied to date, each of these algorithm s have a long history of developm ent, are very intricate, and place high dem ands on a sequential CPU. Given these characteristics, it was im portant to us to provide im proved CPU utilization w ithout altering their structure, and possibly their functionality, yet do so in a tim ely m anner. 170 Aside from our approach to parallel algorithm developm ent, the algorithm im plem entations we derived are im portant in and am ongst themselves. Each has been applied to a variety of com puter vision problems. Furtherm ore, each comprises characteristics common to a wide range of com puter vision algorithm s. In summ ary, we have shown th a t parallel im plem entation of com puter vision algorithm s and systems need not be a complex task if given the freedom to specify the organization of the architecture (as opposed to having it prespecified.) We have essentially proposed a com puter aided design (CAD) system for perform ing such a task. To this end, avenues for future research include: • Development of a library of “standard” architectural characteristics from which a designer can select components. Such a library would include the various categories of parallel processor architecture organizational param e ters as outlined in chapter 2. • Developm ent of an interactive user interface to assist a designer in iden tifying algorithm parallelism . This would correspond to a visual aid in perform ing the control and d ata structure analysis steps of our m ethodol ogy described in chapter 4. • M erging of the methodology w ith parallel compiler technologies as a step | tow ard autom ation of the design task. • Com bining of the above item s via an expert system shell to provide a fully autom ated parallel algorithm /architecture design process. One further avenue of research is the development of the interfaces between system com ponents as discussed in chapter 8. Com plete com puter vision sys tem s comprise a variety of com ponent algorithm s th a t utilize a variety of d ata structures. Significant algorithm speedup and processor efficiency can only be achieved if d ata item s are appropriately distributed among processing elem ents, as also shown in [12]. Achieving this appropriate distribution will require more th an ju st com m unication pathes between architectural stages. M any com puter users believe th a t the u ltim ate goal for parallel program m ing is a fully autom ated system, one th a t accepts an algorithm description, extracts 171 the inherent parallelism , and produces an im plem entation, solution of the so called “dusty deck problem .” The prim ary road block in achieving such a system lies in the nonexistence of a robust parallel compiler, one th a t extracts algorithm parallelism thus relieving the user of the burden. This is exemplified in light of the heterogeneous and complex nature of com puter vision system s. Through this research we hope to have provided a path to achieving such autom ation by presenting an alternative form ulation of the problem. 172 Reference List [1] N. A huja and S. Swamy. M ultiprocessor pyram id architectures for bottom - up image analysis. IE E E Transactions on Pattern Analysis and Machine Intelligence, PAM I-6(4):463-475, July 1984. [2] G.M. Am dahl. Validity of the single processor approach to achieving large scale com puting capabilities. In Proceedings o f A F IP S, volume 30, pages 483-485, Spring 1967. [3] H. S. Barad. The SCOOP pyram id: An object-oriented prototype of a pyra m id architecture for com puter vision. Technical Report SIPI 115, Univer sity of Southern California, December 1988. Signal and Image Processing Institute. [4] T. B. Berg, S. Kim, and Siegel H. J. Im pact of tem poral juxtaposition on the isolated phase optim ization approach to m apping an algorithm to m ixed-m ode architectures. Subm itted fo r publication, 1991. [5] A. P. Blicher. Edge Detection and Geometric Methods in Com puter Vision. PhD thesis, Stanford University, February 1985. D epartm ent of C om puter Science. [6 1 B. W. Boehm. Improving software productivity. IE E E Com puter, pages 43-57, Septem ber 1987. [7] S. H. Bokhari. On the m apping problem. IE E E Transactions on Com put ers, C-30(3):207-215, M arch 1981. [8] R.A. Brooks. Symbolic reasoning among 3-d models and 2-d images. A rti ficial Intelligence Journal, 17:285-348, 1981. [9] J.F . Canny. A com putational approach to edge detection. IE E E Transac tions on Pattern Analysis and M achine Intelligence, PAMI-8(6):679-698, November 1986. 173 [10] Z. Chen. A flexible parallel architecture for relaxation labeling algorithms. Technical R eport CAR-TR-474, University of M aryland, November 1989. C om puter Vision Laboratory. [11] A. N. Choudhary. Parallel Architectures and Parallel Algorithms fo r In tegrated Vision System s. PhD thesis, U niversity of Illinois at Urbana- Cham paign, Septem ber 1989. Coordinated Science Laboratory. [12] A. N. Choudhary and R. Ponnusamy. Parallel im plem entation and evalua tion of a m otion estim ation system algorithm using several d a ta decompo- I sition strategies on a shared memory m ultiprocessor. Subm itted fo r publi cation,, 1991. [13] R. Cole and U. Vishkin. A pproxim ate and exact parallel scheduling with applications to list tree and graph problems. In Proceedings o f the 27th Sym posium on Foundations of Computer Science, pages 478-491, 1986. [14] H. Derin and C. Won. A parallel image segm entation algorithm using relaxation w ith varying neighborhoods and its m apping to array processors, j Computer Vision, Graphics, and Image Processing, 40(l):54-78, October 1987. [15] R. 0 . D uda and P. E. H art. Pattern Classification and Scene Analysis. John W iley & Sons, New York, 1973. [16] R. Engelmore and T. Morgan. Blackboard System s. Addison-W esley P ub lishing Company, M assachusetts, 1988. [17] T. J. Fan. Describing and Recognizing 3-D Objects Using Surface Proper ties. PhD thesis, University of Southern California, August 1988. Institute for Robotics and Intelligent Systems. [18] T. J. Fan, G. Medioni, and R. Nevatia. Segmented descriptions of 3-D surfaces. IE E E International Journal o f Robotics A utom ation, pages 527- 538, December 1987. [19] T .J. Fan, G. Medioni, and R. Nevatia. Recognizing 3-D objects using surface descriptions. IE E E Transactions on Pattern Analysis and Machine Intelligence, PAMI-11(11):1140-1157, November 1989. [20] O.D. Faugeras and K. Price. Sem antic description of aerial images using stochastic labeling. IE E E Transactions on Pattern Analysis and Machine Intelligence, PAM I-3(6):633-642, November 1981. 174 [21] M. J. Flynn. Some com puter organizations and their effectiveness. IE E E Transactions on Computers, C-21(9):948-960, Septem ber 1972. [22] K. Fukushim a and S. Miyake. Neocognitron: A new algorithm for p a t tern recognition tolerant of deform ations and shifts in position. Pattern Recognition, 15(6):455-469, 1982. [23] E. G abber. VM MP: a practical tool for the developm ent of portable and efficient program s for m ultiprocessors. IE E E Transactions on Parallel and Distributed System s, 1(3):304-317, July 1990. [24] M .R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory o f NP-Completeness. W. H. Freem an, San Francisco, 1979. [25] J.L. G audiot, M. Dubois, L.T. Lee, and N.G. Tohme. The tx l6 : A highly program m able m ulti-m icroprocessor architecture. IE E E Micro, 6(5): 18— 31, O ctober 1986. [26] J.L. G audiot, R.W . Vedder, G.K. Tucker, D. Finn, and M.L. Cam pbell. A distributed vlsi architecture for efficient signal and d ata processing. IE E E Transactions on Computers, C-34(12):1072-T087, December 1985. [27] S.L. G azit and G. Medioni. Contour correspondences in dynam ic imagery, i In Proceedings o f the D ARPA Image Understanding Workshop, pages 423- ! 432, April 1988. [28] J. Ghosh and K. Hwang. M apping neural networks onto message-passing m ulticom puters. Journal o f Parallel and Distributed Computing, 2(6):291- 330, 1989. [29] R.C. Gonzalez and P. W intz. Digital Image Processing. Addison-Wesley Publishing Company, M assachusetts, 1987. [30] W. E. L. Grimson and T. Lozano-Perez. Localizing overlapping parts by searching the interpretation tree. IE E E Transactions on Pattern Analysis and M achine Intelligence, PAM I-9(4):469-482, July 1987. [31] J. Gu, W. W ang, and T. C. Henderson. A parallel architecture for discrete relaxation algorithm . IE E E Transactions on Pattern Analysis and M achine , Intelligence, PAMI-9(6):816-831, November 1987. ! [32] C. Guerra. Systolic algorithm s for local operations on images. IE E E Trans actions on Computers, C-35(l):73-77, January 1986. [33] C. G uerra and S. Hambrusch. Parallel algorithm s for line detection on a mesh. Journal o f Parallel and Distributed Computing, 6(2): 1— 19, 1989. 175 [34] A. R. Hanson and E. M. Riseman. VISIONS: a com puter system for in terpreting scenes. In A. R. Hanson and E. M. Risem an, editors, Computer Vision System s, pages 303-333. Academic Press, 1978. [35] D. Hillis. The Connection Machine. M IT Press, M assachusetts, 1985. [36] J.J. Hopfield. Neural networks and physical system s w ith em ergent col lective com putational abilities. In Proceedings o f the N ational Academy of Science, pages 2554-2558, May 1982. [37] B.K.P. Horn. Robot Vision. M IT Press, M assachusetts, 1986. [38] P.V.C. Hough. M ethod and means for recognizing complex patterns. U. S. P atent 3069654■> December 1962. [39] A. H uertas, W. Cole, and R. Nevatia. D etecting runways in aerial images. In Proceedings o f the D ARPA Image Understanding W orkshop, pages 272- 297, February 1987. [40] K. Hwang and F. A. Briggs. Computer Architecture and Parallel Processing. McGraw-Hill, New York, 1984. [41] M. Imai, Y. Yoshida, and T. Fukum ura. A parallel searching scheme for m ultiprocessor systems and its application to com binatorial problems. In Proceedings o f the Sixth International Conference on Artificial Intelligence, pages 416-418, A ugust 1979. [42] INMOS Corporation. The Transputer Databook, 1989. [43] INMOS Lim ited. O C C AM 2 Reference Manual, 1988. [44] Intel Corporation. iPSC 2 User’ s Guide, 1989. [45] M. K am ada, K. Toraichi, R. Mori, K. Yamamoto, and H. Yamada. A par allel architecture for relaxation operations. Pattern Recognition, 21 (2): 175— 181, 1988. [46] E.W . K ent, M.O. Shneir, and R. Lumia. PIPE : pipeline image processing engine. Journal o f Parallel and Distributed Computing, 2(l):50-78, 1985. [47] B. W. Kernigan and D. M. Ritchie. The C Programming Language. P ren tice Hall, New Jersey, 1978. [48] J. T. Kuehn, H. J. Siegel, D. L. Tuomenoksa, and G. B. Adams III. The use and design of PASM. In S. Levialdi, editor, Integrated Technology fo r Parallel Image Processing, pages 133-152. Academic Press, 1985. 176 [49] V. K um ar and L. N. Kanal. Parallel Branch-and-Bound fi,„mulations for A N D /O R tree search. IE E E Transactions on Pattern Analysis and M a chine Intelligence, PAM I-6(6):768-778, November 1984. [50] V. K um ar and V. Nageshwara Rao. Scalable parallel form ulations of depth- first search. In V. K um ar, P.S. Gopalakrishnan, and L.N. K anal, edi tors, Parallel Algorithms fo r M achine Intelligence and Vision, pages 1-41. Springer Verlag, 1990. [51] H. T. Kung and O. Menzilcioglu. W arp: A program m able systolic array processor. In SP IE Real Tim e Signal Processing VII, pages 130-136, 1984. [52] S. Y. Kung and J. N. Hwang. Neural network architectures for robotic applications. IE E E Transactions on Robotics and A utom ation, 5(5):641- j 657, O ctober 1989. I [53] T. K ushner, A. Y. Wu, and A. Rosenfeld. Image processing on ZMOB. IE E E Transactions on Computers, C-31(10):943-951, O ctober 1982. [54] S.Y. Lee and J.K . Aggarwal. Parallel 2-D convolution on a mesh connected array processor. IE E E Transactions on Pattern Analysis and M achine In telligence, PAM I-9(9):590-594, July 1987. [55] S.Y. Lee and J.K . Aggarwal. A problem -driven approach to parallel pro cessing: System design/scheduling and task m apping. Technical Report CVRC TR-87-7-39, University of Texas at A ustin, June 1987. Laboratory for Image and Signal Analysis. [56] S. P. Levitan. Parallel Algorithms and Architectures: A Program m er’ s Perspective. PhD thesis, U niversity of M assachusetts, 1984. C om puter and Inform ation Sciences. [57] G .J. Li and B.W . W ah. Coping w ith anomalies in parallel branch-and- bound algorithm s. IE E E Transactions on Computers, C-35(6):568-573, June 1986. [58] R.P. Lippm ann. An introduction to com puting w ith neural nets. IE E E A S S P Magazine, pages 50-78, April 1987. [59] J. J. Little, G. Blelloch, and T. Cass. Parallel algorithm s for com puter vision on the Connection M achine. In Proceedings o f the D A R P A Image Understanding Workshop, pages 628-638, February 1987. [60] M .J. Little and J. Grinberg. The th ird dimension. B Y T E , pages 311-319, November 1988. 177 [61] D. M arr. Vision. W. H. Freem an and Company, New York, 1982. [62] D. M arr and E. H ildreth. Theory of edge detection. Proceedings of the Royal Society o f London, B , 207:187-217, 1980. [63] J. T. McCall, J. G. Tront, F. G. Gray, R. M. Haralick, and W. M. Mc Cormack. Parallel com puter architectures and problem solving strategies for the consistent labeling problem. IE E E Transactions on Com puters, C -34(ll):973-980, November 1985. [64] G. Medioni and R. Nevatia. M atching images using linear features. IE E E Transactions on Pattern Analysis and M achine Intelligence, PAMI- 6(6):675-685, November 1984. [65] G. Medioni and R. Nevatia. Segment-based stereo m atching. Computer Vision, Graphic, and Image Processing, 31(1):2— 18, July 1985. [66] R. Mohan. Perceptual Organization fo r Com puter Vision. PhD thesis, University of Southern California, August 1989. In stitu te for Robotics and Intelligent Systems. [67] R. M ohan and R. Nevatia. Perceptual grouping w ith applications to 3-D shape extraction. In Proceedings o f the IE E E Com puter Society Workshop on Computer Vision, pages 158-163, December 1987. [68] R. M ohan and R. Nevatia. Perceptual grouping for the detection and de scription of structures in aerial images. In Proceedings o f the D A RP A Image Understanding Workshop, pages 512-526, April 1988. [69] V. Nageshwara Rao and V. Kum ar. Parallel depth first search. P art I. im plem entation. International Journal of Parallel Programming, 16(6):479- 499, 1987. [70] V. Nageshwara Rao and V. Kum ar. Parallel depth first search. P art II. analysis. International Journal o f Parallel Programming, 16(6):501-519, 1987. [71] R. Nevatia. M achine Perception. Prentice Hall, New Jersey, 1982. [72] R. N evatia and K. R. Babu. Linear feature extraction and description. Computer Vision, Graphics, and Image Processing, 13:257-269, 1980. [73] R. N evatia and T. O. Binford. Description and recognition of curved ob jects. Artificial Intelligence Journal, 8:77-98, 1977. 178 [74] R N evatia and K. Price. Research in knowledge-based vision techniques for the autonom ous land vehicle program . Technical R eport IRIS 201, U niversity of Southern California, Septem ber 1986. In stitu te for Robotics and Intelligent Systems. [75] M. O shim a and Y. Shirai. O bject recognition using three-dim ensional infor m ation. IE E E Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(4):353-361, July 1983. [76] T. Poggio, J. Little, E. Gamble, W. G illett, D. Geiger, D. W einshall, M. Vil- lalba, N. Larson, T. Cass, H. Bulthoff, M. Grum heller, P. O ppenheim er, W . Yang, and A. H urlbert. The M IT vision machine. In Proceedings o f the D ARPA Image Understanding W orkshop, pages 177-198, April 1988. [77] V.K. Prasanna-K um ar and D. Reisis. Parallel architectures for im age pro cessing and vision. In Proceedings o f the D ARPA Image Understanding Workshop, pages 609-619, April 1988. [78] W .K . P ratt. Digital Image Processing. John W iley & Sons, New York, 1978. [79] F.P. P rep arata and M.I. Shamos. Com putational Geometry — A n Intro- j duction. Springer, New York, 1985. j [80] K. W . Przytula, V. K. Prasanna-K um ar, and W. M. Lin. Algorithm ic ! m apping of neural network models onto parallel SIMD machines. Subm itted j fo r publication, 1991. [81] S. Ranka and S. Sahni. Image transform ations on hypercube and mesh m ulticom puters. In V. K. Prasanna-K um ar, editor, Parallel Architectures and Algorithms fo r Image Understanding, pages 227-248. Academic Press, Inc., 1991. [82] A. P. Reeves. Software com puter vision environments for parallel com puters. In V. K. Prasanna-K um ar, editor, Parallel Architectures and Al gorithms fo r Image Understanding, pages 453-472. Academic Press, Inc., 1991. [83] D. Reisis and V. K. Prasanna-K um ar. Parallel processing of image and stereo m atching using linear segments. Technical Report IRIS 204, Uni versity of Southern California, February 1987. Institute for Robotics and Intelligent Systems. 179 [84] T. A. Rice and H. Jam ieson. Parallel processing for com puter vision. In S. Levialdi, editor, Integrated Technology fo r Parallel Image Processing, pages 57-78. Academic Press, 1985. [85] A. Rosenfeld. A report on the DARPA image understanding architectures workshop. In Proceedings o f the D ARPA Image Understanding Workshop, pages 298-302, February 1987. [86] A. Rosenfeld. Com puter vision. In M. C. Yovits, editor, Advances in Computers, pages 265-308. Academic Press, Inc., 1988. [87] A. Rosenfeld, R. A. Hum m el, and S. W. Zucker. Scene labeling by relax ation operations. IE E E Transactions on System s, M an and Cybernetics, SMC-6(6):420-433, June 1976. [88] A. Rosenfeld, B. Simpson, and Squires S., editors. Proceedings o f the D ARPA Workshop on Architectures fo r Image Understanding, McLean, Virginia, November 1986. [89] A. Rosenfeld and R. C. Sm ith. Thresholding using relaxation. IE E E Trans actions on Pattern Analysis and Machine Intelligence, PAMI-3(5):598-606, Septem ber 1981. [90] D.E. Rum elhart and J.L. M cClelland. Parallel Distributed Processing. M IT Press, Cambridge, M assachusetts, 1986. [91] W. S. Rutkowski, S. Peleg, and A. Rosenfeld. Shape segm entation using relaxation. IE E E Transactions on Pattern Analysis and M achine Intelli gence, PAMI-3(4):368-375, July 1981. [92] C.L. Seitz. The cosmic cube. Com munications o f the A C M , 28(l):22-33, January 1985. [93] S. C. Shapiro. Encyclopedia o f Artificial Intelligence. John W iley & Sons, New York, 1987. [94] D.E. Shaw. The NON-VON supercom puter. Technical report, Columbia University, August 1982. Com puter Science D epartm ent. [95] D.B. Shu, J.G . Nash, M.M. Eshaghian, and K. Kim. Straight-line detection on a gated-connection VLSI network. In Proceedings o f the Tenth Interna tion Conference on Pattern Recognition, pages 456-461, June 1990. [96] B. Soucek and M. Soucek. Neural and Massively Parallel Computers. John W iley & Sons, New York, 1988. 180 [97] G.L. Steele. Common Lisp. Digital Press, 1984. [98] Q. F. Stout. M apping vision algorithm s to parallel architectures. Proceed ings o f the IEEE, 76(8):982-995, August 1988. [99] R. V aillant, R. Deriche, and 0 . Faugeras. 3D vision on the parallel m a chine CAPITAN. In International Workshop on Industrial Applications of Machine Intelligence and Vision, pages 326-331, April 1989. [100] B. W. W ah and Y. W. E. Ma. M ANIP — a m ulticom puter architecture for solving com binatorial extrem um -search problems. IE E E Transactions on Computers, C-33(5):377-390, May 1984. [101] R.S. Wallace and M.D. Howard. HBA vision architectures: B uilt and benchm arked. IE E E Transactions on Pattern Analysis and M achine In telligence, PA M I-ll(3):227-232, M arch 1989. [102] D. W altz. Generating Sem antic Descriptions from Drawings o f Scenes with Shadows. PhD thesis, M assachusetts In stitu te of Technology, November 1972. Artificial Intelligence Laboratory. [103] J. A. Webb. M achine independent parallel image processing: A pply and A dapt. In V. K. Prasanna-K um ar, editor, Parallel Architectures and A l gorithm s fo r Image Understanding, pages 499-524. Academic Press, Inc., 1991. [104] C. Weems, E. Riseman, and A. Hanson. The DARPA image understand ing benchm ark for parallel com puters. Journal of Parallel and Distributed Computing, 11(1): 1— 24, 1991. [105] C. C. Weems and J. H. Burrill. The image understanding architecture and its program m ing environment. In V. K. Prasanna-K um ar, editor, Paral lel Architectures and Algorithms fo r Image Understanding, pages 525-562. Academic Press, Inc., 1991. [106] C. C. Weems and S. P. Levitan. T he Image U nderstanding A rchitecture. In Proceedings of the D ARPA Image Understanding Workshop, pages 483- 496, February 1987. [107] M. Y. Wu and D. D. Gajski. Hypertool: A program m ing aid for message- passing systems. IE E E Transactions on Parallel and Distributed Systems, l(3):330-343, July 1990. 181
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11255732
Unique identifier
UC11255732
Legacy Identifier
DP22830