Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SCALABLE DATA PARALLEL ALGORITHMS AND IMPLEMENTATIONS FOR OBJECT RECOGNITION by Ashfaq Ahmad Khokhar A Dissertation Presented to the FACULTY OF TH E GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In P artial Fulfillment of the Requirem ents for the Degree D O CTO R O F PHILOSOPHY (Com puter Engineering) May 1993 Copyright 1993 Ashfaq Ahm ad Khokhar UMI Number: DP22867 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. Dissertation Publ lung UMI DP22867 Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106- 1346 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, w ritten by A s h f a q Ahmad K h o k h a r under the direction of h .T s D issertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillm ent of re quirem ents for the degree of ph.P- CpS \ D O C TO R OF PHILOSOPH Y Dean o f Graduate Studies Da fe Apri 116, 1993 DISSERTATION COMMITTEE To the advancement of higher education in Pakistan 1 1 Acknowledgments Like m any others, I wanted to write the acknowledgments in a unique style, b u t this exercise has appeared to be much m ore difficult than the w riting of the dissertation itself. W riting of this dissertation has not been a single-handed accom plishm ent. Several years of efforts and various kinds of help from family, friends, and colleagues need to be acknowledged. To begin with, all praises and thanks to Almighty Allah for blessing me w ith the strength to survive through all the tim es, both easy and hard ones. My parents-they invested every thing they had into educating a family of six sons and four daughters; a very tough job but an extrem ely well accomplished one. Their only asset is our education. My deepest gratitude to them for their continuous support and encouragement in every academic endeavor I pursued. Their unparalleled and unconditional caring kept me going through all the ups and downs in my life. My intellectual mentors; my brothers and sisters and in particular my eldest sister Kalsoom and brother Ijaz, their glorious successes provided motivation for all of us to strive for whatever we dream . Being in the middle of my family, I found myself facing outstanding performances in the field of education from elder brothers and sisters and striving attitudes of younger ones. T h at equipped m e w ith all th a t was needed to go through the rigorous endeavor of obtaining the degree of Doctor of Philosophy! I am thankful to all of them , including my friend Usman, for their innum erable ways of helping and encouraging me. Also, thanks to uncle Nawaz for nurturing the trend of higher education in the family. I can’t forget the days at Columbia University, New York when I was vig orously looking for admission in schools doing vision and parallel processing. I had a craze for this research field in those times! Thanks to Dr. Hussein Ibrahim and Dr. Ajit Singh for sticking w ith me and providing all the moral support. Professor V iktor Prasanna (V.K. Prasanna-K um ar at th a t time!) came through and became the reason for me in choosing USC for PhD studies. I have found a friend in him which will go a long way. I owe him thanks in many different ways; for advising and guiding me through the PhD, for showing me the practicalities and politics of academia, and for encouraging and providing the opportunities to be active in various technical forums. Thanks to Pro fessor Jean-Luc G audiot and Professor Michel Dubois for guiding me through the PhD qualifier. Also, thanks to Professor Gerard Medioni and Professor Ellis Horowitz for being on my dissertation com m ittee and for m any insightful discussions on the latest problems related to my research interests. I appreciate the help of Cho-Li Wang and Hyoung-Joong Kim in w rit ing the code for my algorithm s on parallel machines. In addition, th a nks to my colleagues M uham m ad Shaaban, Anil Rao, Heonchul Park, Ju-wook Jang, Cho-Chin Lin, M anav Misra, Kichul Kim, Wei-Ming Lin, M ourad Zerroug, M ary Eshaghian, and Hussein Alnuweiri for their support and help in disser tation related m atters, or otherwise. Also, at several occasions, after a long day of hard work, those awesome and relaxed dinners at M ary’s parents often fulfilled the desire of gourmet food away from HOME. I deeply appreciate the hospitality and affection of her family. Over the past several years in the US, I got the opportunity to befriend some extrem ely caring people. They have shared my happiest tim es and have provided me support through the worst. Among those Zafar Singhera, Sohail Zafar, Tanuja Pethe, and A thar Salim Mian stood by me through every crisis, be it academic or otherwise. In the m idst of academics, uncountable m om ents of home-sickness were faded away by the true friendship of Neeraj Rai, Kuldip K aur, M adhuri Samel, Abhay Joshi, Im ran Ahmad, M ohamm ad Zafar, Farrukh Raza, Zahida Sharief, Irfan Khursheed, and home away from home was because of Shahab and Niaz. At USC, m any others who extended their friendship to help me in numerous ways include: Naveed Alam, Farooq-e-Azam, M urshed K han, M ushtaq K han, Anoopam a Prasanna, Faheem Uddin, and Shahid Zia. I deeply appreciate the continuous support of Arnold Clomera and Aniekan Akpaffiong and other colleagues at University Com puting Services who kept me on job as a Unix Consultant in spite of my very selective schedule. My spe cial thanks to B arbra Edling who always managed to keep in touch w ith me, provided all the moral support during depressed-times (which came very of- ' ten!), and w ith her amicable personality, enlightened me w ith the m id-western culture and values. During the last part of dissertation writing, I had a chance to m eet w ith a wonderful person, N ikoletta Lendvai. We shared the interest of vision and neural networks. Thanks for her soothing phone calls from St. Paul, M innesota during the looooong nights of dissertation writing. She has proven to be an asset as a friend. Efficient and friendly staff at USC including Lucille Stivers, Irene Olotoa, Regina M orton, Juliet Mendoza, Lea Vasquez, Bill Bates, and Diane Demetres; they came through to m eet, sometime unreasonable, requests related to late paychecks, academic requirem ents, and preparation and mailing of m anuscripts and technical reports. In the last, once again I wish to thank all th e friends and family m em bers back home and here who have always m ade m e feel proud of w hat I do, and all th e teachers who ever taught me from childhood to this age. W ithout their support and blessings from Almighty Allah, this accomplishment would not have been possible. Ashfaq Khokhar Los Angeles April 16, 1992 v i Contents D edication ii A cknow ledgm ents iii List of Figures viii List of Tables xi A bstract xiv ' I T h e P r e m ise 1 1 In trod uction 2 1.1 O bject R e c o g n itio n ................................................................................... 4 1.2 Contributions of this D is s e rta tio n ........................................................ 7 1.3 A View of Things to C o m e ..................................................................... 11 2 Background 12 2.1 Techniques for O bject Recognition .................................................... 12 2.2 Problem S ta te m e n t................................................................................... 16 2.3 Related W o r k ............................................................................................. 17 2.4 C om putational M o d e ls ............................................................................ 19 2.5 Perform ance M e tr i c s ............................................................................... 22 II R esea rch C o n trib u tio n s 24 3 Parallel O bject R ecognition 25 3.1 Image M atching using Relaxation Labelling .................................... 26 3.1.1 M atching T e c h n iq u e .................................................................. 27 3.1.2 A Fast Sequential Image M atching A lg o r ith m ...................... 28 3.1.3 Parallel Image M a tc h in g ........................................................... 31 V I v/3.2 O bject Recognition using Geometric H ash in g ................................... 38 3.2.1 Geom etric Hashing A lg o rith m .................................................. 38 3.2.2 Parallel Geometric H a s h i n g ..................................................... 41 3.3 O bject Recognition using Structural I n d e x i n g ................................ 51 3.3.1 S tructural Indexing T ech n iq u e.......................................................51 3.3.2 Parallel Structural I n d e x in g ...................................................... 55 4 A pplications of Parallel Techniques 65 4.1 Stereo M atch in g ........................................................................................... 66 4.2 Related W o r k .............................................................................................. 67 4.3 Stereo M atching using Zero C ro ssin g s................................................. 70 4.3.1 Sequential Technique ............................................................... 71 4.3.2 A Fast Sequential A lg o rith m ..................................................... 73 4.3.3 P artitioned Parallel Im p lem en tatio n ....................................... 73 4.4 Stereo M atching using Linear F e a t u r e s .............................................. 79 4.4.1 Sequential T e c h n iq u e ................................................................... 79 4.4.2 P artitioned Parallel Im p le m e n ta tio n s.................................... 83 5 Parallel Im plem en tation s 94 5.1 Parallel Machines for Im p le m e n ta tio n s .............................................. 95 5.1.1 The Connection Machine C M - 5 .............................................. 95 5.1.2 The M asPar M P - 1 ....................................................................... 96 5.2 Experim enting w ith CM-5 and M P - 1 ................................................. 97 — —5.3 Geom etric H a s h in g ...................................................................................... 103 5.3.1 P artitioning and M a p p in g ........................................................... 106 5.3.2 Experim ents and Summary of Perform ance Results . . . 108 5.4 Structural Indexing ...................................................................................I l l 5.4.1 Partitioning and M a p p in g ........................................................... 114 5.4.2 Experim ents and Summary of Perform ance Results . . . 114 5.5 Scalability of Im p lem en tatio n s..................................................................117 5.6 Comparison: MP-1 vs C M - 5 .....................................................................119 III T h e S u rm ise 120 6 C onclusions and Future W ork 6.1 Contribution of this D issertation 6.2 Directions for Future Research . 121 121 122 A p pend ices 125 A A dditional Perform ance R esu lts 125 B ibliography 132 I V lll List of Figures 1.1 A hierarchy of sequential approaches for object recognition. . . . 6 2.1 A fixed size linear array.................................................................... 21 2.2 A fixed size m esh array................................. 21 3.1 D eterm ining a match-window................................................................... 27 3.2 A sequential algorithm for image m atching..............................................29 3.3 A parallel algorithm for image m atching...................................................32 3.4 Structure of a P E .......................................................................................... 33 3.5 P artitioned im plem entation on a fixed size linear array....................... 34 3.6 Initialization................................ 35 3.7 Im plem enting Id broadcast on the linear array....................................... 36 3.8 U pdate procedure......................................................................................... 37 3.9 A general scheme for the geometric hashing technique..............39 3.10 A sequential procedure to construct the hash table.................... 41 3.11 O utline of steps in sequential recognition............................................. 42 3.12 A parallel procedure to process a probe of the recognition phase. 44 3.13 Contention for a single hash bin.............................................................. 45 3.14 Congestion at a PE while accessing different hash bins stored in a PE. .................................................................................................. 45 3.15 A parallel procedure to merge quantized coordinates of feature points o f the scene...................................................................................... 46 3.16 A fat tree m odel.................................................................................................47 3.17 Two dimensional super segments of cardinality 4.................................. 53 3.18 R epresentation of a model in a hash table (courtesy Stein and M edioni.)......................................................................................................... 54 3.19 An algorithm to record models in th e hash tab le...................................54 3.20 Retrieval of candidate super segments, (courtesy Stein and M edioni.)......................................................................................................... 55 3.21 Verification of hypotheses, (courtesy Stein and M edioni.). . . . 56 3.22 An algorithm to recognize models present in a scene............................57 3.23 A parallel algorithm to process super segments and record them in a hash tab le............................................................................................... 59 IX 3.24 A parallel algorithm to route the d ata to corresponding bins in th e hash tab le................................................................................................ 60 3.25 A parallel algorithm for th e recognition phase................................... 60 3.26 A parallel algorithm to construct correspondence table. . . . . . 61 3.27 A parallel algorithm to check consistency of hypotheses for mod els present in the scene............................................................................... 62 4.1 Imaging geom etry of stereo cam eras........................................................... 67 4.2 Stereo m atching: Known sequential approaches................................ 68 4.3 Finding a candidate m atch................................................................ ...... . 71 4.4 Sequential algorithm for stereo m atching using zero crossings as m atching prim itives..................................................................................... 73 4.5 A lgorithm for com puting initial weights for candidate m atches. . 74 4.6 A lgorithm for updating probabilities of candidate m atches. . . 74 4.7 A lgorithm for voting for candidate m atches........................................ 75 4.8 A faster algorithm for updating probabilities of candidate m atches. 75 4.9 A parallel algorithm for stereo m atching using zero-crosssings as m atching prim itives..................................................................................... 76 4.10 A parallel agorithm for com puting initial weights for candidate m atches............................................. 77 4.11 A parallel algorithm for updating probabilities of candidate matches. 78 4.12 A parallel algorithm for voting for candidate m atches................ ... . 78 4.13 D eterm ining a stereo-window.............................................. 80 4.14 Sequential algorithm for stereo m atching..................................................81 4.15 Finding partially preferred m atches for the first im age........................ 82 4.16 U pdating sets of preferred m atches.......................................... 83 4.17 Parallel algorithm for stereo m atching.................................................. 83 4.18 M apping windows of left image onto a fixed size linear array. . 86 4.19 Parallel-i-pref{i) algorithm for execution on a fixed size linear array.............................. 87 4.20 Execution of Parallel-i-pref[) on fixed size linear array........................ 90 4.21 Parallel-i-pref{i) on a mesh array.................................................................91 4.22 Parallel-Q-update for updating sets of preferred m atches....................91 4.23 D ata flow for procedure Parallel i-pref(i)................................................. 93 5.1 The organization of the Connection M achine CM-5..............................96 5.2 A virtual CM-5 organization from a program m er’s perspective. 97 5.3 The M asPar MP-1 system block diagram (courtesy M asPar Com puter C orporation)....................................................................................... 98 5.4 A parallel procedure to com pute the coordinates of feature points of the scene........................................................................................................104 5.5 A parallel procedure to vote for the possible presence of a model in the s c e n e ..................................................................................................... 104 5.6 A parallel procedure to com pute the winning [model, basis) pair. 105 5.7 Hash bin access tim e vs Number of hash table copies for Algo rithm B and Algorithm C on a 512 P E CM -5.......................................112 5.8 Hash bin access tim e vs Number of hash table copies for Algo rithm B and Algorithm C sim ulating worst-case scenario on a 512 P E CM -5.................................................................................................... 113 5.9 Voting tim e vs Num ber of hash table copies for A lgorithm B and A lgorithm C on a 512 PE CM -5........................................................114 5.10 Voting tim e vs Num ber of hash table copies for Algorithm B and A lgorithm C sim ulating worst-case scenario on a 512 P E CM-5. 115 5.11 Total execution tim e vs Machine size for various algorithm s. . . 116 List of Tables 1.1 Comparison w ith earlier parallel im plem entations of object recog nition using geom etric hashing............................................................. 5.1 Tim ing analysis for integer com putations on a 256 processor CM-5.............................................................................................................. 5.2 Tim ing analysis for point-to-point d ata com m unication on a 256 processor CM -5.......................................................................................... 5.3 Tim ing analysis for d ata broadcasting on control network. . . 5.4 Tim ing analysis for basic arithm etic operations on a IK proces sor M P-1....................................................................................................... 5.5 Tim ing analysis for basic com m unication operations on a IK processor M P - 1 .................................................... .................................. 5.6 Execution tim es (in msec) of various subtasks in a probe using different partitioning algorithm s for a scene consisting of 1024 feature points on a 256 CM-5 and IK M P-1.................................... 5.7 Execution tim es (in msec) of various algorithm s on a scene con sisting of 1024 feature points................................................................. 5.8 Execution tim es (in msec) of Algorithm C on a scene consisting of 1024 feature points w ith concurrent processing of m ultiple probes on various sizes of MP-1. Average bin size is 8................ 5.9 Comparison w ith previous im plem entations.................................... 5.10 Execution tim es (in msec) of various subtasks in the recognition phase using different partitioning algorithm s for a scene consist ing of 1024 super segments on a 256 CM-5 and IK MP-1. . . 5.11 Execution tim es (in msec) of various algorithm s on a scene con sisting of 1024 super segm ents.............................................................. 5.12 Com parison w ith geom etric hashing im plem entations................. A .l Execution tim es of Algorithm A on a 256 processor CM-5. . . A .2 Execution tim es of Algorithm B on a 32 processor CM-5. . . A .3 Execution tim es of Algorithm B on a 64 processor CM-5. . . A.4 Execution tim es of A lgorithm B on a 128 processor CM-5. . . A .5 Execution tim es of Algorithm B on a 256 processor CM-5. . . A .6 Execution tim es of A lgorithm B on a 512 processor CM-5. . . . 128 A .7 Execution tim es of A lgorithm B corresponding to various d ata granularities of the hash table on a scene consisting of 256 fea tu re points, on a 256 processor CM-5.......................................................129 A .8 Execution tim es of A lgorithm A corresponding to various d ata granularities of th e hash table on a scene consisting of 1024 feature points, on a IK processor M P-1.................................................. 129 A .9 Execution tim es of A lgorithm A corresponding to various d ata granularities of the hash table on a scene consisting of 256 fea tu re points, on a IK processor M P-1........................................................130 A. 10 Execution tim es of Algorithm C on a worst-case d ata w ith 16 copies of the hash table on a 512 processor CM -5................................130 A. 11 Execution tim es of A lgorithm C on a worst-case d ata w ith 8 copies of the hash table on a 512 processor CM -5............................... 131 A .12 Execution tim es of Algorithm C on a semiworst-case d ata with 16 copies of the hash table on a 512 processor CM -5..........................131 A. 13 Execution tim es of Algorithm C on a semiworst-case d ata w ith 8 copies of the hash table on a 512 processor CM-5............................131 I I X lll Abstract O bject recognition involves identifying known objects in a given scene. It plays a key role in image understanding and is of significant interest to the vision community. A wide variety of techniques for object recognition is avail able. In addition to the prim itives used for m atching, these techniques differ w ith respect to the m atching constraints and the search m ethods employed. In these approaches, the am ount of com putation is trem endous and the data structures employed are complex. These characteristics qualify image under standing as a G rand Challenge problem. In order to achieve fast and real tim e solutions to image understanding tasks, parallel processing becomes a necessity. In this dissertation, parallel solutions based on well known sequential ap proaches to object recognition and their im plem entations on the state of the art commercially available parallel machines are studied. Based on various se quential approaches, scalable d ata parallel algorithm s are designed. Novel par titioning and m apping strategies are developed to parallelize the serial solutions to these problems. These techniques lead to optim al and scalable im plem en tations on the parallel machines. For asym ptotic tim e analysis, th e fixed size processor array is used as the com putational model. The object recognition techniques studied include image m atching using relaxation labelling, object recognition using geom etric hashing, and object recognition using structural indexing. For image m atching using relaxation labelling, d ata parallel algorithm s are developed. On a P processor linear array, the algorithm runs in 0 ( ( ^ + P )n m ) tim e, where 1 < P < nm . This leads to a processor-tim e optim al solution for 1 < P < sjrim. An 0 ( ( ^ + \ / ~P)nm) tim e perform ance is achieved on a P 2 processor mesh array. The solution is processor-time optim al for 1 < P < (n m )1 /3. T he sequential algorithm takes 0 ( n 2m 2) tim e. For the geom etric hashing technique, d ata parallel algorithm s are developed for the recognition phase. Com pared w ith earlier parallel im plem entations, the proposed solutions lead to superior tim e performance. Also, the num ber of processors employed in our solution is significantly less. Given a scene w ith S feature points, the algorithm for th e recognition phase takes 0(^j=) tim e to process a probe on a P processor m esh array, where 1 < P < S. A probe corresponds to the execution of the recognition phase for one basis pair. The sequential geometric hashing algorithm takes 0 ( S ) tim e per probe. For structural indexing, the proposed algorithm takes O(yr) tim e on a P processor mesh array, where S is the num ber of super segments in the scene, and 1 < P < S. The sequential algorithm for structural indexing takes 0 ( S 2) tim e. The parallel techniques developed for object recognition are extended to derive parallel solutions to other image understanding tasks. For example, scalable d ata parallel algorithm s are designed for stereo m atching using zero- crossings and for stereo m atching using linear features. These solutions lead to processor-tim e optim al im plem entations on fixed size processor arrays. Finally, the algorithm s proposed for geometric hashing are im plem ented on M asPar MP-1, a Single Instruction M ultiple D ata (SIMD) m achine, and on th e Connection M achine CM-5 operating in the Single Program M ultiple D ata (SPM D) mode. O ur results show th a t a probe for a scene consisting of 1024 feature points takes less than 50 msec on a 1024 processor MP-1 and it takes less th an 10 msec on a 256 processor CM-5. T he database used in these experim ents contains 1024 models and each m odel is represented using 16 feature points. T he im plem entations developed in this dissertation require num ber of processors independent of the size of the m odel database and are scalable w ith th e machine size. For structural indexing, it is shown th a t the execution of th e recognition phase on a scene consisting of 1024 super segments takes less th an 100 msec on MP-1 and it takes less th an 50 msec on a 256 processor CM-5. X V Part I The Premise Chapter 1 Introduction Since the inception of the hum an race, inquisitive and intelligent m inds have attem p ted to unveil the unknown. The greatest challenge, is the hum an m ind itself. The functioning of the m ind, its ability to adapt to changes, and its capability to use inputs from its various senses in an integrated fashion has been subject to speculation. The m ost perplexing phenom enon, th e ability of hum ans to perceive objects based on visual cues has been a m atter of utm ost am azem ent to many. Since the dawn of the com puter era, efforts have been focused on building machines which would m imic the hum an m ind. Com puter vision or m achine perception is one such endeavor striving to model the pro cess of hum an perception using com puters. It attem p ts to enable com puters to perceive their environm ent by sensory m eans as hum an beings do. It deals w ith the construction of explicit, meaningful descriptions of physical objects from images. In general, the challenge lies in finding out w hat exactly can be extracted from images using only very basic operations. Specifically, w hat com putations m ust be performed? At w hat stage should dom ain-dependent prior knowledge about the world be incorporated into the understanding pro cess [BB82]? From a com putational perspective, com puter vision processing is usually organized as follows: • Early processing of the raw image - often term ed as low-level processing. A t this level, input is an image and output is an image of the same 2 size. C om putations consisting of simple arithm etic/logic operations are perform ed on each pixel simultaneously. T he d ata com m unication among the pixels is local to each pixel. • Interface between low-level and image understanding problem s - often term ed as interm ediate-level processing. Input to this level is an image and the ou tp u t is an array or table of d ata item s or an image of reduced size. T he operations perform ed on each d ata item can be nonlocal and the com m unication is also irregular com pared to th a t in low-level image processing. • Image U nderstanding using th e acquired d ata from the above processing (for exam ple, geom etric features such as shape, orientations, m om ents, etc.) to infer sem antic attrib u tes of an image. Processing at this level can be classified as knowledge processing an d /o r symbolic processing. Search based techniques are widely used at this level. As a result of im m ense potential application of com puter vision in areas such as medicine, navigation, m anufacturing, and rem ote sensing, interest in this field is growing rapidly. Several approaches are being studied to solve vision problems efficiently. Independent of the approach taken, com puters capable of executing millions of operations per second are required for real tim e response. For exam ple, consider the interpretation of a changing scene at 30 frames per second. T he volume of d ata to be handled is about 25 million bytes per second, assum ing a m oderate resolution of 512 x 512 pixels per fram e, w ith three bytes per pixel (one byte for each prim ary color w ith 256 grey levels). The num ber of operations for simple image transform ations m ay am ount to 100-10,000 billion instructions per second [WHRR88]. Most com puters built over the past decade have been designed to execute one instruction at a tim e. Thus, the processing of even a simple vision task on such com puters requires massive d ata funneling through a single processor, resulting in slow response tim e of the system . This lim itation has lead researchers to investigate parallel processing for com puter vision [Kum91]. Parallel processing provides a breakthrough to this com putational logjam 3 by replacing th e single processor w ith a large num ber of processors working cohesively on several com putations in parallel. Com puter vision and image understanding tasks employ a wide variety of techniques from several areas such as image and signal processing, discrete m athem atics, graph theory, re lational algebra, and artificial intelligence. Hence, it is not only the volume of com putations for solving vision problems trem endous; com puters capable of handling a wide variety of com putation/com m unication characteristics are needed [CP90]. M ost of th e parallel algorithm s developed so far have addressed problems in low level and interm ediate level vision, which deals w ith image processing [Kum91]. Image understanding is very different from image processing, which studies image-to-image transform ations w ithout explicit descriptions. Descrip tions are a prerequisite for recognizing, m anipulating, and thinking about ob jects. M ost of th e algorithm s in image understanding are symbolic in nature and processing is not entirely local. Use of parallelism for image understanding offers m any challenges. These include the complex d ata structures employed in the serial solutions to these problems, symbolic nature of the com putations requiring efficient run tim e support for interprocessor com m unication, and lack of well defined m etrics to evaluate perform ance of the solutions. This dissertation investigates parallel solutions for various image under standing tasks. In particular, parallelism for object recognition is studied. In the following, a brief introduction to object recognition is presented. 1.1 Object Recognition In im age understanding, object recognition plays a key role. O bject recognition can be simply defined as, “given an environm ent, find what known objects it contains, where they are located, and identify their pose.” In order to address these problems, m any other aspects need to be considered: • w hat sensory cues are used to recognize candidate objects? • how individual objects are represented? 4 • how large th e object libraries are and how are they stored? • w hat m ethods are used to extract the objects from th e library? • how the correspondence is established between sensory cues and features of an object? • how are new objects learned and added to the library? In addition to the above issues, real tim e processing requirem ents makes it im perative to study parallel solutions for object recognition. In this disserta tion, we shall be investigating efficient parallel solutions based on well known sequential techniques for object recognition. Several techniques have been developed for object recognition. These tech niques can be divided into two classes; Non-correspondence m atching and Correspondence m atching [Gri90]. A pictorial representation of the sequen tial techniques for object recognition and their inter-relationship is shown in Fig. 1.1. Non-correspondence m atching, also called global m atching, [WZ88, RPAK88, . M ST85, WMF81] involves finding a transform ation from a model to an image w ithout determ ining the correspondence between individual parts or features of th e m odel and image [Gri90]. In Correspondence M atching [Gri90, Bro81, MN84], also known as Feature M atching, correspondence is established between the features extracted from the image and corresponding local features of the object model. The work in this thesis deals w ith object recognition using two dim ensional Feature M atching (correspondence m atching) techniques only. In Feature M atching, several techniques have been developed. Such tech niques differ not only in term s of m atching prim itives but also in term s of m atching criterion and search m ethods. Several degrees of freedom available in object recognition makes this problem m ore comprehensive but at th e same tim e extrem ely complex. In the sim plest approaches to Feature M atching, m atching is perform ed at pixel level, and th e available descriptors are pixel properties such as color and intensity. These techniques are sim ilar to th e area-based approaches in stereo m atching [BB81], At the other extrem e, images can be m atched by deriving 5 ObJectRecognltlon 2-D 3-D N on-Correspondence Transforms Search Methods Indexing Interpretation Alignment Trees GHTs Hashing Figure 1.1: A hierarchy of sequential approaches for object recognition. 6 object descriptions, i.e. recognizing sem antic structures such as roads, and buildings [MN84]. In search m ethods, various techniques have been employed. These in clude relaxation labelling [MN84, RHZ76, Pri82], alignm ent [HU88, DRLR89, | Low87], interpretation trees [GP87, Bai85, Ett88], generalized Hough tran s forms [GH90, Skl78, IK88], geom etric hashing [LW88, Wol90], and structural indexing [SM90]. 1 1.2 Contributions of this Dissertation The contributions of this dissertation are two fold; design of fast parallel so- • lutions for object recognition and their parallel im plem entations on the state of th e art commercially available parallel machines. In particular, the problem of 2-dimensional object recognition is studied and scalable d ata parallel algo rithm s are designed. T he proposed parallel techniques are further extended to devise parallel solutions for other image understanding tasks. T he im plem enta tions are carried out on the Connection M achine CM-5 and on M asPar MP-1. These im plem entations are developed after carefully studying the characteris tics of the underlying architectures of these m achine, i.e. fat tree for th e CM-5 and mesh array for the MP-1. Various experim ents are conducted to fine tune th e partitioning and the m apping strategies to suit the com m unication and th e com putation capabilities of these machines. Based on these experim ents, d ata parallel algorithm s are designed to efficiently utilize the architectural and program m ing features. This experim entation has assisted in achieving uniform distribution of work load in the machines during the execution of algorithm s leading to fast and scalable im plem entations. For asym ptotic tim e analysis, the fixed size processor array is used as the underlying com putational model. In th e following, the m ajor contributions of this dissertation are outlined. A . P arallel A lgorithm s We consider several well known approaches for object recognition and show scalable d ata parallel solutions for these. 7 • Image Matching using Relaxation Labelling: Assume a scene consisting of n features and a m odel consisting of m features. The image m atching problem deals w ith finding, for each feature in th e scene, a set of feasible m atches in th e m odel, such th a t all the m atches are consistent w ith each other. F irst, a fast se quential algorithm is shown which runs in 0 { n 2m 2) tim e. Previous approaches to this problem take 0 ( n 3m 3) tim e. A processor-tim e optim al, d ata parallel algorithm is developed using the proposed se quential algorithm . A 0 ( ( n p L + P )nm ) tim e perform ance is achieved on a linear array of size P, 1 < P < nm . This leads to a processor- tim e optim al solution for P < yjnm. Also, an 0 ( ( ^ + \/P )n m ) tim e algorithm is shown on a P 2 processor m esh array, where 1 < P < y/nm. This leads to a processor-tim e optim al solution for P < (n m )1/3. • Object Recognition using Geometric Hashing: Geom etric hashing has been recently proposed as a technique for m odel based object recognition in occluded scenes. D ata parallel algorithm s are developed for the preprocessing and for the recog nition phase. In earlier parallel im plem entations, th e num ber of processors employed is independent of the size of the scene b u t de pends on the size of th e model database which is 0 ( M n 3), where M is th e num ber of models in the database, n is the num ber of feature points in each model. We significantly im prove upon th e num ber of processors employed while at the same tim e achieve much superior tim e perform ance. One probe of the recognition phase takes 0 ( ^ p ) tim e on a P processor mesh array, and 0 ( p log S) tim e on a fat tree based architecture. The algorithm s is scalable over the range 1 < P < y/S. The sequential algorithm for the recognition phase takes 0 ( S ) tim e, where S is the num ber of feature points in the scene. • Object Recognition using Structural Indexing: In this m ethod, a set of line segments called super segment is used as 8 a basic m atching prim itive. The super segments present in a given scene are used to index into the m odel database. D ata parallel algorithm s for the representation (preprocessing) and for th e recog nition phase are derived. The algorithm for the recognition phase takes 0 ( t t ) tim e on a P processor m esh array, where S is the num ber of super segments in the scene. The algorithm is processsor-tim e optim al for 1 < P < S. T he techniques developed for object recognition are extended to devise parallel solutions for the following stereo m atching problems. • Stereo Matching using Zero-Crossings: For stereo m atching using zero-crossings, a processor-tim e optim al algorithm is shown. It takes 0 ( z yr) tim e on a P processor m esh array, where n is the num ber of zero crossing points in the left image, m is the set of possible candidate points in the right image for a given zero crossing point, and 1 < P < n. The earlier sequential algorithm takes 0 ( n m 2) tim e. The sequential algorithm shown in this dissertation runs in 0 (n m ) tim e. • Stereo Matching using Linear Features: For stereo m atching using linear features, a processor-tim e optim al algorithm is shown. It takes tim e on a P processor linear array, where N is the num ber of line segments in one image, n is the num ber of line segments in a window determ ined by the object size, and P < n. The sequential algorithm takes 0 ( N n 3) tim e. An O(^j0~) tim e perform ance is shown on a P 2 processor mesh array such th a t 1 < P < n. T he algorithm s proposed for image m atching and stru ctu ral indexing are processor-tim e optim al and scale linearly w ith th e num ber of processors. T he algorithm s for geom etric hashing are optim al for the underlying com putational m odel and are scalable. T he algorithm s developed for stereo m atching are also processor-tim e optim al and scale linearly w ith th e num ber of processors. 9 B . P arallel Im p lem en tation s Several parallel algorithm s proposed for geom etric hashing in this disser tatio n have been im plem ented on M asPar M P-1, and on th e Connection M achine CM-5. Earlier im plem entations claim 700 to 1300 msec for one probe of the recognition phase, assuming 2 0 0 feature points in th e scene on an 8 K processor CM-2. O ur im plem entations run on a P processor m achine, such th a t 1 < P < S, where S is th e num ber of feature points in the scene. O ur results show th a t one probe for a scene consisting of 1024 feature points takes less than 50 msec on a IK processor MP-1 and it takes less th an 10 msec on a 256 processor CM-5. T he model database used in th e im plem entations contains 1024 models and each m odel is rep resented using 16 feature points. Our im plem entations require num ber of processors independent of the size of th e m odel database and are scal able w ith the m achine size. Results of concurrent processing of m ultiple probes of the recognition phase are also reported. As an exam ple, com parison of th e running tim e of one of our im plem entations w ith those of earlier im plem entations is given in Table 1.1. For structural indexing, it is shown th a t the execution of th e recognition phase takes less than 100 msec on M P - 1 and it takes less th an 50 msec on a 256 processor CM-5. T he scene considered for structural indexing contains 1024 super segments. M eth o d s # of M odels (16 p o in ts /m o d e l) M achine S iz e /T y p e # of Scene P o in ts T o tal T im e O ur M ethod 1024 256/CM -5 1024 6 . 6 8 msec O ur M ethod 1024 1K/M P-1 1024 53.37 msec O ur M ethod 1024 1K/M P-1 2 0 0 49.50msec O ur M ethod 1024 256/CM -5 2 0 0 4.50 m sec H um m el et. al.[RH91] 1024 8K /CM -2 2 0 0 800 msec M edioni et. al. [BM8 8 ] X 8K /CM -2 X 2.0-3.0 sec Table 1.1: Com parison w ith earlier parallel im plem entations of object recognition using geom etric hashing. 10 1.3 A View of Things to Come In C hapter 2, techniques for 2-dimensional object recognition are briefly in troduced. A detailed discussion on the previous work in parallelizing object recognition problems is presented. For asym ptotic perform ance analysis of the parallel algorithm s, th e com putational models used in this dissertation are in troduced. To evaluate the parallel solutions, various perform ance m etrics are defined. C hapter 3 presents our parallel algorithm s for several well known tech niques for object recognition. These include image m atching using relaxation labelling, object recognition using geometric hashing, and object recognition using structural indexing. In C hapter 4, th e parallel techniques are extended to derive parallel solutions for other image understanding tasks. These include stereo m atching using zero crossings and stereo m atching using linear features. Im plem entations of several parallel algorithm s proposed in C hapter 3 are car ried out on the Connection M achine CM-5 and on M asPar MP-1. In C hapter 5, based on various partitioning strategies, results are reported and are com pared against earlier im plem entations. C hapter 6 presents concluding rem arks and directions for future research. 11 ' Chapter 2 Background A pplication of parallel processing techniques in deriving fast parallel solutions for com puter vision problem s has been an active area of research since last decade. However, m ost of the work in bringing together parallel processing and com puter vision has been focused on devising parallel solutions for problem s in low level and interm ediate level vision. In this dissertation, parallelism for object recognition, a high level vision task, is studied. This chapter provides a brief introduction to object recognition and parallel processing. Various sequential techniques for object recognition are outlined. Section 2.2 describes th e problem statem ent. A discussion on earlier work in parallelizing object recognition tasks is carried out in Section 2.3. For asym ptotic tim e analysis of th e parallel algorithm s, the com putational models used in this dissertation are introduced in Section 2.4. To evaluate parallel solutions, various perform ance m etrics are defined in Section 2.5. 2.1 Techniques for Object Recognition O bject recognition involves identifying known objects in a given scene. It plays a key role in an image understanding system and has recently gained significant interest in th e vision community. Mainly, the sequential techniques for object recognition are divided into two classes; Non-correspondence m atching and Correspondence m atching [Gri90]. 12 Non-correspondence, or global, m atching [WZ8 8 , R PA K 8 8 , M ST85, WMF81] involves finding a transform ation from a m odel to an image w ithout determ in ing th e correspondence between individual parts or features of th e m odel and im age [Gri90]. In Correspondence M atching [Gri90, Bro81, MN84], also known I as Feature M atching, correspondence is established between th e features ex- 5 ■ tracted from th e image and corresponding local features of th e object model. T he work in this thesis deals w ith object recognition using two dim ensional Feature M atching techniques only. In Feature M atching, several techniques have been developed. Such tech niques differ not only in term s of m atching prim itives but also in term s of m atching criterion and searching m ethods. Several degrees of freedom avail able in object recognition makes this problem m ore versatile but at the same tim e extrem ely complex. In the following, various two dim ensional techniques are outlined. Bolles and Cain [BC82] This technique is based on the local-feature-focus m ethod. D uring the acquisition of a model in the model database, a cluster of local features in a relative configuration is searched. It is assum ed th a t such a configu ration does not occur elsewhere in the model or in other possible model. One feature in th e cluster is selected as the focus feature. During the recognition process, focus feature of the m odel under consideration is searched for in th e scene. If it is found, its neighborhood is searched for the rem aining features in th e cluster. If they are found and their relative configuration is consistent w ith th a t of the m odel, the m odel is chosen as a candidate for the m atch. Medioni and Nevatia [MN84] This technique is based on relaxation labelling. The m atching features are segments and are described by th e coordinates of their end points, orientation and average contrast. The scene is assum ed to have n fea tures, called objects, and th e m odel is assum ed to have m features, called labels. The technique com putes the quantity in {0,1}, which is the possibility of assigning label p to object i, such th a t 1 < i < n, and 13 1 < p < m. The m ethod relies on geom etric constraints, which m eans th a t when a label p is assigned to object *, we expect to find an object j w ith assigned label q in an area depending on i,p,q. Grimson and Lozano-Perez [GH90] This technique defines the recognition m ethods based on interpretation trees. Assuming an object of n segments and k d ata points, n k possible com binations for th a t object are tested. A tree search based m ethod is devised. However, based on th e geom etric constraints, tree pruning is carried out during the search process. Sharir et al. [KSSS8 6 ] In th eir scheme boundary p art of a model are segmented such th a t the segments are likely to belong to one model. These segments are called footprints. These footprints are used to m atch a scene against a database of model. H euristics based hashing scheme, called geometric hashing, is used to search the database. Scaling of model objects is assum ed to be fixed. Ayache and Faugeras [AF8 6 ] A recognition system called H Y PER is introduced. It uses polygonal approxim ations for the representation of scenes and models. Model seg m ents are m atched against scene segments and refinem ent is applied re cursively. Knoll and Jain [KJ8 6 ] Sim ilarities and differences among m odel types are used to from group of models. Features common to several models in a group are chosen to represent th a t group. W ith each feature, a list of models where the feature occurs is associated. During the recognition, if a m atch for a m odel feature is found in the scene, models contained in th e associated list are hypothesized. Lamdan et al. [LW8 8 ] This m ethod is also based on the geom etric hashing scheme. Given a 14 set of features extracted from an image, a subset of the features is des ignated as a basis set. The coordinates of all the features are com puted relative to the basis set. These coordinates are then used as indices into a hash table. T he records in th e hash table comprise of m odel features. T he location of these features in the hash table is decided w ith respect to some basis set chosen from the model. Each hashed image feature votes for a set of possible models and corresponding basis sets stored at the hashed location in the table. An im portant aspect of the geom etric hashing paradigm is th a t an object is m ultiply encoded, using various different basis set selections. Conceptually, the m ethod can be seen as a sequence of m easurem ents and m aps, where each m ap projects to loca tions determ ined from the previous m ap by a process of fixed links and m easured values. Ettinger [E tt8 8 ] A hierarchical library of object structure is used as m odel database. The hierarchy defines the granularity of the structure (coarse to fine). Each object is represented is a hierarchical fashion in the library. T he subparts of the object define the level of representation of th a t object in the library. Huttenlocher and Ullman [HU8 8 ] The ORA (O bject Recognition by Alignm ent) System [HU87] has been designed based on the alignm ent technique. In feature based image m atching, th e search space tends to grow exponentially w ith the num ber of features in an im age or model. T he basic idea behind the alignm ent technique is to first identify the m inim um am ount of inform ation needed to solve for a possible position and orientation. Second, to m inim ize the am ount of search required in m atching local features from m odel and image. Stein and Medioni [SM90] A structural indexing m ethod is proposed. A set of linear line segments, called super segment is used as a basic m atching feature. D uring the representation process, a model is hashed into a hash table using super 15 segments of the m odel as keys and for each hashed location, correspond ing m odel num ber and th e super segment is stored. T he super segments present in the scene are used to index into the hash table. T he super seg m ents contained in th e locations hashed during the recognition process form hypotheses. For ech candidate model, a discrete relaxation based approach is applied to form consistent clusters of super segments. Final verification is carried out using linear transform ations. Califa.no and Mohan [CM91] An indexing algorithm comprising of two stage processing is proposed. F irst, short-range autocorrelation operators are used to m ap the image pixels into a small set of simple localized shape descriptors. At the second stage, a global autocorrelation operator is used to m ap com bination of these local descriptors into invariant indices. T he indices are used to address the cells of a global lookup table which contains the shape m odel represent ations. In this dissertation, we study several of the above m entioned approaches from a parallel processing perspective. 2.2 Problem Statement A wide variety of sequential techniques for object recognition is available. In addition to different m atching prim itives used, the sequential techniques for such a task also differ in term s of m atching constraints. Independent of the sequential approach used in solving such problems, the am ount of com puta tion is trem endous and d ata structures employed are complex. Also, in m any cases, the com putations are symbolic in nature. In order to obtain fast solu tions to these problem s, parallel processing techniques need to be employed. In this dissertation, the problem of parallelizing various sequential techniques for object recognition and their im plem entations on commercially available paral lel m achines is addressed. We shall be interested in developing scalable d ata parallel algorithm s for a set of well known objection recognition approaches. 16 These include im age m atching using relaxation labelling, object recognition using geom etric hashing, and object recognition using structural indexing. For asym ptotic tim e analysis of the parallel algorithm s, the fixed size processor ar ray is used as th e underlying com putational model. Also, th e proposed parallel | techniques are further extended to devise parallel solutions for various stereo m atching techniques. These include stereo m atching using zero-crossings and stereo m atching using linear features. Algorithm ic, architectural, and im ple m entation issues th a t arise during the parallelization of th e above m entioned tasks shall be addressed. Tim e and space com plexities of the proposed algo rithm s shall be analyzed and new routing and m apping techniques shall be devised. Also, the scalabilty of the parallel algorithm s shall be evaluated. F i nally, im plem entations of several proposed algorithm s is carried out on the Connection M achine CM-5 and on M asPar M P-1 , and results are reported. 2.3 Related Work Some work has been done in parallelizing sequential algorithm s for object recog nition. M ost of these im plem entations are m achine specific and th e techniques used in such approaches are not extendable to other general models of parallel com putations. Among th e early works in parallelizing object recognition algorithm s, F ly n n a n d H a r ris [FH 8 6 ] have presented a parallel algorithm to im plem ent a tree ■ interpretation m ethod [GLP84] for recognizing objects on the Connection M a chine. It is claim ed th a t an object w ith n faces and k d ata points can be recognized in 0 ( k ) steps using 0 ( n k) processor Connection M achine. T he size of th e m achine is assum ed to be large enough to hold all the n k interpretations of th e objects. C o o p e r a n d H o llb a c h [CH87] have im plem ented th e problem of recog nizing pure structural objects on a connectionist netw ork sim ulator. A Tinker Toy world of objects is assum ed as object models. Such an assum ption sim pli fies th e general theory of shape representation by representing objects in the 17 form of prim itive parts (rods). Mainly, a scene consisting of one toy (scene ob ject) is considered for recognition. They have devised parallel algorithm which requires n4 connectionist cells to process a scene consisting of n prim itive parts and k models in the m odel d ata base. T u c k e r et al. [T FF 8 8 ] have im plem ented algorithm s on the Connection M achine using hypotheses generation, a variant of th e in terpretation trees based recognition techniques [GH90]. Local boundary features th a t constrain an o b ject’s position and orientation have been used to provide a basis for hy pothesis generation. T he im plem entation claims th a t the rate of increase in execution tim e is m uch slower th an either th e num ber of objects in the database or objects in the scene. No bounds have been provided on this param eter. The authors show experim ental results using databases containing 1 0 to 1 0 0 object models. C a ss [C as 8 8 ] has im plem ented an object recognition algorithm on the Connection M achine CM -1 . A m ethod called transformation sampling is used to determ ine th e optim al transform ation of m odel features to scene features by sam pling th e space of possible transform ations. Search space is curtailed by placing m atching constraints on possible transform ations. T he algorithm runs in 0 (lo g 2 K m n ) tim e on an O (K m n ) processor CM -1 , where n is th e num ber of features in the scene, m is the num ber of m odel features, and K depends on the size of the input image. For values of n = 31, m = 259, and K = 121, they have achieved a 5 seconds im plem entation tim e on a 16K processor CM-1. Recently, B o u rd o n a n d M e d io n i [BM 8 8 ] have im plem ented an object recognition algorithm on the Connection M achine using geom etric hashing. The im plem entation is based on the sequential algorithm introduced by Lam- dan and Wolfson [LW8 8 ]. T he algorithm consists of two phases, preprocessing and recognition. The parallel im plem entation could be optim ized only for a sm all num ber of models. T he num ber of processors required in th e im plem en tatio n is 0 ( M n 3), where M is the num ber of models, and n is th e num ber of feature points in each model. Using a scene consisting of 29 features points and a database of 4 models such th a t each m odel comprises of 16 feature points, th eir im plem entation runs one probe of th e recognition phase in 1 . 2 seconds 18 on a 16K CM-2. An im proved im plem entation of the geom etric hashing algorithm on CM -2 , as com pared w ith th a t of Bourdon and Medioni [BM 8 8 ], has been provided by R igou tsos and H um m el in [RH91]. They have used synthesized d ata to i generate models and scene points. Their im plem entations also require 0 ( M n 3) i processors. They have achieved a 700 msec run tim e perform ance to process one probe of the recognition phase on a 16K CM-2. T he scene contained 200 ■ feature points and the m odel database contained IK models such th a t each ! m odel is represented using 16 feature points. R einhart, in [Rei91], have studied parallel im plem entations of th e image m atching algorithm using relaxation labelling [MN84] and of tree search based ; object recognition techniques [FMN89]. The objectives claim ed in th e parallel ■ im plem entation are different th an the classical approaches to parallelization. The objectives em phasized in his im plem entations are: reducing th e system complexity and reducing th e programmer burden. The system com plexity is defined as the am ount of restructuring required to parallelize a sequential algo rithm , particularly in term s of control logic. Similarly, the program m er burden is defined in term s of the degree of difficulty in developing and m aintaining the parallel algorithm im plem entation. 2.4 Computational Models In this dissertation, fixed size processor arrays have been used to analyze the asym ptotic running tim es of proposed parallel algorithm s. In th e following, we discuss th e m otivation behind this choice and also present the m odel in detail. Since th e inception of parallel processing in early 1970s, several SIMD par allel architectures have been proposed [Hwa85]. Such architectures differ either in term s of the interconnect topology among th e processor or in term s of the m em ory access (local vs. global) p attern. Parallel synchronous com puters based on fixed size arrays w ith nearest neighbor connections have been pro posed for a wide range of applications. There are several advantages in choos ing fixed size arrays as a com putational model. From an algorithm ic point of 19 view, linear and m esh array (1-D and 2-D fixed size arrays) architectures are easy to visualize which helps is designing efficient d ata routing techniques for parallel algorithm s. From a VLSI perspective, processor arrays w ith nearest neighbor connections have several features th a t m ake them attractiv e for VLSI ; im plem entation. These feature include sim plicity of interconnection, m odular ity, am enability to sim pler fault tolerance techniques, scalability, and low I/O bandw idth requirem ents as com pared w ith other array processors w ith m ore complex interconnection networks [AP91]. Also, due to th e fixed degree of each node, such architectures can easily be extended to larger sizes. On th e other hand, architecture such as hypercubes cannot be extended to larger sizes w ithout redesigning their VLSI layouts. In a system w ith fixed num ber of processors, a large m em ory is usually pro vided for storing the d a ta and final results. T he m em ory can be configured as a global shared m em ory or as a distributed m em ory (distributed among th e pro cessors) such th a t only a processor itself has access to its local memory. M any parallel machines have been built based on mesh or linear array configuration such as, CLIP7 [Fou87], Scan Line A rray Processor [Fis8 6 ], P IP E [KSL85], CMU W A RP [AAG+8 6 ]. Also, several other architectures have been derived from these models which include Mesh of Trees [Lei82, NMB83], Mesh w ith M ultiple Buses [KR87], and Mesh w ith Reconfigurable Buses [RK87]. For our im plem entations and analysis, we assume following models for th e fixed size (mesh and linear) arrays. Fixed Size Linear Array: A fixed size linear array is a one dim ensional array of array of P processors, where P is less th an or equal to the problem size. T he processors are indexed 0 through P — 1 (from left to right), w ith each processor connected to its im m ediate right and left neighbors if they exist. T he processors are connected through bidirectional local buses and the array operates in SIMD mode. P ro cessor i is denoted as PE*-, where 0 < i < P — 1. Each PE, is attached to an external m em ory m odule MM,. T he architecture is shown in Fig. 2.1. 20 PE, Figure 2.1: A fixed size linear array. Fixed Size Mesh Array: A fixed size mesh array is a two dim ensional array of P x P processors, where P 2 is less th an or equal to the problem size. Each processor P E s y is connected to P E J+1j, P E i-ij, P E jj-i, and P E ,j+ i, if they exist. A m em ory plane of P x P m em ory modules (MMs) is provided. Each PE ,j is attached to m em ory m odule M M ,,. The architecture is shown in Fig. 2 .2 . j Processing Element i Local Memory Figure 2.2: A fixed size m esh array. In both th e m odels, processors are connected through bidirectional local links and the arrays operate in SIMD mode. Following assum ptions are m ade regarding th e com putations in these models: 21 • An arithm etic/logic operation perform ed in a processor takes 0 (1 ) tim e. • In a linear array (mesh array), a parallel m em ory access by PE* (PE,-j) to m em ory m odule MM; (M M y), 0 < i , j < P — 1, takes 0 (1 ) tim e. • In a linear array, a unit d a ta transfer from PE ; to P E j , 0 < i, j < P — 1, takes 0 (\i — j |) tim e. • In a m esh array, a unit d ata transfer from PE;j to P E ;j_ i, or to P E 8J+1, or to PE;_iy, or to P E ;+ij takes 0 ( 1 ) tim e. • T he PEs have indirect addressing capability. 2.5 Performance Metrics In this section we briefly define several well known m etrics used to analyze d ata parallel algorithm s. A d ata parallel m odel is an architecture-independent m odel th a t allows an arbitrary num ber of virtual processors to operate on large am ounts of d ata in parallel. This m odel has been shown to be efficiently im plem ent able on both SIMD and MIMD machines. C urrently it is being supported by several program m ing languages including C*, data-parallel C, F ortran 90, Fortran D, and CM Fortran. An algorithm developed assum ing such a m odel is called data-parallel algorithm . T im e O ptim ality: A parallel algorithm for a given architecture is called optimal if th e run ning tim e T in ) for a problem of size n m atches th e lower bound tim e of any non-trivial problem of the same size on the architecture. For ex am ple, a non trivial problem on a y/~P x y/P mesh is bounded by the diam eter of th e m esh, i.e y/P [J92]. Therefore, an algorithm is called optim al on a y/P x y/P m esh if its running tim e is 0 {y /P ). P rocessor-tim e O ptim ality: A parallel algorithm will be evaluated in term s of two param eters, the parallel time T(n,P) representing the num ber of parallel steps applied on 22 an input problem of size n on a , P processor m achine, and th e processor time product representing th e to tal am ount of work done. Assum e the sequential tim e com plexity of th e input problem of size n is T*(n). A parallel algorithm to solve th e same problem of size n is called work-time optimal if th e total work done is 0 (T*(n)), regardless of th e running tim e T (n ,P ) of the parallel algorithm . S p eed -U p The speed up of a parallel algorithm is the ratio of best th e known sequen tial tim e algorithm T*(n ) and the running tim e of th e parallel algorithm T (n), i.e. • K fhe speed-up is equal to the num ber of processors P em ployed in the parallel solution, then the algorithm is processor-time optim al in the strong sense [J92]. Scalability: A n algorithm ic perspective: Consider an algorithm runs in T (n, P ) tim e on a P processor architecture and th e input size is n. T he algorithm is considered scalable on th e ar chitecture if T (n , P) increases or decreases linearly w ith a linear increase or decrease in th e input size. T he range of scalability is determ ined by th e the sequential com ponent of the parallel algorithm . T here are m any other notions of scalability and m ost of them use ef ficiency and work size to determ ine the scalability of an architecture algorithm pair [GK91, LP93, Hwa93]. Scalability: A n im p lem en tation perspective: An im plem entation is considered scalable if same code, w ithout m odi fications, could be used for various m achine sizes and for various input sizes and linear speed-ups are achieved. 23 Part II Research Contributions 24 ! Chapter 3 Parallel Object Recognition 1 This chapter presents scalable d ata parallel algorithm s for several well known sequential techniques for object recognition. Large num ber of sequential tech niques based on different m atching prim itives and search m ethods have been proposed. Several of these techniques have been discussed in C hapter 2. In dependent of th e approach used, the com putational characteristics of object recognition exhibit variety of challenges and significantly depend upon th e na tu re of th e object to be recognized. These include irregular and complex d ata structures requiring efficient run tim e interprocessor com m unication support and symbolic com putations which largely depend on the geom etric constraints applied. A b ru te force parallelization of such problem s suffers from perfor m ance degradation. In this chapter, a set of sequential techniques is studied and parallel so lutions based on these are designed. These techniques are based on feature m atching approach for object recognition. In particular, Section 3.1 presents parallel algorithm s for image m atching using relaxation labelling [MN84]. Based on the com putational models presented in the C hapter 2, processor tim e op tim al solutions has been developed for image m atching. T he algorithm s pre sented take 0((^jSr + P )n m ) tim e on a P processor linear array, where P < nm . An 0 ( ( ^ + P )n m ) tim e perform ance is achieved on a P 2 processor m esh array, where P 2 < nm . Section 3.2 presents optim al parallel algorithm s for object 25 recognition using geom etric hashing. Com pared w ith the earlier parallel im ple m entations [BM 8 8 , RH91], while achieving much superior tim e perform ance, significant im provem ent upon the num ber of processors employed is m ade. On a P processor m esh array, th e algorithm for recognition phase takes 0 ( ^ = ) tim e to process one probe (a probe corresponds to th e execution of th e recog nition phase for one basis pair), where S is th e num ber of feature points in th e scene and 1 < P < S. On a fat tree based architecture, one probe takes 0 ( fr log S) tim e, and th e algorithm is scalable over th e range 1 < P < \ / S . Earlier im plem entations require 0 ( M n 3) processors where M is the num ber of models in th e database and n is the num ber of feature points in each model. In Section 3.3, d ata parallel algorithm s for object recognition using stru ctu ral indexing are developed. T he algorithm s for th e recognition phase take 0 ( y r ) tim e on a P processor mesh array, where S is th e num ber of super segm ents in th e scene and 1 < P < S. T he work presented in this C hapter has appeared in [KLP91, KL93, KKP93, KP93]. 3.1 Image Matching using Relaxation Labelling In the p ast, several approaches have been proposed for im age m atching [Pri82, SH81, CLM78], which, in general, differ w ith respect to th e prim itives used for m atching. In this section, we consider im age m atching using relaxation la belling [MN84] for parallel im plem entation. Relaxation techniques have been applied to solve several problem s in com puter vision. T he inherent com puta tional com plexity of this technique is assum ed 0 ( m 3n3), where m is num ber of objects and n is th e num ber of labels. In addition to such a high com pu tational complexity, when used in object recognition the relaxation operator involves complex d ata structures and also consists of symbolic com putations. In th e proposed technique [MN84], linear features has been used as m atch ing prim itives. Use of linear features as prim itives has the advantage of not requiring 3-D and point to point transform ation betw een th e two images. R ead ers can refer to [MN84] for additional details of the m atching technique. We begin w ith th e basic idea of this approach and then present a fast sequential 26 algorithm for im age m atching which is an extension of the discrete relaxation algorithm presented in [LP91]. 3.1.1 M a tc h in g T ech n iq u e ' Generally, in th e im age m atching problem , we have n objects, {oi, < ? 2> • • •, on}, in th e scene and m labels, {/1 ? l2, ..., lm}, in th e m odel. Here, the objects are segments in the scene derived from edge detectors and are described by th e co ordinates of th eir end points, orientation and average contrast. T he technique com putes th e quan tity n4p, in {0 , 1 }, which is th e possibility of assigning label lp to object Oi. T he m ethod relies on geom etric constraints, which m eans th a t when a label lp is assigned to object o,, we expect to find an object Oj w ith assigned label lq in an area depending on i,p,q. T he match-window W { i,p ,q ) denotes the area described by th e param eters i,p, q. By representing the object o8 w ith a vector A iB i, th e label lp w ith CVD V and label lq w ith CqD qi we can determ ine the four extrem e points, W j, W 2, W 3, W 4y of th e induced match-window W (i,p ,q ) using th e following relations: (p denotes the scaling factor known beforehand) • A-W t = ft ■ CpCq, W 4W 2 = p ■ CqDq • BiW3 = ft ■ DpCq, W3 W4 = ft ■ CqD q Fig. 3.1 shows th e relationship between th e window and the segments. c, - p Model Scene Figure 3.1: D eterm ining a match-window. T he m eaning of compatibility is defined as follows: 27 < i,p > is compatible w ith < j ,q > iff O i in W (j, q,p) and Oj in W (i,p ,q ), Let g] denote the compatibility of assigning label lp to object i and label lq to object j. A weak notion of consistency is also used to determ ine w hether an assignm ent is feasible. A predeterm ined confidence factor, 6 < m , is used to decide th e feasibility of < i,p > as in the following u p d ate statem ent during an iteratio n : 1 For every i,p, v'ip * — Vip A N D ‘condition A ’, condition A = true if (3*9 C {1 ,2,..., m} and ]|5|| = S, and Vg € S, 3j € { 1 , 2 ,... ,n} such th a t Vjq = 1 and < ? ] = 1 ) and = false otherwise. T he algorithm stops when for all i,p, v'ip = V(p. We can rew rite the above u p d ate statem ent as S,m n v'ip <- vip * |+) [ ?] ) ] > (3-1) 9=1 i = l where S Q X * = { l (3.2) I U otherwise . Note th a t th e operation T2j=i in equation 3.1 is a logical O R operation, while th e operation X q in equation 3.2 is an arithm etic ADD operation. 3 .1 .2 A F ast S eq u e n tia l Im a g e M a tch in g A lg o r ith m W ith th e modified u p d ate statem ent given by equation 3.2, we design a faster sequential algorithm which is easier to parallelize com pared w ith th e one pro posed in [MN84]. This algorithm is an extension of th e discrete relaxation algorithm developed in [LP91]. T he algorithm is shown in Fig. 3.2. We asso- 1 ||> S ,|| denotes th e card in ality of S 2 8 { Initialization } 1 . fo r o = 1 to n — 1 do 2 . fo r p = 0 to m — 1 do 3. W jp 4 lj lip o, 4. fo r q = 0 to m — 1 do 5. JV ip[g] <- 0; 6 . for j = 0 to n — 1 do / /initialize counter variables// 7. if (ftfjb??] = 1) th e n jV ,-p[g] *- N ip[q\ + 1 8 . end; 9. if (iViP[g] / 0) th e n Tip < — T,p + 1 ; 1 0 . end; 1 1 . if (Tip < 6) th e n do //check feasibility// 12. p u sh sta ck (< i,p >,S); 13. * 0; 14. end 15. en d 16. e n d ; { Iteration } 17. w hile ((< i,p > < — popstack(5')) ^ nil) do 18. for j = 0 to n — 1 do 19. for q — 0 to m — 1 do 20. if ((vjg = 1) A N D = 1)) th e n do 21. Njq\p] «- Njq\p] - 1; 22. if (JV j-g[p] = 0) th e n Tjq < — Tjq - 1; 23. if {Tjq < 6) th e n do / / <j,q> becomes infeasible!// 24. p u sh stack (< j, q >,sy, 25. vjq <- 0; 26. en d 27. end 28. en d 29. en d 30. en d {while} Figure 3.2: A sequential algorithm for image m atching. 29 ciate each vtp w ith m + 1 counter variables. These are Nip[q ] , 1 < q < m , and Tip. These counter variables have the following definitions: • Nip[q\ denotes th e num ber of l ’s in the n entries of < ? ], 0 < j < n — 1 , and • Tip denotes the num ber of nonzero N{p[q] variables, 0 < q < m — 1 . For Vip, each of the m Nip[q] counter variables are used to determ ine an object for label lq, i.e., if we can find any object to be labelled w ith lq when o .t is labelled w ith lp. Tip is used to determ ine if there are at least 8 such compatible labelings when < ? * • is labelled w ith tp. During the initialization procedure (lines 1-16 in Fig. 3.2), all the m N ip[q\ variables as well as th e Tip variable of each assignm ent < i,p > are initialized and the feasibility oi each assignm ent is also determ ined (line 11). A stack S is used to store infeasible assignm ent pairs for further u p d ate operation. During each iteration (lines 18-28), an infeasible assignm ent is retrieved from th e stack and the necessary ‘m odification’ to the Qij is perform ed. Notice th a t the reset operations of th e affected flij entries are not actually carried out but im plem ented by decrem enting th e corresponding iVjp[ < 7] values. It can be easily shown th a t th e ‘m odification’ to entries is not necessary as long as th e counter values are being kept track of, since each entry of th e £lij is inspected (in line 20) at m ost once. This is due to th e fact th a t an assignm ent can be placed on th e stack at m ost once throughout th e algorithm . Since during each iteration at m ost one N ip [ < 7] variable is decrem ented by 1 for each assignm ent < i,p > , it is easy to check if this < i,p > assignm ent rem ains feasible by sim ply checking if T{p < 8 . In otherw ords, increased efficiency in th e sequential algorithm is due to th e fact th a t once an assignm ent of an image feature to the object label is determ ined to be infeasible, it can never be m ade feasible again. Since there are at m ost n m assignm ent pairs th a t can be placed on the stack, th e num ber of iterations is upper bounded by nm . It is easy to verify th a t the algorithm in Fig. 3.2 runs in 0 ( n 2m 2) tim e. Each tim e unit corresponds to a sim ple arithm etic/logic operation. T he original algorithm [MN84] runs in 30 0 ( n 2m 2dw) tim e, where d is the density of th e segments and w is th e window size. In the worst case, d and w can be n and m respectively. 3 .1 .3 P a ra llel Im a g e M a tch in g l ' In this Section, we present a parallel im plem entation of the algorithm out lined in Fig. 3.2. We first introduce a parallel algorithm for im age m atching, assum ing th e num ber of PE s is equal to th e problem size, and th en present a partitioned im plem entation of this algorithm on fixed-size linear and m esh arrays. To begin w ith, assum e a fixed size array of of n m PEs is available. P E zp, 0 < i < n — 1, 0 < p < m — 1, is responsible in determ ining th e feasibility of assignm ent < i,p > . T he parallel algorithm is shown in Figure 3.3. In th e initialization process, each of th e PEs initializes its counter variables in parallel and sets a flag, Send(p in P E jp, to 1 if the assignm ent is infeasible. D uring each iteration, P E tp sends its Identification (Id) to all th e PEs if the flag Sendip is set. A broadcast operation is perform ed to have only one such Id, if there is any, acknowledged by all the PEs during each iteration. If there is no Id broadcast during an iteration, th e procedure is term inated since no m ore infeasible assignm ents can be found. Each iteration can be perform ed in constant tim e if each broadcast opera tion is assum ed to take constant tim e. The m atching process is com pleted in 0 (m n ) tim e. P a rtitio n ed Im p lem en tation on a F ixed Size Linear A rray Since th e size of a match-window determ ined by any object and two labels is m uch sm aller th an th e size of the com plete image, th e num ber of initially assignable segments in any of the match-windows for each object is m uch sm aller th an th e to tal num ber of objects in the image; th a t is, A jp[g] <C n, for all i,p,q. This allows us to obtain a partitioned im plem entation in which each processor assumes th e responsibility of m ore th a n one C object, label> pair, w ith a relatively small size counter in each PE. W ith a fixed num ber of PEs, say P , we are able to design a partitioned 31 { Initialization } 1. Initialize all [p, q] ’s in parallel; 2. p ara llel do (in PE*p, 0 < i < n — 1, 0 < p < m — 1) 3. < 1; 4. for q = 0 to m — 1 do 5. Wtpfa] *- 0; 6 . Tip <- 0 7. for j = 0 to n — 1 do 8 . if ( f q) = 1) th e n JV tp[g] <- N ip[q] + 1; 9. end; 10. if (Nip[q] £ 0) th e n Tip ^ Tip + 1; 1 1 . end; 1 2 . if (Tip < 6 ) th e n do 13. Sendip « — 1 ; 14. Vip < — 0; 15. end; 13. p a ralle l en d ; { Iteration } 16. re p e a t 17. p ara llel do (in PE,p, 0 < i < n — 1, 0 < p <m — 1 ) 18. if (Sendip = 1 ) th e n 19. send Id < i,p > to all the PEs; 20. { if (no broadcast Id acknowledged) 2 1 . th e n stop; 22. else a broadcast Id, say < j, q >, is acknowledged by all P E ’s;}* 23. if (< i,p > = < j, q > ) th e n Sendip • * — 0 ; 24. if ((%, = 1) A N D (Hij[p, g] = 1)) th e n do 25. Nip[q] <- N ip[q] - 1; 26. if (Nip[q] — 0) th e n Tip <- T , - p - 1; 27. if (Tip < S) th e n do 28. Sendip < — 1 ; 29. Vip 0; 30. end 31. en d 32. end 33. p arallel end 34. fo rev er *: the code inside braces is the broadcast operation. Figure 3.3: A parallel algorithm for im age m atching. 32 im plem entation by enabling each P E to process ^ distinct V{p values. The d a ta stored in each of the P MMs includes “ Vip values. In addition, for each Vip, the corresponding n m Qij [p, q] values, the m Nip[q] and the Tip counter values are stored in th e attached MM. Also, th e flag Send{p for vip is stored in MM to indicate w hether the infeasibility of this assignm ent has been acknowl edged by all th e PEs. Such an acknowledgment will then trigger the necessary m odification to all th e corresponding counter values in each P E as defined in th e previous section. PE*,, 0 < k < P — 1, has an additional flag Sdk to indi- , cate if any such infeasible assignm ent among its ^ assignm ents is yet to be acknowledged. Thus, Sdk is set to 1 iff there is at least one Sendip equal to 1 am ong all th e ^ Sendip variables in MM*,. T he structure of th e PEs is shown in Figure 3.4. In PE*,, th e Id to be broadcast is stored in register Rk- to/from MMfc Register Decoder ALU to/from to/from -^t Id Resgiter R^ Figure 3.4: S tructure of a PE. T he com plete parallel algorithm is described in Fig. 3.5. The three m ajor procedures, Initialize, Id-broadcast and U pdate, are given in Fig. 3.6, Fig. 3.7 and Fig. 3.8 respectively. Id(zu) (in line 3 as well as in line 12 of Fig. 3.5) represents the index of th e ru-th variable stored in th e PE. T he procedure In itialize^, p; i', p') initializes th e m -f 1 counter variables for vvp and returns the 33 Li near-A rr ay ( ) { Initialization } 1. p a ra lle l d o (in PE*, 0 < k < P — 1) 2 . fo r w = 1 to ^ d o 3. < i,p > < — Id(ry); 4. Initialize^, p; i',p'); 5. en d ; 6 . < i,p > < — < i',p' >; 7. p a ra lle l e n d ; { Iteration } 8 . r e p e a t 9. p a ra lle l d o (in PE&, 0 < k < P — 1) 10. Id-broadcast(z, p; j, q)\ { execution stops according to a condition in this subroutine } 1 1 . fo r w = 1 to tjr d o 1 2 . < i,p> < — Id(u>); 13. XJpda.te(i,p,j,q;i',p') 14. e n d ; 15. < i,p > < — < i',p' >; 16. p a ra lle l e n d 17. fo re v e r Figure 3.5: P artitioned im plem entation on a fixed size linear array. 34 Id as < i', p' > to th e m ain routine if this assignm ent is infeasible. T he R ead( ) and W rite( ) instructions in the algorithm s represent th e operations executed by th e PEs to retrieve and modify th e contents of the m em ory modules. In itialize^, p; i',p') 1. Rea,d(Tip,Vip,Sendip)] 2. Tip 4 0, 4 1, S end ip 4 ■ 0, 3. fo r q = 0 to m — 1 d o 4. Read(7Vip[g]); Nip[q] < — 0; 5. fo r j — 0 to n — 1 d o 6 . Read (O , • j [p, q \ ); 7. if (fi,y[p, q) — 1) then Nip[q] < — N ip[q] + 1 8 . e n d ; 9. Write(A^p[g]); 10. if ^ 0) th e n Tip * — Tip + 1; 1 1 . en d ; 1 2 . if (Tip < 5) th e n d o / /check feasibility/ / 13. Sendip * — 1 ; Vip < — 0; 14. S d k 4 — 1; < i',p r > 4 — < i,p > 15. e n d 16. W rite(T,p, Sendip) Figure 3.6: Initialization. T he Id-broadcast procedure allows th e Id broadcast operation to be com pleted in 2 P — 2 tim e units, assum ing bidirectional links betw een adjacent PEs and no w raparound connection. T he Id in register R k of the largest indexed P E (if any), will reach PEo by th e end of th e (P — l)-th tim e unit. It is retained in PEo and represents the acknowledged Id in the current iteration. This Id is then sent to all th e PEs in th e following P — 1 tim e units. By accom panying th e Id w ith th e corresponding S d k value during the d ata transfer (lines 7 and 8 in Fig. 3.7), each P E can also determ ine w hether th e execution can term inate (line 10 in Fig. 3.7). A t the end of the initialization process (lines 1-7) in Fig. 3.5, each P E retains th e Id of the last infeasible assignm ent, if any. This Id is th en broadcast to all th e PE s for acknowledgm ent during th e first iteration. 35 Id-broadcast(i, p; j, q) {The Id < i,p > to be broadcast is stored in Rk} 1. for / = 1 to P — 1 do 2. if (Sdk = 1) then 3. Sdk - 1 * — Sdk; {write flag to left PE} 4. R k-i Rk {write Id to left PE} 5. end; 6. for / = 1 to P - 1 do 7. Sdk < — S d k -i; {read flag from left PE} 8 . Rk * — Rk - 1 {read Id from left PE} 9. end; 1 0 . if (Sdk = 0 ) then stop 11. else < j,q > < — Rk Figure 3.7: Im plem enting Id broadcast on th e linear array. T he procedure U p d ate(* ,p ,j, < 7; perform s th e u p d ate operation. Since each Id-broadcast operation takes 0 ( P ) tim e and the com plete u p d ate opera tion in each iteration takes O ( ^ ) tim e, the to tal execution tim e is < 9 (( ^ + P)nm). This leads to processor-tim e optim al solution for P < y/nm. P artition ed Im p lem en tation on a F ixed Size M esh A rray Based on th e algorithm shown in Fig 3.3, a partitioned im plem entation on a fixed size m esh array is obtained. Each of the P 2 PEs process ^ distinct Vip values. T he d a ta stored in each of th e P2 MMs include V{p values, corresponding nm H? [p, q] values, m Nip[q] counter values and the Tip variable. Also, a flag is stored in each MM for each Vip to indicate w hether th e infeasibility has been acknowledged by all the PEs. Such an acknowledgm ent triggers th e necessary u p d ate of the corresponding counters in each PE. Also, in each PE , an ex tra flag is used to indicate if at least one such infeasible assignm ent (among its p r assignm ents) is yet to be acknowledged. An initialization procedure is executed in each P E to initialize th e m counter variables for each of its vrp values. Based on th e condition defined in Section 4.1, each P E sets th e corresponding flag for an infeasible assignm ent and retains 36 U pdat e(i,p ,j,q ;i',p ') 1. R e a d (u J 'p,»S,e7zdi'p,fljj [p, ^ ip [^ / j , 7?p), 2. if (< i,p > = < i ,g > ) th e n Sendip < — 0; 3. e lse if ((utp = 1) A N D (fl,j[p, q] = 1)) th e n d o 4. N ip[q] <- N ip[q] - 1; 5. if (iVjp[g] = 0) th e n Ttp < — Tip — 1; 6 . if (Tip < 6 ) th e n d o / / < «,p > becomes infeasible? / / 7. Sendip < — 1; Vip « — 0 8 . e n d 9. e n d ; 1 0 . if (Sendip = 1 ) th e n d o 11. S d k <- 1 ; 1 2 . < i',p r > < — < i,p > 13. e n d ; 14. W rite( C|p. 5 endip,N ip [ ^7 ]. -^?p) 5 Figure 3.8: U pdate procedure. th e Id of one such assignm ent for later broadcast. This can be perform ed in O(jOf) tim e. D uring each iteration, a ‘collect’ operation is first executed. T he purpose of this operation is to gather the Ids retained in all th e PEs at the end of the previous iteration. A designated P E (say P E 0o) is responsible for collecting these Ids, and retaining one of them . This Id collection process is executed in each row by moving each infeasible Id to th e PEs in th e left m ost colum n and then moving it up to P E 0o- The retained Id in P E 0o is then broadcast to all th e PEs. T he broadcast operation can be executed in 0 ( P ) tim e. Once each P E receives such an Id (of an infeasible assignm ent), an u pdate procedure is carried out to modify the corresponding counter variables and to set th e corresponding infeasibility flag, if necessary. T he algorithm term inates if th ere is no Id retained in P E 0o- Since it takes constant tim e to u p d ate the affected counter variable of each assignm ent, 0 ( j £ ) tim e is sufficient for all the PE s to com plete th e u p d ate operation in parallel. Thus, each iteration can be perform ed in 0(j£r + P ) tim e. T he total execution tim e is 0 ( ( ^ r + P )nm ), since there can be at m ost n m iterations. This im plem entation leads to a 37 processor-time optimal solution when P < (n m )1^ 3. 3.2 Object Recognition using Geometric ] Hashing In a m odel-based recognition system , a set of objects is given and th e task is to find instances of these objects in a given scene. T he objects are represented as i sets of geom etric features, such as points or edges, and th eir geom etric relations are encoded using a m inim al set of such features. T he task becomes m ore com plex if th e objects overlap in the scene a n d /o r other occluded unfam iliar ' objects exist in the scene. M any m odel based recognition system s are based on hypothesizing m atches betw een scene features and m odel features, predicting new m atches, and ver ifying or changing th e hypotheses through a search process. G eom etric hash ing, introduced by Lam dan and Wolfson [Wol90], offers a different and m ore parallelizable paradigm . It can be used to recognize flat objects under weak perspective. For the sake of com pleteness, we briefly outline th e geom etric hashing technique in Section 3.2.1. A dditional details can be found in [Wol90]. 3 .2 .1 G e o m e tr ic H a sh in g A lg o r ith m Figure 3.9 shows a schem atic flow of th e geom etric hashing algorithm . The algorithm consists of two procedures, preprocessing and recognition. These are shown in Figures 3.10 and 3.11 respectively. P reprocessing: T he preprocessing procedure is executed off-line and only once. In this procedure, the m odel features are encoded and are stored in a hash table d ata structure. However, the inform ation is stored in a highly redundant m ultiple-view point way. Assume each m odel in th e database has n feature points. For each ordered pair of feature points in the m odel chosen as basis, th e coordinates of all other points in the m odel are com puted in the orthogonal coordinate fram e defined by th e basis pair. Each such 38 PREPROCESSING RECOGNITION NO YES BAD GOOD Scene Basis Choice Feature Extraction Vote Box for (model,basis) pair (model, basis) with high votes Find the best Transformation Eliminate Object and Proceed Verify Object against Scene Computation of Feature Coordinates Model Objects Feature Extraction Transformation to Invariant Coordinate System HASH TABLE {(coordinates) = (model, basis) ure 3.9: A general scheme for th e geom etric hashing technique. coordinate is quantized and is used as an entry to a hash table, where th e {model, basis) pair, at which th e coordinate was obtained, is recorded. T he com plexity of this preprocessing procedure is 0 ( n 3) for each m odel, hence 0 ( M n 3) for M models. R ecogn ition : In th e recognition procedure, a scene consisting of S feature points is given as input. An arb itrary ordered pair of feature points in th e scene is chosen. Taking this pair as a basis, the coordinates of th e rem aining feature points are com puted. Each such coordinate is used as a key to enter the hash table (constructed in th e preprocessing phase), and for every recorded {model, basis) pair at the corresponding location, a vote is collected for th a t pair. T he pair winning th e m axim um num ber of votes is taken as a m atching candidate. T he execution of the recognition phase corresponding to one basis pair is term ed as a probe. Finally, edges of th e m atching candidate m odel are verified against the scene edges. If no {model, basis) pair scores high enough, another basis from th e scene feature points is chosen and a probe is perform ed. Therefore, th e worst case tim e com plexity of the recognition procedure is 0 { S 3). However, if some classification for choosing a basis from th e scene is available, the com plexity can be reduced to 0 { S ) [LW8 8 ]. T he tim e taken per probe depends on the hash function employed. T he vi sion com m unity has experim ented w ith various hash functions and hash func tions distributing th e feature points uniform ally into th e hash table are known [RH92]. We will be using these hash functions in our im plem entations. As sum ing th a t S feature points of th e input scene leads to 0 { S ) to tal num ber of votes, th e voting process in a probe of the recognition phase can be im ple m ented in 0 { S log S ) tim e using sorting. O ther parts of th e com putation are tim e consum ing, even though they do not contribute to the tim e com plexity; this makes th e serial im plem entations to be tim e consuming. Note th a t the to tal num ber of {model, basis) pairs is 0 { M n 2). T he voting tim e can be reduced to 0 { S + M n 2) by em ploying 0 { M n 2) boxes to collect th e votes. Through out 40 P rep rocessin g() for each m odel i such th a t 1 < * < M do E x tract n feature points from th e model; for j = 1 to n for k = 1 to n - C om pute the coordinates of all other features points in th e m odel by taking this pair as basis. - Q uantize each of th e above com puted coordinates and use it as a key to enter into a hash table where th e pair [model, basis), i.e., (i , j k ), is recorded. n ex t k n ex t j n ex t i end Figure 3.10: A sequential procedure to construct th e hash table. this section, we assum e S « M n 2. For exam ple, S = IK , M = IK , and n = 16. 3 .2 .2 P a ra llel G e o m e tr ic H a sh in g In this section, we present parallel techniques to im plem ent th e recognition phase on a P processor m achine. A lgorithm s presented in this section are im plem ented on Connection M achine CM-5 operating in SPM D m ode and on M asPar M P-1, an SIMD array. In an SIMD m achine, each processor executes a stream of instructions in a lock-step m ode on th e d a ta available in its local memory. T he instructions are broadcast by the control unit. T he SPM D m ode of execution combines th e characteristics of SIMD and M IM D modes. In this m ode, the control processor broadcasts a section of the d a ta parallel program to th e processing nodes, rath er th an broadcasting an instruction at a tim e (as in a typical SIMD m achine). At the sta rt of th e execution of a program , the com plete program is sent to all the nodes w ith pseudo synchronization instruc tions em bedded in the code. Each node executes the program independent of others until an em bedded synchronization instruction is reached. It resum es 41 R eco g n itio n () 1. E x tract 5 feature points from the scene. 2. Selection: Select a pair of feature points a basis. 3. Probe: a. C om pute the coordinates of all other features points in th e scene relative to th e selected basis. b. Q uantize each of the above com puted coordinates and use it as a key to access th e hash table containing enteries of th e (model, basis) pairs. c. Vote for the entries in th e hash table. d. Select th e (model, basis) pair w ith m axim um votes as th e m atched m odel in the scene. 4. Verification: Verify th e candidate m odel edges against the scene edges. 5. If th e m odel wins the verfication process, remove the corresponding feature points from th e scene. 6 . R epeat steps 2, 3, 4, and 5 until no features are left in th e scene. end Figure 3.11: O utline of steps in sequential recognition. th e execution of th e program only after all th e nodes reach the synchronization barrier. T he control unit assists in enforcing the synchronization barriers em bedded in th e program . This operation m ode is also referred to as synchronized M IM D m ode [et.92]. We will not elaborate on parallelizing th e preprocessing phase, since it is a one tim e procedure and can be carried out off-line. Each Processing Ele m ent (P E ) in the array is assum ed to have O (^jr-) memory. Previously, there has been two separate efforts in parallelizing th e geom etric hashing algorithm [BM 8 8 , RH91]. B oth im plem entations have been perform ed on SIMD hyper cube based m achines. These im plem entations are am ong the early efforts in using parallel techniques to solve high-level vision problem s. One of th e m ajor problem s in b o th th e im plem entations is the requirem ent of large num ber of processors. In our results, we exploit th e fact th a t the num ber of votes cast in 42 an iteration of th e recognition phase is bounded by S, th e num ber of feature points in th e scene. Therefore, no m ore th an S locations of th e hash table are accessed during th e execution of th e recognition algorithm . This allows us to reduce th e num ber of processors em ployed to at m ost S, th e num ber of feature points in a scene. Previous im plem entations used 0 ( M n 3) processors, i.e., th e num ber of bins in th e hash table. In addition, th e im plem entation by Bourdon and M edioni [BM 8 8 ] suffers due to inefficiency of th e routing al gorithm . This inefficiency lim its th e scope of their im plem entation to a small m odel database. In [RH91], Rigoutsos and H um m el suggest to use radix sort to im plem ent histogram m ing, a technique used in th eir im plem entation to count th e votes for each (model, basis) pair. T he use of radix sort in histogram m ing is advantageous only if th e num ber of levels in the histogram is m uch less than th e num ber of d a ta points [LK90]. In the case of geom etric hashing, this is not true. In th e following section, parallel algorithm s for the recognition phase has been developed. P arallel Im p lem en tation o f th e R ecogn ition P roced u re We use P processors such th a t 1 < P < S, where S is the num ber of feature points in a scene. Each Processing Elem ent (PE ) in the array is assum ed to have O ( ^ r - ) memory. In the recognition phase, possible occurrence of th e m odels (stored in th e database) in th e scene is checked. T he models are available in th e hash table created during preprocessing. All th e models are allowed to go under rigid and or sim ilarity transform ations. An arb itrary ordered pair of feature points in the scene is chosen. Taking this pair as a basis, a probe of th e m odel d ata base is perform ed. T he m ain steps of a parallel algorithm to process a single probe of th e recognition phase are given in Figure 3.12. As we are using less num ber of processors th an th e size of th e hash table, each P E will have several hash table bins stored in its local memory. Two issues arise during th e execution of the procedure ParalleLProbeQ. .1. M ore th an one feature point in th e scene m ay cast th eir votes to the same location in th e hash table, resulting in a contention for a single m em ory 43 P arallel JP rob e(S ,P ) /* S is th e num ber of features in th e input scene and P is th e num ber of processors. Initially each P E is assum ed to have S / P distinct scene feature points stored in a local array FP[]. * / - Choose an arb itrary pair of feature points in th e scene as a basis and broadcast it to all the PEs. - Compute_Keys() - VoteQ - Com pute_W inner() end Figure 3.12: A parallel procedure to process a probe of the recognition phase. location in a P E (see Figure 3.13). 2. M ore th an one feature point in th e scene m ay cast th eir votes to different bins stored in a PE, resulting in a congestion at a P E (see Figure 3.14). T he worst case in both th e cases will be O (S) contention and congestion. Such a scenario can lead to no speedup at all. On the other hand, other researchers have used large num ber of processors to avoid congestion and con tention problem s. However, in th eir im plem entations th e processor utilization is extrem ely low. Also, such solutions result in enorm ous com m unication over heads in perform ing global operations, such as global m ax and histogram m ing, as evident in the im plem entations proposed in [BM 8 8 , RH92]. In th e following, we address these issues and present efficient m apping and routing techniques to resolve th e contention and congestion problem s arising in perform ing a probe, while using a small num ber of processors. In order to elim inate th e m em ory contention problem in th e array, we in troduce a Merge-KeyQ procedure shown in Figure 3.15. This procedure sorts th e hash table keys corresponding to the input scene. All keys having th e same value reside in a block of PEs and in each block th e least indexed P E holds the leader key. T he leader key has th e num ber of elem ents in its block. During th e voting process, each leader key accesses th e P E holding th e corresponding 44 Hash Table key 1 key 2 key 3 key 4 0 PE N-tt Figure 3.13: Contention for a single hash bin. Hash Table key 1 key 2 key 3 key 4 PE PE ♦ PE N -J Figure 3.14: Congestion at a P E while accessing different hash bins stored in a PE. 45 location of th e hash table and casts a vote on behalf of all the keys in its block, i.e. if there are m elem ents in the block, m votes will be registered for the corresponding location in the hash table. This reduces th e num ber of accesses to th e hash table stored in a P E and thus reduces th e traffic over th e network, j Similarly, in order to address the processor congestion problem , we propose 1 to store m ultiple copies of th e hash table in th e array. In th e worst case, a copy of th e hash table can be assigned to each sub-array of suitable size. This increases th e size of th e m em ory required w ithin each processor. Each P E restricts its search for th e hash table bins w ithin th e corresponding subarray. This solution localizes th e congestion problem to th e sub arrays. ' M erge_K eys(CAN DID ,P) : - Sort(CANDID,P) 1 - In parallel, each PE;, 1 < i < P for each distinct key j, 1 < j < S/P . Identify the leader key and mark it in the array CANDID [j]. - In parallel, each PE;, 1 < i < P for each leader key, Count the number of keys same as the leader key and store it in VCOUNT[). end Figure 3.15: A parallel procedure to m erge quantized coordinates of feature points o f th e scene. A n a ly sis o f th e ru n n in g tim e o f a p rob e: In the following analysis, we ignore th e initialization costs, such as loading the scene points to th e processor array, loading th e hash table into th e processor ar ray, and initialization of m em ory locations used inside each PE. These assum p tions are also m ade in th e previous im plem entations reported in [BM 8 8 , RH92]. A . A n alysis of th e running tim e o f a probe on CM -5: For asym ptotic tim e analysis on CM-5, we assum e SIMD m ode of operation. T he fat tree [Lei85b] is th e underlying interconnection netw ork of CM-5. On th e fat tree m odel having P leaf nodes, we m ake the following assum ptions: 46 §£ §p|gi PE, Figure 3.16: A fat tree model. • A perm utation of £ item s takes O (Jrlo g P ) tim e, w here P < S. Initially each P E is assum ed to have |j item s and each P E receives p item s. The perm utation is assum ed to be known a priori. • Sort of S item s takes O (^ lo g S ') tim e. T he tim e bound for sort can be achieved by sim ulating well known sort algorithm s which em ploy constant num ber of steps. Each step consists sort on local d a ta and perm utation of data. For exam ple, row-column sort procedure described in [Lei85a] can be used. T he procedure states th a t if n = r s ,r mod s = 0 and r > 2 (s — l ) 2, th en n num bers can be sorted by constant num ber of iterations, w ith each iteration involving one of shifting, shuffling of n num bers or s parallel sort of r num bers. In our case, n = S, and r = J^, i.e. num ber of elem ents in each PE. Each PE , in parallel, sorts ^ item s in Jr log Jj tim e. A shuffle p erm u ta tion is perform ed on th e sorted elem ents in all the PEs. This sort-shuffle sequence is applied constant num ber of tim es. A shuffle of S item s cor responds to a perm utation, therefore, total tim e for sorting S item s is: ° ( p loS p + p log p ) = 0 ( j logs) T he condition 1 < P < S'1 / 3 can be further relaxed to 1 < P < S 1/ 2 by 47 using th e procedure in [MG87] for sorting S'1/ 2 x S'1 / 2 array. Based on th e above, it is easy to verify th e following: L em m a 1 Given a fat tree with P processors such that each subarray of size log P is assigned a copy of the hash table, the voting can be performed in 0 ( p lo g 2 P ) time. In th e worst case, all th e PEs in the subarray of size log P cast their votes to locations stored in a single PE. Therefore, th e to tal access tim e for ^ log P keys in each subarray is 0 (Jrlo g 2 P). This leads to th e tim e bound claim ed in th e above Lemma. L em m a 2 Given a fat tree with P processors, such that each processor has O (S fP ) (model, basis) pairs, computing a winner pair takes 0 ( p lo g 5 ') time. We assum e th a t d ata elem ents in th e hash table are uniform ally d istributed such th a t each bin has constant num ber of elem ents. T he hash function / ( ) in th e Compute-KeysQ procedure ensures this property (see Section 4 for details). We can use a procedure sim ilar to th e Merge-KeysQ procedure to find th e to tal num ber of votes for each {model, basis) pair voted during th e voting process. Pairs receiving zero num ber of votes are not considered. T he pair(s) receiving th e m axim um num ber of votes is chosen as a winner. This leads to 0 ( |r log S ) tim e perform ance for com puting a winner. N ote th a t even if the voting pro cedure results in uneven distribution of votes, they can be redistributed such th a t each P E has 0 ( Jr) votes w ithout increasing the asym ptotic com plexity of com puting the winner. The to tal execution tim e for one probe of th e recogni tion phase is: = tim e to com pute the keys { 0 ( S /P ) } S + voting tim e { 0 { — log2 P )} 5 + tim e to com pute the w inner {0 {-p l°g •S')} 48 T h eorem 1 Given a fat tree architecture consisting of P leaf nodes, one probe o f the recognition phase can be processed in 0 ( ^ l o g S) time on a scene con sisting of S feature points, such that log2 P < log S. N ote th a t the restriction on P can be relaxed if sufficient num ber of copies of th e hash table is available. Based on the above theorem , th e algorithm for th e recognition phase is processor tim e optim al and scales linearly w ith P for 1 < P < s 1/2. B . A n alysis o f th e running tim e of a probe on M P-1: We use a two dim ensional mesh array of P processor as th e underlying archi tectu re such th a t P = S. We do not consider the case of P < S, as the size of th e m achines in the MP-1 series is atleast 1024. Following results on th e m esh m odel are used in our algorithm s: • A perm utation of P item s can be routed in 0 ( \ /r P ) tim e [KR87]. • Sort of P item s takes 0 (y /P ) tim e [NS81]. L em m a 3 Given a mesh array of \ / P x y fP processors with P keys sorted in row major order, identification of leader key for each distinct set of keys, with the count of total number of keys in each set, takes 0 {y /P ) time. □ Proof for Lem m a 3: T he execution tim e for each step is given in parenthesis. Also assum e row m ajor indexing of processors in the mesh array. • In each row of y/P processor, for each sequence of identical keys, m ark the key in least indexed processor as th e leader key and associate w ith it a count of num ber of keys in the sequence. As th e keys are sorted in row m ajor order, a serial scan through the processors in th e row would suffice. (0 ( \/P ) ) . • E num erate th e leader keys. (0 (y/P )). • R oute leader key i to processor i. {0 {yfP f). 49 • Merge th e leader keys which correspond to same type of sequences into one leader key and u p d ate th e count accordingly. In th e worst case, all S keys are identical. Each row will have one leader key for its own row elem ents. Thus, at m ost \ f P identical sequences, therefore, at m ost \f P leader keys. {0 {\fP )) T he identification process takes 0 (y /P ) tim e. □ Thus, th e procedure Merge-Keys) takes 0 (y /P ) tim e. L e m m a 4 Given a mesh array with S processors such that each subarray of size yfs is assigned a copy of the hash table, the voting can be performed in 0 ( y / S ) time. L e m m a 5 Given a mesh array with S processors such that each processor has 0 ( 1 ) (model, basis) pairs, the computing o f the winner pair can be performed in 0 (y /S ) time. F irst of all, we assum e th a t d a ta elem ents in the hash table are unifor- m ally distributed such th a t each bin has constant num ber of elem ents. We use Merge-KeysQ procedure to find th e total num ber of votes for each {model, ba sis) pair which got vote(s) during the the voting process. N ote th a t pairs w ith zero num ber of votes are not considered in this process. This step considerably reduces th e congestion and hence reduces the execution tim e. T he input to the Merge-KeysQ procedure is CA N D ID [] array. In our im plem entations, we refer to this step as “com putation of local m ax” . A fter this step, global m axim um is found over the leader elem ents in all the processors. □ Total execution tim e for one probe of the recognition phase: = tim e to com pute th e keys + voting tim e -f tim e to count votes for candidate model, basis + tim e to com pute th e winner 50 T h e o r e m 2 Given a mesh array of y/P x \ / P processors, such that P — S , one probe of the recognition phase can be processed in 0 (\/jP ) time on a scene consisting of S feature points. □. For architectures w ith P < S, we present the following theorem s. T h e o r e m 3 Given a mesh array of y /P x y /P processors, such that 1 < P < S, one probe of the recognition phase can be processed in 0 ( p y / P + p lo g p ) time on a scene consisting of S feature points. □. 3.3 Object Recognition using Structural Indexing In this section, we present d a ta parallel algorithm s for th e object recognition technique proposed by Stein and Medioni [SM90]. In this technique models are acquired and initially approxim ated by polygons w ith m ultiple line tolerances for robustness. G roups of consecutive segments (super segments) are then quantized and entered into a hash table. This provides th e essential m echanism for indexing and fast retrieval to th e d ata stored. Once th e database is built, the recognition proceeds by extracting super segm ents from the scene. These super segm ents are quantized and are used to retrieve m odel hypotheses from th e hash table. T hen, m utually consistent hypotheses are clustered to represent the instance of a model. This technique allows to recognize models in th e presence of noise, occlusion, scale, rotation, translation and weak perspective. Unlike m ost of th e other current system s, its com plexity grows as 0 ( S 2) where S is the num ber of super segm ents in the scene. For th e sake of clarity, the sequential algorithm is outlined in Section 3.3.1. D etails can be found in [SM90]. T he sequential algorithm for th e recognition phase takes 0 ( S 2), w here S is the num ber of super segm ents in th e scene. T he parallel im plem entation takes 0 ( S 2/P), where 1 < P < S and scales linearly w ith the m achine size. 3 .3 .1 S tr u c tu r a l In d e x in g T ech n iq u e T he basic underlying algorithm is described as follows: 51 1. F ind all th e super segments (token elem ents) of all models and store them into a table. 2. Collect all th e super segments of the scene. 3. R etrieve from th e table all the tokens which are in th e scene. 4. All corresponding token generate hypotheses. 5. T he hypotheses have to be grouped w ith respect to constraints which assign th em m em bership to a model. For details, th e above algorithm is divided into two phases, th e object representation phase (preprocessing phase) and the object recognition phase. In th e next subsections, we briefly outline the two phases. Form al algorithm s for both th e phases are given in Figures. 3.19 and 3.22. D etails of th e sequential approach can be found in [Ste92]. O b ject R ep resen tation P h ase In this approach, th e representation of a m odel or a scene is based on polyg onal approxim ations. T he polygonal approxim ation of a curve captures some of th e curvature inform ation in th e form of angles betw een consecutive seg m ents. These curvature angles are invariant to scale, rotation, and translation. Also, as the polygonal approxim ation is not unique, several approxim ations w ith different line fitting tolerances are used sim ultaneously for th e purpose of robustness. Fixed num ber of consecutive segm ents, hence forth called super segment, is used as feature point in the m atching process. In order to m anip u late and to encode super segments, curvature angles (as shown in fig. 3.17) and eccentricity of th e super segment (the ratio of th e lengths of the small and th e long axis) are used. For exam ple, a super segment ss of cardinality n is encoded as follows: cod e(ss) = (q u an t(ai), q u an t(a2), - • • q u an t(an_i), quant {eccentricity)) All th e encoded super segm ents serve as keys to enter into a hash table, and the corresponding to each key, its super segment num ber and m odel num ber are 52 recorded. We assum e th a t all the super segm ents are uniform ally d istributed over th e hash table. Therefore, the num ber of entries in each bin of th e hash table is constant. A pictorial representation of the m odel database is shown in Figure 3.18. Segm ent 1 Segm ent 3 Segm ent 4 Figure 3.17: Two dim ensional super segm ents of cardinality 4. O bject R ecogn ition P h ase T he scene is preprocessed to generate all th e super segm ents. These super segm ents are encoded and the corresponding codes are used as keys to the hash table to retrieve the m atching hypotheses betw een th e super segm ents of m odel and scene. A hypothesis consists of a super segm ents from th e scene and it corresponding m odel’s super segment retrieved from th e hash table. This concept is pictorially shown in Figure 3.21. These, le t’s say n, hypothe ses h = { h i , A - 2,... hn} are then grouped according to th e m odel num ber they belong to. For th e analysis purpose, we assum e th a t there are at m ost k hypotheses per model. This inform ation is stored in a correspondence table where m odel num bers serve as keys and the k hypotheses hi = {h\, h% 2, . . . , h^} (w ith h{ C h ) serve as entries. N ext, w ithin each m odel, all th e hypotheses are checked for consistency. T hree consistent hypotheses are considered suffi cient to in stan tiate th e recognition of the m odel in th e scene. T he rem aining hypotheses in the hi subset are checked to be consistent w ith any one of the th ree selected hypotheses. A fter grouping the consistent hypotheses, th e tra n s form ation from th e m odel coordinates to th e scene coordinates is com puted. T he overall tim e com plexity of th e object recognition phase is 0 ( S 2), assum ing 53 C o d e ( SS •) data base model Figure 3.18: R epresentation of a m odel in a hash table (courtesy Stein and M edioni.). O b je c t_ R e p re s e n ta tio n () fo r each M odel i such th a t 1 < i < M d o E x tract all s super segm ents from th e m odel im age and store them into the array fo r each super segment j = 1 to s - Q uantize each M„-[i] and use it is as a key to enter into th e hash table and record (M odel, super segm ent), i.e. (si, j ) at th e corresponding hash bin. n e x t j n e x t i e n d Figure 3.19: An algorithm to record models in th e hash table. 54 scene data base Figure 3.20: R etrieval of candidate super segments, (courtesy Stein and M edioni.). n = S in th e worst case and th e length of each bin is constant. 3 .3 .2 P a r a lle l S tru ctu ra l In d e x in g In this section, we provide d ata parallel algorithm s for th e object representation and th e object recognition phases discussed in Section 3.3.1. P arallel A lgorith m for th e O b ject R ep resen tation P h ase In th e object representation phase, super segm ents of each m odel *, 1 < i < M , are encoded according to their curvature angles and eccentricity. T he resulting code is used as a key to store the corresponding super segm ents in a hash table. Each m odel can be processed independent of other models. We use a P processor m achine, such th a t each processor is assigned to process M / P models. A parallel procedure for this phase is given in Figure 3.23. A fter com puting the hash table code, key, for each super segment in a m odel, all th e keys are sorted. This step helps in determ ining the am ount of space need to be allocated for each bin in th e hash table, as there m ay be m ultiple keys being m apped onto th e sam e hash bin. T he sorting step also transform s th e com m unication p attern am ong processors into a perm utation. For each distinct key, each processor sends a request for m em ory allocation to the processor responsible for storing th e corresponding hash bin. This is determ ined by th e by the m apping function 55 I hypotheses SSk-m SSj (=h^) ssk - n sS j correspondence table Constraints distance angle direction models, — > h^.h modeL — » h0, h Transformation / i f cluster table 1 T: modelm— » scene <L SQ | \ modelm (h1, h4, h8, h6) (h5) 1 & modeln - rejected modeln (h 2’ hg) (h7)(h3) ¥ 0 . %■ Figure 3.21: Verification of hypotheses, (courtesy Stein and M edioni.). 56 O b ject -R ecogn ition () - E x tract all S super segm ents from th e scene and store th em in th e array SS[]. - for each super segment j = 1 to S - Q uantize SS[j] and use th e resulting code as a key to enter into the hash table. - R etrieve the corresponding entries from th e hash table and generate hypotheses and store them in the array TEM P_TABLE[]. n ex t j - Based on th e hypotheses in th e th e array TEM P_TABLE[], (le t’s say n), group them according to th e m odel they belong to, and for each m odel store its hypotheses in th e array CORRESP_TABLE[] where m odel num ber serves as index. - For each m odel present in th e array CORRESP_TABLE[], check th e consistency am ong it hypotheses; /* A m odel w ith three consistent hypotheses is considered as candidate m odel present in th e scene. * / e n d Figure 3.22: An algorithm to recognize models present in a scene. 57 <70. N ote th a t th e function g() is differnt th an th e m apping function used in determ ining th e hash bins for keys. This function defines th e m apping of hash table bins onto th e processor array. T he Construct-Hash-Table() procedure is used to route th e d a ta to the corresponding bins of th e hash table. For a i database of M models and s super segments per m odel, the size of the hash table is assum ed to have O(Ms) entries. P arallel A lgorith m for th e R ecogn ition P h ase T he recognition phase consists of following basic steps. • C om pute keys for all th e super segm ents of th e scene and access the corresponding d ata from th e hash table, • G enerate hypotheses, • C onstruct correspondence table, • C luster consistent hypotheses for each votes m odel, and • Select th e models to be recognized. In our parallel im plem entation, parallelism em bedded in each of th e above steps is exploited. Each P E ,, 0 < i < P — l i s assigned distinct S / P super segm ents present in th e scene. W ith in each PE , these super segm ents are encoded and corresponding codes are used as keys to th e hash table to retrieve m atching hypotheses. T he hash table is spread over th e whole arrays using th e m apping function g() such th a t each P E hold 0 { M s j P ) bins of the table. A detailed parallel procedure for th e recognition phase is shown in Fig. 3.25. T he procedure ConstrucPCorrespondence-Table() shown in Fig. 3.26 is used to construct th e correspondence table. N ext, th e procedure Check-ConsistencyQ processes hypotheses, in parallel, for each m odel present in th e scene. For each hypothesis for a m odel, th e array CONSIST-COUNT[] is used to count th e num ber of hypotheses it is consistent with. All the models w ith > 3 consistent hypotheses are considered as candidate models present in th e scene. These models are th en verified for final recognition. 58 P arallel_O b ject_R epresen tation () A ssign M / P models to each P E 8, 1 < i < P. In parallel for all PE*, 1 < i < P, do for each m odel j in PE,-, j = 1 to M /P E x tract all s super segm ents for th e m odel j , along w ith their curvature angles and eccentricity value and store them in th e array Mj []. for each super segment k = 1 to s C om pute th e code(Mj[k]) and store it in the array KEY[j, & ]. N e x t k N e x t j P arallel-en d Sort(K E Y) In parallel, each P E X , 1 < i < P counts th e num ber of sim ilar keys for each distinct key j , 1 < j < M s / P and stores the value in count[key]. In parallel, each P E X -, for each distinct key , send th e value count[key] to P E j such th a t th e P E j is assum ed to store the corresponding hash bin entry. In parallel, each P E j allocates a contiguous m em ory space for th e corresponding hash table bin entry. Construct_H ash_Table() end Figure 3.23: A parallel algorithm to process super segments and record them in a hash table. 59 C on struct _H ash_Table(K EY ) In parallel, each PE j, 1 < i < P , do for each m odel j in P E ,, j = 1 to M / P for each feature k = 1 to s Send {model, super segment) pair {j, k) to P E f(KEY[j,r])‘ In parallel, each P E /^ jg y ^ ] ) stores the corresponding {model, super segment) pair at the next available space in the allowed block of memory. n ex t k n ex t j P arllel-en d end Figure 3.24: A parallel algorithm to route th e d a ta to corresponding bins in th e hash table. O b ject _R ecogn ition (P ) - Assum e S super segments from th e scene are available. - Assign S / P super segments to each P E and store them in th e array SS []. - Access_Hash_Table(SS,P) - Construct_Coresspondence_Table(TEM P_TA BLE,P) - Check_Consistency(CO RRESP_TA BLE,P) end Figure 3.25: A parallel algorithm for the recognition phase. 60 i C onstruct_C orrespondence_T able(T E M P _T A B L E ,P ) In parallel, for each PE*, 1 < i < P for j — 1 to S / P Split TEM P_TABLE[j] entry into individual hypotheses for each m odel present, and store them in CO RRESP_TABLE[]. n ex t j P arallel-en d Sort(C O R R E SP-T A B L E) G roup all th e PE s having hypotheses for the sam e model. /* All the entries of th e array CORRESP_TABLE[] w ithin a group correspond to th e total num ber of hypotheses for th e m odel represented by the group. * / end Figure 3.26: A parallel algorithm to construct correspondence table. 61 C heck_C onsisten cy(C O R R E SP _T A B L E ,P ) In parallel, for each PE;, 0 < i < P — 1 Check th e consistency of all the elem ent of CORRESP_TABLE[] w ith each other and u p d ate CONSIST-COUNT[] counter for all successful checks, for each j — 1, to n /P - Send CO RRESP_TA BLE[j] to th e PEs in th e group. - Check th e consistency of local hypotheses w ith those received. - For each successful consistency check, increm ent the CONSIST-COUNT[] appropriately. n ex t j P arallel-en d /* All th e models w ith CONSIST-COUNTQ > 3 are considered as candidate for recognition. */ end Figure 3.27: A parallel algorithm to check consistency of hypotheses for models present in th e scene. 62 A n alysis o f th e R u nn in g T im e For th e sake of analysis, we are assum ing a P processor m esh array, such th a t 1 < P < s. I Object Representation: It is easy to show th a t th e procedure Comp-BasisQ takes 0 ( M s / P ) tim e. T he tim e for sorting th e keys is 0 (^ = ) (see Section 3.2.2). A ssum ing a uniform distribution of th e hash bins over th e entire hash space, it is easy to show th a t th e tim e taken for th e procedure Construct-Hash-Table() on a P processor m esh array is 0 ( ^ = ) . As th e entries are sorted and th ere are at m ost M s / P , entries per P E , therefore, M s / P perm utations need to be routed. Therefore, th e to tal tim e to execute th e representation phase on a P pro cessor m esh array is 0 (^ = ). Object Recognition: In th e procedure Access-Hash-TableQ S /P keys are com puted in 0 ( S / P ) tim e. A pplying Lem m a 1, hash table access can be com pleted in 0 ( ^ = ) tim e on a P processor m esh array. The procedure Construct-Correspondence-Table() requires grouping of retrieved hash table entries according to the m odel num ber they belong to. We use a variant of th e th e Merge-KeyQ procedure (see Figure 3.15) to group th e entries. It is assum ed th a t the to tal num ber of hy pothesis generated is 0 ( S ) and th a t there are atm ost M s models present in th e scene, such th a t M s « M , where M is the to tal num ber of models in the database. These assum ption are also m ade in [Ste92]. For each m odel i, i < M s found in th e scene, a distinct block of P /M s processors is assigned to check th e consistency. T he procedure Construct-Correspondence-TableQ also group th e processors for each m odel present in th e scene. It is easy to show th a t the tim e taken by th e procedure Construct-CorrespondenceJTable^) is dom inated by the sorting tim e over th e 0 (S ) entries of th e table. Therefore th e to tal tim e to execute this procedure on a P processor mesh array is 0{-f^). 63 T he procedure Check-ConsistencyQ is used to check consistency of th e hy potheses stored w ith in each PE. As assum ed earlier, there are atm ost S hy potheses, therefore, th e num ber of hypotheses per P E is 0 (S /P ) . Therefore th e tim e taken by this step is ( 5 / P ) 2. Also, it is assum ed previously th a t th e num ber of hypotheses per m odel present in th e scene is 0 (S /M s ). To check consistency w ith other hypotheses of th e same m odel, th e PEs belonging to the sam e group are linked in a linear array fashion through a linked list. Each P E sends its hypotheses to its successor PEs. Each P E checks th e consistency of its local hypotheses w ith those received from its ancestor P E in th e linked list. It is easy to show th a t after 0 ( S / P x S /M s) steps all the PEs have com puted consistencies over all th e possible pairs of hypotheses of each m odel. Therefore it takes (j$-) tim e to perform the consistency check. Total execution tim e for th e recognition phase: = tim e to com pute th e keys and to access th e hash table 4 - tim e to construct correspondence table + tim e to perform consistency check Therefore, Chapter 4 Applications of Parallel 1 Techniques T he parallel techniques developed in C hapter 3, m ainly deal w ith th e p arti tioning and m apping of complex d ata structures onto parallel m achines. Also, efficient d a ta routing techniques have been presented to handle irregular d ata com m unication p attern s em ployed in th e sequential techniques for object recog nition. In this chapter, several other image understanding tasks having com pu ta tio n and com m unication characteristics sim ilar to th a t of object recognition are studied. T he techniques developed in C hapter 3 are further extended to derive parallel solutions for these tasks. In particular, parallel solutions for various stereo m atching techniques are devised. In m ost of th e stereo m atch ing techniques, th e m atching process is sim ilar to th e im age m atching process in object recognition and various stereo m atching algorithm s differ w .r.t. th e prim itives used for m atching and th e constraints applied during th e m atching process. However, th e use of inform ation inferred from the m atching process is used differently. In this chapter, d a ta parallel solutions for two well known sequential stereo m atching techniques are derived, i.e. stereo m atching using zero crossings as m atching prim itives [KA87] and stereo m atching using line segm ents [MN85]. B oth th e techniques use a variant of relaxation labelling discussed in C hapter 3. 65 A dditional details of th e work presented in this chapter can be found in [KLP91, K P92, PK P92, KL93]. In th e following a brief introduction to th e basics of stereo m atching is presented. I 4.1 Stereo Matching Stereo m atching (stereopsis) is one of th e well known m ethods for th e extraction of depth inform ation. Two images, left and right, captured at th e sam e tim e b u t at different angles are m atched. T he m ajor steps involved in th e process of stereopsis are preprocessing, establishing correspondence (m atching), and recovering depth. Among these steps, m atching is the m ost im portant and ' com putation intensive stage. It establishes correspondence am ong homologous features, th a t is, features th a t are projections of the sam e physical identity in each view. ; T he im aging geom etry of a conventional im aging system involves a pair of cam eras w ith th eir optical axis m utually parallel and separated by a horizon ta l distance denoted as th e stereo base line. T he cam eras have their optical axes perpendicular to the stereo baseline. Since th e displacem ent betw een the optical centers of th e two cam eras is purely horizontal, th e position of corre sponding points in the two images can differ only in th e horizontal com ponent. Fig. 4.1 shows th e im aging geom etry of a stereo pair of cam eras. T he rays of projection P O l and P O r define th e plane of projection of the 3-D scene point called th e epipolar plane. For a given point P l in th e left im age, its corresponding m atch point Pr in th e right im age m ust lie on the line of intersection of th e epipolar plane and the im age plane. This line is called the epipolar line. T he epipolar line in th e right im age corresponding to a point P l in the left im age defines th e search space w ithin which th e corresponding m atchpoint Pr should lie in th e right image. Thus, the epipolar constraint is obtained as a result of the im aging geom etry of the stereo cam era system and help stereo analysis. T he m atch points obtained as a result of im posing th e epipolar constraint on the local m atching search could result in 66 P(x,y,z) I I l I f* * --------------------- baseline ► ! Figure 4.1: Im aging geom etry of stereo cam eras. two or m ore candidate m atches. T he disparity obtained by com puting th e relative displacem ent of th e m atching feature points in th e two images is used to ex tract the 3-D depth of th e scene point th a t projects on th e two m atched points. T he techniques in stereo m atching are usually classified into three cate gories, i.e. in ten sity /area based m atching, feature based m atching, and hierar chical m atching [DA89]. A pictorial representation of th e sequential approaches in stereo m atching and th eir inter-relationship is provided in Figure 4.2. 4.2 Related Work In th e past, several im plem entations of various stereo m atching algorithm s have been derived. These im plem entations differ in term s of m atching prim itives used and target architectures employed in parallel solutions. M ost of these im plem entations are m achine specific and th e techniques used in such approaches are not extendable to other m esh based/linear array architectures. 67 STEREO MATCHING Feature Based Area Based Hierarchical Figure 4.2: Stereo m atching: Known sequential approaches. D ru m m h eller and P oggio [DP 8 6 ] have im plem ented hierarchical stereo m atching algorithm s on th e Connection M achine [Hil85]. They use zero cross ings and sign of convolution as features for m atching. P otential m atches are saved if th ey have th e sam e zero-crossing values. N ext, local support for each candidate pair is gathered by applying a constraint on disparity continuity. Finally, m atches are selected based on the am ount of local support gathered for disparity continuity, uniqueness, and an ordering constraint. Recently, Laine and R om an [LR91] have given an increm ental algorithm for stereo m atching on a theoretical model of SIMD m achines. This im plem en tatio n uses some unrealistic assum ptions, a m odel closer to PRA M . It assumes a 2-D array of pipelined processor and a set of m em ory arrays th a t m ay be read a n d /o r u p d ated during each m achine cycle. M odel param eters include th e num ber of stages per pipeline, input and o u tp u t bandw idth, and stage in terconnection network. T he algorithm comprises of two phases. In phase one, m atches are first discarded based on a loose geom etric constraint and th e or dering of previous m atches, if any. Phase two com putes th e initial probability of each possible m atch based on the individual evaluation of th e m atch and th e classification of its set. R ein h art and N evatia, in [RN90], have also provided an im plem entation of th e stereo m atching algorithm [MN85] using linear features as prim itives. T he objectives achieved in the parallel im plem entation by R einhart and Neva tia [RN90] are different th a n the classical approaches to parallelization. The objectives em phasized in th eir im plem entations are: reducing th e system com plexity and reducing th e programmer burden. T he system com plexity is defined as th e am ount of restructuring required to parallelize a sequential algorithm , p articularly in term s of control logic. Similarly, th e program m er burden is defined in term s of the degree of difficulty in developing and m aintaining the parallel algorithm im plem entation. In their approach, analysis of th e algorithm is perform ed to identify th e inherent parallelism . Based on the analysis, com ponents of a parallel processor architecture are specified. T h at is, rath er than m apping th e algorithm onto a given or well-established parallel architecture, a m atching parallel architecture is defined. This approach m ay introduce some 69 ease to the program m er and also m ay reduce some logic b u t, on th e other hand, it m ay introduce a different architecture for each of the techniques. Also, th e stru ctu re of th e proposed architecture m ay be irregular in term s of sym m etry, hence introducing difficulty in VLSI realization of such parallel architectures. B a r a n a r d [Bar89] have introduced a new hierarchical stereo system based on th e concept of sim ulated annealing. The parallel im plem entation has been developed on the Connection M achine. 4.3 Stereo Matching using Zero Crossings In this section, scalable d a ta parallel algorithm s are derived for a stereo m atch ing technique using zero crossing points as m atching prim itives. T he proposed parallel algorithm s take O in m fP ) tim e on a y/P x y/P processor m esh array, w here n is the num ber of zero crossing points in th e left im age, m is th e set of possible candidate points in th e right im age for a given zero crossing point, and 1 < P < n. T he sequential algorithm takes 0 ( n m 2) tim e while the faster sequential algorithm shown developed in this section runs in 0 ( n m ) tim e. T he stereo m atching algorithm studied in this section uses two independent m easures of sim ilarity nam ely zero crossing p attern and intensity gradient. The m atching process uses a relaxation m ethod based on th e continuity of local disparities, th e sim ilarity of th e zero crossing p attern s and th e sm oothness of th e probability of m atching. Following the term inology in [KA87] , the m atching process associates each zero crossing point in th e left im age Zi(xi, yi) w ith a point in th e right im age Zr(xj,yj). In the parallel axis geom etry, the search process for m atching is applied only on th e left side of th e transferred coordinates of th e candidate point [KA87]. The search for correspondence is constrained by the epipolar line and is determ ined by th e m axim um disparity value(denoted as dmax in Figure 4.3). T he dmax value can be com puted to suit th e application. In the following section, for the sake of com pleteness, the m atching process is described. T he details can be found in [KA87]. 70 Left Image Right Image • • • • • • • • • • • • • • • • • • • • • Figure 4.3: Finding a candidate m atch. 4 .3 .1 S e q u e n tia l T ech n iq u e T he collection of locations of all th e zero crossing points which do not form a horizontal zero crossing p attern in th e left im age forms th e set of nodes ni for th e relaxation process. Each node n,- = (Xi,yi) has a set of labels D; which consist of a set of possible disparity values {dj} and each label is associated w ith a probability Pi(dj) th a t th e node n* has disparity value dj. Initially th e set of labels for each node is established by finding all the candidate points for each zero crossing point in th e left image. Once a set of labels is associated w ith a node, to assign th e initial probability to each label, two weighting functions are used. O ne of th e weighting functions is based on th e sim ilarity of the zero crossing point and th e other is based on th e difference in th e gradients of th e grey level intensity. These weighting functions Wj\ and Wj2 are com puted as follows. Let Ij(x , y) and / r (x, y) be th e im age functions at point (a:, y) of th e left and right images respectively, and DPij be the directional difference betw een Z P ;(x 8 -,yj) and ZPr{x,j, yj). Then, for label dj in D i , we have, Wj i Wj2 where 71 . . x [Ii(xi+i,yi) - / / ( * < _ ! , y,-)] Vi) — Gri^Xii t/i) 2 and th e to tal weight is assigned as follows: Wj = a * wji + b * Wj2 w here a and b are positive constants (a -f b = 1 ) which affect th e influences of Wji and wj2 on th e to tal weight Wj. Let us suppose th a t D t has m labels and the j t h label has weight W j ; th en th e initial probability th a t node n, has disparity value dj is given by pi idi) = w * + y:t=i wkJ ^ = *’ x’ m w here W * is the weight of no m atch which is calculated by W* = 1 — max(VFj). T he initial probabilities which are assigned to all th e labels of D, are modified using an iterative updating scheme. Let P-*(dj) be the fcth iterated probability of th e point Zi(xj, yi) whose first and second connected zero crossing points be Z i(xf,yf) and Zi(xs,y s) respectively. Let Pj(dj) and P^(dj) be the probabilities at locations (Zi(xj,yy),dj) and (Zi(xs,ys), dj) respectively. Then, a ‘ + ,w -) = i ? ( 4 ) + c * F (i? (« f3 - ) ) * ^ - d t P ^ t l i P r s ) P?+'(d,) = P,k(d.) where Pjjf = maxfPj^dj- — 1 ), Pf(dj), Pj{dj + 1 )] 72 rep eat change < — 0 ; Com p ute.W eight s ( zero- cros sings in left and right images); U pdate_Probabilities(); Vote_for _the_m atch(); until (change = 0 ) Figure 4.4: Sequential algorithm for stereo m atching using zero crossings as m atching prim itives. p * = ma.x[pi(dj - i) ,p ';( d i ) , p ’ :(dj + i)} = 1 F(P‘( d \ \ = l lPHdi)]2 0 < Pf(dj) < 0.5 ' P,Hdi) • (1 - P?(d,)) 0.5 < PHdj) < 1 / 0, p p + P $ # 0 1 {PfSj = < „ „ \ i , P g + P g = o T he algorithm term inates after a constant num ber of iterations and each iteratio n takes 0 ( n m 2) tim e. Each tim e unit corresponds to a sim ple a rith m etic/logic operation. 4 .3 .2 A F ast S e q u e n tia l A lg o r ith m T he u p d ate probability procedure presented in Figure 4.6 takes 0 ( n m 2) tim e. We provide a faster sequential algorithm for the u p d ate procedure in Figure 4.8. A careful observation 4 .3 .3 P a r titio n e d P a r a lle l Im p le m e n ta tio n In this section, we present a partitioned parallel im plem entation of th e se quential algorithm shown in Figure 4.8. T he parallel algorithm is shown in Figure 4.9. Each P E ;j in the array is assigned to find a m atch for n / P zero 73 Com pute_W eights(N, M) fo r i = 1 to N d o W k\i] = 0 fo r j — 1 to m such th a t j is a candidate m atch point of i d o com pute Gi, Gr com pute Wji, Wj2, Wj if Wj > Max[i][?‘] th e n M ax[i][j] = Wj w k[i] = W k[i\ + Wj N e x t j C om pute initial P robability P^[j] N e x t i e n d Figure 4.5: A lgorithm for com puting initial weights for candidate m atches. U pdate_Probabilities () fo r i = 1 to n d o fo r j = 1 to m such th a t j is a candidate m atch point of i d o com pute Pp and Pg com pute I(P f s ) com pute F(Pf!) com pute P f +1 N e x t j N e x t i e n d Figure 4.6: Algorithm for updating probabilities of candidate matches. 74 Vote_for _the_mat ch() for * ’ = 1 to N do for j = 1 to m such th a t j is a candidate m atch point of i do if Pf < 0.05 th en D iscard th e pair from th e m atch list else if Pf >0.7 th en A ccept th e pair as m atched and rem ove i from next iterartions. N e x t j N e x t i end Figure 4.7: A lgorithm for voting for candidate m atches. F aster-U pdate JProbabilities() for i = 1 to n do com pute Pp and P ^ \ for * = 1 to n do for j = 1 to m such th a t j is a candidate m atch point of i do com pute I(Pps) com pute F(Pth) com pute P f +l N e x t j N e x t i end Figure 4.8: A faster algorithm for updating probabilities of candidate matches. 75 rep eat change < — 0 ; Parallel_Compute_W eights(zero-crossings in left and right images); Parallel_U pdate_Probabilities(); Parallel Abte_for_the_m at ch(); u n til (change = 0 ) Figure 4.9: A parallel algorithm for stereo m atching using zero-crosssings as m atching prim itives. crossings points in th e left image. T he zero-crossings, com puted both in left and right images, are stored in snake-like row m ajor order in th e processor ar ray. A P E is m arked as a leader if it contains a right-m ost zero crossing point of a row in th e right image. T he m em ory m odule M M jj, associated w ith P E ?j contains D P ij , intensity values for th e corresponding zero crossing points in right and left images, and a list of m candidate m atches for each of th e n jP points assigned to it. An initilization procedure, “parallel_c_weights()” , is executed in each P E to com pute initial probability for each of the candidate points. This can be perform ed in 0 ( n m / P ) tim e. N ext, during each iteration of th e procedure, “parallel_update_probabilities()” , the intial probabilities of candidate m atches are u p d ated based on th e neighboring 0 (m ) points. This procedure can be executed in a pipeline fashion in each P E in 0 ( n m / P ) tim e. T he PE s m arked as “leader” term in ate th e pipeline process. This claim is supported by th e assum ption th a t we are using a parallel axis geom etry and a zero crossing in the left im age has its m atch, if any, w ithin a distance of dmax, approxim ately w ithin m points, and is present on th e same epipolar line. A fter each iteration of the u p d ate procedure, voting for each candidate point is perform ed in 0 ( n m /P ) tim e. T he algorithm term inates after a constant num ber of iterations of update and voting procedures [KA87]. 76 Parallel_Compute_Weights(N, M) in parallel for each i in th e left im age, 0 < * < n do W k\i) = 0 com pute Gi parallel_end in parallel for each j in th e right im age, 0 < j < n do com pute Gr d istrib u t e(G>) parallel_end in parallel for each i in th e left im age, 0 < * < n do for j = 1 to m such th a t j is a candidate m atch point of i do com pute W j i com pute W j2 com pute Wj if Wj > M ax[*'][;] th en M ax[i][j] = Wj W k[i] = W k\i} + Wj N e x t j C om pute initial P robability P^[j] parallel_end Figure 4.10: A parallel agorithm for com puting initial weights for candidate m atches. 77 Parallel_Update_Probabilities() in parallel for each i in the left im age, 0 < * < n do com pute max(Pik(dj - 1), P f K ) , P k{dj - 1)) distribute(m aa;, F, S) parallel_end in parallel for each i in th e left im age, 0 < i < n do for j = 1 to m such th a t j is a candidate 1 m atch point of i do com pute I(P f s ) com pute F(P-!) com pute Pz k+1 N e x t j parallel_end * Figure 4.11: A parallel algorithm for updating probabilities of candidate m atches. Parallel_Vote_for_the_match() in parallel for each i in th e left im age, 0 < i < n do for j = 1 to m such th a t j is a candidate m atch point of i do if P k < 0.05 th en discard th e pair from the m atch list else if P k > 0 .7 th en accept th e pair as m atched and remove i from next iterartions, i.e m ark th e P E as inactive. N e x t j parallel_end end Figure 4.12: A parallel algorithm for voting for candidate matches. 78 4.4 Stereo Matching using Linear Features In this section we provide d ata parallel algorithm s for th e stereo m atching technique using linear features [MN85]. I 4 .4 .1 S e q u e n tia l T ech n iq u e O ne of th e well known m ethods for stereo m atching is using linear features as m atching objects. T he key advantage of this m ethod is th e intrinsic m erit w ith respect to accuracy and sensitivity due to photom etric variations [MN85, MN84]. Stereo m atching using linear features is capable of handling m ore ; com plex scenes (such as those containing repetitive structures). For sake of com pleteness, th e m ain ideas of this algorithm are presented in Section 4.4.1. M in im u m D isp arity A lgorith m T he technique attem p ts to m atch overlapping segm ents detected along th e sam e epipolar line, having sim ilar contrast and orientation. Following the term inology in [MN85], for each segment a; in th e left image, a m atch is found in a window w(a.i) defined in th e right image. Similarly, for each segm ent bj in th e right im age, a m atch is found in w(bj) defined in th e left image. T he shape of th e window is a parallelogram . This is shown in Figure 4.13 for left to right m atch, one side corresponds to a,-, and the other side is a horizontal vector of length 2dmax, where dmax is th e m axim um disparity. T he num ber of segm ents in each window is assum ed to be at m ost n and both the images are assum ed to have N segments each. For each a,, a set Sp(a,i) of possible m atches in window u)(a4 ) is defined based on th e contrast, overlap, and orientation. Sim ilarly for each bj, a set Sp(bj) is defined. To assign unam biguous m atches, a set of m atches is considered together for each segm ent in the image. For each possible elem ent j in Sp(a,i), an evaluation function v ( i ,j ) is com puted. This function is dependent on how well th e disparities of th e other m atching pairs in w(bj) agree w ith the average disparity of th e m atching pair. A set of preferred m atches Q t{al) is constructed for each i during iteration t , and j is included in th e set if the 79 Left Image Right Image Figure 4.13: D eterm ining a stereo-window. following holds: V A ; € Sp(a,i) such th a t bk < -> and V/i (E Sp(bj) such th a t ah ai,vt(i,j) < vt(h,j) T he relation bk * -* ■ bj is tru e if bk overlaps bj. (4.1) (4.2) v ( i ,j ) is defined as follows: a h ) 1 , • \ \dhk d jj\/ c a r d i b j ) in vdbj) hk verifies C ^ a* ) ~ f ~ ^ ' mm ^ijhk d^j\fcavd(^a^ ^4.3) bk in « ,(« ,.•) ah verifies C2(bk) In th e above equation, t + 1 indicates th e iteration num ber and \ijhk — m in(overlap(i, j), overlap (/i, k)) and card(ai) is th e num ber of segm ents in 80 w(at). Two segm ents overlap if by sliding one of them in a direction paral lel to th e epipolar line, they would intersect. T he am ount of overlap is used in w eighting a p articular m atch. T he relations C\ and C 2 are defined as follows. We say bk verifies C i(a^) if: 1 . If Qt{ah) 7^ 0, bk is in Qt(ah) else bk is in Sp(ah), 2. E ither bk bj, or and a, do not overlap. , N ote th a t th e conditions C\ and Ci are a conjunction. In order to expose th e potential parallelism , th e com putations perform ed in th e algorithm are shown in Figure 4.14. T he procedure i-pref{i, j , QT(a,)) 1 . r e p e a t 2 . change * — 0 ; 3. for i = 1 to N d o 4. for each j such th a t j < E w(a, ) do 5. i-prej{i,j,QT(ai)) 6. en d 7. end; 8 . for j = 1 to N do 9. for each i such th a t i € w(bj) do 10. j-prej(j,i,QT(bj)) 11. en d 12. end; 13. for i = 1 to N do 14. for each j such th a t j € w(a{ ) do 15. Q-update(i,j) 16. en d 17. en d 18. for * = 1 to N do Q{ai) Q'(a,-); 19. for j = 1 to N d o Q(bj) « - Q’(bj)-, 2 0 . u n til (change = 0 ) Figure 4.14: Sequential algorithm for stereo m atching. shown in Figure 4.15 is used to find a tem porary preferred set QT(a,i) for a; w ithout th e confirm ation from the other im age (i.e. satisfying E quation 4.1 but 81 not E quation 4.2). T he two nested loops, (lines 4-8 and lines 9-13) correspond p r o c e d u r e i-prej{i,j,QT(ai)) 1 . fo r each h such th a t h € w(bj) d o 2. fo r each k such th a t bk verifies C\{a,h) d o 3. min,, <- min(m infc , \ijhk\dhk ~ dy|); 4. e n d ; 5. s u m l(? ,j) < — sum(*,ji) + min/j; 6 . en d ; 7. av el(i, j) < — s u m l{i,j)/card(bj)\ 8 . fo r each k such th a t k € w(a,i) d o 9. fo r each h such th a t ah verifies ) d o 1 0 . m inf c < — min(m infc , Aijhk\dhk — dij\); 1 1 . e n d ; 1 2 . sum 2 (* ,j) < — sum (i, j) + m int; 13. en d ; 14. av e2 (i,< )) < — sum 2(i,j)/card(a,i)-, 15. s u m (i,j) < — a v e l(i,j) + av e2 (i,i); 16. case: 17. sum (i, j ) < Min(«): 18. QT(at ) { j }; 19. M in(i) < — su m (?,j); 20. sum(i,j>) = Min(«): 21. QT(ai) < — QT(ai) U{ j }; 2 2 . e n d Figure 4.15: Finding partially preferred m atches for th e first image. to th e com putations in Equation 4.3. On th e other hand, T he procedure j- pref(j, i, QT(bj)) is used to find a tem porary preferred set QT(bf) for bj w ithout th e confirm ation from th e other im age (i.e. satisfying Equation 4.2 b u t not E quation 4.1). T his procedure is sim ilar to th e th e procedure i-pref[). The th ird procedure, Q-update(i, j), is th en used to combine the results from the two procedures, i-pref and j-pref, to determ ine the new sets, Q(ai) and Q(bj), for ai and bj, respectively. Notice th a t th e execution of i-pref procedure (or j-pref procedure) takes 0 ( n 2) tim e, and th a t of Q-update procedure takes constant tim e. It is easy to 8 2 p roced ure Q-update(i, j ) 1. if ((j < = QT(ai)) A N D (i e QT{bj))) th en 2. b egin 3. change 1 4. Q \ai) «- Q'(ai)\J{j}; 5. W ;) «-$'(*;) Uto; 6. end Figure 4.16: U pdating sets of preferred m atches. verify th a t each rep eat iteratio n takes 0 ( N n 3) tim e. T he com plete algorithm term inates after a constant num ber of iterations [MN85]. 4 .4 .2 P a r titio n e d P a r a lle l Im p le m e n ta tio n s In this section we present parallel im plem entations of th e m inim um disparity algorithm on a fixed size linear and m esh arrays. A parallel version of th e algorithm is given in Figure 4.17. We provide parallel im plem entations of the i-pref() and j-prefQ and Q-update() procedures. 1. rep eat 2 . change < — 0 ; 3. for i = 1 to N do 4. Parallel-i-pref(i); 5. for j = 1 to N do 6 . Parallel-j-pref{j); 7. for i — 1 to N do 8 . Parallel-Q-update(i)\ 9. u n til (change = 0) Figure 4.17: Parallel algorithm for stereo m atching. T he difficulty in p a ra lle lin g such procedures stem s from th e irregular d ata 83 structures em ployed in th eir sequential solutions. A naive parallel im plem en tatio n requires enorm ous d ata redundancy or m ay suffer due to large com m unication overheads. We provide efficient partitioning and m apping of the input d a ta th a t results in processor-tim e optim al solutions. T he m ain idea is to parallelize th e E quation 4.3. This equation (procedure i-prefQ com putes this equation) takes 0 ( n 3) tim e to find a m atch for each segm ent. T he m atching process for each segm ent can be processed in parallel. However, this requires extensive d a ta redundency, otherw ise the tim e spent in exchanging th e d ata am ong processors dom inates the com putation tim e. We p artitio n and m ap th e d a ta onto the processor array such th a t the com m unication overhead is m inim ized. We show an 0 ( n 3/P ) solution for E quation 4.3 on a P processor ■ linear array. Procedures Parallel-i-pref{i) and Parallel-j-pref(j) determ ine the p artial preferred m atches, QT(ai) and QT(bj), for a; and bj respectively. The th ird procedure Parallel-Q-update is used to determ ine the new sets, Q{af) and Q(bj), for a; and bj, respectively. Figure 4.20 shows th e flow of com putation and d a ta in th e PE s during the execution of procdure Parallel-i-pref{i). D a ta P a rtitio n in g /M a p p in g In stereo m atching, input to th e algorithm is a set of N segm ents from th e right im age and a set of N segm ents from the left image. As described in sec tion 4.4.1, each segment is represented by its length, contrast and orientation. W ith each pair such th a t a,, bj overlap and have sim ilar contrast and sim ilar orientation, an average disparity dij is associated. In order to find a possible m atch for each segment a; in the left im age, a window w{af) is de fined in th e right image. Similarly, for each segment bj in the right im age a window w(bj) is defined in the left image. Each window is assum ed to contain a t m ost n segments. In this section, we present a partitioning and m apping of these windows onto th e processor array such th a t th e com m unication overhead am ong th e processors does not become a bottleneck and at th e sam e tim e the d a ta redundancy is m inim ized. As described earlier, for each segment a 8 there is a window defined in th e o ther image. Thus, there is a to tal of N distinct windows, th e num ber of 84 segm ents in an image. Each window has at m ost n segm ents. In order to reduce th e d ata redundancy while m apping these windows onto a linear array of P P E s, th e following constraint should be satisfied. • T here should be at m ost N / P windows m apped onto any P E . A t any tim e if th e above condition is satisfied, no fu rth er windows should be assigned to th a t PE. As shown in lines 5 and 10 of th e algorithm given in Figure 4.14, it is clear th a t th e windows defined in th e left im age are accessed by th e segm ents in th e right im age and vice versa. In order to exploit th e m axim um parallelism during th e processing of a segm ent a,- and its possible m atch bj, each window defined by h, V/i E bj, should be m apped onto a different P E (assum ing n — P). This is required during the execution of lines 2-5 and 9-12 of th e Parallel-i-prej{) pro cedure. O therw ise, a P E m ay spend m ore th an 0 ( n ) tim e in sending/receiving th e data, as th e size of each window is 0(n). Therefore, while processing a segm ent a* in th e right im age, th e algorithm in Figure 4.18 ensures th a t for a given segment a;, each P E is assigned no m ore th an n / P windows (counter2\\ in Figure 4.18 is used to check for this condition). Also, no window is assigned to m ore th an one PE . A P E is m arked available if it is taking p art in th e as signm ent. D uring the assignm ent if a P E has been assigned N / P windows in to tal, it is m arked as riot-available and no m ore windows are assigned to th a t P E (counter 1[] is used to verify such a condition). At th e beginning of the algorithm all the PEs are m arked available. T he m apping algorithm described in Figure 4.18 guarantees th a t, while processing any segment for m atching, there will be no m ore th a n 0 ( n ) d ata com m unication per potential m atch. This fact will becom e clear when th e par allel algorithm is explained in Section 4.4.2. T he running tim e of th e m apping algorithm is 0 (N n \o g P) on a serial m achine. P a rtitio n ed Im p lem en tation on a F ixed Size Linear A rray T he im plem entation is perform ed on th e architecture shown in Figure 2.1. T he num ber of processors is P < n, th e window size. Each processor is assum ed to have 0 ( N n ( P ) memory. 85 procedure Map-windows-left-image(right-image, left-image) 1. Initialize: 2. for k = 0 to P — 1 do 3. counterl[k], counter2[k] < — 0; 5. M ark all PE s as available 4. for each segm ent in th e left im age do 6 . for each bj in th e right im age such th a t j € u>(«i) do 7. if u>(6j) is not assigned th en 8 . assign w(bj) to an available P E , say PEfc, 0 < k < P — 1 such th a t counter2[k\ < nfP \ 9 counter\[k}++, counter2[k]++ 10. if counterl[k] = \N /P \ 1 1 . m ark PE^ as not-available 1 2 . else counter2[k]++, assum ing w(bj) is assigned to PEk 13. end; 14. for k = 0 to P — 1 do 15. counter2[k] < — 0; 16. end; Figure 4.18: M apping windows of left im age onto a fixed size linear array. 86 T he parallel algorithm shown in Figure 4.17 is used. In order to give a m ore detailed picture of th e partitioning, Parallel-i-prefii) is described in Figure 4.19. In th e following sections, | denotes a m od b. I t p r o c e d u r e Parallel-i-prej{i,QT(ai)) 1 . fo r each j such th a t j € w(aj ) d o ! 2. In P arallel for each h such th a t h € w(bj) do 3. for each k such th a t bk verifies Ci(ah) do 4. min(z, j, h ) *- m in (m m (ij, h ), \ ijhk\dhk - d+j |); 5. end; 6 . su m l(* ,j) *- sum(i,j) + m in(i,j,h); 7. P arallel end; 8 . a v e l(z ,j) < — s u m l(i,j)/card(bj)-, 9. In P arallel for each k such th a t k € do 1 0 . for each h such th a t ak verifies do 1 1 . min(z, j, k ) <- , k), Xijhk\dhk - d{j|); 12. end; 13. sum 2(*,j) *— s u m (i,j) + m in(i,j,h); 14. Parallel end; 15. ave2 (i,j) < — sum 2 (i, j) / card(a{)] 16. sumfz,,?') ♦ — a v e l(i,^ ) -f ave2 17. case: 18. s u m (j,;) < Min(z): 19. « T (« i) M j } i 20. Min(z) « — su m (i,j); 21. s u m (i,j) = Min(z): 2 2 . QT(ai) «- QT(ai) U{ j }; 23. end 24. e n d I i i Figure 4.19: Parallel-i-pref{i) algorithm for execution on a fixed size linear ! array. I In each z-loop w ith a fixed j , P E h , 0 < h < n — 1 (or PE*. for the ; second p art of the z-loop) is dedicated to the com putation of min(z’,j , k) (or min(z, j , h)). T he inform ation required to accom plish th e com putation in P E ^ (or PE*.) includes: p | I 1. if t = 0, then Sp(ah) (or Sp(bk)), else Qt{ah) (or Q \b k)), , 2 . dhk, 0 < k < n — 1 (or dhk, 0 < h < n — 1 ) and 3. Aijhk, 0 < k < n — 1 (or A ^ * , 0 < h < n — 1 ) It should be noted th a t for the com putation of m in(i,j, k ) (m in(i, j, h )) all th e segm ents belonging to w{a,h) (w{bk)) should be in M M k (M M ^). T he d a ta partitioning algorithm presented in Figure 4.18 ensures this requirem ent is satisfied. T he m ain steps of procedure Parallel-i-pref(i) are briefly discussed in the following, w ith th e corresponding execution tim e indicated w ithin th e paren theses. 1 . V h G w(bj), load d^j and A ijhk, 0(n) d a ta to PE*., 0 ( n 2) d a ta in all. (<>(£)) 2 . PEo broadcasts dij to all th e PEs, 0 < i ,j < n — 1 . ( 0 (n)). 3. Perform th e min operation over all k to determ ine min(z, j , h) in PEh.. (O((f K 0 4. Along the linear array, i.e. V7t, all min(z, j, h ) values are sum m ed up and saved in P E 0. (0 ((^ )P )) 5. C om pute the average in PEo- (0 (1 )) 6 . V k G w(ai), load dkj and A ijhk, 0 ( n ) d ata to PE.*, 0 ( n 2) d a ta in all. ( 0 ( f ) ) 7. V k G to(a,-), perform th e min^ operation to determ ine m in(i,j,k) in PEi. (0((£)n)) 8 . Along th e linear array, i.e. Vk, all m in(«,- ;, k) values are sum m ed up and saved in the PEo. (0 ((^ )P )) ! 9. Take the average of the sum obtained in step 8 and add it to th e average i obtained in step 5. (0 (1 )) i f 10. V j, find the m inim um of all the values obtained in step 9, which is QT(a{) ] and store it back in th e memory. ( 0 ((|;)n )) ; 8 8 : Figure 4.20 provides a pictorial representation of the execution of th e proce dure Parallel-i-pref(i). Sim ilar steps can be designed for th e procedure Parallel- j-pref(j). For each i, P E 0 is assigned to store th e value QT(ai) obtained at th e end of th e procedure. It can be easily verified th a t for each i th e procedure Parallel- I i-prej{i) runs in 0 ( n 3/ P ) tim e. All the resulting QT(aiYs and QT(bjY s can th en be com bined in constant tim e by using th e procedure Parallel-Q-update by parallelizing th e sequential version Q-update in Figure 4.16. Therefore, each iteratio n takes 0 ( N n 3 / P), which represents optim al speedup. P a rtitio n ed Im p lem en tation on a F ixed Size M esh A rray Based on th e algorithm shown in Figure 4.17, parallel im plem entation on a P x P m esh array is developed. T he procedure Parallel-i-pref{i) is presented in Figure 4.21. Sim ilar procedure for Parallel-j-pref(j) can be developed. In th e following, floor function ([ J) is assum ed for all indices of th e form | . In each i-loop, P E a a , 1 < h, j < n (or P E a a for the second p a rt of the | P P P P j ./-loop) is used for th e com putation of m in(*,j, k) (or m in(*,j, h)). T he infor m ation required to accom plish th e com putation in P E i i (or P E a a ) includes: 1 . if t = 0 , th en Sp(ah) (or Sp(bk)), else Qt{ah) (or Qt{bk)), j 2 . dhk, 1 < A : < n (or dkk, 1 < h < n) and 3. ^ijhki 1 — A : 4 ti (or Xijhk, 1 ^ b ^ n j T he m ain steps of procedure Parallel-i-pref(i) are briefly discussed in th e fol lowing, w ith th e corresponding execution tim e indicated w ithin parentheses. i 1. V j € w{ai), load dhk and Xijhk, 0 ( n ) d ata to P E a x , 0 ( n 3) d a ta in all. (0 ( £ ) ) i 2. P E ji broadcasts dij to all the processors in th e array. (0((fi)P )) j I I 3. Perform th e min operation over all k to determ ine min(*,jf, h) in P E a a - m & p ) ) i I I 89 (C om putations for the left image) 1. Load data: \/h load d hk 2. Broadcast dy to all the PEs 3. In all PEs compute X ^ and find the min over all k in each PE 4. Compute sum l of all mins 5. Compute avel (C om putations for the rig h t image) 6. Load data: V£ load d^. 7. In all PEs compute X jj^ and find min over all h in each PE 8. Compute sum2 of all mins 9. Compute ave2 and sum(i,j) 10. Store Ql(aj) -»► External Memory PE0 PEX ■ PE0 PE! mm—mm— f m I BP PEp.i a - a i PEp.i T Figure 4.20: Execution of Parallel-i-pref{) on fixed size linear array. p r o c e d u r e Parallel-i-prej{i,QT(ai)) 1 . in p a r a lle l fo r each h ,j such th a t h £ w(bj) and j £ w(a{) d o 2. fo r each k such th a t bk verifies C\ (cth) d o 3. m in(i,j, h ) < — m in(m in(*,j, h ), Xijhk\dhk ~ <^|); 4. en d ; 5. su m l(« ,j) < — su m (* ,j) + m in(*,< ;,A); 6 . p a r a lle l e n d ; 7. av el(* ,j) < — s u m l(* ,j)/c a rd ( 6j); 8 . in p a r a lle l fo r each k ,j such th a t k £ w(a,i) and i £ w(bj) d o 9. fo r each h such th a t ah verifies C 2 {bk) d o 1 0 . m in(i,j,k ) <- m in(m in(i,j,k), Xijhk\dhk — djjj); 1 1 . e n d ; 1 2 . sum 2 (« ’, j ) < — su m (* ,j) + m in 13. p a r a lle l e n d ; 14. a,ve2(i,j) < — sum2(z, j) /c a r d ( a 8 ); 15. sum(i,j) < — a v e l(i,j) + ave2(«,.;); 16. case: 17. su m (i,j) < Min(«): 18. QT(ai) «- { j }; 19. M in(i) 4 s- su m (i,j); 2 0 . sum (i, j ) = M in(i): 2 1 . QT(ai) <- QT(a{) U{ j }; 2 2 . end Figure 4.21: Parallel-i-pref{i) on a m esh array. procedure Parallel-Q-update(i, j) 1. if ((j e QT(ai)) A N D (t € QT(bj))) th en 2. b egin 3. change < — 1; 4. Q'(di) «- Q '(a ,)U { j} ; 5. Q'ibj) ^ 6. end Figure 4.22: Parallel-Q-update for updating sets of preferred m atches. 4. Along each colum n of processors, i.e. Vj, all th e min(z, j , h) values are sum m ed up and saved in th e last processor P E p j_. ( 0 ( ( p ) P ) ) 5. In each PEpi , com pute th e average. | P , 6 . V j € w(ai), load dhk and Xijhk, 0(n) d a ta to PE*_i, 0 ( n 3) d a ta in total. ! 7. Perform th e min operation over all h to determ ine m in(i, k) in PE*, i.- p p i (0 ( ( ? ) ^ ) ) | 8 . Along each colum n of processors, i.e. V), all the m in(i, j , k) values are ! sum m ed up and saved in th e last processor P E p x . (0 ((p r)P )) ] 9. In each P Ep ±, V j G w(ai), com pute th e average of th e sum obtained in j step 8 and add to th e average obtained in step 5. (0 (1 )) 10. Along th e last row of processors, find th e m inim um of all th e values obtained in step 9, and store th e corresponding bj back in th e memory, which is QT(ai). (0 ((^)P * )) Figure 4.23 provides a pictorial representation of th e execution of th e proce dure Parallel-i-pref[i). Sim ilar steps can be designed for th e procedure Parallel- i j-pref(j). i It can be easily verified th a t for each a, th e procedure Parallel-i-preJ{i) j runs in 0 ( n 3/ P 2) tim e w ith each tim e unit corresponding to a sim ple arith m etic/logic operation. T he resulting QT(aiY s and QT{bjY s can then be com- j bined in constant tim e by using th e procedure Parallel-Q-update given in Fig- ! ure 4.22. Therefore, each iteration takes 0 ( ^ h ) tim e. i i i i I I I f 9 2 j ______ l (Computations for the left image) (Computations for'the right image) 1. Load data Vhj load dhk 6. Load data V*; load dkh 2. Broadcast dy to all PEs 3. In all PEs compute A , yhk arK * find min over all k in each PE o o .... o O O .......o 7. In all PEs compute ^ ijhk find min over all h in each PE :::! © © ........© O O ....... o ® # ....... © 8. Compute sum2 O O ....... o of all mins « # # # ® ....... # # % -.....# 4. Compute suml of alf mins O O ........o O O ........o O O .......o 9. Compute Ave2 O O ....... o # # ....... # # # ........@ 5. Compute Avel o o - o o - ....o ....o 10. Compute Q^afl O O o o .......o .......o # # - .... # o o -III'- @ Figure 4.23: D ata flow for procedure Parallel i-pref(i). Chapter 5 Parallel Implementations, T his chapter presents im plem entations results of various algorithm s developed in th e previous chapters. T he im plem entations are carried out on th e Connec tion M achine CM-5 and on M asPar M P-1. These im plem entations are carried out after carefully studying th e characteristics of the underlying architectures of these m achine, i.e. fat tree for th e CM-5 and mesh array for the MP-1. Various experim ents are conducted to fine tune th e partitioning and th e m ap ping strategies to suit th e com m unication and th e com putation capabilities of these m achines. Based on these experim ents, d a ta parallel algorithm s are de signed to efficiently utilize the architectural and program m ing features. This experim entation has assisted in achieving uniform distribution of work load in j th e m achines during th e execution of algorithm s leading to fast and scalable 1 I im plem entations. j For geom etric hashing, earlier im plem entations claim 700 to 1300 msec for j one probe of th e recognition phase, assum ing 2 0 0 feature points in th e scene on I an 8 K processor CM-2. O ur im plem entations run on a P processor m achine, such th a t 1 < P < S, where S is the num ber of feature points in th e scene. T he j results show th a t the one probe for a scene consisting of 1024 feature points takes less th an 50 msec on a IK processor MP-1 and it takes less th an 10 msec on a 256 processor CM-5. T he m odel database used in the im plem entations j contains 1024 models and each m odel is represented using 16 feature points. ! 94 T he im plem entations developed in this chapter require num ber of processors in dependent of the size of the m odel database and are scalable w ith the m achine size. Results of concurrent processing of m ultiple probes of th e recognition phase are also reported. Also, based on th e im plem entation experiences and algorithm s presented in Section 3.3, th e im plem entation results are extrapo lated for th e strcu tu ral indexing recognition phase. A dditional details of the work presented in this C hapter can be found in [KKW93, K K PW 93, KP93]. In Section 5.1, a brief introduction of CM-5 and MP-1 architectures is pre- j sented. A discussion on th e com putation and com m unication characteristics of j th e m achines is carried out in Section 5.2. Based on th e experim ents, various J m apping and partitioning algorithm s are derived in Section 5.3.1. Finally, per- form ance results for geom etric hashing and stru ctu ral indexing are presented j in Sections 5.3 and 5.4 respectively. j 5.1 Parallel Machines for Implementations In this section, first, we describe th e underlying models of th e m achines used in our im plem entations and then present d a ta m apping and partitioning strategies em ployed on these m achines. 5 .1 .1 T h e C o n n e c tio n M a ch in e C M -5 A C onnection M achine M odel CM-5 system contains betw een 32 and 16,348 processing nodes. Each node is a 32 MHz SPARC processor w ith upto 32 j M bytes of local memory. A 64 bit floating point vector processing unit is j optional w ith each node. Each processing node is a general purpose com puter ! I th a t can fetch and in terp ret its own instruction stream . System adm inistration tasks and serial user tasks are perform ed by control processors. Input and o u tp u t is perform ed via high-bandw idth I/O interfaces. T he processing nodes, control processors, and I/O interfaces are intercon nected by three networks: a d a ta netw ork, a control netw ork, and a diagnostic i network. Figure 5.1 shows a diagram of th e CM-5 organization. T he d a ta net- j work provides high perform ance point-to-point d ata com m unications betw een J I l 95 | system com ponents. T he control netw ork provides cooperative operations, in cluding broadcast, synchronization, and scans (parallel prefix and suffix). T he i diagnostic netw ork allows back-door access to system hardw are to test system [ integrity and to detect and isolate system errors. T he system operates as one or m ore user partitions. Each partitio n consists of a control processor, a collec tion of processing nodes, and dedicated portions of d a ta and control networks. Throughout this paper, size of the processor array refers to th e num ber of PEs in a partition. Control Network Data Network NI NI — NI — NI NI — NI CP I/O VO CP I/O Interface Processing Nodes Figure 5.1: T he organization of th e Connection M achine CM-5. j From a program m er’s perspective, the CM-5 system can be considered as ' com prising of a control processor, a collection of processing nodes, and facilities i for interprocessor com m unication (see Fig. 5.2). Each node is a general-purpose , SISD processor capable of executing code w ritten in C, F ortran, or in assembly I language. A dditional details of CM-5 can be found in [Cor91]. j i I I 5 .1 .2 T h e M a sP a r M P -1 ; i i T he MP-1 is a m assively parallel SIMD com puter system w ith upto 16K P ro cessing Elem ents. T he system consists of a high perform ance U nix W orkstation as Front E nd (FE ) and a D ata Parallel U nit (D PU ). T he D PU consists of PEs, j each w ith upto 64 K bytes of m em ory and 192 bytes of register space. All PEs ; execute instructions broadcast by an A rray Control U nit (ACU) in lock step. • P E s have indirect addressing capability, and can be selectively disabled. ! Control Processor ill m CM-5 Communication Networks Processing Nodes ——■ " Figure 5.2: A virtual CM-5 organization from a program m er’s perspective. Each P E is connected to its eight neighbors via th e X net for local com m uni cation. Besides th e X net, MP-1 has a global ro u ter netw ork which provides di rect point-to-point global com m unication. T he router netw ork is im plem ented by a three-stage crossbar switch. It provides an equal distance (constant la- I tency) point-to-point com m unication betw een any two PEs. A th ird netw ork, j called global or-tree is used to move d a ta from the individual processors to the ACU. This netw ork can be used to perform global operations on d ata in the ; entire array such as global m axim um , prefix sum , global OR, etc. A block diagram of th e MP-1 system is shown in Fig. 5.3. M asPar currently supports M asPar Program m ing Language (M PL) and M asPar F ortran. M PL is M asP ar’s lowest level program m ing language and is based on ANSI C. M asPar Fortran is based on F ortran 90. We have used M PL in our im plem entations. 5.2 Experimenting with CM-5 and MP-1 Before im plem enting th e parallel algorithm s on CM-5 and M P-1, we experi- I m ented w ith th e m achines to explore salient com m unication and com putation j i l [ 97 I Figure 5.3: T he M asPar MP-1 system block diagram (courtesy M asPar Com p u ter C orporation). 98 features of th e underlying architectures. T he m achines used in the im plem en tations differ w ith respect to th e m ode of operation, th e processing speed of th e PE s, and th e interprocessor com m unication bandw idth. ; In general, execution tim e of an application is affected by two com ponents j of th e underlying architecture, i.e. th e com puting power of the processing j nodes/processing elem ents and the interprocessor com m unication bandw idth. I In th e following, we briefly discuss these experiences and also provide m otiva tion for partitioning and m apping strategies em ployed in our im plem entations. In CM-5, each processing elem ent (P E ) is a SPARC processor operating at 32MHz. T he architecture inherently is an M IM D m achine w ith pseudo synchronization barriers which m ake this m achine to operate in SPM D mode. T he processing elem ents have m uch higher com puting power com pared w ith other com m ercially available SIM D /M IM D m achines. T he high com puting power of each P E and relatively expensive com m unication am ong processors m otivates to p artitio n th e d a ta such th a t th e algorithm s exhibit less com m u nication among PEs at the cost of redundant com putation w ithin each PE. However, a balance betw een these two is needed to a tta in speed-ups. In m ost of the high level vision tasks, such as object recognition, complex search m ethods are employed. For exam ple, in the geom etric hashing algo rithm , com puting the w inner (model, basis) requires finding m axim um over as large as M n 2 elem ents, where M = 1024, and n = 16. Similarly, in structural indexing based recognition, th e consistency check step involves com plex integer arithm etic operations on all th e hypotheses and on all th e models present in the scene. In order to analyze th e raw com puting power of th e m achines, a large array of integer elem ents was uniform ally distributed over all th e processors and com parison operations perform ed on th e elem ents w ith in each processor. Perform ance of th e processing elem ent, in this case, ranges from 0.5 M IPS to j 4 M IPS depending on the num ber of PEs available w ith th e m achine. These | results are tab u lated in Table 5.1. j T he floating-point com puting power of CM-5 can be estim ated from th e j com putation of th e hash function, both in geom etric hashing and in stru ctu ral j indexing. It takes 33.3 msec to encode IK scene points on a single node of J CM-5. One encoding operation constitutes of 7 to 10 floating-point operations therefore, on th e average a floating-point operation takes 3-4 psec. These i results show th a t each P E can provide 0.33 M FLO PS using CM M D version 1.3. j Vector units are available only if high level languages such as CM FO R TR A N or C* are used for program m ing. O n th e other hand, using CM FO R TR A N and C* does not allow front-end users to use low level m acros for interprocessor com m unication available w ith CMMD. T he interprocessor com m unication in CM-5 is achieved through d a ta n et work and through control network. In order to analyze th e effectiveness of these networks for our im plem entations, several packets of different sizes are routed through th e d a ta and control networks. T he d ata netw ork is based on th e fat tree m odel [Lei85b]. As indicated in [Cor91], com m unication through d a ta netw ork is pipelined. It appears th a t pipeline start-u p tim e is approxi m ately 0.30m sec. An approxim ate form ula for for the com m unication tim e is constructed as follows: Communicationtime = startuptime + size0fdata x datarate (5-1) W here startuptime and datarate are estim ated to be 0.30m sec 0.5/^sec/integer j respectively. T he tim ing results shown in Table 5.2 are m easured by sending ; packets of various sizes from one P E to another P E through th e d a ta netw ork. ' T he CM-5 control network is a com plete binary tree w ith all PE s on th e ! D a ta S iz e (u n it= in te g e r ) C o m p u tin g T im e (in m a e c ) C o m p u tin g R a t e ( /is e c /» n ie g e r ) 512 0.94 1.83 IK 1.83 1.83 2K 3.64 1.82 4K 7.98 1.99 8 K 15.82 1.97 Table 5.1: Tim ing analysis for integer com putations on a 256 processor CM-5. 100 ! _______ I D a ta Size ( u n it= in te g e r ) T o ta l C o m m u n ic a tio n T im e ( in m s e c ) P ip e lin e S ta rt-U p T im e (in m s e c ) D a ta T ra n s m is io n T im e (in m s e c ) D a ta R a te ( p s e c / i n t e g e r ) 32 0.34 0.30 0.04 64 0.35 0.30 0.05 128 0.38 0.30 0.08 256 0.43 0.30 0.13 1 . 0 512 0.56 0.30 0.26 0.5 IK 0.80 0.30 0.50 0.5 2K 1.30 0.30 1 . 0 0 0.5 Table 5.2: Tim ing analysis for point-to-point d ata com m unication on a 256 processor CM-5. D a ta B r o a d c a s tin g T im e (in msec) D a ta Size (u n it= in te g e r) 32 64 128 256 512 1024 2048 4096 M achine size ss 128 PEs 0 . 1 0 . 2 0.4 0 . 8 1.5 2.9 5.8 12.5 M achine size as 256 PEs 0 . 1 0 . 2 0.4 0 . 8 1.5 2.9 5.8 12.5 D a ta R a te (p s e c /* n te g e r) 3.1 3.1 3.1 3.1 2.9 2 . 8 2.7 3.0 Table 5.3: Tim ing analysis for d a ta broadcasting on control network. leaves and is used for global cofnputations, such as prefix sum , global m axi m um , etc. T he Table 5.3 shows the tim ing result for broadcasting d a ta from control processor to all processing nodes. It appears th a t there is no pipeline startuptime required in this network. T he average datarate is estim ated to be 3usee/integer, which is 6 tim es slower th an th a t on d a ta network. Ad ditional details on th e experim ental perform ance of CM-5 can be found in [PCF92, BRF92]. In M P-1 , a fine grain massively parallel m esh based SIMD m achine, each P E is a 4 b it processor. Interprocessor com m unication is supported through two com m unication networks, 1) X net for regular com m unication and 2) router for random com m unication. As in object recognition, th e com m unication p a t tern in m ost of th e search techniques em ployed is irregular, th e router netw ork provides superior perform ance over the X net [Pre93]. It is also experienced j th a t th e ratio of unit-floating-point-com putation tim e over unit-com m unication ! tim e (th ru X net or th ru contention free router) is approxim ately 1. A dditional floating point processor boards are attached for fast floating point operations. However, th e m achine we have access to did not have these boards. It suggests to carefully p artitio n th e d a ta such th a t b o th th e com putation and com m uni cation capabilities of the architecture are fully utilized. O peration T yp e T im e (in jx sec) adding integer num bers (short ty p e) 1 0 adding integer num bers (long typ e) 13 adding floating-point num bers (float typ e) 29 adding floating-point num bers (double typ e) 45 m ultiplying integer num bers (short type) 1 1 m ultiplying integer num bers (long typ e) 15 m ultiplying floating-point num bers (float typ e) 37 m ultiplying floating-point num bers (double type) 74 Table 5.4: Tim ing analysis for basic arithm etic operations on a IK processor M P-1. C om m u nicationX T yp e T im e (in // sec) send 32 bit m essage to next PE via xnet co n stru ct 17 send 32 bit m essage to next P E via ro u ter construct (w ith o u t contention) 32 send 32 bit m essage to next P E via ro u ter construct (w ith 16 contention) 143 | Table 5.5: Tim ing analysis for basic com m unication operations on a IK pro- I cessor MP-1 I I ■ U nfortunately, th e com m unication p a ttern in m ost of th e recognition algo rithm s is irregular. Therefore, m ost of th e tim es, we have to use th e router netw ork rath er th an th e X net. O f course, router construct have a nice feature th a t it takes th e sam e am ount of tim e regardless of th e distances betw een the com m unicating PEs. T he router is a 3-stage circuit sw itched netw ork, th a t contains one p ath per 16 PEs (i.e., per one cluster). Each non-overlapping ! square m atrix of 16 PE s in the P E array is a separate cluster. T he control j 102 processor can com m unicate w ith all P E clusters, b u t w ith only one P E /clu ster at one tim e. To access m ore th an one PE s in a cluster, accesses to th a t cluster are serialized. Thus, if every P E in the array tries to access PE s in one cluster th ere are will be 1,024 contentions on a IK M P-1. In th e ideal case, in a sin gle com m unication access, there are m inim um of 16 contentions in the router netw ork, provided all the PEs are active, j T he perform ance m easure of a com m unication access using router can be observed by noticing th e value of th e M asPar Program m ing Language (M PL) variable __routerCount. It provides the num ber of contentions occurred during a com m unication access. A com prehensive study has been conducted in [Pre93] on MP-1 com m unication operations. 5.3 Geometric Hashing i F irst, we present d a ta m apping and partitioning strategies em ployed on these j m achines. We also outline algorithm s for com puting the hash key, for voting, | and for com puting the winner pair. Based on these experim ental results on CM- 5 and MP-1 are reported. Throughout th e im plem entation, we will assum e, S « M n 3. For exam ple a typical scenarios is S = 1 K , M = I K , and n = 16. We define a probe as execution of th e recognition phase corresponding to one ; chosen basis. ) T hree procedures, Compute-KeysQ, VoteQ, and Compute.. WinnerQ shown 1 in Figures 5.4, 5.5, and 5.6 respectively, correspond to the steps described in the : ParalleLProbeQ procedure shown in Figure 3.12. T he Com pute_Keys() proce- ) dure com putes th e transform ed coordinates of th e scene points and quantizes them according to a hash function / ( ) . T he transform ed and quantized coor dinates are stored in N EW FPj]. We use th e sam e hash function as in[RH92]. This hash function distributes the d ata uniform ly over all th e hash bins. T he transform ed coordinates are then used as keys to access the d ata in th e hash table. T he VoteQ procedure routes th e keys to their corresponding hash lo cations stored in P E s(^ey). T he function g{) defines th e m apping of th e hash table entries onto th e processor array. T he locations in th e hash table accessed during voting are stored in CANDID[] array. This array is used in com puting th e final winner. T he size of this array is m uch sm aller th an th e size of the hash table stored in each PE. C om p u te_K eys() In parallel for all PE*, 1 < i < P, do for r = 1 to S /P - C om pute the transform ed coordinate of th e feature point FP[r] relative to th e basis, and store it in SNEW [r]. - Q uantize SNEW [r] using hash function / ( ) . N e x t r P arallel-en d end Figure 5.4: A parallel procedure to com pute th e coordinates of feature points of th e scene. N ext, th e Compute-WinnerQ procedure determ ines th e m odel-basis pair receiving th e m axim um num ber of votes. The winning pair is th en sent to the control processor to perform the final verification. V o te (F P N E W ,P ) In parallel, each PE^, 0 < i < P — 1, do k := 0 ; for j = 1 to S /P - Send a vote (additive w rite) to processor and location specified by g(NEWFP[j]). - If a vote is received, copy th e contents of th e corresponding hash-table entry in the array CANDID [k-fi-f-]. n ex t j P arllel-en d end Figure 5.5: A parallel procedure to vote for th e possible presence of a m odel in th e scene. 104 C om p ute_W in ner() /* Each P E is assigned distinct m odel-basis pairs to com pute the num ber of votes cast to them . */ In parallel, in each PE*, 0 < i < P — 1 Send every elem ent of th e CANDID[] array to th e P E assigned for com puting the to tal num ber of votes for th a t elem ent. In parallel, in each PE*, 0 < * < P — 1 C ount th e to tal num ber of votes for each distinct {model, basis) pair received and store it in VCOUNT[mode(,&as*s]. In parallel, in each P E ;, 0 < i < P — 1, C om pute the local m axim um of VCOUNT[] array and store it in locahmax. C om p u te th e m axim um of local-max over th e entire processor array. /* T he {model, basis) pair w ith m axim um num ber of votes is th e m atched m odel in the scene. */ end Figure 5.6: A parallel procedure to com pute th e winning {model, basis) pair. 5 .3 .1 P a r titio n in g an d M a p p in g Based on th e above described subtasks of th e ParalleLProbeQ procedure and th e experim ents carried out on the targ et m achines, several d a ta m apping and p artitioning strategies are developed, which affect th e overall execution tim e of th e recognition phase. In th e following we present four algorithm s, which we have experim ented w ith. These algorithm s differ w ith respect to partitioning and m apping of th e hash table onto the processor array. Various strategies are em ployed to take into account practical considerations, such as available m em ory in each P E , processor speed, and I/O speed of th e m achines. In A lgorithm A, and A lgorithm B, we assum e th a t each processor is assigned distinct hash table locations. In A lgorithm C, each sub-array of processors is assigned a com plete copy of the hash table. Each processor in a sub-array of size, s2, w here 1 < s < y/P, has distinct entries of the hash table. T he case of large num ber of processors is considered next. A lgorithm D per forms concurrent processing of m ultiple probes of the th e recognition phase. T he array is divided into disjoint sets of S processors. Each set of PE s pro cesses a probe using a basis (a different basis for each set). A lgorith m A: • Execute th e Comp^KeysQ procedure serially in th e control processor and broadcast the encoded points (keys) to each processor in th e processor array. • Each processor scans through all th e keys and accum ulates votes for the keys which correspond to hash table locations stored in its local memory. • Executes the Compute-WinnerQ procedure. A lgorith m B: • T he control processor broadcasts S / P scene points to each processor along w ith a basis pair. 106 • Execute th e Comp.KeysQ procedure • Execute th e Sort-KeysQ procedure. • Execute th e VoteQ procedure • Execute th e Compute_ WinnerQ procedure. A lgorith m C: In this algorithm , we assum e m ultiple copies of th e hash table stored in the processor array. • T he control processor broadcasts S /P scene points to each processor along w ith a basis pair. • Execute th e Comp-KeysQ procedure • Execute th e Sort-KeysQ procedure. • Execute th e VoteQ procedure such th a t th e d ata search is bounded w ith in its sub-array • Execute th e Compute_ WinnerQ procedure. A lgorith m D: In this algorithm , we assum e th a t th e num ber of PEs is larger th an th e num ber of feature points in th e scene. • T he control processor broadcasts S IP scene points to each PE. • T he control processor broadcasts a basis pair to all PE s in each subarray of size S. • E xecute the Comp-KeysQ procedure. • Execute the Sort-KeysQ procedure. • Execute th e VoteQ procedure • Execute th e Compute- WinnerQ procedure. I 5 .3 .2 E x p e r im e n ts an d S u m m a ry o f P erfo rm a n ce I R e s u lts i We have used a synthesized m odel database, containing 1024 m odels, each m odel consisting of 16 random ly generated points. These points are generated according to a G aussian distribution of zero m ean and unit stan d ard deviation. T he m odels are allowed to undergo only rigid transform ation. However, results \ from other transform ations do not affect th e perform ance of th e parallel algo rithm . Similarly, scene points are synthesized using norm al distribution. We apply th e equalization techniques, given in [RH91], to th e transform ed coordi nates, i.e., for each of th e transform ed point (u ,u ), following hash function is j applied. f(u ,v ) = ( 1 - e~ 3< j2 , a ta n 2 (u, w)) i T he above hash function uniform ally distributes th e d ata over th e hash space such th a t th e average hash bin length is constant. We assum e a d ata 1 base of 1024 m odels w ith 16 points in each model. This gives a hash table size of 4M entries. Each entry m ay consist of several (m odel, basis) pairs. We : have experim ented on various d a ta granularities in th e hash table com prising of average bin lengths of 1, 4, 8 , 16 and 32. These granularities can be chosen i | according to th e local m em ory available w ithin each P E . We have executed j these algorithm s on various sizes of CM-5 and M P-1, both in term s of num ber 1 of PEs in th e array and local m em ory available w ithin each P E . C ontrary to the I | results reported in [RH91], we claim th a t for a single probe of the recognition j phase, m achine sizes larger th an S would deteriorate th e tim e perform ance. This is due to th e fact th a t interconnection networks w ith larger diam eter takes m ore tim e to perform global operations. We also show, in algorithm s D, larger size m achines can be used for concurrent processing of m ultiple probes of th e recognition phase. In th e following, we tab u late our results for partitioning algorithm s A, B, C, and D. Raw tim ing d ata are included in the A ppendix A. Table 5.6 presents execution tim es of various subtasks using partitioning algorithm s described in 108 P artitio n in g Algorithm. M achine T ype E ncoding Scene Points H ash Bin Access Voting C om puting Local M axim um of Votes C om puting Global M axim um of Votes Total Tim e A CM-5 33.3 5.45 1.04 1.83 0.29 41.45 B CM-5 0.33 1.96 2.27 1.83 0.29 6 . 6 8 C CM-5 0.33 1.33 1.96 1.83 0.29 5.74 B MP-1 2.51 18.53 24.10 3.36 0.055 48.76 C MP-1 2.51 6.78 2 0 . 0 2 3.36 0.055 32.72 Table 5.6: Execution tim es (in msec) of various subtasks in a probe using different partitioning algorithm s for a scene consisting of 1024 feature points on a 256 CM-5 and IK M P-1. th e previous section. T he A lgorithm A addresses th e congestion and contention problem by com puting th e keys in th e control processor/array control unit at th e cost of redundant processing in each processor in th e array. As shown in Table 5.6, for A lgorithm A, th e com putation tim e in th e control processor becomes the dom inating factor in th e overall execution tim e. We did not execute A lgorithm A on M P - 1 because of th e large com putation tim e and insufficient m em ory available w ithin th e control unit. We are unable to execute A lgorithm B on various size of MP-1 as the sm allest size MP-1 consists of IK processors. T he perform ance of A lgorithm s A, B, and C on various sizes of CM-5 and M P - 1 is shown in Table 5.7. Larger M P -ls have been used for concurrent processing of m ultiple probes (A lgorithm D) and results are reported in Table 5.8. Several interesting obser vations on th e interplay betw een various com ponents of th e M P - 1 architecture can be m ade from this table. As th e num ber of PEs increases w ith the num ber of concurrent probes, it affects various com ponents of the execution tim e. For exam ple, larger size m achines m ean larger diam eter im plying m ore tim e for global operations. On th e other hand, larger m achine size reduces th e load on each processor, hence less tim e is spent on local operations. Figures 5.7 and 5.9 show the perform ance of A lgorithm s B and C. T he bin access tim e and voting tim e reduces linearly as the num ber of copies of th e hash table in th e processor array increases. T he hash bin access tim e refers 109 M achine S ize/T y p e A lgorithm A (in m se c ) A lgorithm B (in m s e c ) A lgorithm C (in m s e c ) 32/CM -5 78.07 35.38 25.48 64/CM -5 61.04 19.75 16.02 128/CM -5 50.02 10.77 8.96 256/CM -5 41.45 6 . 6 8 5.74 512/CM -5 51.74 4.41 3.8 1K /M P-1 XX 48.76 32.72 Table 5.7: Execution tim es (in msec) of various algorithm s on a scene consisting of 1024 feature points. N um ber of Probes M achine Size E ncoding Scene Points Hash Bin Access Voting C om puting Local M ax of Votes C om puting Global Max of Votes T otal Tim e 1 IK 2.51 18.53 24.10 3.36 0.055 48.76 2 2K 2.51 23.16 30.60 8.16 0.165 64.59 4 4K 2.51 40.36 49.86 8.16 0.314 1 0 1 . 2 0 8 8 K 2.51 54.62 49.72 8.16 0.608 115.62 Table 5.8: Execution tim es (in msec) of A lgorithm C on a scene consisting of 1024 feature points w ith concurrent processing of m ultiple probes on various ! sizes of M P-1. Average bin size is 8 . f to th e tim e taken to access hash bins corresponding to feature points in the , scene. T he voting tim e corresponds to routing the inform ation w ithin each I voted bin, (model,basis) pairs, to com pute local m axim um for each pair. In j Figs. 5.8 and 5.10, we sim ulate worst-case and semiworst-case scenario. For ] th e worst-case, we assum e th a t all th e scene points hash to locations stored in » ! a single P E and in th e semiworst-case, all the keys hash to locations stored in a sm all subset of PEs. T he results show th e perform ance of various partitioning strategies adopted in algorithm s B and C. In the case of hash bin access, th e access tim e decreases linearly w ith th e increase in the num ber of hash i table copies resident in the processor array. On the other hand, beyond a certain num ber of copies of th e hash table, th e voting tim e starts increasing (see Fig. 5.10). This is due to th e increased netw ork traffic generated by larger 110 M ethods # of Models (16 p o in ts/m o d el) Machine® S ize/T y p e # of Scene Points T otal Tim e O ur M ethod (A) 1024 256/CM 5 1024 42.76 msec O ur M ethod (B) 1024 256/CM -5 1024 6 . 6 8 msec O ur M ethod (C) 1024 256/CM -5 1024 5.74 msec O ur M ethod (B) 1024 1K/M P-1 1024 53.37 msec O ur M ethod (C) 1024 1K /M P-1 1024 38.89 msec O ur M ethod (B) 1024 1K /M P-1 2 0 0 49.50m sec O ur M ethod (B) 1024 256/CM -5 2 0 0 4.50 msec H um m el et. al.[RH91] 1024 8K /CM -2 2 0 0 800 msec M edioni et. al. [BM 8 8 ] X 8K /CM -2 X 2.0-3.0 sec Table 5.9: Com parison w ith previous im plem entations. “ E ach p ro cesso r in C M -5, C M -2, a n d M P -1 o p e ra te s a t 32M H z, 7M H z, a n d 12.5M H z respectively. num ber of copies. In Table 5.9, we com pare our results w ith those reported in [BM 8 8 , RH91]. We assum e no hash table folding, sym m etries, and or partial histogram m ing on th e hash table data. O ur serial im plem entation shows th a t one probe of j th e recognition phase takes about 13.4 seconds on a SUN SPARC2 operating at 25MHz and 32 M bytes of on board RAM . i I 5.4 Structural Indexing i In stru ctu ral indexing, th e num ber of super segm ents in a m odel range from 10’s to 100’s. Therefore th e size of the model database for an average of 1024 m odels would be in th e range of 10K to 100K entries, which is much sm aller th an th e size of th e hash table for same num ber of models in geom etric hash- j ing. This gives an advantage to structural indexing technique over geom etric i ! hashing for am ount of m em ory required per P E to store th e hash table. Also, j it indicates less p otential congestion problem s while accessing th e hash table i bins. Throughout th e im plem entation, we will assume, S < M s, w here S is j th e num ber of super segm ents in the scene, M is th e num ber of models in th e ■ i I 111 1 1.25 1 2 1.15 < • 5 ® 1.1 1.05 0.95 • - Algorithm B - o. - ■ Algorithm ,C ‘ i 2 8 16 Number of Copies of the Hash Table Figure 5.7: Hash bin access tim e vs N um ber of hash table copies for A lgorithm B and A lgorithm C on a 512 P E CM-5. database and s is th e num ber of super segments used to represent a model. For exam ple a typical scenarios is S — 1 K , M = IK , and n = 10. In order to analyze th e tim e taken by one probe of th e recognition phase on MP-1 and CM-5, we define the following term s: tc = tim e to com pute a code and a hash table key for a given super segment in each PE. tr = tim e to route th e P d a ta elem ents to their corresponding locations in th e processor array. t< = ta b — tim e to construct th e correspondence table of models and hypotheses tdh = tim e to check th e consistency of two hypotheses inside a P E lb = m axim um length of a hash table bin in any PE. It is easy to show th a t the tim e for the recognition phase on a P processor m achine can be expressed as: 112 i o ~ < r 60- .9 50- H Worst-case Algorithm B < 40- e f l Q i 30- Worst-case Algorithm C 2 0 - o.... Semiworst-case Algorithm B 3 - 0 - - o -------------------------o ------------------------------------- — .• . _ g ~ ^ Seimwors^-case Algorithm^: 1 2 4 8 16 Number of Copies of the Hash Table Figure 5.8: Hash bin access tim e vs N um ber of hash table copies for A lgorithm B and A lgorithm C sim ulating worst-case scenario on a 512 P E CM-5. S S C om putation tim e = (— x tc) + (S x — x tcih) S S Com m unication tim e = (— x tctab) + ( p x h x tr) Total tim e = C om putation tim e + Com m unication tim e j I 7 I T he recognition phase on a serial m achine takes 0 ( S x 4 ) tim e. Note th a t j tr and tctab and tgch depend on the diam eter of the interconnection netw ork I and th e routing strategy employed. T he initialization costs, such as loading ! th e scene points to the processor array, loading hash table locations to th e \ processor array, and initialization of m em ory locations used inside each P E are not considered in com puting th e to tal tim e. 113 1.25 - © — 1.15 - * 1.1 Algorithm B Algorithm _ C 0.95 L 1 2 4 8 Number of Copies of the Hash Table Figure 5.9: Voting tim e vs N um ber of hash table copies for A lgorithm B and A lgorithm C on a 512 P E CM-5. 5.4.1 P a r titio n in g an d M a p p in g As th e d ata structures in stru ctu ral indexing, i.e. tables, are sim ilar to those in geom etric hashing, various partitioning and m apping strategies developed in Section 5.3.1 are applied in here to im plem ent the recognition phase onto M P - 1 and CM-5. 5 .4 .2 E x p e r im e n ts an d S u m m a ry o f P er fo rm a n ce R e s u lts W e assum e a m odel database, containing 1024 models, each m odel consisting of 50 super segments. This gives a hash table size of 50K entries. Each en try m ay consist of several (model, super segments) pairs. Each super segment is assum ed to be of cardinality 5, i.e it consists of 5 segments. T he models are allowed to undergo only rigid transform ation. However, results from other transform ations do not affect the perform ance of the parallel algorithm . Sim ilarly, a scene is assum ed to have IK super segments. It is also assum ed th a t 114 8 0 70 60 I 50 H oo s > 40 30 Figure 5.10: Voting tim e vs N um ber of hash table copies for A lgorithm B and A lgorithm C sim ulating worst-case scenario on a 512 P E CM-5. for each super segment in th e scene, th e angles between each segment of the super segm ent and its eccentricity value are available. We assum e a hash function / ( ) , which m aps the d a ta uniform ally over the entire hash space and provides constant bin length entries in the hash table. Average num ber of floating point operations executed on each super segment to com pute the hash table key is O {cardinality). T he tim e to check if two hypotheses are consistent depends upon th e distance, angles, and orientations of m odel super segm ents and scene super segm ents [Ste92]. We assum e on the average 30 floating point operation are carried out for each consistency check. As shown in Figure 3.27 in C hapter 3, the d ata to check consistency betw een various hypotheses has been nicely partitioned on adjacent PEs. Therefore, for im plem entations on MP-1 th e X net m acros are used to move d a ta to th e destination PEs. On CM-5, we use point to point com m unication macros available w ith CM MD to rout the data. Based on th e experim ents conducted on the MP-1 and CM-5 (Section 5.2) • Worst-case Algorithm E _ o o-------- e------------------- e----------- --------------------© - - \ D Worst-case Algorithm C - Semiworst-case Algorithm B ©•-to-------o ----------------- -o---------- — ; :-------------------© - ' o « O • ’ _ - ■ Semiworst-case Algorithm C - 1 2 4 8 16 Number of Copies of (be Hash Table 115 I 3 30 Algorithm B o 4- • _ _ _ Algorithm'C 256 32 64 128 512 Machine Size Figure 5.11: Total execution tim e vs M achine size for various algorithm s. and also th e results of actual im plem entations carried out for geom etric hash ing, extrapolated execution tim es of various partitioning algorithm s on a 256 processor CM-5 and 1024 processor M P - 1 are presented. A pparently, the ex ecution tim e taken by the recognition phase in stru ctu ral indexing is larger th an th a t of one probe of the geom etric hashing recognition phase. However, t I in geom etric hashing, in th e worst case it m ay become necessary to execute I 0 ( S S) probes. j ! i P artitio n in g M achine Encoding Hash Bin C onstructing C hecking T otal A lgorithm T ype Scene SS Access Corres. Table C onsistency Tim e A CM-5 15.36 5.45 7.8 23.02 51.63 B CM-5 0.15 1.96 7.8 23.02 32.93 B MP-1 1.25 18.53 16.83 60.32 96.93 Table 5.10: Execution tim es (in msec) of various subtasks in th e recognition phase using different partitioning algorithm s for a scene consisting of 1024 super segm ents on a 256 CM-5 and IK M P-1 . 116 I 1 M achine S ize/T y p e A lgorithm A (in m sec) A lgorithm B (in m «ec) 32/CM -5 84.10 60.23 • 64/CM -5 73.04 53.18 128/CM -5 61.00 41.10 256/CM -5 51.63 32.93 512/CM -5 31.30 22.09 1K /M P-1 96.93 XX Table 5.11: Execution tim es (in msec) of various algorithm s on a scene con sisting of 1024 super segments. M ethods M achine S ize/T ype T otal T im e G eom etric H ashing (A ) 256/CM -5 42.76 m sec G eom etric H ashing (B ) 256/CM -5 6 . 6 8 m sec S tru c tu ral Indexing (A ) 256/CM 5 51.63 m sec S tru c tu ral Indexing (B ) 256/CM-5 31.30 m sec G eom etric H ashing (B ) 1K /M P-1 53.37 m sec S tru c tu ra l Indexing (B ) 1K /M P-1 96.93 m sec Table 5.12: Com parison w ith geom etric hashing im plem entations. 5.5 Scalability of Implementations C urrently, several notions of scalability are available in the literatu re to ana lyze th e perform ance of parallel algorithm s [GK91, LP93, Hwa93], However, m ost of these approaches do not discuss scalability in term s of im plem entation. We define an im plem entation to be scalable if th e same code, can be used on various m achine sizes and achieves proportional speed-ups. In this C hapter, we have developed a library of scalable program s for object recognition. T he various m odules th a t have been developed can be executed on various sizes of MP-1 and CM-5 w ithout m odification. From an im plem entation point of view, th e tim e for one probe of the recognition phase on a P processor m achine can be expressed as: 117 Computation time = (— x tc) + timax' S S C om m unication tim e = (— x tr) -f ( — x Z & x t r) -f tgmax where, Total tim e = C om putation tim e + C om m unication tim e t t c = tim e to com pute a transform ed coordinate and a hash table key for a given feature point in each PE. t r = tim e to route and vote P hash keys to their corresponding locations in th e hash table, iax = tim e to find th e local m axim um of d a ta w ithin each PE. d a ta routing tim e to find the global m axim um over the entire processor array. h = m axim um length of a hash table bin in any PE. g m a x T he com ponents of the above described m etrics scale linearly w ith th e m a chine size. T he term s t c, tr, and t[max depend upon the load on each processor. We have devised techniques which ensure uniform distribution of d a ta among all th e processors available in a m achine. Similarly, the tim e to perform global m axim um tgmax, depends upon th e m achine size. Also, for m achines larger th an th e problem size, we have used several instantiations of the sam e code for concurrent processing of m ultiple probes. As shown in th e results presented in this C hapter, each m odule of our im plem entation scales linearly w ith the m achine size. 118 5.6 Comparison: MP-1 vs CM-5 D uring th e im plem entation of geom etric hashing algorithm , we experim ented w ith various architectural and program m ing aspects of MP-1 and CM-5. Usu ally, SIMD m achines em ploy fine grained massive parallelism and com puta tionally less powerful processing elem ents. In M P-1, we could access m achines w ith upto 16K processors. However, each processor has a 4-bit ALU. It takes 2.51 msec to encode a scene point. The encoding process comprises of approx im ately 7 floating point operations and 5 integer arithm etic operations. On th e other hand, SPM D (synchronized MIM D) m achines em ploy coarse grain parallelism w ith powerful processors as processing nodes. It takes 0.0825 msec to encode a scene point. However, com m unication intensive subtasks perform poorly on CM-5. For exam ple, com puting m axim um of d a ta elem ents, one elem ent per processor, takes 0.32 msec on a 512 processor CM-5, and it takes 0.055 msec on a 1024 processor M P-1. If the com m unication p a ttern is irreg ular, such as in the voting process, perform ance of SIMD m achines degrades drastically. Such com putation and com m unication characteristics suggest the use of SIMD and SPM D m achines in applications w ith varied characteristics. Traditionally, SIMD m achines are known to be well-suited for low level vision operations. However, M P-1, w ith an additional router netw ork m otivates the use of MP-1 for applications w ith m oderate com putational needs and regu lar global com m unication p attern s. Several heuristic techniques in high level vision fall in this category. SPM D m achines, such as CM-5, are suitable for applications w ith high com putational needs and m oderate global com m unica tion requirem ents. In addition, in th e absence of efficient d a ta partitioning and routing techniques, the perform ance of such m achines degrades for appli cations w ith local neighborhood com m unication requirem ents. T he m em ory available w ith each processor also affects the usage of th e underlying archi tecture. Traditionally, due to lim itations of VLSI and cost considerations, m em ory available w ithin each processor is relatively less in SIMD m achines, com pared w ith SPM D m achines. Lim ited m em ory can affect th e perform ance of applications which require storage of large volum e of on-line data. 119 Part III The Surmise Chapter 6 Conclusions and Future Work Parallelism for object recognition and for image understanding in general, has been th e them e of th e research presented in this dissertation. T he disserta tion has addressed th e design and analysis of fast parallel solutions for im age understanding tasks. Fast and scalable d ata parallel algorithm s and im plem en tations for a set of well known object recognition techniques are developed. It is a widely accepted fact th a t parallel processing is the solution to obtain real tim e response for vision problem s. However, due to the wide variety of sequential techniques and th eir com putational characteristics, a b ru te force paralleliza- tion m ay lead to no speed-up at all. Complex d ata structures em ployed in the solutions, symbolic n atu re of th e com putations, and irregular d a ta dependent com m unication p attern s need efficient d a ta partitioning and m apping strate gies. In this chapter, contribution of the dissertation are identified and some plausible directions for future research are outlined. 6.1 Contribution of this Dissertation T he contributions of this dissertation are two fold; design of fast parallel so lutions for object recognition and their well-studied parallel im plem entations on the state of the art com m ercially available parallel m achines. In particular, th e problem of 2 -dim ensional object recognition is studied and scalable d ata parallel algorithm s are designed. T he proposed parallel techniques are further extended to devise processor-tim e optim al parallel solutions for other image understanding tasks. T he im plem entations of several proposed algorithm s are carried out on the Connection M achine CM-5 and on M asPar M P-1. Sec- i tion 1 . 2 provides a detailed sum m ary of research contributions m ade in this dissertation. I I 6.2 Directions for Future Research J Parallelism for im age understanding tasks has gained significant interest in th e recent past. We have addressed key problem s in parallelizing several well known techniques. A plethora of high level vision algorithm s is available which needs to be studied from a parallel processing perspective. O ur work can act as a first step in this direction. A lthough parallel techniques developed in this dissertation can be extended to derive parallel solutions for other image understanding tasks, however, a thorough study is required to understand the inherent parallelism in these tasks. Various future research avenues emerge from this dissertation: • Sequential techniques for tasks in low-level vision offer a common frame- i work for th e design of parallel solutions for these tasks. Parallel tech- I niques designed to execute one task can be easily extended to execute ; the other. T he question is, is it possible to find such a common frame- j work for parallel solutions for high level vision (im age understanding) | tasks? At this point in tim e, w ith th e diversity of sequential techniques j available in this area, it is difficult to come up w ith an affirm ative answer. However, w ith further detailed investigation of the sequential techniques, it m ay be possible to design libraries of scalable algorithm s for a set of generic techniques and efficient d a ta routing and m apping algorithm s. These libraries can be used to develop parallel solutions for new as well as several existing approaches. t • So far th e study of parallelism for vision has investigated parallel solutions for individual tasks. For an integrated vision system , providing a parallel j 122 solution to the com plete system requires integrating parallel solutions for individual com ponents. This integration does not seem easy though. It requires th e study of overheads involved in com bining various solutions and fine tuning the existing parallel algorithm s such th a t th e overheads incurred due to d ata conversion betw een various tasks are m inim ized. • From a com putational point of view, in an integrated vision system sev- J eral types of parallel com putations are em bedded in its com ponents. Due j to heterogeneity in em bedded parallelism , heterogeneous parallel archi- i tectures are becom ing a n atu ral choice for solving such tasks. T he pro- ; posed architectures for vision consist of several levels. Typically, the lower levels operate in SIMD m ode and th e higher levels operate in MIMD mode. Recently, a new com putational paradigm has been introduced to cope w ith th e heterogeneous com putational requirem ents of m ost of in tegrated vision tasks [KSPW92]. This paradigm defines an environm ent, called Heterogeneous SuperC om puting (HSC) environm ent, com prising of several autonom ous parallel architectures integrated using a high- j bandw idth, low latency network. T he use of such a paradigm for real tim e solutions to vision problem s offers m any challenges. T he issues which ; need to be studied include, HSC based image understanding environ- i m ents, m apping and partitioning of parallel algorithm s onto various par allel m achines present in HSC, general purpose -m achine in d ependent- program libraries, m odification of existing parallel solutions to suit to HSC paradigm , and design and developm ent of new solutions. In ad dition, HSC paradigm is asynchronous in nature, the asynchronousity stem s from the autonom ous n atu re of the individual m achines employed. Therefore, for efficient utilization of such a paradigm , asynchronous par allel algorithm s for vision tasks need to be developed. • Study of parallelism for graphics processing is another possible exten sion of work presented in this dissertation. Com putationally, graphics processing is sim ilar to com puter vision b u t theoretically it is opposite. 123 In com puter vision a raw image is given and descriptions are th e de sired o u tp u t. In graphics processing, descriptors are given and an image defined by th e descriptors is desired. T he processing starts from high ; level descriptors and em bedded parallelism increases from one level to th e next. Various partitioning, m apping, and routing techniques devel oped for vision are focused on combining the inform ation tow ards th e end of processing. W hereas, in graphics processing th e inform ation spreads tow ards the end of th e processing. A study of parallel techniques de veloped in this dissertation is needed to investigate their utilization in j graphics processing. | Appendix A j l I Additional Performance Results J For readers interested in additional details of im plem entations results, we are appending several tables com prising of d a ta on execution tim es of various p ar titioning algorithm s on various m achine sizes. We have experim ented on var ious d a ta granularities com prising of average bin lengths of 1, 4, 8 , 16 and 32. These granularities can be chosen according to the local m em ory available w ithin each PE. T he first colum n in these tables depicts the d ata granular ity chosen for the hash table. T he second colum n in these tables represents th e average bin length corresponding to its d a ta granularity. We have used a synthesized m odel database, containing 1024 models, each m odel consisting of 16 random ly generated points. These points are generated according to a G aussian distribution of zero m ean and unit standard deviation. T he models are allowed to undergo only rigid transform ation. However, results from other transform ations do not affect th e perform ance of th e parallel algorithm . Simi larly, scene points are synthesized using normal distribution. Unless specified, assum e a scene consisting of 1024 feature points. \ 125 ) H ash Table T im e (in m sec) for Hash Average Scene points D istrib u te Hash C om puting Table Bin Encoding Scene Bin Voting M axim um of Total Size Length inside CP Points Access Votes T im e 4M /1 1 33.3 3.16 1.85 1.04 2 . 1 0 41.45 4M /2 2 33.3 3.16 1.85 1.62 2 . 1 1 42.04 4M /4 4 33.3 3.16 1.85 2.78 2 . 1 1 43.20 4M /8 8 33.3 3.16 1.85 5.72 2 . 1 1 46.14 4M /16 16 33.3 3.16 1.85 9.37 2 . 1 1 49.79 4M /32 32 33.3 3.16 1.85 17.39 2 . 1 1 57.81 Table A .l: Execution tim es of A lgorithm A on a 256 processor CM-5. H ash Table T im e (in m sec) for Hash Average E ncoding Hash C om puting C om puting Table Bin Scene Bin Voting Local M axim um Global M aximum Total Size Length Points Access of Votes of Votes T im e 4M /1 1 1.56 7.90 9.79 15.84 0.29 35.38 4M /2 2 1.60 7.42 13.26 15.84 0.29 38.41 4M /4 4 1.59 7.07 22.26 15.85 0.29 47.06 4M /8 8 1.63 6.85 45.71 15.83 0.29 70.31 4M /16 16 1 . 6 6 6.42 83.19 15.82 0.29 107.38 4M /32 32 1 . 6 6 6.07 155.93 15.75 0.29 179.70 Table A .2: Execution times of Algorithm B on a 32 processor CM-5. 126 H a s h T a b le T im e (in m se c ) fo r Hash Average Encoding Hash C om puting C om puting Table Bin Scene Bin Voting Local M aximum Global M axim um Total Size Length Points Access of Votes of Votes Tim e 4M /1 1 0.82 4.57 6.09 7.98 0.29 19.75 4M /2 2 0 . 8 6 4.76 7.67 7.98 0.29 21.56 4M /4 4 0.87 4.14 12.55 7.97 0.29 25.82 4M /8 8 0.85 4.24 27.68 7.98 0.29 41.04 4M /16 16 0 . 8 6 3.78 49.27 7.97 0.29 62.17 4M /32 32 0 . 8 6 3.49 86.07 7.96 0.29 98.67 Table A .3: Execution tim es of A lgorithm B on a 64 processor CM-5. H a s h T a b le T im e (in m sec) fo r H ash Average Encoding Hash C om puting C om puting T able Bin Scene Bin Voting Local M aximum Global M axim um T otal Size Length Points Access of Votes of Votes Tim e 4M /1 1 0.50 3.02 3.32 3.64 0.29 10.77 4M /2 2 0.52 2.67 4.68 3.64 0.29 11.80 4M /4 4 0.52 2.57 8.45 3.65 0.29 15.48 4M /8 8 0.52 2.50 16.02 3.64 0.29 22.97 4M /16 16 0.53 2.31 26.24 3.63 0.29 33.00 Table A.4: Execution times of Algorithm B on a 128 processor CM-5. 127 H ash Table T im e (in m sec) for Hash Average E ncoding Hash C om puting C om puting Table Bin Scene Bin Voting Local M aximum Global M axim um T otal Size Length Points Access of Votes of Votes Tim e 4M /1 1 0.33 1.96 2.27 1.83 0.29 6 . 6 8 4M /2 2 0.35 1.78 2.98 1.82 0.29 7.22 4M /4 4 0.34 1.62 5.14 1.83 0.29 9.22 4M /8 8 0.34 1.78 8.47 1.83 0.29 12.71 4M /16 16 0.34 1.78 15.88 1.83 0.29 2 0 . 1 2 4M /32 32 0.34 1.32 29.00 1.83 0.29 32.78 Table A .5: Execution tim es of A lgorithm B on a 256 processor CM-5. H ash Table T im e (in m sec) for Hash Average Encoding Hash C om puting C om puting Table Bin Scene Bin Voting Local M aximum Global M aximum T otal Size Length Points Access of Votes of Votes T im e 4M /1 1 0.25 1.25 1.69 0.93 0.29 4.41 4M /2 2 0.25 1.14 1.87 0.94 0.29 4.49 4M /4 4 0.25 1 . 1 0 3.37 0.93 0.29 5.94 4M /8 8 0.25 1.06 5.52 0.93 0.29 8.05 4M /16 16 0.25 0.98 10.74 0.94 0.30 13.21 4M /32 32 0.25 1 . 0 2 20.70 0.93 0.27 23.17 Table A .6: Execution times of Algorithm B on a 512 processor CM-5. 128 H ash Table T im e (in m sec) for Hash Average Encoding Hash C om puting C om puting Table Bin Scene Bin Voting Local M axim um G lobal M aximum Total Size Length Points Access of Votes of Votes Tim e 4M /1 1 0 . 2 2 1.14 1.04 1.82 0.28 4.50 4M /2 2 0 . 2 2 0.92 1.62 1.82 0.29 4.87 4M /4 4 0 . 2 2 0.85 2.78 1.82 0.29 5.96 4M /8 8 0 . 2 2 0.90 5.72 1.82 0.29 8.98 4M /16 16 0 . 2 2 0.82 9.37 1.82 0.29 12.52 4M /32 32 0 . 2 2 0 . 8 6 17.39 1.82 0.29 20.58 Table A .7: Execution tim es of A lgorithm B corresponding to various d ata granularities of th e hash table on a scene consisting of 256 feature points, on a 256 processor CM-5. H ash Table T im e (in m sec) for Hash Average Encoding Hash C om puting C om puting Table Bin Scene Bin Voting Local M aximum Global M axim um Total Size Length Points Access of Votes of Votes Tim e 4M /1 1 2.51 18.40 3.14 8.17 0.055 32.3 4M /2 2 2.51 18.45 6 . 1 0 8.17 0.055 35.30 4M /4 4 2.51 27.65 12.18 8.17 0.055 50.58 4M /8 8 2.52 18.53 24.10 8.17 0.055 53.37 4M /16 16 2.51 18.57 47.92 8.17 0.055 77.23 4M /32 32 2.51 18.50 95.89 8.17 0.055 125.13 Table A .8 : Execution tim es of A lgorithm A corresponding to various d ata granularities of th e hash table on a scene consisting of 1024 feature points, on a IK processor MP-1. 129 H a s h T a b le T im e (in m sec) fo r Hash Average Encoding Hash C om puting C om puting Table Bin Scene Bin Voting Local M aximum Global M axim um T otal Size Length Points Access of Votes of Votes T im e 4M /1 1 2.54 20.84 3.12 2.072 0.054 28.62 4M /2 2 2.54 20.84 6.08 2.072 0.054 31.58 4M /4 4 2.54 20.84 12.04 2.072 0.054 37.54 4M /8 8 2.54 20.84 23.99 2.072 0.054 49.50 4M /16 16 2.54 20.84 47.63 2.072 0.054 73.36 4M /32 32 2.54 20.84 95.41 2.072 0.054 120.91 Table A .9: Execution tim es of A lgorithm A corresponding to various d ata granularities of th e hash table on a scene consisting of 256 feature points, on a IK processor MP-1. H a s h T a b le T im e (in m se c ) fo r Hash Table Size Average Bin Length E ncoding Scene P oints Hash Bin Access Voting C om puting Local M aximum of Votes C om puting Global M axim um of Votes T otal T im e 4M /1 1 0.25 6.75 68.84 0.93 0.29 70.42 4M /2 2 0.25 7.15 82.20 0.94 0.29 90.83 4M /4 4 0.25 6 . 6 6 84.24 0.93 0.29 92.37 4M /8 8 0.25 6.73 99.81 0.93 0.29 108.01 Table A. 10: Execution times of Algorithm C on a worst-case data with 16 copies of the hash table on a 512 processor CM-5. H ash Table T im e (in m sec) for Hash Table Size Average Bin Length Encoding Scene Points Hash Bin Access Voting C om puting Local M aximum of Votes C om puting Global M axim um of Votes Total Tim e 4M /1 1 0.25 13.13 31.70 0.93 0.29 46.30 4M /2 2 0.25 13.15 34.66 0.94 0.29 49.29 4M /4 4 0.25 12.87 57.30 0.93 0.29 71.64 4M /8 8 0.25 13.06 89.03 0.93 0.29 103.56 Table A. 1 1 : Execution tim es of A lgorithm C on a worst-case d a ta w ith 8 copies of the hash table on a 512 processor CM-5. H ash Table T im e (in m sec) for Hash Table Size Average Bin Length E ncoding Scene Points Hash Bin Access Voting C om puting Local M axim um of Votes C om puting Global M axim um of Votes Total Tim e 4M /1 1 0.25 1.36 34.16 0.93 0.29 36.99 4M /2 2 0.25 1.41 38.82 0.94 0.29 41.71 4M /4 4 0.25 1.31 34.17 0.93 0.29 36.95 4M /8 8 0.25 1.23 38.70 0.93 0.29 41.40 Table A .12: Execution tim es of A lgorithm C on a semiworst-case d a ta w ith 16 copies of the hash table on a 512 processor CM-5. H ash Table T im e (in m sec) for Hash T able Size Average Bin Length Encoding Scene Points Hash Bin Access Voting C om puting Local M axim um of Votes C om puting Global M axim um of Votes T otal Tim e 4M /1 1 0.25 1.58 14.79 0.93 0.29 17.84 4M /2 2 0.25 1.71 15.42 0.94 0.29 18.61 4M /4 4 0.25 1 . 6 8 17.68 0.93 0.29 20.83 4M /8 8 0.25 1.74 45.82 0.93 0.29 49.03 Table A. 13: Execution tim es of A lgorithm C on a sem iworst-case d ata w ith 8 copies of th e hash table on a 512 processor CM-5. 131 Bibliography [AAG+8 6 ] [AF 8 6 ] [AP91] [Bai85] [Bar89] [BB81] [BB82] [BC82] [BM88] M. A nnaratone, E. Amould, T. Gross, H. T. Kung, M. Lam, 0 . M enzilcioglu, K. Sarocky, and J. A. Webb. W arp architecture and im plem entation. In 13th Annual International Sym posium on Com puter Architecture, June 1986. N. Ayache and 0 . Faugeras. Hyper: A new approach for th e recog nition and positioning of two dim ensional objects. IE E E Trans actions on Pattern Analysis and M achine Intelligence, 8(l):44-54, 1986. t H. Alnuweri and V. K. Prasanna. O ptim al geom etric algorithm s j for digitized images on fixed size linear array. Distributed Comput- j ing, 5:55-65, 1991. ' j H. S. Baird. Model-Based Image M atching using Location. M IT Press, Cam bridge, MA, 1985. S. T. B arnard. Stochastic stereo m atching on th e Connection M a chine. In D A RP A Image Understanding Workshop, pages 1021- 1031, Palo Alto, CA, May 1989. H. H. Baker and T. O. Binford. D epth from edge and intensity based stereo. In International Joint Conference on Artificial In telligence,, pages 631-636, Vancouver, Canada, A ugust 1981. j I D. H. B allard and C. M. Brown. Com puter Vision. Prentice-H all, \ Englewood Cliffs, N J, 1982. j ) R. C. Bolles and R. A. Cain. Recognizing and locating partially . visible objects: T he local-feature-focus m ethod. International Journal o f Robotics Research, l(3):57-82, 1982. j O. Bourdon and G. Medioni. O bject recognition using geom etric j hashing on th e Connection M achine. In International Conference ' on Pattern Recognition, pages 596-600, 1988. i 132 I [BRF92] [Bro81] [Cas8 8 ] [CH87] [CLM78] [CM91] [Cor91] [CP90] [DA89] [DP 8 6 ] [DRLR89] Zeki Bozkus, Sanjay Ranka, and Geoffrey Fox. Benchm arking the CM-5 m ulticom puter. In Frontiers ’ 92: The Fourth Sym posium on the Frontiers o f M assively Parallel C om putation, pages 100— 107, Me Lean, V irginia, O ctober 1992. R. Brooks. Symbolic reasoning among 3-dim ensional m odel and 2-dim ensional images. Artificial Intelligence, pages 17:285-349, 1981. T. A. Cass. A robust parallel im plem entation of 2d model-based recognition. In IE E E International Conference on Com puter Vi sion and P attern Recognition, pages 879-884, A nn A rbor, MI, 1988. Paul R. Cooper and Susan C. Hollbach. Parallel recognition of of objects com prised of pure structures. In Proc. o f the D ARPA Image Understanding W orkshop, pages 381-391, 1987. C. Clark, A. Luk, and C. McNary. Feature based scene analy sis and m odel m atching. In N A T O Advanced Study Institute of Pattern Recognition and Signal Processing, Paris, 1978. ? i A. Califano and R. M ohan. M ultidim ensional indexing for recog- ‘ nizing visual shapes. In IE E E International Conference on Com- 1 puter Vision and Pattern Recognition, pages 28-34, M aui, Hawaii, j 1991. I I Thinking M achines C orporation. CM-5: Technical summary. j Technical report, Thinking M achines C orporation, 1991. ' A. N. Choudhary and J. H. Patel. Parallel Architectures and Algo rithm s fo r Integrated Vision System s. Kluwer A cadem ic Publish ers, Norwell, MA, 1990. U. R. D hond and J. K. Aggarwal. S tructure from stereo: A review. IE E E Transactions on System s, M an, and Cybernetics, 19(8):1489-1510, November 1989. M. D rum m heller and T. Poggio. On parallel stereo. In Interna tional Conference on Robotics and A utom ation, pages 1439-1448, April 1986. M. Dhome, M. Richetin, J. T. Lapreste, and G. Rives. D eterm i nation of the a ttitu d e of 3 -d objects from single prospective view. IE E E Transactions on Pattern Analysis and M achine Intelligence, pages 11(12):1265— 178, 1989. | i 133 j [et.92] [E tt8 8 ] [FH 8 6 ] [Fis8 6 ] [FMN89] [Fou87] [GH90] [GK91] 1 t [GLP84] [GP87] [Gri90] [Hil85] Charles E. Leiserson et.al. T he network architecture of th e Con nection M achine CM-5. Technical report, Thinking M achines Cor poration, 1992. G. J. E ttinger. Large hierarchical object recognition using libraries of param eterized model sub-parts. In IE E E International Confer ence on Com puter Vision and Pattern Recognition, pages 32-41, 1988. A nita M. Flynn and John G. Harris. Recognition algorithm s for th e Connection M achine. Technical report, C om puter Science Lab., Mass. In stitu te of Technology, 1986. A. I. Fisher. Scan line array processor for im age com putations. In International Conference on Com puter Architecture, 1986. T. J. Fan, G. M edioni, and Nevatia. Recognizing 3-d objects using surface descriptions. IE E E Transactions on Pattern Analysis and M achine Intelligence, PAM I-11(11):1140-1157, November 1989. T. J. Fountain. Processor Arrays: Architectures and Applications. Academ ic Press, London, 1987. W . E. L. Grimson and D. P. H uttenlocher. On the verification of hypothesized m atches in model-based recognition. In F irst Europe Conference on Com puter Vision, pages 489-498, 1990. A nshul G upta and V ipin K um ar. A nalyzing scalability of par- j allel algorithm s and architectures. Technical report, D epartm ent of C om puter Science, Uiversity of M innesota, M inneapolis, MN I 55455,1991. j P. C. G aston and T. Lozano-Perez. Tactile recognition and local- j ization using object models: T he case of polyhedra on a plane. I IE E E Transactions on Pattern Analysis and M achine Intelligence, 6:257-265, 1984. ! W . E. L. Grimson and T. L. Perez. Localizing overlapping parts by searching the interpretation trees. IE E E Transactions on Pattern Analysis and M achine Intelligence, 9(4):469-482, 1987. W . E. L. Grimson. Object Recognition by Computer. M IT Press, Cam bridge, MA, 1990. ; ! D. Hillis. The Connection Machine. M IT Press, Cam bridge, MA, : 1985. ; i 134 ■ [HU87] D. P. H uttenlocher and S. Ullman. O bject recognition using align m ent. In IE E E F irst International Conference on Com puter Vi sion, pages 102-111, 1987. [HU8 8 ] D. P. H uttenlocher and S. Ullman. Recognizing solid objects by alignm ent. In D A R P A Image Understanding Workshop, pages 1114-1124, 1988. [Hwa85] K. Hwang. Com puter Architecture and Parallel Processing. Mc- Graw Hill, New York, NY, 1985. [Hwa93] K. Hwang. Advanced Com puter Architecture with Parallel Pro gramming. M cGraw Hill, N J, prelim inay edition edition, 1993. [IK8 8 ] [J92] [KA87] [KJ8 6 ] J. Illingw orth and J. K ittler. A survey of th e hough transform . Com puter Vision, Graphics, and Image Processing, pages 44:87- 116, 1988. Joseph Jaa. A n Introduction to Parallel Algorithms. Wesley, 1992. Addison- Y. C. Kim and J. K. Aggarwal. Positioning 3-d objects using stereo images. IE E E Transactions on Robotics and A utom ation, RA-3(4):361-373, Aug 1987. T. F. Knoll and R.C. Jain. Recognizing partially visible objects using feature indexed hypotheses. IE E E Transactions on Robotics and Autom ation, 2:3-13, 1986. [KKP93] A. K hokhar, H -J. Kim, and V. Prasanna. Scalable geom etric hash ing on M asPar MP-1. In Proc. International Conference on Com puter Vision and Pattern Recognition, New York, NY, 1993. [KKPW93] A. K hokhar, H-J. Kim, V. Prasanna, and C-L. Wang. Scalable d ata parallel im plem entations of object recognition using geom et ric hashing. Technical report, D epartm ent of E E System s, Univer sity of Souther California, Los Angeles, CA, February 1993. [KKW93] Ashfaq K hokhar, Hyoung J. Kim, and Cho-Li Wang. Im age under standing on M asPar MP-1 and on th e Connection m achine CM-5: Experim ents and perform ance comparisons. In Parallel System Fair at 7th IE E E International Parallel Processing Sym posium , N ewport Beach, CA, April 1993. 135 [KL93] [KLP91] [KP92] [KP93] [KR87] [KSL85] [KSPW92] [KSSS8 6 ] [Kum91] [Lei82] [Lei85a] A. K hokhar and W . Lin. Stereo and image m atching on fixed size linear arrays. In International Parallel Processing Sym posium , N ew port Beach, CA, 1993. A. K hokhar, W. Lin, and V. K. Prasanna. Stereo and im age m atching on fixed size m esh arrays. In International Conference on Com puter Architectures and M achine Perception, Paris, France, D ecem ber 1991. A. K hokhar and V. K. Prasanna. Parallel algorithm s for stereo and im age m atching. In D A R P A Image Understanding W orkshop, pages 1057-1062, San Diego, CA, January 1992. A. K hokhar and V. K. P rasanna. Parallel object recognition us ing stru ctu ral indexing. In International Conference on Com puter Architectures and M achine Perception, 1993. subm itted to. V. K. P rasanna K um ar and C. S. Raghavendra. A rray processor w ith m ultiple buses. Journal o f Parallel and Distributed Com put ing, 4:173-190, 1987. E. W . K ent, 0 . Sheier, and R. Lumia. Pipe: Pipeline image pro cessing engine. Journal o f Parallel and Distributed Computing, 2:50-78, 1985. Ashfaq K hokhar, M oham m ad E. Shaaban, V iktor K. Prasanna, and Cho-Li Wang. Heterogeneous supercom puting (HSC): P rob lems and issues. In Workshop on Heterogeneous Processing at 6th IE E E International Parallel Processing Sym posium , Beverley Hill, CA, M arch 1992. A. Kalvin, E. Schonberg, J. T. Schwartz, and M. Sharir. Two di m ensional model-based, boundary m atching using footprints. In ternational Journal o f Robotics Research, 5(4):38-55, 1986. V. K. P rasanna K um ar. Parallel Algorithms and Architectures fo r Image Understanding. Academic Press, 1991. E dited Volume. F. T. Leighton. Parallel com putations on m esh of trees. Techni cal report, C om puter Science Lab., Mass. In stitu te of Technology, 1982. F. T. Leighton. Tight bounds on th e com plexity of parallel sorting. IE E E Transactions on Com puters, C-34:344-354, 1985. i 136 | | [Lei 85b] [LK90] [Low87] [LP91] [LP93] [LR91] [LW8 8 ] [MG87] [MN84] [MN85] [MST85] Charles E. Leiserson. FA T-TR EES:universal networks for hard ware efficient superom puting. In International Conference on P ar allel Processing, pages 393-402, 1985. W . Lin and V. K. P rasanna Kum ar. Efficient histogram m ing on hypercub SIMD m achines. Com puter Vision, Graphics, and Image Processing, 49(1): 104-120, January 1990. D. G. Lowe. Three-dim ensional object recognition from single two-dim ensional images. Artificial Intelligence, pages 31:335-395, 1987. W . Lin and V. K. Prasanna. Parallel architectures and algorithm s for discrete relaxation technique. In IE E E International Confer ence on Com puter Vision and Pattern Recognition, 1991. Cho-Chin Lin and V iktor K. Prasanna. Using extensity as scal ability m easure. Technical report, D epartm ent of EE-System s, U niversity of Southern California, Los Anegeles, CA 90089-2562, i 1993. j ! A. F. Laine and G. C. Rom an. A parallel algorithm for increm en tal stereo m atching on SIMD machines. IE E E Transactions on Robotics and Autom ation, 7(1):123-134, February 1991. i i Y. Lam dan and H. J. Wolfson. Geom etric hashing: A general and ; efficient m odel based recognition scheme. In International Confer- \ ence on Com puter Vision, pages 218-249, Tem pa, FL, Decem ber ; 1988. J. M. M arberg and E. Gafni. Sorting in constant num ber of row and column phases on a mesh. Algorithmica, pages 603-612, 1987. ! I G. M edioni and Nevatia. M atching images using linear feature. IE E E Transactions on Pattern Analysis and M achine Intelligence, j PAM I-6(6):675-685, 1984. G. M edioni and Nevatia. Segments-based stereo m atching. Com puter Vision, Graphics, and Image Processing, 31(1):2— 18, July 1985. ! i L. M assone, G. Sandini, and V. Tagliasco. From -invariant topolog ical m apping strategy for 2d shape recognition. Com puter Vision, ; Graphics, and Image Processing, pages 30:169-188, 1985. ; 137 [NMB83] [NS81] [PCF92] [PKP92] [Pre93] [Pri82] [Rei91] [RH91] [RH92] [RHZ76] [RK87] D. N ath, S. N. M aheshwari, and P. C. B hat. Efficient VLSI n e t works for parallel processing based on orthogonal trees. IE E E Transactions on Com puters, c-32(6), June 1983. D. Nassimi and S. Sahni. D ata broadcasting in SIMD com puters. IE E E Transactions on Com puters, C-30:101-107, 1981. Ravi Ponnusamy, Alok Choudhary, and Geoffrey Fox. Com m u nication overhead on CM-5: An experim ental perform ance evau- lation. In Frontiers ’ 92:The Fourth Sym posium on the Frontiers \ o f M assively Parallel Com putation, pages 100-107, Me Lean, Vir- I ginia, O ctober 1992. j i P. N. Pani, A. K hokhar, and V. K. Prasanna. Parallel algorithm s for stereo based on discrete relaxation techniques. In International Conference on Pattern Recognition, A m sterdam , Holland, August 1992. Lutz Prechelt. M easurem ents of M asPar MP-1261A com m uni cation operations. Technical R eport D X X /93, In stitu t fur Pro- ; gram m strukturen und D atenorganisation, U niversitat K arlsruhe, ; Postfach 6980, D-7500 K arlsruhe, Germany, 1993. j K. Price. Symbolic m atching of images and scene models. In IE E E I Workshop on Com puter Vision, pages 105-112, 1982. C. R einhart. Specifying Parallel Processor Architectures fo r High Level Com puter Vision Algorithms. PhD thesis, In stitu te of Robotics and Intelligent Systems, School of Engineering, Univer- i sity of Southern California, Los Angeles 90089-0273, 1991. I. Rogoutsos and R. Hummel. Im plem entation of geom etric hash- j ing on the Connection M achine. In IE E E CAD-Based Vision Workshop, pages 76-84, M aui, HI, 1991. I. Rogoutsos and R. Hummel. Massively parallel m odel m atching: G eom etric hashing on the Connection M achine. Computer, pages \ 33-42, February 1992. I I A. Rosenfeld, R. A. Hummel, and S. W. Zucker. Scene labelling by | relaxation operation. IE E E Transactions on System s, Man, and j Cybernetics, SMC-6:420-423, June 1976. ; I D. Reisis and V. K. P rasanna K um ar. VLSI arrays w ith recon- ! figurable busses. In International Conference on Supercomputing, A thens, Greece, June 1987. 138 : [RN90] [RPAK 8 8 ] [SH81] [Skl78] [SM90] [Ste92] [T F F 8 8 ] [W H RR 8 8 ] [WMF81] [Wol90] C. R einhart and R. Nevatia. Efficient parallel processing in high level vision. In D A RP A Image Understanding Workshop, pages 829-839, Septem ber 1990. . j A. P. Reeves, R. J. Prokop, S. E. Andrews, and F. P. Kuhl. Three- ! dim ensional shape analysis using m om ents and fourier descriptors. IE E E Transactions on Pattern Analysis and M achine Intelligence, pages 10(6):937-943, 1988. L. Shapiro and R. Haralick. S tructural description and inexact m atching. IE E E Transactions on Pattern Analysis and M achine Intelligence, PAMI-3, 1981. J. Sklansky. On th e hough technique for curve detection. IE E E Transactions on Computers, pages 27:923-926, 1978. F. Stein and G. M edioni. Efficient two dim ensional object recog nition. In International Conference on Pattern Recognition, pages 13-17, Ann Arbor, MI, June 1990. F. Stein. Structural Indexing fo r Object Recognition. PhD thesis, In stitu te of Robotics and Intelligent Systems, School of Engineer ing, University of Southern California, Los Angeles 90089-0273, 1992. L. W. Tucker, C. R. Feynm an, and D. M. Fritzsche. O bject recog nition using the Connection M achine. In IE E E International Con ference on Com puter Vision and Pattern Recognition, pages 871— 878, 1988. i I i C. Weems, A. Hanson, E. Risem an, and A. Rosenfeld. An inte- i grated image understanding benchm ark: Recognition of a 2 1/2 ; d mobile. In IE E E International Conference on Com puter Vision j and Pattern Recognition, pages 871-878, Ann A rbor, MI, June ' 1988. ; T. P. W allace, O. R. M itchell, and K. Fukunaga. Three- • dim ensional shape analysis using local shape descriptors. IE E E I Transactions on Pattern Analysis and M achine Intelligence, pages . 3:310-323, 1981. i H. J. Wolfson. M odel based object recognition by geom etric hash ing. In F irst Europe Conference on Com puter Vision, pages 526- 536, 1990. i 139 | [WZ88] I H. W echsler and G. L. Zim m erm an. 2 -d invariant object recogni tion using distributed associative memory. IE E E Transactions on Pattern Analysis and M achine Intelligence, pages 10(6):811-821, 1988. 140
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11255763
Unique identifier
UC11255763
Legacy Identifier
DP22867