Page 1 |
Save page Remove page | Previous | 1 of 3 | Next |
|
small (250x250 max)
medium (500x500 max)
Large (1000x1000 max)
Extra Large
large ( > 500x500)
Full Resolution
|
This page
All
|
-
15539.pdf
[57.21 MB]
Link will provide options to open or save document.
File Format:
Adobe Reader
COMPUTATIONAL MODELING AND UTILIZATION OF ATTENTION, SURPRISE AND ATTENTION GATING by Terrell Nathan Mundhenk A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2009 Copyright 2009 Terrell Nathan Mundhenk ii Epigraph “I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living; It's a way of looking at life through the wrong end of a telescope. Which is what I do, and that enables you to laugh at life's realities.” Dr Seuss iii Dedication For my parents Terry and Ann iv Acknowledgements This really is the trickiest part to write because I want to thank so many people for so many things. First off I would like to thank my sister Amy (Zon) who thought enough of me to whip out a copy of the Schizophrenia paper I wrote with Michael Arbib when she visited that research neurologist at the Cleveland Clinic. I thought that was funny, but it made me feel as if I was doing something interesting. Then there are my closest friends Paul Gunton, Brant Heflin, Mike Olson and Tim Olson who exhibited confidence in my ability to actually complete this silly thing. I would also like to thank Kate Svyatets for sticking with me through my thesis and mood swings. Life is meaningless without friends and loved ones, so I am in your debt. I would also like to acknowledge the excellent scientists I worked closely with over the years, without whom my research would not have been possible. Firstly this includes Michael Arbib who has essentially been a co-thesis advisor to me. I can hardly communicate the enormous amount of things which I learned from him over the years. His honesty and integrity were of great value and, I knew when he said something, he really meant it. Next, I would like to thank Kirstie Bellman and Chris Landauer at the Aerospace Corporation. They gave me more of an idea of what real engineering is about than just about anyone. I also want to give a great big thanks to Wolfgang Einhäuser my co-author on several publications. I don’t think I’ve ever worked with anyone so totally on the ball. I also need to mention Ernest Hall who was my mentor during my undergraduate years and source of encouragement during my graduate years. I don’t think I would be in a research field if it wasn’t for him. v I also must extend my deepest gratitude to the many faculty members who have provided excellent feedback and conversation over the years of my graduate education. I cannot mention every teacher and mentor who touched my life over the past few years, because there were so many. However I would like to extend special thanks to: Irving Biederman, Christof Koch, Christoph von der Malsburg, Bartlett Mel and Stefan Schaal. I would also like to acknowledge many of the students and post-doctoral researchers I have collaborated with or were in general very helpful in assisting me with my research through direct assistance or discussion. They are: Jeff Begley, James ‘Jimmy’ Bonaiuto, Mihail Bota, Vincent J. Chen, Aaron D'Souza, Nitin Dhavale, Lior Elazary, Jacob Everist, Doug Garrett, Larry Kite, Hendrik Makaliwe, Salvador Marmol, Thomas Moulard, Pradeep Natarajan, Vidhya Navalpakkam, Jan Peters, Rob Peters, Eric Pichon, Jack Rininger, Christian Siagian, and Rand Voorhies. If I forgot to mention anyone, I’m really quite sorry. Lastly, but far from leastly, I would like to thank my Thesis Advisor Laurent Itti for all his help, input and encouragement he has provided over the years. After my first year at USC, I was getting kind of board being out of the research game. At the time, I was taking robotics from Stefan Schaal. I talked to him about who was doing interesting research in computer vision and he suggested I talk to a promising new faculty member. I took him up on his advice which turned out to be an excellent decision. iLab was new and only had a few students back then, now it so vibrant and full of life with so many projects. I will surely miss Laurent and iLab and I am certain I will look back on these days with great positive satisfaction. vi Table of Contents Epigraph ii Dedication iii Acknowledgements iv List of Tables x List of Figures xi Abbreviations xv Abstract xviii Preface xix About this thesis xix Graduate works not included in this thesis xx Other works of interest not included in this thesis xxi Don’t read the whole thesis xxii Chapter 1: A Brief Introduction to Vision and Attention 1 1.1 What Does our Brain Want to Look For? 5 1.2 How Does our Brain Search for What it Wants? 9 1.2.1 What’s a Feature? 9 1.2.2 How do we Integrate These Features? 12 1.2.3 Beyond the Basic Saliency Model 17 1.3 The Current State of Attention and Other Models 18 1.3.1 Top-Down Models 19 1.3.2 Other Contemporary Models of Saliency 20 1.3.3 The Surprise Model 22 Chapter 2: Distributed Biologically Based Real Time Tracking with Saliency Using Vision Feature Analysis Toolkit (VFAT) 23 2.1.1 Vision, Tracking and Prior Information 23 2.1.3 Meta-priors, Bayesian Priors and Logical Inductive Priors 26 2.1.4 The iRoom and Meta-prior Information 30 2.2 Saliency, Feature Classification and the Complex Tracker 31 2.2.1 Complex Feature Tracker Components 34 2.3 The Simple Feature Based Tracker 59 2.4 Linking the Simple and Complex Tracker 61 2.5 Results 64 2.6 Discussion 66 2.6.1 Noticing 66 vii 2.6.2 Mixed Experts 66 2.6.3 System Limitations and Future Work 67 Chapter 3: Contour Integration and Visual Saliency 69 3.1 Computation 75 3.2 The model 77 3.2.1 Features 77 3.2.2 The Process 83 3.2.3 Kernel 87 3.2.4 Pseudo-Convolution 91 3.3 Experiments 97 3.3.1 Local element enhancement 98 3.3.2 Non-local Element Enhancement 103 3.3.3 Sensitivity to Non-contour Elements 112 3.3.4 Real World Image Testing 118 3.4 Discussion 122 3.4.1 Extending Dopamine to Temporal Contours via TD (dimensions) 125 3.4.2 Explaining Visual Neural Synchronization with Fast Plasticity 126 3.4.3 Contours + Junctions, Opening a New Dimension on Visual Cortex 127 3.4.4 Model Limitations 128 3.5 Conclusion 129 Chapter 4: Using an Automatic Computation of an Image’s Surprise to Predicts Performance of Observers on a Natural Image Detection Task 130 4.1.1 Overview of Attention and Target Detection 131 4.1.2 Surprise and Attention Capture 134 4.2 Methods 136 4.2.1 Surprise in Brief 136 4.2.2 Using Surprise to Extract Image Statistics from Sequences 139 4.3 Results 144 4.4 A Neural Network Model to Predict RSVP Performance 152 4.4.1 Data Collection 153 4.4.2 Surprise Analysis 154 4.4.3 Training Using a Neural Network 154 4.4.4 Validation and Results of the Neural Network Performance 156 4.5 Discussion 164 4.5.1 Patterns of the Two-Stage Model 165 4.5.2 Information Necessity, Attention Gating and Biological Relevance 169 4.5.3 Generalization of Results 173 4.5.4 Comparison with Previous RSVP Model Prediction Work 173 4.5.5 Network Performance 174 4.5.6 Applications of the Surprise System 175 4.6 Conclusion 176 viii Chapter 5: Modeling of Attentional Gating using Statistical Surprise 177 5.1 From Surprise to Attentional Gating 180 5.2 Methods 183 5.2.1 Paradigm 183 5.2.2 Computational Methods 184 5.3 Results 188 5.3.1 Relation of Results to Previous Studies Which Showed Causal Links between Surprise and Target Detection 193 5.4 Discussion 196 5.4.1 Variability of the Attention Gate Size Fits within the Paradigm 196 5.4.2 The Attention Gate may Account for Some Split Attention Effects 198 5.4.3 Unifying Episodic Attention Gate Models with Saliency Maps 199 Chapter 6: A Comparison of Surprise Methods and Models Using the Metric of Attention Gate (MAG) 201 6.1 The MAG Method for Comparison of Different Models 201 6.1.1 Fishers Linear Discriminant and Fitness 203 6.1.2 Data Sets Used 206 6.2 Comparison of Opponent Color Spaces using MAG 207 6.2.1 iLab RGBY 210 6.2.2 CIE Lab 211 6.2.3 iLab H2SV2 214 6.2.4 MAG Comparison of Color Spaces 214 6.3 Addition of Junction Feature Channels 216 6.4 Comparison of Different Statistical Models 217 6.5: Checking the Problem with Beta 219 6.5.1 Asymptotic Behavior of β 219 6.5.2 What Happens if We Fix the β Hyperparameter to a Constant Value? 221 6.5 Method Performance Conclusion 226 References 228 Appendices 245 Appendix A: Contour Integration Model Parameters 245 Appendix B: Mathematical Details on Surprise 246 Appendix C: Kullback-Liebler Divergences of Selected Probability Distributions 253 C.1 Conceptual Notes on the KL Distance 253 C.2 KL of the Gaussian Probability Distribution 255 C.3 KL of the Gamma Probability Distribution 255 C.4 KL of the Joint Gamma-Gaussian or Gamma-Gamma Distribution 258 Appendix D: Junction Channel Computation and Source 262 D.1 Junction Channel Source Code 264 Appendix E: RGBY and CIE Lab Color Conversion 267 E.1 RGBY Color Conversion 267 ix E.2 CIE Lab Color Conversion 268 Appendix F: HSV Color Conversion Source 273 F.1 RGB to HSV Transformation 274 F.1.1 HSV Transformation C / C++ Code 275 F.2 HSV to RGB Transformation: 278 F.2.1 RGB Transformation C/C++ Code 279 Appendix G: H2SV Color Conversion Source 281 G.1 HSV to H2SV Transformation 282 G.1.1 HSV to H2SV1 Variant 282 G.1.2 HSV to H2SV2 Variant 283 G.2 H2SV to HSV Simple Transformation 284 G.2.1 H2SV1 to HSV Simple 284 G.2.2 H2SV2 to HSV Simple 284 G.3 H2SV to HSV Robust Transformation 285 G.3.1 General Computations: 285 G.3.2 C / C++ Code for Robust Transformation 286 Appendix H: Selected Figure Graphing Commands for Mathematica 288 x List of Tables Table 2.1: Variance accounted for in ICA/PCA. 50 Table 3.1: Table of probabilities of results at random. 107 Table 3.2: Types of features found salient by CINNIC. 119 Table 4.1: M-W feature significance per type. 145 Table 6.1: MAG scores for color spaces. 213 Table 6.2: MAG scores for junction filters. 215 Table 6.3: MAG scores for statistical models. 217 Table 6.4: MAG scores for different values of beta. 223 xi List of Figures Figure 1.1: Examples of retinotopic maps of the visual cortex. 2 Figure 1.2: What does the brain find visually interesting? 4 Figure 1.3: Why the brain looks for so many types of features. 6 Figure 1.4: The increasing complexity of the visual system. 7 Figure 1.5: Examples of basic feature detectors. 11 Figure 1.6: Generations of feature based attention models. 13 Figure 1.7: Orientation features and Gabor pyramid example with Ashes. 15 Figure 1.8: Butterfly regions and contour integration example. 16 Figure 1.9: Examples of top-down models of attention. 19 Figure 2.1: Bayesian priors and Meta Priors spectrum. 26 Figure 2.2: From features to ICA to clustering. 32 Figure 2.3: The VFAT architecture graph. 35 Figure 2.4: General saliency model graph. 36 Figure 2.5: Junction detection from INVT features with ICA. 38 Figure 2.6: Feature clustering example shown with node climbing. 40 Figure 2.7: Examples of feature clustering on different data points. 42 Figure 2.8: NPclassify compared with K-means. 43 Figure 2.9: Example of similarity by statistical overlap. 45 Figure 2.10: Example of feature output following ICA/PCA. 51 Figure 2.11: ICA inversion and color features. 53 Figure 2.12: Example of image feature clustering. 54 xii Figure 2.13: NPclassify compared quantitatively with K-means. 57 Figure 2.14: Features clustered during tracking. 58 Figure 2.15: The simple feature tracker. 59 Figure 2.16: The complex tracker handing off to simple trackers. 62 Figure 2.17: Screen shot of the VFAT based tracker. 63 Figure 3.1: The Braun Make Snake contour. 70 Figure 3.2: The basics of contour alignment and processing. 78 Figure 3.3: Neuron priming diagram. 80 Figure 3.4: Neuron group suppression in theory. 82 Figure 3.5: The basics of the CINNIC alignment and processing. 84 Figure 3.6: Hypercolumns and pseudo-convolution. 91 Figure 3.7: Breakdown of the CINNIC process. 95 Figure 3.8: CINNIC multiple scales and averaging. 96 Figure 3.9: 2AFC simulation for the Polat Sagi display. 99 Figure 3.10: Fit of CINNIC to observer AM. 101 Figure 3.11: Interaction of element size and enhancement. 103 Figure 3.12: CINNIC working on Make Snake contours. 105 Figure 3.13: Performance of CINNIC on Make Snake contours. 106 Figure 3.14: The subjective perception of contours and element separation. 108 Figure 3.15: Accounting for performance of CINNIC with kernel size. 110 Figure 3.16: CINNIC sensitivity to junctions. 113 Figure 3.17: Explaining sensitivity of junctions by CINNIC. 115 Figure 3.18: CINNIC sensitivity to salient locations and face features. 120 xiii Figure 3.19: CINNIC and fast plasticity. 127 Figure 4.1: Overview of the surprise system. 138 Figure 4.2: The surprise map over sequence frames. 141 Figure 4.3: Peaks of surprise seem predictive. 144 Figure 4.4: Mean surprise and visual features. 148 Figure 4.5: Standard deviation of surprise and visual features. 150 Figure 4.6: Spatial location of max surprise and visual features. 151 Figure 4.7: The surprise prediction system. 155 Figure 4.8: How surprise prediction was analyzed. 158 Figure 4.9: Performance of surprise prediction. 162 Figure 4.10: Theoretical aspects of surprise prediction. 171 Figure 5.1: Surprise peaks at flankers for hard targets. 179 Figure 5.2: Attention gating and the contents of working memory. 180 Figure 5.3: From RSVP to attention gate computation. 182 Figure 5.4: Computation of the attention gate. 186 Figure 5.5: Computing the overlap ratio. 189 Figure 5.6: Surprise attention gate quantitative results. 191 Figure 5.7: Subjective results on Transportation Targets. 192 Figure 5.8: Subjective results on Animal Targets. 193 Figure 5.9: Explaining past results for Easy-to-Hard. 195 Figure 5.10: Attention gating and detecting multiple targets. 199 Figure 6.1: Which of the two models is better or worse? 202 Figure 6.2: Pretty fisher information graph 205 xiv Figure 6.3: The MAG, an overview. 207 Figure 6.4: A general color space overview. 208 Figure 6.5: RGBY Color space example. 210 Figure 6.6: CIE Lab color space example. 211 Figure 6.7: H2SV2 color space example. 212 Figure 6.8: MAG and color space results. 213 Figure 6.9: MAG and junction filter results. 215 Figure 6.10: MAG and statistical model results. 217 Figure 6.11: The asymptotic behavior of beta. 220 Figure 6.12: MAG performance for different values of beta. 223 Figure B.1: Different views on the Gamma PDF. 247 Figure B.2: Surprise in Wows! 248 Figure B.3: The DoG Filter. 251 Figure C.1: From a PDF to the integrated KL region. 254 Figure C.2: The Joint gamma-gamma KL. 257 Figure D.1: The junction filter. 262 Figure E.1: CIE 1931 XYZ color space. 269 Figure E.2: Map of the CIE Lab gamut space. 270 Figure F.1: HSV color space. 273 Figure G.1: H2SV color space. 281 xv Abbreviations AI Artificial Intelligence AIP Anterior Interparietal Sulcus AMD Advanced Micro Devices BPNN Back Propagation Neural Network CIE International Commission on Illumination CINNIC Carefully Implemented Neural Network for Integrating Contours CRT Cathode Ray Tube (monitor) DoG Difference of Gaussian EPSP Excitatory Post Synaptic Potential EQ Equation ERF Error Function ERFC Complementary Error Function fMRI Functional (Nuclear) Magnetic Resonance Imaging FS Fast Spiking GABA Gamma Aminobutyric Acid GB Gigabyte (1 billion bytes) GCC GNU C++ Compiler GIMP GNU Image Manipulation Program GNU GNU's Not Unix [sic] (An open source, free software consortium) GPL GNU General Public License xvi HSV Hue/Saturation/Value H2SV HSV Variant with two hue components H2SV2 H2SV with Red/Green Blue/Yellow opponents Hz Hertz (cycles per second) ICA Independent Component Analysis INVT iLab Neuromorphic Vision Toolkit IPSP Inhibitory Post Synaptic Potential IT Inferior Temporal Cortex KL Kullback-Liebler Divergence (sometimes called the KL distance) Lab CIE Lab Color (Luminance with two opponents, a Red/Green b Blue/Yellow) MAG Metric of Attention Gate MHz Megahertz (1,000,000 cycles per second) ms Milliseconds (1/1000 of a second) O Worst Case Asymptotic Complexity (called the big “O” notation) OpenCV Open Computer Vision (Intel Toolkit) PCA Principal Component Analysis PDF Probability Distribution Function PFC Pre-Frontal Cortex POMDP Partially Observable Markov Decision Process RAM Random Access Memory RGB Red, Green and Blue Color xvii RGBY Red/Green and Blue/Yellow Color RMSE Root Mean Squared Error RSVP Rapid Serial Vision Presentation SMA Supplementary Motor Area SQRT Square Root T Terrell TD Temporal Difference V1 Primary Visual Cortex V2 – V5 Regions of Extrastriate Cortex VFAT Vision Feature Analysis Toolkit WTA Winner Take All xviii Abstract What draws in human attention and can we create computational models of it which work the same way? Here we explore this question with several attentional models and applications of them. They are each designed to address a missing fundamental function of attention from the original saliency model designed by Itti and Koch. These include temporal based attention and attention from non-classical feature interactions. Additionally, attention is utilized in an applied setting for the purposes of video tracking. Attention for non-classical feature interactions is handled by a model called CINNIC. It faithfully implements a model of contour integration in visual cortex. It is able to integrate illusory contours of unconnected elements such that the contours “pop-out” as they are supposed to and matches in behavior the performance of human observers. Temporal attention is discussed in the context of an implementation and extensions to a model of surprise. We show that surprise predicts well subject performance on natural image Rapid Serial Vision Presentation (RSVP) and gives us a good idea of how an attention gate works in the human visual cortex. The attention gate derived from surprise also gives us a good idea of how visual information is passed to further processing in later stages of the human brain. It is also discussed how to extend the model of surprise using a Metric of Attention Gating (MAG) as a baseline for model performance. This allows us to find different model components and parameters which better explain the attentional blink in RSVP. xix Preface About this thesis This thesis is about the computational modeling of visual attention and surprise. The aspects that will be covered in this work include: • Utilization of the computation of attention in engineering. • Extensions to the computational model of attention and surprise. • Explaining human visual attention and cognition from simulation using computational models. This work is integrative and based on the philosophy that computer vision is aided by better understanding of the human brain and it’s already developed exquisite mechanisms for dealing with the visual world as we know it. At the same time, development of biologically inspired computer vision techniques, when done correctly, yields insight into the theoretical workings of the human brain. Thus, the integration of engineering, neuroscience and cognitive science gives rise to useful synergy. The second chapter covers the utilization of saliency as an engineering topic. This is an example of applying what we have learned from the human brain towards an engineering goal pursued with real world applications in mind. It is somewhat more applied and as a result, many components are not biologically motivated. The reader should keep in mind that project goals placed constraints on what can be done. In this xx case, a real time system able to process images very quickly was needed. Additionally, the project as is typical for engineering endeavors required “deliverables”. Chapters three and six cover methods for extending or changing the way in which surprise is computed. In the case of the former, a model of contour integration is created and examined. This allowed the creation of an extension to the basic saliency model for non-local interactions. Its primary contribution however turned out to be gainful knowledge of the human visual mechanisms involved. The fourth and fifth chapters deal with temporal dimensions of attention using surprise. The goals are to test and extend the model to see if predictions can be made of observer performance. Thus, it is suggested that a better fit model, which is improved in its ability to predict human performance, is closer to the actual mechanisms which the human brain uses. This also has reciprocal engineering applications since it can be used to help determine what humans will attend to in a dynamic scene. Graduate works not included in this thesis I have tried to keep all work included in this document constrained to the topic of visual attention and to work with salient results. As such, much of the work I have done in pursuit of my doctorate is not included. These works include, but are not limited to (in chronological order): • The Beobot Project (Mundhenk, Ackerman, Chung, Dhavale, Hudson, Hirata, Pichon, Shi, Tsui & Itti, 2003a) • Schizophrenia and the Mirror Neuron System (Arbib & Mundhenk, 2005) • Estimation of missing data in acoustic samples (Mundhenk, 2005) xxi • Surprise Reduction and Control (Mundhenk & Itti, 2006) • Three Dimensional Saliency (Mundhenk & Itti, 2007) Of interest in particular is the work on Schizophrenia and Mirror Neuron system which has been cited 45 times according to Google scholar. Also of interest is the Beobot project paper which was the most downloaded paper from iLab for three years straight, and it is still in the top five downloads to this day. Other works of interest not included in this thesis Also not included is the large amount of educational materials created and posted online. These include: • http://www.cool-ai.com –AI homeworks, projects and lecture notes for usage in AI courses. • Wikipedia and Wikiversity – contributions including: o http://en.wikiversity.org/wiki/Learning_and_Neural_Networks - Created self-guided teaching page on Neural Networks. o http://en.wikipedia.org/wiki/Cicadidae - Contributed Wikipedia featured picture of the day and written content. o http://en.wikipedia.org/wiki/Gamma_distribution - contributed graphics and corrections. o http://en.wikipedia.org/wiki/Kullback-Leibler_divergence - contributed graphics and corrections. o http://en.wikipedia.org/wiki/Methods_of_computing_square_roots - Added algorithms and analysis. xxii • http://www.cinnic.org/CINNIC-primer.htm – Contour Integration Primer. • http://www.nerd-cam.com/how-to/ - Detailed Instructions on how to build your own robotic camera. Don’t read the whole thesis This thesis uses the standard “stapled papers” framework. While each chapter has been integrated into a coherent work, they each will stand on their own. As a result, the reader is advised to get what they want and get out. That is, go ahead and read a chapter which interests you, but don’t bother to read other parts. However, there tends to be more information here than in the authors papers cited. As such, this thesis may be of use in getting some of the model details not covered in the authors published materials due to space constraints in peer reviewed journals. Have fun T. Nathan Mundhenk 1 Chapter 1: A Brief Introduction to Vision and Attention You got to work today without running over any pedestrians. How did you do that? To be sure this is a good thing. You can pick up items without even thinking about it; you can thumb through a magazine until you get to a favorite advertisement; You can tell a shoe from a phone and you can tell if that giant grizzly bear is in fact gunning for your ice cream cone. You do all sorts of things like this every day and frequently they seem utterly simple. To be certain, sometimes you cannot find your keys to save your life, but even while searching for them, you don’t bang into the furniture in your apartment, at least to too much. How did you do this? I ask, because like just about every person on earth, I’m not totally sure. OK, true, you’ll be glad to know I have some ideas. However, the pages that follow will only scratch the surface of how human beings such as ourselves view the world. To this day, much of human vision still remains a mystery. However, many things about human vision are well established. For instance, we do in fact see from our eyes and the information from them does travel to our brain. The brain itself is where what we see is processed and it turns out that its job it not merely to cool our blood as Aristotle believed it to be. However, there is a place between seeing and understanding which resides within human brain itself, and how it takes the items in the world and places them into your mind is a complicated story. In this work, we will focus on an important part of this process, the notion of selection and attention. The idea as it were is that not everything presented to our eyes makes its way from the retina in the eyes to the seat of 2 consciousness. Instead, it seems that most of what we perceive is just a fraction of what we could. The brain is picky, and it only selects some things to present to us, but many other things simply fade from being. Figure 1.1: Retinotopy has been demonstrated repeatedly over the years in the visual cortex. Thus, its existence is well founded. An early example is given by Gordon Holmes who studied brain injuries in solders after the first world war (Holmes, 1917, Holmes, 1945) and traced visual deficits to specific injury cites in visual cortex. Then with primate (Macaque) cortex experiments using Deoxyglucose (Tootell, Silverman, Switkes & De Valois, 1982) it was shown that a pattern activated a region of visual cortex with the same shape. However, this method was limited due to the fact that the animal had to be sacrificed immediately after viewing the pattern in order to reveal and photograph the pattern on the cortex. Later in 2003, with fMRI using sophisticated moving target displays, (Dougherty, Koch, Brewer, Fischer, Modersitzki & Wandell, 2003) regions in the human brain were shown to correspond to locations in the visual cortex in much the same way. However, fMRI allows observation in healthy human volunteers, which is a distinct advantage since more advanced experiments such as those involving motion can be conducted. What then does the brain do to select the things it wants to see? One could suppose that a magic elf sits in a black box in the brain with a black magic marker 3 looking at photos of the world sent to it by the eyes. The elf inspects each photo and decides if it’s something it believes you should see. Otherwise it marks it with an ‘x’ which means that another magic elf should throw the image away. The idea of magic elves as a brain process is intriguing, however the evidence does not bear it out. Then again, the brain is in some sense a black box. Thus, while we do not think that magic elves are the basis for cognition, we still must make inferences about the brains basic workings from a variety of frequently indirect evidence. For instance, we can probe the brain of other primates. In figure 1.1 it is shown that we know that the visual cortex receives information from the eyes in retinotopic coordinates. We know this from experiments on primates where briefly flashed visual patterns caused a similar pattern to form on the visual cortex (Inouye, 1909, Holmes, 1917, Holmes, 1945, Tootell et al., 1982). Does the same thing happen in the human brain? The general consensus is yes, many pieces of visual information from the eye line up on the back of the brain somewhat like a movie projecting onto a screen. Newer studies with functional magnetic resonance imaging (fMRI) on humans reinforces this idea (Horton & Hoyt, 1991, Hadjikhani, Liu, Dale, Cavanagh & Tootell, 1998, Dougherty et al., 2003, Whitney, Goltz, Thomas, Gati, Menon & Goodale, 2003). Still, the evidence is indirect. No one has seen the movie on the back of the brain, but fortunately, the evidence is satisfying. Retinotopy in the visual cortex is an example of something which is well founded even if the evidence is sometimes indirect. However, do we have such a good notion about how the brain selects what it wants to see from input coming from the eyes? It turns out sort of, but not completely. However, this is not without good reason. What 4 captures ones attention is quite complex (Shiffrin & Schneider, 1977, Treisman & Gormican, 1988, Mack & Rock, 1998). So for instance, things which are colorful tend to get through the brain much easier than things which are dull. This is for instance why stop signs are red and not gray. This is also why poisonous snakes or monarch butterflies (which are also poisonous) have such vivid colors. Interestingly, it is not just the colors which attract our attention it is how the colors interact. For instance, something which is blue attracts more attention when it is next to something yellow while something red tends to get more attention when it is next to something green. So it’s not just the color of something that makes it more salient, it’s how the colors interact as opponents. Figure 1.2: What does the brain find visually interesting? There are many things (from left to right). Good continuation of objects which form a larger percept is interesting. Conspicuous colors, particularly the opponents red/green and blue/yellow stand out. Objects with unique features and special asymmetries (Treisman & Souther, 1985) compared with surrounding objects can stand out. Also motion is a very important cue. Ok, seems pretty simple, but that was just one piece of a rather gigantic puzzle. Just a sampling of what is visually interesting is shown in figure 1.2. It turns out that edges, bright patches, things which are moving and things which are lined up like cars in traffic and … well many things all can attract your attention and control what it is that your brain deems interesting. Still it gets even more complex, your brain itself can decide to change the game and shift the priority on certain visual features. As an example, if you 5 are looking for a red checker, your brain could decide to turn up the juice on the red color channel. That is, your brain can from the top-down change the importance of visual items making some things which were less interesting more interesting and vice versa (Shiffrin & Schneider, 1977, Wolfe, 1994a, Navalpakkam & Itti, 2007, Olivers & Meeter, 2008). So just on the front, we can see that the notion of visual attention and what gets from the retina in the eyes to the seat of thought is quite complex. It involves a great deal of things which interact in rather complex and puzzling ways. However, as mentioned we do know many things, and we are discovering new properties every day. Hopefully this work will help to illuminate some of the processes by which the visual world can pass through brain into the realm of thought. 1.1 What Does our Brain Want to Look For? Imagine that the world was not in color. Further, imagine that all you could see was the outlines of the stuff that makes up the world. You would still need to move around without tipping over chairs and be able to eat and recognize food. What then would draw your attention? You can still tell how to identify many things. After all, it is the world of lines which makes up the Sunday comic strips. You might not be able to tell something’s apart which you could back in our colorful world, but for the most part you could tell a table from a chair or an apple from a snake. In this case, what would your brain look for? 6 Figure 1.3: Why does the brain look for so many different types of features? It depends on what it needs to find. Some images are defined by lines, others by colors and some by the arrangement of objects. All of the images shown are interpretable even though typical natural scene information of one type or another seems lacking. Shown from Top Left: Atari’s Superman, Picaso’s La Femme Qui Pleure, Gary Larson’s Far Side; Bottom Left: Liquid Television’s Stick Figure Theater, Van Gogh’s Starry Night Over the Rhone. In basic terms, what your brain wants to look for is information. Figure 1.3 shows several different scenes which one can interpret even though the information is presented very differently with typical information components such as color, lines or texture missing. As will be reviewed later, images are comprised of features, which are the essential bits of information for an image. These can include all of the above as well as more complex features such as junctions and motion. Not all features are necessary for object identification. A typical example is that people were able to enjoy television before it was in color. 7 Figure 1.4: (Left) Features gain increasing complexity and their responses become more and more task dependent. Additionally, visual information is sent down multiple pathways for different kinds of processing (Fagg & Arbib, 1998). Here the task of grasping a mug will prime features related to a mug top-down. These features in turn will be processed in different ways depending on whether we are trying to identify the mug (Ventral: What) or if we are trying to understand its affordances (Dorsal: How) (Ingle, Schneider, Trevathen & Held, 1967). How the brain splits visual information in this way and then reintegrates it, is still not completely understood.1 (Right) The connection diagram of Felleman and Van Essen (Felleman & Van Essen, 1991) of the primate visual cortex demonstrates that elegant models such as the one by Fagg and Arbib still only scratch the surface of the full complexity of the workings of the brain. In addition to the essential bits of an image which are important, what the brain wants to see is also based on the task at hand. Figure 1.4 illustrates a model for the task of grasping an object (Fagg & Arbib, 1998). Initially the object to be grasped must be spotted. If a person has some idea of what they are looking for, then they can attempt to try and focus their attention towards something that matches the expected features of the object. For instance, if the object to be grasped is a red mug, then the initial search for it should bias one to look for red and round things. Such a bias becomes even more 1 This is a reconceptualiztion of the original Fagg & Arbib figure which appears in: [44] Fellous, J.-M., & Arbib, M.A. (2005). Who Needs Emotions? The Brain Meets the Robot. Oxford: Oxford University Press. 8 important in a cluttered scene where many simple salient items may be a distraction. Otherwise, finding a red mug in a plain white room would be more simple. Once the object has been spotted, appropriate features must be further extracted such as geometric land marks (Biederman, 1987, Biederman & Cooper, 1991). So the brain will need to find essential characteristics of the object for the task. In this case, we want to grasp or pick up the object. If a portrait of Weird Al Yankovic is painted on the side of the mug, it might grab our attention, but it is unimportant for the task of acquiring the mug. Instead, we should ignore the portrait and just scan the geometry. The task might be entirely different if we had another action we wanted to execute. For instance, if someone asks us whose face is on the mug, we would want to scan for face like features and perhaps ignore the geometric properties completely. In the mug example, we can imagine that many other factors might come into play. For instance the scene might change unexpectedly. As an example, our clumsy relative might have knocked over the mug. This sudden change in the scene would come as a surprise and should initiate a change in attention priorities. If the coffee is flowing towards my notebook computer I should notice that as soon as possible. Then I should perhaps cancel my grasping action and search for paper towels or maybe make a grasp for my computer. The brain also sometimes has very little choice in what it looks for. Some things are highly salient such as a stop sign or an attractive person. It can be hard to override the innate bottom-up search system at times. Thus, many things are attended to fairly quickly and automatically. This is a rather important trait, a rock hurling towards you at great speed demands your attention more than a cup of coffee. As such, we can see that what 9 the brain wants to see also depends on automatic bottom-up systems which can preempt current task demands. 1.2 How Does our Brain Search for What it Wants? 1.2.1 What’s a Feature? What the brain wants to see is based on what is useful for it to see. Early on, after the invention of photography in the 19th century, many artists began to rethink what it was that they were doing. Up until then, artists created the essence of photographs with a paint brush, but since a machine could do the same thing faster and cheaper, direct photographic style artistry seemed like it would become archaic. This helped to bring about the Impressionist style of art. What is notable to our discussion is that artists began to experiment with imagery where fundamental features of a painting could be altered, but the scene could still be interpreted. As structure and form of art was changed and experimented with, it became more obvious that the brain did not need a direct photograph of a scene in order to understand it. Instead, it merely needed some form of structure which resembled the original scene. Partially as a result of this new way of looking at the world, early 20th century cognitive scientists began to think about how objects and components of an image could be linked together to create something which the brain could understand. Both Structuralists such as William James (James, 1890) and in particular Gestalt psychologists such as Max Wertheimer (Wertheimer, 1923) and Kurt Koffka (Koffka, 1935) began to think about how the brain can take in parts of a scene and assemble them 10 into something the brain understands. They believed that perception was a sum of the parts, but at the time they lacked the scientific abilities to prove their ideas. That the visual world was composed of parts which the brain assembles had been proposed. However, what these parts looked like or what form they took was far from certain. Several theories came forward over the years to refine what kind of parts the brain uses to create the whole. A popular term for the elementary parts of an image was features. Several scientists in the 1950’s such as Gibson (Gibson, 1950), Barlow (Barlow, 1961) and Attneave (Attneave, 1954) began to note that prior information about shapes, line and textures could be collected and used to interpret abstracted scenes statistically. In particular, Fred Attneave proposed that much of the visual world is redundant and unnecessary for the task of recognition. A cat for instance could be represented as points (or perhaps better as junctions) which are connected by the brain to form the perception of a cat. Under this assumption, a large mass of visual information presented to the retinal, for instance all the parts of the image which are not junctions are extraneous. Partially as a result of such assertions, several theories were put forward claiming that there should be a bottleneck in attention (Broadbent, 1958, Deutsch & Deutsch, 1963). As such, the picture of the visual world was still hazy, but several theories were now giving an idea of how the brain sees the world and what it wants to find. First, the brain compiles images from parts to create a whole. Second, features of an image as simple as points, lines, textures or junctions scattered about a scene may be sufficient in order for the brain to understand an image, but that there may be limits on how much the brain can process at one time. However, several questions remained. First, what kind of features is 11 the brain looking for and second how does the brain look for and process these features keeping in mind that it has some limitations on capacity? Figure 1.5: (Left) Early visual processing by the brain looks for simple features. For instance the retinal begins by sorting out color opponents such as red/green and blue/yellow (Kuffler, 1953, Meyer, 1976). While the understanding of the center surround mechanism is somewhat recent, knowledge of the arrangement of color opponents is very old and its theory can be traced at least as far back as to the German physiologist Ewald Hering in 1872 (Hering, 1872) but was first described physiologically in the goldfish (Daw, 1967). We can simulate these mechanisms using the filters shown. Here we see DoG (Difference of Gaussian) filters which give the On Center / Off Surround response (von Békésy, 1967, Henn & Grüsser, 1968) for colors (Luschow & Nothdurft, 1993). (Right) Later, the visual cortex utilizes hyper columns (Hubel & Wiesel, 1974) to find lines in an image. We can use wavelets like the one on the right to give a response to lines in an image (Leventhal, 1991). The type of wavelets used are typically called Gabor wavelets in honor of the Hungarian engineer Gábor Dénes (Dennis Gabor). (Bottom) The bottom row shows a cross section of the filters on the top. The answers to these questions began to congeal with the development of improved psychometric instrumentation in the 1960s that could better time and control the reaction of human subjects with a wide variety of stimulus. [For instance see (Sperling, 1960, Raab, 1963, Sperling, 1965, Weisstein & Haber, 1965)]. This was accompanied by improved psychophysical instrumentation capable of direct 12 measurement of neural activity in animals [For instance (Daw, 1968, Henn & Grüsser, 1968)]. By the 1970’s combined with the seminal work by David Hubel and Torsten Wiesel (Hubel & Weisel, 1977) we were starting to get a pretty good idea of what kind of elementary features the brain is looking for. In figure 1.5 we see some of the features which we knew the brain to be sensitive to by the mid 1970’s. The brain has literal detectors for lines and color opponents such as red/green and blue/yellow. It should be noted however, that this is still the beginning of the story. We knew that there was a set of simple features which the visual cortex would pick up on, but there was no idea how these features could be assembled into larger objects. Additionally, were there more features or was this the full basis set? 1.2.2 How do we Integrate These Features? By the 1970’s two important concepts were beginning to emerge. One was the notion of focused attention. That is, if Attneave and his contemporaries are correct, the brain might be wise to only spend time processing parts of a scene and not the whole thing. Second, features such as lines and colors integrate and bind in the brain. For instance, it had been known since the 1930’s that the brain can bind colors and words. John Stroop (Stroop, 1935) showed that by flashing a word such as “blue” but coloring it red tended to trip up and slow down observers when asked to name it. Would such a mechanism also apply at the level of feature integration? 13 Figure 1.6: Three generations of models of feature based attention are shown in succession. Starting with Treisman, Gelade & Gormican (Treisman & Gelade, 1980, Treisman & Gormican, 1988)2 it was hypothesized that the way visual features such as lines and colors integrate in parallel controls the serial components of attention. This model itself is a refinement of earlier theories of attention, for instance Shiffrin and Schneiders theory of automatic and controlled attention (Shiffrin & Schneider, 1977) and the pre-attentive and focal attention model of Neisser (Neisser, 1967). Later Koch and Ullman (Koch & Ullman, 1985) expanded this with the notion of having a saliency map which controls the spotlight of attention with a winner-take-all network. Following this, it was made into a fully functional computational model by Itti and Koch (Itti & Koch, 2001b). Several theoretical constructs were advanced and lead to increasing understanding on the question of attention (Figure 1.6). It was discovered that attention seems to be focal and that only parts of an image actually reach what many people would call consciousness. In 1967, this hypothesis was put forward by Ulric Neisser (Neisser, 1967) who suggested that there was a pre-attentive phase to visual processing when features were gathered together in parallel, but that later the features combined and were inspected serially by focal attention. This was further expanded by Richard Shiffrin and Walter Schneider (Shiffrin & Schneider, 1977) who saw a second dimension to attention. They suggested that some parts of attention are automatic and some parts are controlled. That 2 This drawing is from Treisman and Gormican 1988. It is based on the feature integration theory given in Treisman and Gelade 1980. However, Treisman and Souther 1985 gives a very similar figure. 14 is, some features in an image grab our attention automatically and almost reflexively. However, we are also consciously able control some things which we attend to. This is what is now thought of in broader terms as bottom-up and top-down attention. In 1980, Anne Treisman and Gerry Gelade further refined these ideas into a Feature Integration theory of attention (Treisman & Gelade, 1980). There idea was that the parallel computation of Neisser could be split into different features which could be processed separately in the pre-attentive stage and then brought together. Thus, the brain would compute its interest in colors, lines and intensities at the same time and that it is the sum integration of different features which determines the locus of attention. That is, attention is driven simultaneously be each type of feature, but the conjunction or independent dominance of a feature can draw in attention. However, the question was left open as to how the features could combine to create a master map of attention. A possible answer was given by Christof Koch and Shimon Ullman (Koch & Ullman, 1985) who gave the idea that the brain maintained a saliency map for the visual world and that a max selector processes (Didday, 1976, Amari & Arbib, 1977) would refine the saliency map so that only a single location in the visual field would stick out. This allowed for many things in the world to be salient at the same time, but suggested that the most salient item of all is that one which the brain will attend to. The theories of attention put forward by Treisman et al as well as Koch and Ullman gained further support over the next decade due to a variety of experimental results [For examples see (Nothdurft, 1991b, Nothdurft, 1991a, Nothdurft, 1992, Luschow & Nothdurft, 1993)]. In 1998 Laurent Itti, Christof Koch and Ernst Niebur further refined the model of Koch and Ullman and created a comprehensive 15 computational model that allowed direct testing of it (Itti, Koch & Niebur, 1998). It also included a comprehensive set of feature detectors as well as a Gaussian/Laplacian pyramid to detect features at many different scales (Figure 1.7). Figure 1.7: Gabor wavelet filters give a response to lines in an image. One way to do this is to create four or more wavelet filters each with its own directional orientation (Itti et al., 1998). On the left this can be seen as filters sensitive to lines are 0, 45, 90 and 135 degrees. On the right is an image which has been convolved by the filters at 0 and 90 degrees and the lines that were extracted by the filters. Since lines have different sizes we can convolve each image at a different scale to increase our chances of discovering lines of different widths (Tanimoto & Pavlidis, 1975, Burt & Adelson, 1983, Greenspan, Belongie, Goodman, Perona, Rakshit & Anderson, 1994)3. The essential gain was that the computer could be treated like a brain in a box. If the model of Koch and Ullman was correct, then a comprehensive computational model should have parity with the behavior of humans. Initial results showed that the saliency 3 The cats name is Ashes. 16 Figure 1.8: (Top Row) Features that the brain is looking for get increasingly complex. This happens frequently when simpler features are combined to create new ones (Field, Hayes & Hess, 1993, Kovács & Julesz, 1993, Polat & Sagi, 1994, Gilbert, Das, Ito, Kapadia & Westheimer, 1996, Li, 1998, Mundhenk & Itti, 2005). For instance, line fragments which Gabor filters pick up on can then be connected in a corresponding zone which completes contours. The butterfly pattern on the left will complete a contour when line fragments lie in the green zone and are aligned. This can be seen on the right where three co-linearly aligned fragments enhance each other to give a larger response. The graph is somewhat crude, but the point is that the more elements that are aligned, the stronger the response. (Bottom Row) The elements aligned into a circle on the left are much more salient than random elements (Kovács & Julesz, 1993, Braun, 1999). They should produce an activation pattern like the one on the right (Mundhenk & Itti, 2003, Mundhenk & Itti, 2005).This is discussed at length in chapter 3. 17 model behaved in a manner that was expected (Itti & Koch, 2001b). The computational saliency model was able to detect many odd-man-out features, search asymmetries and conditions for pop-out that would be expected of human observers. Additionally, the model could be augmented to included top-down attentional effects (Itti, 2000) by adjusting features weights in a manner similar to the mechanism proposed 25 years earlier for directed attention by Shiffrin and Schneider (Shiffrin & Schneider, 1977). Thus, for instance, when looking for a red Coke can, it is almost a simple matter to weight the red feature more during search. 1.2.3 Beyond the Basic Saliency Model The original saliency model of Itti and Koch lacked three components. One was the interaction of non-local features. Thus, as can be seen in figure 1.8, contours and line segments which extend past the classic receptive fields of the basic feature detectors have been found to be salient (Kovács & Julesz, 1993, Polat & Sagi, 1993b, Gilbert et al., 1996, Braun, 1999, Geisler, Perry, Super & Gallogly, 2001). The second element missing was temporal attention. This itself is comprised of three components which may or may not be independent of each other. They are motion, change and masking. Thus, things which are in motion tend to draw our attention. However simple changes such as the appearance or disappearance of an element in a video can draw or attention as well (Mack & Rock, 1998). The third element of temporal attention, masking, has been studied quite extensively (Breitmeyer & Öğmen, 2006). It is where something at one instance in a sequence of images is blocked from perception by something spatially proximal that comes before or after it. It includes both backwards and forwards masking, 18 the attentional blink (Raymond, Shapiro & Arnell, 1992) and both automatic and controlled mechanisms (Sperling & Weichselgartner, 1995, Olivers & Meeter, 2008). Further, the temporal components of attention are hypothesized to be comprised of more than one processing stage (Chun & Potter, 1995). The third element, top-down attention has been partially implemented since the original model was incepted (Itti, 2000, Navalpakkam & Itti, 2005). However, a complete model of top-down attention is probably many years away since it requires construction of the “top” component which may include consciousness itself. A non-local extension to the saliency model was eventually provided by T Nathan Mundhenk (Mundhenk & Itti, 2003, Mundhenk & Itti, 2005) and was extensively tested. This is covered in chapter 3. The extensions to temporal saliency are covered in chapters 2, 4, 5 and 6. They include extensions by the addition of a motion channel in chapter 2 (Mundhenk, Landauer, Bellman, Arbib & Itti, 2004b, Mundhenk, Navalpakkam, Makaliwe, Vasudevan & Itti, 2004c, Mundhenk, Everist, Landauer, Itti & Bellman, 2005a) and extension by the usage of Bayesian Surprise in chapters 4, 5 and 6 (Itti & Baldi, 2005, Einhäuser, Mundhenk, Baldi, Koch & Itti, 2007b, Mundhenk, Einhäuser & Itti, 2009). 1.3 The Current State of Attention and Other Models Many contemporary models of attention are designed to address one or more of the shortcomings of the original saliency model discussed in the last section, while many are attempts at general improvements or are different models altogether. 19 1.3.1 Top-Down Models Modeling the factors of top-down v. bottom-up attention goes back very far. As can be seen in figure 1.9 an early model was provided by Shiffrin and Schneider, but that model lacked a good notion of feature integration as well as an attentional map. Jeremy Wolfe (Wolfe, 1994a) provided a good synthesis of the model of Shiffrin and Schneider with the model of Koch and Ullman. Thus, the affects of top-down controll were merged with a feature integration attention model which also included an attention map. However, this is an example of a static scene top-down model. That is, prior knowledge is integrated as a top-down mechanism, but not necessarily online. Current extensions of this model include the integration of task influence (Navalpakkam & Itti, 2005) as well as an explanation of feature tuning (Navalpakkam & Itti, 2007). Figure 1.9: (Left) An early example of an attention model with top-down guided search activation is the attention model of Shiffrin and Schneider (Shiffrin & Schneider, 1977). Here automatic parallel processing layers that compute attention can be controlled by a more serialized attention director. (Right) The model by Wolfe (Wolfe, 1994a) is conceptually a synthesis of Shiffrin & Schneider with Koch and Ullman (Koch & Ullman, 1985). That is, it has added feature integration and a saliency map. 20 Many other models which integrate top-down attention are concerned with online handling of features as well as task demands. Sperling et al (Sperling, Reeves, Blaser, Lu & Weichselgartner, 2001) has provided one such model with a gamma shaped window function of attention. Task it treated as a spatial cue to certain locations allowing a “Quantal” discrete attention window to be opened at that location for a certain amount of time. It also includes bottom-up attention using the original term “automatic” attention. However, like with the model of Wolfe, it has not been nearly as completely implemented as the Itti and Koch model. One might consider it a partial implementation in comparison. A recent and important contribution to the modeling of top-down attention is provided by Olivers and Meeter. This is known as the Boost and Bounce theory of attention (Olivers & Meeter, 2008). In many ways it is an extension of Sperling et al, but it has more explicit handling of features as well as an improved description of the interaction of frontal cortical mechanisms with visual cortical processing. Again, however, the implementation is very computationally limited. 1.3.2 Other Contemporary Models of Saliency Currently there are a variety of other attention models in existence. Some are variants of the model of Itti and Koch (Frintrop, 2006, Itti & Baldi, 2006, Gao, Mahadevan & Vasconcelos, 2008) while others are more unique (Cave, 1999, Li, 2002, Bruce & Tsotsos, 2006). The model by Simone Frintrop is known as VOCUS. Its goal is to use models of saliency to improve computer vision search. It implements top-down task improvements in a manner similar to Itti and Koch, but adds a top-down 21 excitation/inhibition mechanism. It also uses the CIE Lab (McLaren, 1976) color space for color opponents and implements a form of 3D saliency for laser range finders. Dashan Gao et al (Gao et al., 2008) have implemented an interesting variation on Itti and Koch which is to change the treatment of center surround interactions. The center surround response is termed “Discriminant” center surround because it forms a center surround response based on the strength of a linear discriminant. The more crisp the discrimination of the center of a location is when compared with its surround, the stronger a response is given at that location. However, this is a mechanism very similar to the way the model of Surprise (Itti & Baldi, 2005, Itti & Baldi, 2006) computes spatial attention. The model by Bruce and Tsotsos (Bruce & Tsotsos, 2006) is an information maximization model. It works by taking in a series of images and forming a bases set of features. The bases set is then used to convolve an image. The response to each basis feature is competed against the basis features from all other patches. Thus, if a basis feature gives a unique response at an image location, it is considered salient. The most notable difference with this model compared with Itti and Koch is the derivation of basis features from prior images similar to Olshausen and Field (Olshausen & Field, 1996). However, the rectification using a neural network may compute competition in a way which is not sufficiently different from a WTA competition, but it may be arguably more biologically plausible. The model by Li is much more different. Li’s model (Li, 2002) is strongly model theoretic and somewhat neglects the task of image processing. However, it is claimed that it can provide saliency pre-attentively without the use of separate feature saliency maps. 22 Thus, the model should compute a singular saliency from combining features responses at the same time. This may be a more plausible method for computing saliency, but it is unclear if it functionally gains much over other models of saliency. 1.3.3 The Surprise Model There are two notable trends with saliency models. One is the emergence of information theoretic constructs and the other is the continued divergence between static saliency models and dynamic models of attention. With the recent exception of Gao (Gao et al., 2008) attention models were either static feature based models or dynamic, but primarily theoretical models (Sperling et al., 2001). The introduction of Surprise based attention (Itti & Baldi, 2005, Itti & Baldi, 2006) created for the first time a statistically sound and dynamic model of attention. In chapter 4, we will introduce surprise based attention and show that it does an excellent job of taking into account dynamic attentional effects seen in rapid serial vision experiments. This is then shown to give a good framework for a short term attention gate mechanism in chapter 5. In short, the new framework has some similarities to Bruce and Tsostos in that prior images are used to create belief about new images. However, surprise computes these beliefs online. This means that it does not need to be trained or have strong prior information about feature prevalence. Instead the sequence provides the needed information. The extensive testing and validation in chapters 4-6 also demonstrate firmly that it explains many temporal attention effects. Additionally, we postulate that we have gained further insight into the attentional window into the brain. 23 Chapter 2: Distributed Biologically Based Real Time Tracking with Saliency Using Vision Feature Analysis Toolkit (VFAT)4 In a prior project, we developed a multi agent system for noticing and tracking different visual targets in a room. This was known as the iRoom project. Several aspects of this system included both individual noticing and acquisition of unknown targets as well as sharing that information with other tracking agents (Mundhenk et al., 2003a, Mundhenk, Dhavale, Marmol, Calleja, Navalpakkam, Bellman, Landauer, Arbib & Itti, 2003b). This chapter is primarily concerned with a combined tracker that uses the saliency of targets to notice them. It then classifies them without strong prior knowledge (priors) of their visual feature, and passes that information about the targets to a tracker, which conversely requires prior information about features in order to track them. This combination of trackers allows us to find unknown, but interesting objects in a scene and classify them well enough to track them. Additionally, information gathered can be placed into a signature about objects being tracked and shared with other camera agents. The signature that can be passed is helpful for many reasons since it can bias other agents towards a shared target as well as help in creating task dependant tracking. 2.1.1 Vision, Tracking and Prior Information For most target acquisition and tracking purposes, prior information about the targets features is needed in order for the tracker to perform its task. For instance, a basic color tracker that tracks objects based on color needs to know a priori what the color of 4 For more information see also: http://ilab.usc.edu/wiki/index.php/VFAT_Tech_Doc 24 the target that it wishes to track is. If one is going to track a flying grape fruit, then one would set a tracker with a certain color of yellow and some threshold about which the color can vary. In general, many newer trackers use statistical information about an objects features which allows one to define seemingly more natural boundaries for what features one would expect to find on a target (Lowe, 1999, Mundhenk et al., 2004b, Mundhenk et al., 2004c, Mundhenk et al., 2005a, Siagian & Itti, 2007). However, in order to deploy such a tracker, one needs to find the features, which describe the object before tracking it. This creates two interesting problems. The first problem is that the set of training examples may be insufficient to describe the real world domain of an object. That is, the trainer leaves out examples from training data, which may hold important information about certain variants of an object. We might think for instance from our flying grapefruit tracking example, that of the fruits that fly by, oranges never do. As a result, we would unknowingly let our tracker have some leeway and track grapefruit that might even be orange in appearance. It might however turn out that we were wrong. At some point, an orange flies by and our tracker tracks it the same as a flying grapefruit. This can happen for several reasons, the first is that we had never observed an orange fly by and as such didn’t realize that indeed, they can fly by. Another reason is that the world changed. When we set up the tracker, only grapefruits could fly by. However, the thing that makes them fly, now acts on oranges, which may be an accidental change, for instance if an orange tree begins to grow in our flying grapefruit orchard. However, it might also be the case that someone has decided to start throwing oranges in front of our tracker. As such, the domain of trackable objects can change either accidentally or 25 intentionally. In such a case, our tracker may now erroneously tracks flying oranges as flying grapefruit. As can be seen from our first example, our tracker might fail if someone tries to fool it. Someone starts throwing oranges in front of our tracker, or perhaps they might wrap our grapefruits in a red wrapper so that our tracker thinks they are apples. If we are selling our flying grapefruits and our tracker is supposed to make sure each one makes it to a shipping crate, it would fail if someone sneaks them by as another fruit. As such, once a dishonest person learns what our tracker is looking for, it becomes much easier to fool. This is seen in the real world in security applications, such as Spam filtering, where many email security companies have to update information on what constitutes Spam on a regular bases to deal with spammers who learn simple ways around the filters. It should be expected that the same problem would go for any other security related application including a vision-based tracker. In the case of our flying grapefruit tracker, its function may not be explicitly security related, but as a device related to accounting, it is prone to tampering. What is needed then for vision based tracking is the ability to be able to define its own priors. It has been proposed that gestalt rules of continuity and motion allow visual information to be learned without necessarily needing prior information about what features individual objects possess (Von der Malsberg, 1981, Prodöhl, Würtz & von der Malsberg, 2003, Mundhenk et al., 2004b, Mundhenk & Itti, 2005). That is, the human visual system does not necessarily know what it is looking for, but it knows how to learn how to look. This itself constitutes a kind of prior information which one might consider meta-prior information. That is, information about what structure or meta-model is 26 needed to gather prior information, such as Bayesian information, is itself a type of prior information. Using meta-prior information, an artificial agent might learn on its own how to form groups that can be used to create statistical relationships and build new prior information about what it wishes to track. Thus, abstractly speaking, meta-priors are concerned with learning about how to learn. 2.1.3 Meta-priors, Bayesian Priors and Logical Inductive Priors Figure 2.1: It is interesting to note how different AI solutions require different amounts of prior information in order to function. Additionally, it seems that the more prior information a solution requires the more certainty it has in its results, but the more biased it becomes towards those results. Thus, we can place solutions along a spectrum based on the prior information required. Popular solutions such as Back Propagation Neural Networks and Support Vector Machines seem to fall in the middle of the spectrum in essence making them green machines and earning them the reputation of being the 2nd best solution for every problem. We propose that meta-priors are part of a spectrum of knowledge acquisition and understanding. At one end of the spectrum, are the rigid rules of logic and induction from which decisions are drawn with great certainty, but with which unknown variables must be sparse enough to make those reasonable decisions (figure 2.1). In the middle we place 27 more traditional statistical methods, which either require what we will define as strong meta-priors in order to work or require Bayesian priors. We place the statistical machines in the middle, since they allow for error and random elements as part of probabilities and do not need to know everything about a target. Instead, they need to understand the variance of information and draw decisions about what should be expected. Typically, this is gifted to a statistical learner in the form of a kernel or graph. Alternatively, the meta-prior does not make an inference about knowledge itself, but instead is used to understand its construction. From this, we then state, that meta-priors can lead to Bayesian priors, which can then lead to logical inductive priors. From meta-priors we have the greatest flexibility about our understanding of the world and in general terms, the least amount of bias; whereas on the other end of the spectrum, logical inductive priors have the least flexibility, but have the greatest certainty. An ideal agent should be able to reason about its knowledge along this spectrum. If a probability becomes very strong, then it can become a logical rule. However, if a logical rule fails, then one should reason about the probability of it doing so. Additionally, new things may occur which have yet unknown statistical properties. As such, the meta-priors can be used to promote raw data into a statistical framework or to re-reason about a statistical framework, which now seems invalid. Using certain kinds of meta-prior information, many Bayesian systems are able to find groupings which can serve as prior information to other programs which are unable to do so themselves. However, most Bayesian models work from meta-priors that require a variety of strong meta-priors. For instance, the most common requirement is that the number of object or feature classes must be specified. This can be seen in expectation 28 maximization, K-means and back-propagation neural networks, which need to have a set size for how many classes exist in the space they inspect. The number of classes thus, becomes a strong and rather inflexible meta-prior for these methods. Additionally, other strong meta-priors may include space size, data distribution types and the choice of kernel. The interesting thing about meta-priors is that they can be flexible or rigid. For instance, specifying you have several classes that are fit by a Gaussian distribution is semi-flexible in that you have some leeway in the covariance of your data, but the distribution of the data should be uni-modal and have a generally elliptical shape. An example of more rigid meta-priors would be specifying a priori the number of classes you believe you will have. So for instance, going back to our grapefruit example, if you believe your data to be Gaussian, you suspect that flying grapefruit have a mean color with some variance in that color. You can make a more rigid assumption that you will only see three classes such as, flying grapefruit, oranges and apples. All of these are of course part of the design process, but as mentioned they are prone to their own special problems. Ideally, an intelligent agent that wishes to reason about the world should have the ability to reason with flexible weak meta-priors but then use those to define Bayesian like priors. Here we define weak meta-priors as having flexible parameters that can automatically adjust to different situations. So for instance, we might set up a computer vision system and describe for it the statistical features of grapefruit, oranges and apples. However, the system should be able to define new classes from observation either by noticing that a mass of objects (or points) seem to be able to form their own category (Rosenblatt, 1962, Dempster, Laird & Rubin, 1977, Boser, Guyon & Vapnik, 1992, Jain, 29 Murty & Flynn, 1999, Müller, Mika, Rätsch, Tsuda & Schölkopf, 2001, Mundhenk et al., 2004b, Mundhenk et al., 2005a) or through violation of expectation and surprise (Itti & Baldi, 2005, Itti & Baldi, 2006). An example of the first is that if we cluster data points that describe objects, and if a new object appears such as a kiwi, a new constellation of points will emerge. An example of the second is that if we expect an apple to fly by, but see an orange, it suggests something interesting is going on. It might be that new fruit have entered our domain. In the first case, our learning is inductive, while in the second case it is more deductive. We thus define weak meta-priors to be situationally independent. That is, the meta-prior information can vary depending on the situation and the data. Ideally, information within the data itself is what drives this flexibility. So for instance, when selecting what is the most salient object in a scene, we might select a yellow ball. However, a moving gray ball may be more salient if presented at the same time as the yellow ball. Thus, the selection feature for what is most salient is not constantly a color, but can also be motion. So it is the interplay of these features, which can promote the saliency of one object over the other (Treisman & Gelade, 1980). Yet another example is that the number of classes is not defined a priori as a strong meta-prior, but instead, variance between features causes them to coalesce into classes. So as an abstract example, the number of planets in a solar system is not pre-determined. Instead, the interplay of physical forces between matter will eventually build a certain number of planets. Thus, the physical forces of nature are abstractly a weak meta-prior for what kind of planets will emerge, and how many will be formed. 30 2.1.4 The iRoom and Meta-prior Information Here we now review a vision system for following and tracking objects and people in a room or other spaces that can process at the level of weak meta-priors, Bayesian priors and even logical inductive priors. From this, we then need artificial experts, which can use weak meta-priors to process information into more precise statistical and Bayesian form information. Additionally, once we know things with a degree of certainty, it is optimal to create rules for how the system should behave. That is, we input visual information looking for new information from weak meta-priors, which can be used to augment a vision system that uses Bayesian information. Eventually strong Bayesian information can be used to create logical rules. We will describe this process in greater detail in the following pages but give a brief description here. Using a biological model of visual saliency from the iLab Neuromorphic Vision Toolkit (INVT) we find what is interesting in a visual scene. We then use it to extract visual features from salient locations (Itti & Koch, 2001b) and group them into classes using a non-parametric and highly flexible weak-meta prior classifier NPclassify (Mundhenk et al., 2004b, Mundhenk et al., 2005a). This creates initial information about a scene: for instance how many classes of objects seem present in a scene, where they are and what general features they contain. We then track objects using this statistically priorless tracker but gain advantage by taking the information from this tracker and handing it to a simple tracker, which uses statistical adaptation to track a target with greater effectiveness. In essence, it takes in initial information and then computes its own statistical information from a framework using weak meta-prior information. That 31 statistical information is then used as a statistical prior in another simpler and faster tracker. 2.2 Saliency, Feature Classification and the Complex Tracker There were several components used in the tracking system in iRoom. As mentioned, these started by needing less meta-prior information and then gathering information that allows the tracking of targets by more robust trackers that require more information about the target. The first step is to notice the target. This is done using visual saliency. Here very basic gestalt rules about the uniqueness of features in a scene are used to promote objects as more or less salient (Treisman & Gelade, 1980, Koch & Ullman, 1985, Itti & Koch, 2001b). This is done by competing image feature locations against each other. A weak image feature that is not very unique will tend to be suppressed by other image features, while strong image features that are different will tend to pop out as it receives less inhibition. In general, the saliency model acts as a kind of max selector over competing image features. The result from this stage is a saliency map that tells us how salient each pixel in an image is. Once the saliency of locations in an image can be computed, we can extract information about the features at those locations. This is done using a Monte Carlo like selection that treats the saliency map as a statistical map for these purposes. The more salient a location in an image is, the more likely we are to select a feature from that location. In the current working version we select about 600 feature locations from each frame of video. Each of the feature locations contains information about the image such as color, texture and motion information. These are combined together and used to 32 Figure 2.2: The complex feature tracker is a composite of several solutions. It first uses INVT visual saliency to notice objects of interest in a scene. Independent Component Analysis and Principle Component Analysis (Jollife, 1986, Bell & Sejnowski, 1995, Hyvärinen, 1999) are used to reduce dimensions and condition the information from features extracted at salient locations. These are fed to a non-parametric clustering based classification algorithm called NPclassify, which identifies the feature classes in each image. The feature classes are used as signatures that allow the complex tracker to compare objects across frames and additionally share that information with other trackers such as the simple tracker discussed later. The signatures are also invariant to many view point effects. As such they can be shared with cameras and agents with different points of view. classify each of the 600 features into distinct classes. For this we use the non-parametric classifier NPclassify mentioned above. This classifier classifies each feature location without needing to know a priori the number of object feature classes or how many samples should fall into each class. It forms classes by weighting each feature vector from each feature location by its distance to every other point. It then can link each feature location to another, which is the closest feature location that has a higher weight. This causes points to link to more central points. Where a central point links to another cluster it is not a member of, we tend to find that the link is comparatively rather long. 33 We can use this to cut links, thus, creating many classes. In essence, feature vectors from the image are grouped based on value proximity. As an example, two pixels that are close to each other in an image and are both blue would have a greater tendency to be grouped together than two pixels in an image that are far apart and are blue and yellow. Once we have established what classes exist and which feature locations belong to them, we can statistically analyze them to determine prior information that will be useful to any tracker, which requires statistical prior information in order to track a target. Thus, we create a signature for each class that describes the mean values for each feature type as well as the standard deviation within that class. Additionally, since spatial locations play a part in weighting feature vectors during clustering, feature vectors that are classified in the same class tend to lie near each other. Thus, the signature can contain the spatial location of the class as well. Figure 2.2 shows the flow from saliency to feature classification and signature creation. The signatures we derive from the feature properties of each class exist to serve two purposes. The first is that it allows this complex tracker to build its own prior awareness. When it classifies the next frame of video, it can try and match each of the new objects it classifies as being the same object in the last frame. Thus, it is not just a classifier, but it can track objects on its own for short periods. Further, we can use information about targets to bias the classification process between frames. So for instance, we would expect that the second frame of video in a sequence should find objects which are similar to the first frame. As such, each classified object in any given frame, biases the search in the next frame, by weighting the classifier towards finding objects of those types. 34 While this seems very complex, signature creation is fairly quick, saliency computation is done in real time on eight 733 MHz Pentium III computers in a Beowulf cluster. The rest of the code runs in under 60 ms on an Opteron 150 based computer. This means we can do weak meta-prior classification and extraction of signatures at around > 15 frames per second. 2.2.1 Complex Feature Tracker Components 2.2.1.1 Visual Saliency The first stage of processing is finding which locations in an image are most salient. This is done using the saliency program created by (Itti & Koch, 2001b), which works by looking for certain types of uniqueness in an image (Figure 2.3). This simulates the processing in visual cortex that the human brain performs in looking for locations in an image, which are most salient. For instance, a red coke can placed among green foliage would be highly salient since it contrasts red against green. In essence, each pixel in an image can be analyzed and assigned a saliency value. From this a saliency map can be created. The saliency map simply tells us the saliency of each pixel in an image. 2.2.1.2 Monte Carlo Selection The saliency map is taken and treated as a statistical map for the purpose of Monte Carlo selection. The currently used method will extract a specified number of features from an image. Highly salient locations in an image have a much higher probability of being selected than regions of low saliency. Additionally, biases from other modules may cause certain locations to be picked over consecutive frames from a video. For instance, if properties of a feature vector indicate it is very useful, then it makes sense 35 to select from a proximal location in the next frame. Thus, the saliency map combines with posterior analysis to select locations in an image which are of greatest interest. Figure 2.3: The complete VFAT tracker is a conglomeration of different modules that select features from an image, mix them into more complex features and then tries to classify those features without strong meta-priors for what kind of features it should be looking for. 36 2.2.1.3 Mixing Modules 2.2.1.3.1 Junction and End Stop Extraction Figure 2.4: Saliency is comprised of several channels which process an image at a variety of different scales and then combine those results into a saliency map. During the computation of visual saliency, orientation filtered maps are created. These are the responses of the image to Gabor wavelet filters. These indicate edges in the image. Since each filter is tuned to a single preferred orientation, a response from a filter indicates an edge that is pointed in the direction of preference. The responses from the filters are stored in individual feature maps. One can think of a feature map as simply an image which is brightest where the filter produces its highest response. Since the feature 37 maps are computed as part of the saliency code, re-using them can be advantageous from an efficiency standpoint. From this we create feature maps to find visual junctions and end-stops in an image by mixing the orientation maps (Figure 2.4). We believe such new complex feature maps can also tell us about the texture at image locations which can help give us the gist of objects to be tracked. The junction and end stop maps are computed as follows. Note that this is a different computation then the one used in appendix D and chapter 5 in the attention gate model. At some common point i,j on the orientation maps P the filter responses from the orientation filters are combined. Here the response to an orientation in one orientation map ij p is subtracted from an orthogonal map’s orientation filter output orth ij p and divided by a normalizer n which is the max value for the numerator. For instance, one orientation map that is selective for 0 degree angles is subtracted from another map selective for 90 degree angles. This yields the lineyness of a location in an image because where orthogonal maps overlap in their response is at the junctions of lines. (2.1) ; {1,2} orth k ij ij ij p p a k n − = ∈ We then compute a term (2.2) which is the orthogonal filter responses summed. This is nothing more than the sum of the responses in two orthogonal orientation maps. 38 Figure 2.5: The three images on the right are the results of the complex junction channel after ICA/PCA processing from the original image on the left. As can be seen it does a reasonable job of finding both junctions and end stops. (2.2) ; {1, 2} orth k ij ij ij p p b k n + = ∈ The individual line maps are combined as: (2.3) 1 2 ij ij ij a a n α + = This gives the total lineyness for all orientations. We then do a similar thing for our total response maps: (2.4) 1 2 ij ij ij b b n β − = The final junction map γ is then computed by subtracting the lineyness term from the total output of the orientation filters: (2.5) ij ij ij γ =α − β 39 Since the junction map is computed by adding and subtracting orientation maps which have already been computed during the saliency computation phase, we gain efficiency we wouldn’t have had if we were forced to convolve a whole new map by a kernel filter. Thus, this junction filter is fairly efficient since it does not require any further convolution to compute. Figure 2.5 shows the output and it can be seen that it is effective at finding junctions and end-stops. 2.2.1.3.2 ICA/PCA We decrease the dimensionality of each feature vector by using a combination of Independent Component Analysis (ICA) (Bell & Sejnowski, 1995) and Principle Component Analysis (PCA) (Jollife, 1986). This is done using FastICA (Hyvärinen, 1999) to create ICA un-mixing matrices offline. The procedure for training this is to extract a large number of features from a large number of random images. We generally use one to two hundred images and 300 points from each image using the Monte Carlo selection processes just described. FastICA first determines the PCA reduction matrix and then determines the matrix that maximizes the mutual information using ICA. Unmixing matrices are computed for each type of feature across scales. So as an example, the red-green opponent channel is computed at different scales, usually six. PCA/ICA will produce a reduced set of two opponent maps from the six original scale maps (This is described in detail later and can be seen in figure 2.7). Using ICA with PCA helps to ensure that we not only reduce the dimension of our data set, but that the information sets are fairly unique. From the current data, we reduce the total number of dimensions with all channels from 72 to 14 which is a substantial efficiency gain 40 especially given the fact that some modules have complexity O(d2) for d number of feature channels (dimensions). Figure 2.6: NPclassify works by (A) first taking in a set of points (feature vectors) (B) then each point is assigned a density which is the inverse of the distance to all other points (C) Points are then linked by connecting a point to the nearest point which has a higher density (D) Very long links (edges) are cut if they are for instance statistically longer than most other links. This creates separate classes. 2.2.1.4 Classification Modules 2.2.1.4.1 Classification of Features with NPclassify5 Features are initially classified using a custom non-parametric clustering algorithm called NPclassify6. The idea behind the design of NPclassify is to create a 5 This component is implemented in the iLab Neuromorphic Vision Toolkit as VFAT/NPclassify2.C/.H 6 A description and additional information on top of what will be discussed can be found at: http://www.nerd-cam.com/cluster-results/. 41 clustering mechanism which has soft parameters that are learned and are used to classify features. We define here soft parameters as values which define the shape of a meta-prior. This might be thought of as being analogous to a learning rate parameter or a Bayesian hyperparameter. For instance, if we wanted to determine at which point to cut off a dataset and decided on two standard deviations from the mean, two standard deviations would be a soft parameter since the actual cut off distance depends on the dataset. NPclassify (Figure 2.2, 2.6 and 2.7) (Mundhenk et al., 2004b, Mundhenk et al., 2005a) works by using a kernel to find the density at every sample point. The currently used kernel does this by computing the inverse of the sum of the Euclidian distance from each point to all other points. After density has been computed the sample points are linked together. This is done by linking each point to the closest point which has a higher density. This creates a path of edges which ascends acyclically along the points to the point in the data set which has the highest density of all. Classes are created by figuring out which links need to be cut. For instance, if a link between two sample points is much longer than most links, it suggests a leap from one statistical mode to another. This then may be a good place to cut and create two separate classes. Additionally, classes should be separated based upon the number of members the new class will have. After classes have been created, they can be further separated by using interclass statistics. The advantage to using NPclassify is that we are not required to have a prior number of classes or any prior information about the spatial or sample sizes of each class. 42 Figure 2.7: On the left are samples of features points with the class boundaries NPclassify has discovered. Some of the classes have large amounts of noise while others are cramped together rather than being separated by distance. On the right are the links NPclassify drew in order to create the clusters. Red links are ones which are too long and were clipped by the algorithm to create new classes. 43 Instead, the modal distribution of the dataset combined with learned notions of feature connectedness determine whether a class should be created. So long as there is some general statistical homogeneity between training and testing datasets we should expect good performance for clustering based classification. The training results are discussed later in the section on training results. Figure 2.8: The results using NPclassify are shown next to the same results for k-means on some sham data. The derived clusters are shown with the Gaussian eignenmatrix bars (derived using the eigenmatix estimation in section 2.2.1.4.2). In general, NPclassify creates more reliable clusters particularly in the presence of noise. Additionally, it does so without needing to know a priori how many classes one has. As such, we do have a few meta-priors still present. The first is a basic kernel parameter for density. In this case, the Euclidian distance factor makes few assumptions 44 about the distribution other than that related features should clump together. The second meta-prior is learned as a hyperparameter for a good cutoff. This can be derived using practically any gradient optimization technique. So it is notable, that NPclassify is not without some type of prior, but the assumptions on the data is quite relaxed and only assumes that related feature samples will be close to each other in feature space. An example of NPclassify working on somewhat arbitrary data points can be seen in figure 2.8. 2.2.1.4.2 Gaussian Generalization and Approximation7 In order to store classes for future processing it is important to generalize them. Gaussian ellipsoids are used since their memory usage for any class is O(d2) for d number of dimensions for a given class. Since d is fairly low for us, this is an acceptable complexity. Additionally, by using Gaussians we gain the power of Bayesian inference when trying to match feature classes to each other. However, the down side is that computing the eigen matrix necessary for Gaussian fitting scales minimally as d3 for dimensions and s2 for the number of samples. That is, it is O(d3 + s2). This is due to the fact that computing such elements using the pseudo inverse method (or QR decomposition) involves matrix inversion and multiplication. In order to avoid such large complexity we have implemented an approximation technique that scales minimally as d2 for dimensions and s for the number of samples - O(sd2). This means that a net savings happens if the number of samples is much larger than the number of dimensions. So for 7 This component is implemented in the iLab Neuromorphic Vision Toolkit as VFAT/covEstimate.C/.H 45 instance, if there are more than 100 samples and only 10 dimensions, this will produce a savings over traditional methods. Figure 2.9: After NPclassify has grouped feature samples together they can be fit with Gaussian distributions. This helps to determine the probability that some new feature vector belongs to a given class or that two classes compute in consecutive frames using NPclassify are probably the same class. If the distributions overlap greatly as on the left figure, then two classes are probably the same class. The approximation method works by using orthogonal rotations to center and remove covariance from the data. By recording the processes, we can then compute the probability on data points by translating and transforming them in the same way to align with the data set. What we want to be able to do is to tell the probability of data points belonging to some class as well as being able to tell if two classes derived in consecutive frames are probably the same class (see figure 2.9) The first step is to center the data about the origin. This is done by computing the mean and then subtracting that number from each feature vector. Next we compute approximate eigenvectors by trying to find the average vector from the origin to all 46 feature vector coordinates. So for k th feature vector, we first compute the ratio between its distance l from the origin along dimensions j and i. This yields the ratio rijk. That is, after aligning the feature vector with the origin, we take the ratio of two features in the same vector (we will do this for all possible feature pairs in the vector). (2.6) jk ijk ik l r l = Next we find the Euclidian distance uijk from the origin along dimensions j and i. (2.7) 2 2 uijk = lik −ljk By Summing the ratio of rijk and uijk for all k feature vectors, we obtain a mean ratio that describes the approximated eigenvector along the dimensions i and j. (2.8) 0 k ijk ij k ijk r m = u = Σ A normalizer is computed as the sum of all the distances for all samples k. (2.9) 0 k ij ijk k n u = = Σ Next we determine the actual angle of the approximated eigenvector along the dimensions i and j. (2.10) tan 1 ij ij ij m n θ − ⎛ ⎞ = ⎜⎜ ⎟⎟ ⎝ ⎠ 47 Once we have that, we can rotate the data set along that dimension and measure the length of the ellipsoid using a basic sum of squares operation. Thus, we compute ρik and ρjk which is the data set rotated by θij. Here ξ is the positions of kth feature vector along the i dimension and ψ is the position of the feature vector along the j dimension. What we are doing here is rotating covariance out along each dimension so that we can measure the length of the eigenvalue. Thus, we iterate over all data points k and along all dimensions i and along i+1 dimensions j summing up σ as we go. We only sum j for i+1 since we only need to use one triangle of the eigenvector matrix since it is symmetric along the diagonal. (2.11) i +1 ≤ j (2.12) cos( ) sin( ) ik ij ij ρ =ξ ⋅ θ +ψ ⋅ θ (2.13) sin( ) cos( ) jk ij ij ρ = −ξ ⋅ θ +ψ ⋅ θ What we have done is figure out how much we need to rotate the set of feature vectors in order to align the least squares slope with the axis. Once this is done, we can rotate the data set and remove covariance. Since the mean is zero because we translated the data set by the mean to the origin, variance for the sum of squares is computed simply as: (2.14) 2 0 k ik iij k s n ρ = = Σ (2.15) 2 0 k jk jji k s n ρ = = Σ 48 Each sum of squares is used to find the eigenvalue estimate by computing Euclidian distances. That is, by determining the travel distance of each eigenvector during rotation and combining that number with the computed sum of squares we can determine an estimate of the eigenvalue from triangulation. The conditional here is used because σii is computed more than once with different values for θij. Thus, σii is the sum of all the products of θij and siij. (2.16) ( )2 iff = 0 cos( ) otherwise iij ii ii ii iij ij s s σ σ σ θ ⎧⎪ = ⎨ + ⋅ − ⎪⎩ (2.17) ( )2 iff = 0 cos( ) otherwise jji jj jj jj jji ij s s σ σ σ θ ⎧⎪ = ⎨ + ⋅ − ⎪⎩ The end result is a non-standard eigenmatrix which can be used to compute the probability that a point lies in a Gaussian region. We do this by performing the same procedure on any new feature vector. That is, we take any new feature vector and replay the computed translation and rotations to align it with covariance neutral eigenmatrix approximation. Probability for the feature vector is then computed independently along each dimension thus eliminating further matrix multiplication during the probability computation. To summarize, by translating and rotating the feature set, we have removed covariance so we can compute probabilities assuming dimensions do not interact. In essence this removes the need for complex matrix operations. While the complexity is high, it is one order lower than the standard matrix operations as was mentioned earlier. 49 Examples of fits created using this method can be seen in figure 2.7 where NPclassify has created classes and the eigenmatrix is estimated for the ones created. 2.2.1.4.3 Feature Contiguity, Biasing and Memory Once features have been classified we want to use them to perform various tasks. These include target tracking, target identification and feature biasing. Thus from a measurement of features from time t, we would like to know if a collection of features at time t+1 is the same, and as such either the same object or a member of the same object. By using Bayesian methods we can link classes of features in one frame of a video to classes in the next frame by tying a class to another which is its closest probabilistic match. Additionally, we use the probability to bias how the non-parametric classifier and saliency work over consecutive frames. For NPclassify we add a sink into the density computation. That is, we create a single point whose location is the mean of a class with the mass of the entire class. Think of this as dropping a small black hole in a galaxy that represents the mass of the other class. By inserting this into the NPclassify computation, we skew the density computation towards the prior statistics in the last iteration. This creates a Kalman filter like effect that smoothes the computation of classes between frames. This is a reasonable action since the change in features from one frame to the next should be somewhat negligible. 2.2.1.5 Complex Feature Tracker Methods and Results 2.2.1.5.1 Complexity and Speed One of the primary goals of VFAT is that it should be able to run in real time. This means that each module should run for no more than about 30 ms. Since we are using a Beowulf cluster, we can chain together modules such that even if we have several 50 steps that take 30 ms each, by running them on different machines we can create a vision pipeline whereby a module finishes a job and hands the results to another machine in a Beowulf cluster that is running the next process step. In time trials the modules run within real time speeds. Using a Pentium 4 2.4 GHz Mobile Processor with 1 GB of RAM, each module of VFAT runs at or less than 30 ms. The longest running module is the NPclassify feature classifier. If given only 300 features it runs in 23 ms, for 600 features it tends to take as long as 45 ms. On a newer system it should be expected to run much faster. 2.2.1.5.2 Training for Classification Table 2.1: Following PCA the amount of variance accounted for was computed for each type of feature channel. Each channel started with six scales (dimensions). For many channels, 90% of variance is accounted for after a reduction to two dimensions. For all others, no more than three dimensions are needed to account for 90% of variance. Two modules in VFAT need to be trained prior to usage. These include ICA/PCA and NPclassify. Training for both has been designed to be as simple as possible in order to maintain the ease of use goal of the iRoom project. Additionally and fortunately, training of both modules is relatively quick with ICA/PCA taking less than a minute using the FastICA algorithm under Matlab and NPclassify taking around two hours using 51 gradient descent training. Since we only need to ever train once, this is not a prohibitive amount of time. 2.2.1.5.3 Training ICA/PCA Figure 2.10: The various conspicuity maps of the feature channels from the saliency model are shown here ICA/PCA reduced. Training was completed by using 145 randomly selected natural images from a wide range of different image topics. Images were obtained as part of generic public domain CD-ROM photo packages, which had the images sorted by topic. This enabled us to ensure that the range of natural images used in training had a high enough variety to prevent bias towards one type of scene or another. For each image, 300 features were extracted using the Monte Carlo / Visual saliency method described earlier. In all this 52 gave us 43,500 features to train ICA/PCA on. The results are shown on table 2.1. For most channels, a reduction to two channels from six still allowed for over 90% of variance to be accounted for. However, directional channels that measure direction of motion and orientation of lines in an image needed three dimensions to still account for more than 90% of all variance. Assuming that the data is relatively linear and a good candidate for PCA reduction, this suggests that we can effectively reduce the number of dimensions to less than half while still retaining most of the information obtained from feature vectors. Visual inspection of ICA/PCA results seems to show the kind of output one would expect (Figure 2.10 and 2.11). For instance, when two channels are created from six, they are partially a negation to each other. On the red/green channel, one of the outputs seems to show a preference for red. However, the other channel does not necessarily show an anti-preference for red. This may suggest that preferences for colors may also depend on the scales of the images. That is, since what makes the six input images to each channel different is the scale at which they are processed, scale is the most likely other form of information processed by ICA/PCA. This might mean for instance that the two channels of mutual information contain information about scaling. We might guess that of the two outputs from the red/green channel, one might be a measure of small red and the other of large green things. If this is the case it makes sense since in nature, red objects tend to be small (berries, nasty animals, etc.) while green things tend to be much more encompassing (trees, meadows, ponds). 53 Figure 2.11: From the original image we see the results of ICA/PCA on the red/green and blue/yellow channels. As can be seen some parts of the outputs are negations of each other which makes sense since ICA maximizes mutual information. However, close examination shows they are not negatives. It is possible that scale information applies as a second input type and prevents obvious negation. 2.2.1.5.4 Training NPclassify To hone the clustering method we use basic gradient decent with sequential quadratic programming using the method described by (Powell, 1978). This was done offline using the Matlab Optimization Toolbox. For this study, error was defined as the number of classes found versus how many it was expected to find (see Figure 2.12). Thus, we presented the clustering algorithm with 80 natural training images. Each image 54 Figure 2.12: In this image there are basically three objects. NPclassify has found two (colors represent the class of the location). This is used as the error to train it. So for 80 images it should find x number of objects. The closer it gets to this number, the better. Notice that the points are clustered in certain places. This is due to the saliency/Monte Carlo method used for feature selection. had a certain number of objects in it. For instance an image with a ball and a wheel in it would be said to have two objects. The clustering algorithm would state how many classes it thought it found. If it found three classes in an image with two objects then the error was one. The error was computed as average error from the training images. The training program was allowed to adjust any of several hard or soft parameters for NPclassify during the optimization. The training data was comprised of eight base objects of varying complexity such as balls and a wheel on the simple side or a mini tripod and web cam on the more 55 complex side. Objects were placed on a plain white table in different configurations. Images contained different numbers of objects as well. For instance some images contained only one object at a time, while other contained all eight. A separate set of validation images was also created. These consisted of a set of eight different objects with a different lighting created by altering the f-stop on the camera. Thus, the training images were taken with an f-stop of 60 while the 83 validation images were taken with an f-stop of 30. Additionally, the angle and distance of view point is not the same between the training and validation sets. The validation images were not used until after optimal parameters were obtained by the training images. Then the exact same parameters were used for the validation phase. Our first test was to examine if we could at the very least segment images such that the program could tell which objects were different from each other. For this test spatial interaction was taken into account. We did this by adding in spatial coordinates as two more features in the feature vectors with the new set of 14 ICA/PCA reduced feature vectors. The sum total of spatial features were weighted about the same as the sum total of non-spatial features. As such, the membership of an object in one segmented class or the other was based half by its location in space and half by its base feature vector composition. Reliability was measured by counting the number of times objects were classified as single objects, the number of times separate objects were merged as one object and the number of time a single object was split into two unique objects. Additionally, there was a fourth category for when objects were split into more than three objects. This was small and contained only four instances. 56 The results were generally promising in that based upon simple feature vectors alone, the program was able to segment objects correctly with no splits or merges in 125 out of the 223 objects it attempted to segment. In 40 instances an object was split into two objects. Additionally 54 objects were merged as one object. While on the surface these numbers might seem discouraging there are several important factors to take into account. The first is that the program was segmenting based solely on simple features vectors with a spatial cue. As such it could frequently merge one shiny black object into another shiny black object. In 62 % of the cases of a merger, it was obvious that the merged objects were very similar with respect to features. 2.2.1.5.5 NPclassify v. K-Means NPclassify was also tested on its general ability to classify feature clusters. In this case it was compared with K-means. However, since K-means requires the number of classes to be specified a priori, this was provided to it. So in essence, the K-means experiment had the advantage of knowing how many classes it would need to group, while NPclassify did not. The basic comparison test was similar to the test presented in the previous section. In this case, several Gaussian like clusters were created of arbitrary 2 dimensional features. They had between 1 and 10 classes in each data set. 50 of the sets were clean with no noise such that all feature vectors belonged explicitly to a ground truth class. However, in 50 other sets, small amounts of random noise were added. The comparison metric for K-means and NPclassify was how often classes were either split or merged 57 when they should have not been. The mean error for both conditions is shown below in figure 2.13. It should be noted that while K-means may be sensitive to noise in data, it is used here since it is well known and can serve as a good base line for any clustering algorithm. Figure 2.13: NPclassify is compared with K-Means for several data sets. The error in classification for different sets is the same if there is little noise in the data. However, after injecting some noise, NPclassify performs superior. The general conclusion is that compared with K-means, NPclassify is superior particularly when there is noise in the data. This is not particularly surprising since as a spanning tree style algorithm, NPclassify can ignore non proximal data points much more easily. That is, K-means is forced to weigh in all data points and really has no innate ability to determine that an outlying data point should be thrown away. However, NPclassify will detect the jump in distance to an outlier or noise point from the central density of the real class. 58 2.2.1.5.6 Contiguity Figure 2.14: Tracking from frame 299 to frame 300 the shirt on the man is tracked along with the head without prior knowledge of what is to be tracked. It should be noted that that while the dots are drawn in during simulation, the ellipses are drawn in by hand for help in illustration in gray scale printing. Contiguity has been tested but not fully analyzed (Figure 2.14). Tracking in video uses parameters for NPclassify obtained in section 2.2.1.5.4. Thus, the understanding of how to track over consecutive frames is based on the computers subjective understanding of good continuity for features. In general, classes of features can be tracked for 15 to 30 frames before the program loses track of the object. This is not an impressive result in and of itself. However, several factors should be noted. First is that each object that VFAT is tracking is done so without priors for what the features of each should be. Thus, the program is tracking an object without having been told to either track that object or what the object its tracking should be like. The tracking is free form and in general without feature based priors. The major limiter for the contiguity of tracking is that an object may lose saliency as a scene evolves. As such an object if it becomes too low in saliency will have far fewer features selected for processing from it, which destroys the track of an object with the current feature qualities. However, as will be noted in the next 59 section, this is not a problem since this tracker is used to hand off trackable objects to a simple tracker which fixates much better on objects to be tracked. 2.3 The Simple Feature Based Tracker8 Figure 2.15: The Simple tracker works by taking in initial channel values such as ideal colors. These are used to threshold an image and segment it into many candidate blobs. This is done by connecting pixels along scan lines that are within the color threshold. The scan lines are then linked which completes a contiguous object into a blob. The blobs can be weeded out if they are for instance too small or too far from where the target last appeared. Remaining blobs can then be merged back and analyzed. Finding the center of mass of the left over blobs gives us the target location. By finding the average color values in the blob, we can define a new adapted color for the next image frame. Thus, the threshold color values can move with the object. Once a signature is extracted using the complex tracker described in the previous section, it can be feed to a faster and simpler tracking device. We use a multi channel 8 For more information see also: http://ilab.usc.edu/wiki/index.php/Color_Tracker_How_To 60 tracker, which uses color thresholding to find candidate pixels and then links them together. This allows it to not only color threshold an image, but to segregate blobs and analyze them separately. So for instance, if it is tracking a yellow target, if another yellow target appears, it can distinguish between the two. Additionally, the tracker also computes color adaptation as well as adaptation over any channel it is analyzing. We compute for instance a new average channel value c (2.18) as the sum of all pixel values in this channel p c over all N pixels in tracked ‘OK’ blobs (as seen in figure 2.15) p from the current frame t to some past frame t′ . In basic terms, this is just the average channel value for all the trackable pixels in several consecutive past frames. Additionally we computeσ , which is just the basic standard deviation over the same pixels. (2.18) ( )2 0 and 0 1 t N t N ip ip i t p i t p t t i i i t i t c cc c N N σ = ′ = = ′ = = ′ = ′ − = = − ΣΣ ΣΣ Σ Σ Currently, we set a new pixel as being a candidate for tracking if for all channels that have a pixel value p c : (2.19) p c −α ⋅σ ≤ c ≤ c +α ⋅σ Thus, a pixel is thresholded and selected as a candidate if it falls within the boundary of each channel that is its mean value computed from eq. (2.18) +/- the product of the standard deviation and a constantα . Forgetting is accomplished in the adaptation by simply windowing the sampling interval. 61 This method allows the tracker to track a target even if its color changes due to changes in lighting. It should be noted that the simple tracker can track other features in addition to color so long as one can create a channel for it. That is, an RGB image can be separated into three channels, which are each gray scale images. In this case, we create one for red, one for green and one for blue. We can also create images that are for instance, the responses of edge orientation filters or motion filters. These can be added as extra channels in the simple tracker in the same manner. However, to preserve luminance invariance we use the H2SV color scheme described in appendix G. This is just an augmentation of HSV color space that solves for the singularity at red by converting hue into Cartesian coordinates. In addition to the basic vision functional components of the simple tracker, its code design is also important. The tracker is object oriented which makes it easy to create multiple independent instances of the simple tracker. That is, we can easily run several simple trackers on the same computer each tracking different objects from the same video feed. The computational work for each tracker is fairly low and four independent trackers can simultaneously process 30 frames per second on an AMD Athlon 2000 XP processor based machine. This makes it ideal for the task of tracking multiple targets at the same time. 2.4 Linking the Simple and Complex Tracker In order for the simple tracker and the complex tracker to work together they have to be able to share information about a target. As such the complex tracker must be able to extract information about objects that is useful to the simple tracker (Figure 2.16). 62 Additionally, linking the simple tracker with the complex tracker creates an interesting problem with resource allocation. This is because each simple tracker we instantiate tracks one target at a time while the complex tracker has no such limit. A limited number of simple trackers can be created and there must be some way to manage how they are allocated to a task based on information from the complex tracker. Figure 2.16: The simple and complex trackers are linked by using the complex tracker to notice and classify features. The complex tracker then places information about the feature classes into object feature class signatures. The complex tracker uses these signatures to keep track of objects over several frames or to bias the way in which it classifies objects. The signatures are also handed to simple trackers, which track the objects with greater proficiency. Here we see two balls have been noticed and signatures have been extracted and used to assign each ball to its own tracker. The smaller target boxes on the floor show that the simple tracker was handed an object (the floor), which it does not like and is not tracking. Thus, the simple tracker has its own discriminability as was mentioned in section 2.3 and figure 2.15. We address the first problem by making sure both trackers work with similar feature sets. So for example, the complex tracker when it runs will examine the H2SV color of all the classes it creates. It then computes the mean color values for each class. This mean color value along with the standard deviation of the color can then be handed to the simple tracker, which uses it as the statistical prior color information for the object it should track. 63 Figure 2.17: This is a screen grab from a run of the combined tracker. The lower left two images show the complex tracker noticing objects, classifying and tracking them. The signature is handed to the simple tracker, which is doing the active tracking in the upper left window. The combined tracker notices the man entering the room and tracks him without a priori knowledge of how he or the room looks. Once he walks off the right side, the tracker registers a loss of track and stops tracking. The bars on the right side show the adapted actively tracked colors from the simple tracker in H2SV color. The lower right image shows that many blobs can fit the color thresholds in the simple tracker, but most are selected out for reasons such as expected size, shape and position. The second issue of resource allocation is addressed less easily. However, there are simple rules for keeping resource allocation under control. First, don’t assign a simple tracker to track an object that overlaps with a target another simple tracker is tracking in the same camera image. Thus, don’t waste resources by tracking the same target with two or more trackers. Additionally, since the trackers are adaptive we can find that two trackers were assigned to the same target, but we didn’t know this earlier. For instance, if 64 accidentally one simple tracker is set to track the bottom of a ball and one the top of the ball, after a few iterations of adaptation, both trackers will envelop the whole ball. It is thus advantageous to check for overlap later. If we find this happening, we can dismiss one of the simple trackers as redundant. Additionally, our finite resources mean we do not assign every unique class from the complex tracker to a simple tracker. Instead, we try and quantify how interesting a target is. For instance, potential targets for the simple tracker may be more interesting if they are moving, have a reasonable mass or have been tracked by the complex tracker for a long enough period of time. 2.5 Results On the test videos used, the system described seems to work very well. A video of a man entering and leaving a room (Figure 2.17) was shown five times to the combined complex and simple tracker. In each run, the man was noticed within a few frames of entering the cameras view. This was done without prior knowledge of how the target should appear and without prior knowledge of the room’s appearance. The features were extracted and a simple tracker was automatically assigned to track the man, which did so, until he left the room, at which point the simple tracker registered a loss of track. Interestingly enough, the tracker extracted a uniform color over both the man’s shirt and his skin. It was thus able to, on several instances, track the man as both his shirt and his skin. Thus, even though the shirt was burgundy and the skin reddish, the combined tracker was able to find a statistical distribution for H2SV color that encompassed the color of both objects as unique from the color of objects in the rest of the room. 65 The tracker was also tested on a video where a blue and yellow ball both swing on a tether in front of the camera. In five out of five video runs, both balls are noticed and their features extracted. Each ball is tracked as a separate entity by being assigned by the program its own simple tracker. Each ball is tracked until it leaves the frame, at which point the simple trackers register a loss of track. The balls even bounce against each other, which demonstrates that the tracker will trivially discriminate between objects even when they are touching or overlapping. In both video instances, objects are tracked without the program knowing the features of the object to be tracked a priori. Instead, saliency is used to notice different possible targets and the complex tracker is used to classify possible targets into classes. This was then used to hand target properties to the simple trackers as automatically generated prior information about the targets to be tracked. Additionally, the simple tracker will register a loss of track when the target leaves the field of view. This allows us to not only notice when a new target enters our field, but also when it leaves. The tracking was also aided by the use of H2SV color. Prior to using the H2SV color scheme, the purple shirt the man is wearing was split as two objects since the color of many of the pixels bordered on and even crossed into the red part of the hue spectrum. Thus, standard HSV created a bi-modal distribution for hue. The usage of H2SV allowed us to now track the purple shirt as well as objects that are reddish in hue, such as skin. H2SV color also works for tracking of objects in the center of the spectrum, which we observed by tracking objects that are green, yellow and blue. In addition to tracking using a static cameral, the same experiment was done using a moving camera. This is much less trivial since the common method of eigen 66 background subtraction cannot be used to determine new things in a scene from the original scene. Again the tracker was able to track a human target without prior knowledge of features even as the camera moved. This is a distinct advantage for our tracker and illustrates the advantage of using saliency to extract and bind features since it can compensate for global motion. 2.6 Discussion 2.6.1 Noticing The most notable and important aspect of the current work is that we are able to track objects or people without knowing what they will look like a priori and we are able to do so quickly enough for real time applications. Thus, we can notice, classify and track a target fairly quickly. This has useful applications in many areas and in particular security. This is because we track something based on how interesting it is and not based on complete prior understanding of its features. Potentially, we can then track any object or person even if they change their appearance. Additionally, since we extract a signature that describes a target that is viewpoint invariant, this information can be used to share target information with other agents. 2.6.2 Mixed Experts Additionally, we believe we are demonstrating a better paradigm in the construction of intelligent agents, one that uses a variety of experts to accomplish the task. The idea is to use a variety of solutions that work on flexible weak meta-prior information, but then use their output as information for a program that is more biased. 67 This is founded on the idea that there is no perfect tool for all tasks and that computer vision is comprised of many tasks such as identification, tracking and noticing. To accomplish a complex task of noticing and tracking objects or people, it may be most optimal to utilize many different types of solutions and interact them. Additionally, by mixing experts in this way, no one expert necessarily needs to be perfect at its job. If the experts have some ability to monitor one another, then if one expert makes a mistake, it can possibly be corrected by another expert. It should be noted that this tends to follow a biological approach in which the human brain may be made up of interacting experts, all of which are interdependent on other expert regions in order to complete a variety of tasks. Another important item to note in the mixed experts paradigm is that while it may make more intuitive sense to use such an approach, new difficulties arise as our system becomes more abstractly complex. So as an example, if one works with support vector machines only, then one has the advantage of a generally well-understood mathematical framework. It is easier to understand a solutions convergence, complexity and stability in a system if it is relatively homogeneous. When a person mixes experts, particularly if the experts act very differently, the likelihood of the system doing something unexpected or even catastrophic tends to increase. Thus, when one designs an intelligent agent with mixed experts, system complexity should me managed carefully. 2.6.3 System Limitations and Future Work The system described has its own set of limitations. The work up to this point has concentrated on being able to notice and track objects in a scene quickly and in real time. 68 However, its identification abilities are still somewhat limited. It does not contain a memory such that it can store and identify old targets in the long term. However, such an ability is in the works and should be aided by the ability of the tracking system to narrow the area of the image that needs to be inspected which should increase the speed of visual recognition 69 Chapter 3: Contour Integration and Visual Saliency In the visual world there are many things, which we can see, but certain features, sets of features and other image properties tend to more strongly draw our visual attention towards them. A very simple example is a stop sign, in which combinations of red color and angular features of an octagon combine with a strong word “stop” to create something that hopefully we would not miss if we come upon it. Such propensity of some visual features to attract attention defines in part the phenomenon of visual saliency. Here we assert, as others (James, 1890, Treisman & Gelade, 1980, Koch & Ullman, 1985, Itti & Koch, 2001b) that saliency is drawn from a variety of factors. At the lowest levels, color opponents, unique orientations and luminance contrasts create the effect of visual pop-out (Treisman & Gelade, 1980, Wolfe, O'Neill & Bennett, 1998). Importantly, these studies have highlighted the role of competitive interactions in determining saliency --- hence, a single stop sign on a natural scene backdrop usually is highly salient, but the saliency of that same stop sign and its ability to draw attention is strongly reduced as many similar signs surround it. At the highest levels it has been proposed
Object Description
Title | Computational modeling and utilization of attention, surprise and attention gating |
Author | Mundhenk, Terrell Nathan |
Author email | nathan@mundhenk.com; tnmundhenk@hrl.com |
Degree | Doctor of Philosophy |
Document type | Dissertation |
Degree program | Computer Science |
School | Viterbi School of Engineering |
Date defended/completed | 2009-04-21 |
Date submitted | 2009 |
Restricted until | Unrestricted |
Date published | 2009-08-03 |
Advisor (committee chair) | Itti, Laurent |
Advisor (committee member) |
Arbib, Michael A. Schaal, Stefan Biederman, Irving |
Abstract | What draws in human attention and can we create computational models of it which work the same way? Here we explore this question with several attentional models and applications of them. They are each designed to address a missing fundamental function of attention from the original saliency model designed by Itti and Koch. These include temporal based attention and attention from non-classical feature interactions. Additionally, attention is utilized in an applied setting for the purposes of video tracking. Attention for non-classical feature interactions is handled by a model called CINNIC. It faithfully implements a model of contour integration in visual cortex. It is able to integrate illusory contours of unconnected elements such that the contours “pop-out” as they are supposed to and matches in behavior the performance of human observers. Temporal attention is discussed in the context of an implementation and extensions to a model of surprise. We show that surprise predicts well subject performance on natural image Rapid Serial Vision Presentation (RSVP) and gives us a good idea of how an attention gate works in the human visual cortex. The attention gate derived from surprise also gives us a good idea of how visual information is passed to further processing in later stages of the human brain. It is also discussed how to extend the model of surprise using a Metric of Attention Gating (MAG) as a baseline for model performance. This allows us to find different model components and parameters which better explain the attentional blink in RSVP. |
Keyword | surprise; attention; computation; gating; vision; tracking; statistics; Bayes; saliency; contour; integration; Nerd-Cam; CINNIC; MAG; H2SV; information; visual cortex; biologically inspired; spot light; human performance; visual saliency; iLab; Neuromorphic Vision Toolkit; image processing; RSVP; attentional blink; masking; detection; Itti; Koch |
Language | English |
Part of collection | University of Southern California dissertations and theses |
Publisher (of the original version) | University of Southern California |
Place of publication (of the original version) | Los Angeles, California |
Publisher (of the digital version) | University of Southern California. Libraries |
Provenance | Electronically uploaded by the author |
Type | texts |
Legacy record ID | usctheses-m2449 |
Contributing entity | University of Southern California |
Rights | Mundhenk, Terrell Nathan |
Repository name | Libraries, University of Southern California |
Repository address | Los Angeles, California |
Repository email | cisadmin@lib.usc.edu |
Filename | etd-Mundhenk-2997; nathan_demo_video_a; nathan_Thesis_Defense_Slides |
Description
Title | Page 1 |
Contributing entity | University of Southern California |
Repository email | cisadmin@lib.usc.edu |
Full text | COMPUTATIONAL MODELING AND UTILIZATION OF ATTENTION, SURPRISE AND ATTENTION GATING by Terrell Nathan Mundhenk A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2009 Copyright 2009 Terrell Nathan Mundhenk ii Epigraph “I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living; It's a way of looking at life through the wrong end of a telescope. Which is what I do, and that enables you to laugh at life's realities.” Dr Seuss iii Dedication For my parents Terry and Ann iv Acknowledgements This really is the trickiest part to write because I want to thank so many people for so many things. First off I would like to thank my sister Amy (Zon) who thought enough of me to whip out a copy of the Schizophrenia paper I wrote with Michael Arbib when she visited that research neurologist at the Cleveland Clinic. I thought that was funny, but it made me feel as if I was doing something interesting. Then there are my closest friends Paul Gunton, Brant Heflin, Mike Olson and Tim Olson who exhibited confidence in my ability to actually complete this silly thing. I would also like to thank Kate Svyatets for sticking with me through my thesis and mood swings. Life is meaningless without friends and loved ones, so I am in your debt. I would also like to acknowledge the excellent scientists I worked closely with over the years, without whom my research would not have been possible. Firstly this includes Michael Arbib who has essentially been a co-thesis advisor to me. I can hardly communicate the enormous amount of things which I learned from him over the years. His honesty and integrity were of great value and, I knew when he said something, he really meant it. Next, I would like to thank Kirstie Bellman and Chris Landauer at the Aerospace Corporation. They gave me more of an idea of what real engineering is about than just about anyone. I also want to give a great big thanks to Wolfgang Einhäuser my co-author on several publications. I don’t think I’ve ever worked with anyone so totally on the ball. I also need to mention Ernest Hall who was my mentor during my undergraduate years and source of encouragement during my graduate years. I don’t think I would be in a research field if it wasn’t for him. v I also must extend my deepest gratitude to the many faculty members who have provided excellent feedback and conversation over the years of my graduate education. I cannot mention every teacher and mentor who touched my life over the past few years, because there were so many. However I would like to extend special thanks to: Irving Biederman, Christof Koch, Christoph von der Malsburg, Bartlett Mel and Stefan Schaal. I would also like to acknowledge many of the students and post-doctoral researchers I have collaborated with or were in general very helpful in assisting me with my research through direct assistance or discussion. They are: Jeff Begley, James ‘Jimmy’ Bonaiuto, Mihail Bota, Vincent J. Chen, Aaron D'Souza, Nitin Dhavale, Lior Elazary, Jacob Everist, Doug Garrett, Larry Kite, Hendrik Makaliwe, Salvador Marmol, Thomas Moulard, Pradeep Natarajan, Vidhya Navalpakkam, Jan Peters, Rob Peters, Eric Pichon, Jack Rininger, Christian Siagian, and Rand Voorhies. If I forgot to mention anyone, I’m really quite sorry. Lastly, but far from leastly, I would like to thank my Thesis Advisor Laurent Itti for all his help, input and encouragement he has provided over the years. After my first year at USC, I was getting kind of board being out of the research game. At the time, I was taking robotics from Stefan Schaal. I talked to him about who was doing interesting research in computer vision and he suggested I talk to a promising new faculty member. I took him up on his advice which turned out to be an excellent decision. iLab was new and only had a few students back then, now it so vibrant and full of life with so many projects. I will surely miss Laurent and iLab and I am certain I will look back on these days with great positive satisfaction. vi Table of Contents Epigraph ii Dedication iii Acknowledgements iv List of Tables x List of Figures xi Abbreviations xv Abstract xviii Preface xix About this thesis xix Graduate works not included in this thesis xx Other works of interest not included in this thesis xxi Don’t read the whole thesis xxii Chapter 1: A Brief Introduction to Vision and Attention 1 1.1 What Does our Brain Want to Look For? 5 1.2 How Does our Brain Search for What it Wants? 9 1.2.1 What’s a Feature? 9 1.2.2 How do we Integrate These Features? 12 1.2.3 Beyond the Basic Saliency Model 17 1.3 The Current State of Attention and Other Models 18 1.3.1 Top-Down Models 19 1.3.2 Other Contemporary Models of Saliency 20 1.3.3 The Surprise Model 22 Chapter 2: Distributed Biologically Based Real Time Tracking with Saliency Using Vision Feature Analysis Toolkit (VFAT) 23 2.1.1 Vision, Tracking and Prior Information 23 2.1.3 Meta-priors, Bayesian Priors and Logical Inductive Priors 26 2.1.4 The iRoom and Meta-prior Information 30 2.2 Saliency, Feature Classification and the Complex Tracker 31 2.2.1 Complex Feature Tracker Components 34 2.3 The Simple Feature Based Tracker 59 2.4 Linking the Simple and Complex Tracker 61 2.5 Results 64 2.6 Discussion 66 2.6.1 Noticing 66 vii 2.6.2 Mixed Experts 66 2.6.3 System Limitations and Future Work 67 Chapter 3: Contour Integration and Visual Saliency 69 3.1 Computation 75 3.2 The model 77 3.2.1 Features 77 3.2.2 The Process 83 3.2.3 Kernel 87 3.2.4 Pseudo-Convolution 91 3.3 Experiments 97 3.3.1 Local element enhancement 98 3.3.2 Non-local Element Enhancement 103 3.3.3 Sensitivity to Non-contour Elements 112 3.3.4 Real World Image Testing 118 3.4 Discussion 122 3.4.1 Extending Dopamine to Temporal Contours via TD (dimensions) 125 3.4.2 Explaining Visual Neural Synchronization with Fast Plasticity 126 3.4.3 Contours + Junctions, Opening a New Dimension on Visual Cortex 127 3.4.4 Model Limitations 128 3.5 Conclusion 129 Chapter 4: Using an Automatic Computation of an Image’s Surprise to Predicts Performance of Observers on a Natural Image Detection Task 130 4.1.1 Overview of Attention and Target Detection 131 4.1.2 Surprise and Attention Capture 134 4.2 Methods 136 4.2.1 Surprise in Brief 136 4.2.2 Using Surprise to Extract Image Statistics from Sequences 139 4.3 Results 144 4.4 A Neural Network Model to Predict RSVP Performance 152 4.4.1 Data Collection 153 4.4.2 Surprise Analysis 154 4.4.3 Training Using a Neural Network 154 4.4.4 Validation and Results of the Neural Network Performance 156 4.5 Discussion 164 4.5.1 Patterns of the Two-Stage Model 165 4.5.2 Information Necessity, Attention Gating and Biological Relevance 169 4.5.3 Generalization of Results 173 4.5.4 Comparison with Previous RSVP Model Prediction Work 173 4.5.5 Network Performance 174 4.5.6 Applications of the Surprise System 175 4.6 Conclusion 176 viii Chapter 5: Modeling of Attentional Gating using Statistical Surprise 177 5.1 From Surprise to Attentional Gating 180 5.2 Methods 183 5.2.1 Paradigm 183 5.2.2 Computational Methods 184 5.3 Results 188 5.3.1 Relation of Results to Previous Studies Which Showed Causal Links between Surprise and Target Detection 193 5.4 Discussion 196 5.4.1 Variability of the Attention Gate Size Fits within the Paradigm 196 5.4.2 The Attention Gate may Account for Some Split Attention Effects 198 5.4.3 Unifying Episodic Attention Gate Models with Saliency Maps 199 Chapter 6: A Comparison of Surprise Methods and Models Using the Metric of Attention Gate (MAG) 201 6.1 The MAG Method for Comparison of Different Models 201 6.1.1 Fishers Linear Discriminant and Fitness 203 6.1.2 Data Sets Used 206 6.2 Comparison of Opponent Color Spaces using MAG 207 6.2.1 iLab RGBY 210 6.2.2 CIE Lab 211 6.2.3 iLab H2SV2 214 6.2.4 MAG Comparison of Color Spaces 214 6.3 Addition of Junction Feature Channels 216 6.4 Comparison of Different Statistical Models 217 6.5: Checking the Problem with Beta 219 6.5.1 Asymptotic Behavior of β 219 6.5.2 What Happens if We Fix the β Hyperparameter to a Constant Value? 221 6.5 Method Performance Conclusion 226 References 228 Appendices 245 Appendix A: Contour Integration Model Parameters 245 Appendix B: Mathematical Details on Surprise 246 Appendix C: Kullback-Liebler Divergences of Selected Probability Distributions 253 C.1 Conceptual Notes on the KL Distance 253 C.2 KL of the Gaussian Probability Distribution 255 C.3 KL of the Gamma Probability Distribution 255 C.4 KL of the Joint Gamma-Gaussian or Gamma-Gamma Distribution 258 Appendix D: Junction Channel Computation and Source 262 D.1 Junction Channel Source Code 264 Appendix E: RGBY and CIE Lab Color Conversion 267 E.1 RGBY Color Conversion 267 ix E.2 CIE Lab Color Conversion 268 Appendix F: HSV Color Conversion Source 273 F.1 RGB to HSV Transformation 274 F.1.1 HSV Transformation C / C++ Code 275 F.2 HSV to RGB Transformation: 278 F.2.1 RGB Transformation C/C++ Code 279 Appendix G: H2SV Color Conversion Source 281 G.1 HSV to H2SV Transformation 282 G.1.1 HSV to H2SV1 Variant 282 G.1.2 HSV to H2SV2 Variant 283 G.2 H2SV to HSV Simple Transformation 284 G.2.1 H2SV1 to HSV Simple 284 G.2.2 H2SV2 to HSV Simple 284 G.3 H2SV to HSV Robust Transformation 285 G.3.1 General Computations: 285 G.3.2 C / C++ Code for Robust Transformation 286 Appendix H: Selected Figure Graphing Commands for Mathematica 288 x List of Tables Table 2.1: Variance accounted for in ICA/PCA. 50 Table 3.1: Table of probabilities of results at random. 107 Table 3.2: Types of features found salient by CINNIC. 119 Table 4.1: M-W feature significance per type. 145 Table 6.1: MAG scores for color spaces. 213 Table 6.2: MAG scores for junction filters. 215 Table 6.3: MAG scores for statistical models. 217 Table 6.4: MAG scores for different values of beta. 223 xi List of Figures Figure 1.1: Examples of retinotopic maps of the visual cortex. 2 Figure 1.2: What does the brain find visually interesting? 4 Figure 1.3: Why the brain looks for so many types of features. 6 Figure 1.4: The increasing complexity of the visual system. 7 Figure 1.5: Examples of basic feature detectors. 11 Figure 1.6: Generations of feature based attention models. 13 Figure 1.7: Orientation features and Gabor pyramid example with Ashes. 15 Figure 1.8: Butterfly regions and contour integration example. 16 Figure 1.9: Examples of top-down models of attention. 19 Figure 2.1: Bayesian priors and Meta Priors spectrum. 26 Figure 2.2: From features to ICA to clustering. 32 Figure 2.3: The VFAT architecture graph. 35 Figure 2.4: General saliency model graph. 36 Figure 2.5: Junction detection from INVT features with ICA. 38 Figure 2.6: Feature clustering example shown with node climbing. 40 Figure 2.7: Examples of feature clustering on different data points. 42 Figure 2.8: NPclassify compared with K-means. 43 Figure 2.9: Example of similarity by statistical overlap. 45 Figure 2.10: Example of feature output following ICA/PCA. 51 Figure 2.11: ICA inversion and color features. 53 Figure 2.12: Example of image feature clustering. 54 xii Figure 2.13: NPclassify compared quantitatively with K-means. 57 Figure 2.14: Features clustered during tracking. 58 Figure 2.15: The simple feature tracker. 59 Figure 2.16: The complex tracker handing off to simple trackers. 62 Figure 2.17: Screen shot of the VFAT based tracker. 63 Figure 3.1: The Braun Make Snake contour. 70 Figure 3.2: The basics of contour alignment and processing. 78 Figure 3.3: Neuron priming diagram. 80 Figure 3.4: Neuron group suppression in theory. 82 Figure 3.5: The basics of the CINNIC alignment and processing. 84 Figure 3.6: Hypercolumns and pseudo-convolution. 91 Figure 3.7: Breakdown of the CINNIC process. 95 Figure 3.8: CINNIC multiple scales and averaging. 96 Figure 3.9: 2AFC simulation for the Polat Sagi display. 99 Figure 3.10: Fit of CINNIC to observer AM. 101 Figure 3.11: Interaction of element size and enhancement. 103 Figure 3.12: CINNIC working on Make Snake contours. 105 Figure 3.13: Performance of CINNIC on Make Snake contours. 106 Figure 3.14: The subjective perception of contours and element separation. 108 Figure 3.15: Accounting for performance of CINNIC with kernel size. 110 Figure 3.16: CINNIC sensitivity to junctions. 113 Figure 3.17: Explaining sensitivity of junctions by CINNIC. 115 Figure 3.18: CINNIC sensitivity to salient locations and face features. 120 xiii Figure 3.19: CINNIC and fast plasticity. 127 Figure 4.1: Overview of the surprise system. 138 Figure 4.2: The surprise map over sequence frames. 141 Figure 4.3: Peaks of surprise seem predictive. 144 Figure 4.4: Mean surprise and visual features. 148 Figure 4.5: Standard deviation of surprise and visual features. 150 Figure 4.6: Spatial location of max surprise and visual features. 151 Figure 4.7: The surprise prediction system. 155 Figure 4.8: How surprise prediction was analyzed. 158 Figure 4.9: Performance of surprise prediction. 162 Figure 4.10: Theoretical aspects of surprise prediction. 171 Figure 5.1: Surprise peaks at flankers for hard targets. 179 Figure 5.2: Attention gating and the contents of working memory. 180 Figure 5.3: From RSVP to attention gate computation. 182 Figure 5.4: Computation of the attention gate. 186 Figure 5.5: Computing the overlap ratio. 189 Figure 5.6: Surprise attention gate quantitative results. 191 Figure 5.7: Subjective results on Transportation Targets. 192 Figure 5.8: Subjective results on Animal Targets. 193 Figure 5.9: Explaining past results for Easy-to-Hard. 195 Figure 5.10: Attention gating and detecting multiple targets. 199 Figure 6.1: Which of the two models is better or worse? 202 Figure 6.2: Pretty fisher information graph 205 xiv Figure 6.3: The MAG, an overview. 207 Figure 6.4: A general color space overview. 208 Figure 6.5: RGBY Color space example. 210 Figure 6.6: CIE Lab color space example. 211 Figure 6.7: H2SV2 color space example. 212 Figure 6.8: MAG and color space results. 213 Figure 6.9: MAG and junction filter results. 215 Figure 6.10: MAG and statistical model results. 217 Figure 6.11: The asymptotic behavior of beta. 220 Figure 6.12: MAG performance for different values of beta. 223 Figure B.1: Different views on the Gamma PDF. 247 Figure B.2: Surprise in Wows! 248 Figure B.3: The DoG Filter. 251 Figure C.1: From a PDF to the integrated KL region. 254 Figure C.2: The Joint gamma-gamma KL. 257 Figure D.1: The junction filter. 262 Figure E.1: CIE 1931 XYZ color space. 269 Figure E.2: Map of the CIE Lab gamut space. 270 Figure F.1: HSV color space. 273 Figure G.1: H2SV color space. 281 xv Abbreviations AI Artificial Intelligence AIP Anterior Interparietal Sulcus AMD Advanced Micro Devices BPNN Back Propagation Neural Network CIE International Commission on Illumination CINNIC Carefully Implemented Neural Network for Integrating Contours CRT Cathode Ray Tube (monitor) DoG Difference of Gaussian EPSP Excitatory Post Synaptic Potential EQ Equation ERF Error Function ERFC Complementary Error Function fMRI Functional (Nuclear) Magnetic Resonance Imaging FS Fast Spiking GABA Gamma Aminobutyric Acid GB Gigabyte (1 billion bytes) GCC GNU C++ Compiler GIMP GNU Image Manipulation Program GNU GNU's Not Unix [sic] (An open source, free software consortium) GPL GNU General Public License xvi HSV Hue/Saturation/Value H2SV HSV Variant with two hue components H2SV2 H2SV with Red/Green Blue/Yellow opponents Hz Hertz (cycles per second) ICA Independent Component Analysis INVT iLab Neuromorphic Vision Toolkit IPSP Inhibitory Post Synaptic Potential IT Inferior Temporal Cortex KL Kullback-Liebler Divergence (sometimes called the KL distance) Lab CIE Lab Color (Luminance with two opponents, a Red/Green b Blue/Yellow) MAG Metric of Attention Gate MHz Megahertz (1,000,000 cycles per second) ms Milliseconds (1/1000 of a second) O Worst Case Asymptotic Complexity (called the big “O” notation) OpenCV Open Computer Vision (Intel Toolkit) PCA Principal Component Analysis PDF Probability Distribution Function PFC Pre-Frontal Cortex POMDP Partially Observable Markov Decision Process RAM Random Access Memory RGB Red, Green and Blue Color xvii RGBY Red/Green and Blue/Yellow Color RMSE Root Mean Squared Error RSVP Rapid Serial Vision Presentation SMA Supplementary Motor Area SQRT Square Root T Terrell TD Temporal Difference V1 Primary Visual Cortex V2 – V5 Regions of Extrastriate Cortex VFAT Vision Feature Analysis Toolkit WTA Winner Take All xviii Abstract What draws in human attention and can we create computational models of it which work the same way? Here we explore this question with several attentional models and applications of them. They are each designed to address a missing fundamental function of attention from the original saliency model designed by Itti and Koch. These include temporal based attention and attention from non-classical feature interactions. Additionally, attention is utilized in an applied setting for the purposes of video tracking. Attention for non-classical feature interactions is handled by a model called CINNIC. It faithfully implements a model of contour integration in visual cortex. It is able to integrate illusory contours of unconnected elements such that the contours “pop-out” as they are supposed to and matches in behavior the performance of human observers. Temporal attention is discussed in the context of an implementation and extensions to a model of surprise. We show that surprise predicts well subject performance on natural image Rapid Serial Vision Presentation (RSVP) and gives us a good idea of how an attention gate works in the human visual cortex. The attention gate derived from surprise also gives us a good idea of how visual information is passed to further processing in later stages of the human brain. It is also discussed how to extend the model of surprise using a Metric of Attention Gating (MAG) as a baseline for model performance. This allows us to find different model components and parameters which better explain the attentional blink in RSVP. xix Preface About this thesis This thesis is about the computational modeling of visual attention and surprise. The aspects that will be covered in this work include: • Utilization of the computation of attention in engineering. • Extensions to the computational model of attention and surprise. • Explaining human visual attention and cognition from simulation using computational models. This work is integrative and based on the philosophy that computer vision is aided by better understanding of the human brain and it’s already developed exquisite mechanisms for dealing with the visual world as we know it. At the same time, development of biologically inspired computer vision techniques, when done correctly, yields insight into the theoretical workings of the human brain. Thus, the integration of engineering, neuroscience and cognitive science gives rise to useful synergy. The second chapter covers the utilization of saliency as an engineering topic. This is an example of applying what we have learned from the human brain towards an engineering goal pursued with real world applications in mind. It is somewhat more applied and as a result, many components are not biologically motivated. The reader should keep in mind that project goals placed constraints on what can be done. In this xx case, a real time system able to process images very quickly was needed. Additionally, the project as is typical for engineering endeavors required “deliverables”. Chapters three and six cover methods for extending or changing the way in which surprise is computed. In the case of the former, a model of contour integration is created and examined. This allowed the creation of an extension to the basic saliency model for non-local interactions. Its primary contribution however turned out to be gainful knowledge of the human visual mechanisms involved. The fourth and fifth chapters deal with temporal dimensions of attention using surprise. The goals are to test and extend the model to see if predictions can be made of observer performance. Thus, it is suggested that a better fit model, which is improved in its ability to predict human performance, is closer to the actual mechanisms which the human brain uses. This also has reciprocal engineering applications since it can be used to help determine what humans will attend to in a dynamic scene. Graduate works not included in this thesis I have tried to keep all work included in this document constrained to the topic of visual attention and to work with salient results. As such, much of the work I have done in pursuit of my doctorate is not included. These works include, but are not limited to (in chronological order): • The Beobot Project (Mundhenk, Ackerman, Chung, Dhavale, Hudson, Hirata, Pichon, Shi, Tsui & Itti, 2003a) • Schizophrenia and the Mirror Neuron System (Arbib & Mundhenk, 2005) • Estimation of missing data in acoustic samples (Mundhenk, 2005) xxi • Surprise Reduction and Control (Mundhenk & Itti, 2006) • Three Dimensional Saliency (Mundhenk & Itti, 2007) Of interest in particular is the work on Schizophrenia and Mirror Neuron system which has been cited 45 times according to Google scholar. Also of interest is the Beobot project paper which was the most downloaded paper from iLab for three years straight, and it is still in the top five downloads to this day. Other works of interest not included in this thesis Also not included is the large amount of educational materials created and posted online. These include: • http://www.cool-ai.com –AI homeworks, projects and lecture notes for usage in AI courses. • Wikipedia and Wikiversity – contributions including: o http://en.wikiversity.org/wiki/Learning_and_Neural_Networks - Created self-guided teaching page on Neural Networks. o http://en.wikipedia.org/wiki/Cicadidae - Contributed Wikipedia featured picture of the day and written content. o http://en.wikipedia.org/wiki/Gamma_distribution - contributed graphics and corrections. o http://en.wikipedia.org/wiki/Kullback-Leibler_divergence - contributed graphics and corrections. o http://en.wikipedia.org/wiki/Methods_of_computing_square_roots - Added algorithms and analysis. xxii • http://www.cinnic.org/CINNIC-primer.htm – Contour Integration Primer. • http://www.nerd-cam.com/how-to/ - Detailed Instructions on how to build your own robotic camera. Don’t read the whole thesis This thesis uses the standard “stapled papers” framework. While each chapter has been integrated into a coherent work, they each will stand on their own. As a result, the reader is advised to get what they want and get out. That is, go ahead and read a chapter which interests you, but don’t bother to read other parts. However, there tends to be more information here than in the authors papers cited. As such, this thesis may be of use in getting some of the model details not covered in the authors published materials due to space constraints in peer reviewed journals. Have fun T. Nathan Mundhenk 1 Chapter 1: A Brief Introduction to Vision and Attention You got to work today without running over any pedestrians. How did you do that? To be sure this is a good thing. You can pick up items without even thinking about it; you can thumb through a magazine until you get to a favorite advertisement; You can tell a shoe from a phone and you can tell if that giant grizzly bear is in fact gunning for your ice cream cone. You do all sorts of things like this every day and frequently they seem utterly simple. To be certain, sometimes you cannot find your keys to save your life, but even while searching for them, you don’t bang into the furniture in your apartment, at least to too much. How did you do this? I ask, because like just about every person on earth, I’m not totally sure. OK, true, you’ll be glad to know I have some ideas. However, the pages that follow will only scratch the surface of how human beings such as ourselves view the world. To this day, much of human vision still remains a mystery. However, many things about human vision are well established. For instance, we do in fact see from our eyes and the information from them does travel to our brain. The brain itself is where what we see is processed and it turns out that its job it not merely to cool our blood as Aristotle believed it to be. However, there is a place between seeing and understanding which resides within human brain itself, and how it takes the items in the world and places them into your mind is a complicated story. In this work, we will focus on an important part of this process, the notion of selection and attention. The idea as it were is that not everything presented to our eyes makes its way from the retina in the eyes to the seat of 2 consciousness. Instead, it seems that most of what we perceive is just a fraction of what we could. The brain is picky, and it only selects some things to present to us, but many other things simply fade from being. Figure 1.1: Retinotopy has been demonstrated repeatedly over the years in the visual cortex. Thus, its existence is well founded. An early example is given by Gordon Holmes who studied brain injuries in solders after the first world war (Holmes, 1917, Holmes, 1945) and traced visual deficits to specific injury cites in visual cortex. Then with primate (Macaque) cortex experiments using Deoxyglucose (Tootell, Silverman, Switkes & De Valois, 1982) it was shown that a pattern activated a region of visual cortex with the same shape. However, this method was limited due to the fact that the animal had to be sacrificed immediately after viewing the pattern in order to reveal and photograph the pattern on the cortex. Later in 2003, with fMRI using sophisticated moving target displays, (Dougherty, Koch, Brewer, Fischer, Modersitzki & Wandell, 2003) regions in the human brain were shown to correspond to locations in the visual cortex in much the same way. However, fMRI allows observation in healthy human volunteers, which is a distinct advantage since more advanced experiments such as those involving motion can be conducted. What then does the brain do to select the things it wants to see? One could suppose that a magic elf sits in a black box in the brain with a black magic marker 3 looking at photos of the world sent to it by the eyes. The elf inspects each photo and decides if it’s something it believes you should see. Otherwise it marks it with an ‘x’ which means that another magic elf should throw the image away. The idea of magic elves as a brain process is intriguing, however the evidence does not bear it out. Then again, the brain is in some sense a black box. Thus, while we do not think that magic elves are the basis for cognition, we still must make inferences about the brains basic workings from a variety of frequently indirect evidence. For instance, we can probe the brain of other primates. In figure 1.1 it is shown that we know that the visual cortex receives information from the eyes in retinotopic coordinates. We know this from experiments on primates where briefly flashed visual patterns caused a similar pattern to form on the visual cortex (Inouye, 1909, Holmes, 1917, Holmes, 1945, Tootell et al., 1982). Does the same thing happen in the human brain? The general consensus is yes, many pieces of visual information from the eye line up on the back of the brain somewhat like a movie projecting onto a screen. Newer studies with functional magnetic resonance imaging (fMRI) on humans reinforces this idea (Horton & Hoyt, 1991, Hadjikhani, Liu, Dale, Cavanagh & Tootell, 1998, Dougherty et al., 2003, Whitney, Goltz, Thomas, Gati, Menon & Goodale, 2003). Still, the evidence is indirect. No one has seen the movie on the back of the brain, but fortunately, the evidence is satisfying. Retinotopy in the visual cortex is an example of something which is well founded even if the evidence is sometimes indirect. However, do we have such a good notion about how the brain selects what it wants to see from input coming from the eyes? It turns out sort of, but not completely. However, this is not without good reason. What 4 captures ones attention is quite complex (Shiffrin & Schneider, 1977, Treisman & Gormican, 1988, Mack & Rock, 1998). So for instance, things which are colorful tend to get through the brain much easier than things which are dull. This is for instance why stop signs are red and not gray. This is also why poisonous snakes or monarch butterflies (which are also poisonous) have such vivid colors. Interestingly, it is not just the colors which attract our attention it is how the colors interact. For instance, something which is blue attracts more attention when it is next to something yellow while something red tends to get more attention when it is next to something green. So it’s not just the color of something that makes it more salient, it’s how the colors interact as opponents. Figure 1.2: What does the brain find visually interesting? There are many things (from left to right). Good continuation of objects which form a larger percept is interesting. Conspicuous colors, particularly the opponents red/green and blue/yellow stand out. Objects with unique features and special asymmetries (Treisman & Souther, 1985) compared with surrounding objects can stand out. Also motion is a very important cue. Ok, seems pretty simple, but that was just one piece of a rather gigantic puzzle. Just a sampling of what is visually interesting is shown in figure 1.2. It turns out that edges, bright patches, things which are moving and things which are lined up like cars in traffic and … well many things all can attract your attention and control what it is that your brain deems interesting. Still it gets even more complex, your brain itself can decide to change the game and shift the priority on certain visual features. As an example, if you 5 are looking for a red checker, your brain could decide to turn up the juice on the red color channel. That is, your brain can from the top-down change the importance of visual items making some things which were less interesting more interesting and vice versa (Shiffrin & Schneider, 1977, Wolfe, 1994a, Navalpakkam & Itti, 2007, Olivers & Meeter, 2008). So just on the front, we can see that the notion of visual attention and what gets from the retina in the eyes to the seat of thought is quite complex. It involves a great deal of things which interact in rather complex and puzzling ways. However, as mentioned we do know many things, and we are discovering new properties every day. Hopefully this work will help to illuminate some of the processes by which the visual world can pass through brain into the realm of thought. 1.1 What Does our Brain Want to Look For? Imagine that the world was not in color. Further, imagine that all you could see was the outlines of the stuff that makes up the world. You would still need to move around without tipping over chairs and be able to eat and recognize food. What then would draw your attention? You can still tell how to identify many things. After all, it is the world of lines which makes up the Sunday comic strips. You might not be able to tell something’s apart which you could back in our colorful world, but for the most part you could tell a table from a chair or an apple from a snake. In this case, what would your brain look for? 6 Figure 1.3: Why does the brain look for so many different types of features? It depends on what it needs to find. Some images are defined by lines, others by colors and some by the arrangement of objects. All of the images shown are interpretable even though typical natural scene information of one type or another seems lacking. Shown from Top Left: Atari’s Superman, Picaso’s La Femme Qui Pleure, Gary Larson’s Far Side; Bottom Left: Liquid Television’s Stick Figure Theater, Van Gogh’s Starry Night Over the Rhone. In basic terms, what your brain wants to look for is information. Figure 1.3 shows several different scenes which one can interpret even though the information is presented very differently with typical information components such as color, lines or texture missing. As will be reviewed later, images are comprised of features, which are the essential bits of information for an image. These can include all of the above as well as more complex features such as junctions and motion. Not all features are necessary for object identification. A typical example is that people were able to enjoy television before it was in color. 7 Figure 1.4: (Left) Features gain increasing complexity and their responses become more and more task dependent. Additionally, visual information is sent down multiple pathways for different kinds of processing (Fagg & Arbib, 1998). Here the task of grasping a mug will prime features related to a mug top-down. These features in turn will be processed in different ways depending on whether we are trying to identify the mug (Ventral: What) or if we are trying to understand its affordances (Dorsal: How) (Ingle, Schneider, Trevathen & Held, 1967). How the brain splits visual information in this way and then reintegrates it, is still not completely understood.1 (Right) The connection diagram of Felleman and Van Essen (Felleman & Van Essen, 1991) of the primate visual cortex demonstrates that elegant models such as the one by Fagg and Arbib still only scratch the surface of the full complexity of the workings of the brain. In addition to the essential bits of an image which are important, what the brain wants to see is also based on the task at hand. Figure 1.4 illustrates a model for the task of grasping an object (Fagg & Arbib, 1998). Initially the object to be grasped must be spotted. If a person has some idea of what they are looking for, then they can attempt to try and focus their attention towards something that matches the expected features of the object. For instance, if the object to be grasped is a red mug, then the initial search for it should bias one to look for red and round things. Such a bias becomes even more 1 This is a reconceptualiztion of the original Fagg & Arbib figure which appears in: [44] Fellous, J.-M., & Arbib, M.A. (2005). Who Needs Emotions? The Brain Meets the Robot. Oxford: Oxford University Press. 8 important in a cluttered scene where many simple salient items may be a distraction. Otherwise, finding a red mug in a plain white room would be more simple. Once the object has been spotted, appropriate features must be further extracted such as geometric land marks (Biederman, 1987, Biederman & Cooper, 1991). So the brain will need to find essential characteristics of the object for the task. In this case, we want to grasp or pick up the object. If a portrait of Weird Al Yankovic is painted on the side of the mug, it might grab our attention, but it is unimportant for the task of acquiring the mug. Instead, we should ignore the portrait and just scan the geometry. The task might be entirely different if we had another action we wanted to execute. For instance, if someone asks us whose face is on the mug, we would want to scan for face like features and perhaps ignore the geometric properties completely. In the mug example, we can imagine that many other factors might come into play. For instance the scene might change unexpectedly. As an example, our clumsy relative might have knocked over the mug. This sudden change in the scene would come as a surprise and should initiate a change in attention priorities. If the coffee is flowing towards my notebook computer I should notice that as soon as possible. Then I should perhaps cancel my grasping action and search for paper towels or maybe make a grasp for my computer. The brain also sometimes has very little choice in what it looks for. Some things are highly salient such as a stop sign or an attractive person. It can be hard to override the innate bottom-up search system at times. Thus, many things are attended to fairly quickly and automatically. This is a rather important trait, a rock hurling towards you at great speed demands your attention more than a cup of coffee. As such, we can see that what 9 the brain wants to see also depends on automatic bottom-up systems which can preempt current task demands. 1.2 How Does our Brain Search for What it Wants? 1.2.1 What’s a Feature? What the brain wants to see is based on what is useful for it to see. Early on, after the invention of photography in the 19th century, many artists began to rethink what it was that they were doing. Up until then, artists created the essence of photographs with a paint brush, but since a machine could do the same thing faster and cheaper, direct photographic style artistry seemed like it would become archaic. This helped to bring about the Impressionist style of art. What is notable to our discussion is that artists began to experiment with imagery where fundamental features of a painting could be altered, but the scene could still be interpreted. As structure and form of art was changed and experimented with, it became more obvious that the brain did not need a direct photograph of a scene in order to understand it. Instead, it merely needed some form of structure which resembled the original scene. Partially as a result of this new way of looking at the world, early 20th century cognitive scientists began to think about how objects and components of an image could be linked together to create something which the brain could understand. Both Structuralists such as William James (James, 1890) and in particular Gestalt psychologists such as Max Wertheimer (Wertheimer, 1923) and Kurt Koffka (Koffka, 1935) began to think about how the brain can take in parts of a scene and assemble them 10 into something the brain understands. They believed that perception was a sum of the parts, but at the time they lacked the scientific abilities to prove their ideas. That the visual world was composed of parts which the brain assembles had been proposed. However, what these parts looked like or what form they took was far from certain. Several theories came forward over the years to refine what kind of parts the brain uses to create the whole. A popular term for the elementary parts of an image was features. Several scientists in the 1950’s such as Gibson (Gibson, 1950), Barlow (Barlow, 1961) and Attneave (Attneave, 1954) began to note that prior information about shapes, line and textures could be collected and used to interpret abstracted scenes statistically. In particular, Fred Attneave proposed that much of the visual world is redundant and unnecessary for the task of recognition. A cat for instance could be represented as points (or perhaps better as junctions) which are connected by the brain to form the perception of a cat. Under this assumption, a large mass of visual information presented to the retinal, for instance all the parts of the image which are not junctions are extraneous. Partially as a result of such assertions, several theories were put forward claiming that there should be a bottleneck in attention (Broadbent, 1958, Deutsch & Deutsch, 1963). As such, the picture of the visual world was still hazy, but several theories were now giving an idea of how the brain sees the world and what it wants to find. First, the brain compiles images from parts to create a whole. Second, features of an image as simple as points, lines, textures or junctions scattered about a scene may be sufficient in order for the brain to understand an image, but that there may be limits on how much the brain can process at one time. However, several questions remained. First, what kind of features is 11 the brain looking for and second how does the brain look for and process these features keeping in mind that it has some limitations on capacity? Figure 1.5: (Left) Early visual processing by the brain looks for simple features. For instance the retinal begins by sorting out color opponents such as red/green and blue/yellow (Kuffler, 1953, Meyer, 1976). While the understanding of the center surround mechanism is somewhat recent, knowledge of the arrangement of color opponents is very old and its theory can be traced at least as far back as to the German physiologist Ewald Hering in 1872 (Hering, 1872) but was first described physiologically in the goldfish (Daw, 1967). We can simulate these mechanisms using the filters shown. Here we see DoG (Difference of Gaussian) filters which give the On Center / Off Surround response (von Békésy, 1967, Henn & Grüsser, 1968) for colors (Luschow & Nothdurft, 1993). (Right) Later, the visual cortex utilizes hyper columns (Hubel & Wiesel, 1974) to find lines in an image. We can use wavelets like the one on the right to give a response to lines in an image (Leventhal, 1991). The type of wavelets used are typically called Gabor wavelets in honor of the Hungarian engineer Gábor Dénes (Dennis Gabor). (Bottom) The bottom row shows a cross section of the filters on the top. The answers to these questions began to congeal with the development of improved psychometric instrumentation in the 1960s that could better time and control the reaction of human subjects with a wide variety of stimulus. [For instance see (Sperling, 1960, Raab, 1963, Sperling, 1965, Weisstein & Haber, 1965)]. This was accompanied by improved psychophysical instrumentation capable of direct 12 measurement of neural activity in animals [For instance (Daw, 1968, Henn & Grüsser, 1968)]. By the 1970’s combined with the seminal work by David Hubel and Torsten Wiesel (Hubel & Weisel, 1977) we were starting to get a pretty good idea of what kind of elementary features the brain is looking for. In figure 1.5 we see some of the features which we knew the brain to be sensitive to by the mid 1970’s. The brain has literal detectors for lines and color opponents such as red/green and blue/yellow. It should be noted however, that this is still the beginning of the story. We knew that there was a set of simple features which the visual cortex would pick up on, but there was no idea how these features could be assembled into larger objects. Additionally, were there more features or was this the full basis set? 1.2.2 How do we Integrate These Features? By the 1970’s two important concepts were beginning to emerge. One was the notion of focused attention. That is, if Attneave and his contemporaries are correct, the brain might be wise to only spend time processing parts of a scene and not the whole thing. Second, features such as lines and colors integrate and bind in the brain. For instance, it had been known since the 1930’s that the brain can bind colors and words. John Stroop (Stroop, 1935) showed that by flashing a word such as “blue” but coloring it red tended to trip up and slow down observers when asked to name it. Would such a mechanism also apply at the level of feature integration? 13 Figure 1.6: Three generations of models of feature based attention are shown in succession. Starting with Treisman, Gelade & Gormican (Treisman & Gelade, 1980, Treisman & Gormican, 1988)2 it was hypothesized that the way visual features such as lines and colors integrate in parallel controls the serial components of attention. This model itself is a refinement of earlier theories of attention, for instance Shiffrin and Schneiders theory of automatic and controlled attention (Shiffrin & Schneider, 1977) and the pre-attentive and focal attention model of Neisser (Neisser, 1967). Later Koch and Ullman (Koch & Ullman, 1985) expanded this with the notion of having a saliency map which controls the spotlight of attention with a winner-take-all network. Following this, it was made into a fully functional computational model by Itti and Koch (Itti & Koch, 2001b). Several theoretical constructs were advanced and lead to increasing understanding on the question of attention (Figure 1.6). It was discovered that attention seems to be focal and that only parts of an image actually reach what many people would call consciousness. In 1967, this hypothesis was put forward by Ulric Neisser (Neisser, 1967) who suggested that there was a pre-attentive phase to visual processing when features were gathered together in parallel, but that later the features combined and were inspected serially by focal attention. This was further expanded by Richard Shiffrin and Walter Schneider (Shiffrin & Schneider, 1977) who saw a second dimension to attention. They suggested that some parts of attention are automatic and some parts are controlled. That 2 This drawing is from Treisman and Gormican 1988. It is based on the feature integration theory given in Treisman and Gelade 1980. However, Treisman and Souther 1985 gives a very similar figure. 14 is, some features in an image grab our attention automatically and almost reflexively. However, we are also consciously able control some things which we attend to. This is what is now thought of in broader terms as bottom-up and top-down attention. In 1980, Anne Treisman and Gerry Gelade further refined these ideas into a Feature Integration theory of attention (Treisman & Gelade, 1980). There idea was that the parallel computation of Neisser could be split into different features which could be processed separately in the pre-attentive stage and then brought together. Thus, the brain would compute its interest in colors, lines and intensities at the same time and that it is the sum integration of different features which determines the locus of attention. That is, attention is driven simultaneously be each type of feature, but the conjunction or independent dominance of a feature can draw in attention. However, the question was left open as to how the features could combine to create a master map of attention. A possible answer was given by Christof Koch and Shimon Ullman (Koch & Ullman, 1985) who gave the idea that the brain maintained a saliency map for the visual world and that a max selector processes (Didday, 1976, Amari & Arbib, 1977) would refine the saliency map so that only a single location in the visual field would stick out. This allowed for many things in the world to be salient at the same time, but suggested that the most salient item of all is that one which the brain will attend to. The theories of attention put forward by Treisman et al as well as Koch and Ullman gained further support over the next decade due to a variety of experimental results [For examples see (Nothdurft, 1991b, Nothdurft, 1991a, Nothdurft, 1992, Luschow & Nothdurft, 1993)]. In 1998 Laurent Itti, Christof Koch and Ernst Niebur further refined the model of Koch and Ullman and created a comprehensive 15 computational model that allowed direct testing of it (Itti, Koch & Niebur, 1998). It also included a comprehensive set of feature detectors as well as a Gaussian/Laplacian pyramid to detect features at many different scales (Figure 1.7). Figure 1.7: Gabor wavelet filters give a response to lines in an image. One way to do this is to create four or more wavelet filters each with its own directional orientation (Itti et al., 1998). On the left this can be seen as filters sensitive to lines are 0, 45, 90 and 135 degrees. On the right is an image which has been convolved by the filters at 0 and 90 degrees and the lines that were extracted by the filters. Since lines have different sizes we can convolve each image at a different scale to increase our chances of discovering lines of different widths (Tanimoto & Pavlidis, 1975, Burt & Adelson, 1983, Greenspan, Belongie, Goodman, Perona, Rakshit & Anderson, 1994)3. The essential gain was that the computer could be treated like a brain in a box. If the model of Koch and Ullman was correct, then a comprehensive computational model should have parity with the behavior of humans. Initial results showed that the saliency 3 The cats name is Ashes. 16 Figure 1.8: (Top Row) Features that the brain is looking for get increasingly complex. This happens frequently when simpler features are combined to create new ones (Field, Hayes & Hess, 1993, Kovács & Julesz, 1993, Polat & Sagi, 1994, Gilbert, Das, Ito, Kapadia & Westheimer, 1996, Li, 1998, Mundhenk & Itti, 2005). For instance, line fragments which Gabor filters pick up on can then be connected in a corresponding zone which completes contours. The butterfly pattern on the left will complete a contour when line fragments lie in the green zone and are aligned. This can be seen on the right where three co-linearly aligned fragments enhance each other to give a larger response. The graph is somewhat crude, but the point is that the more elements that are aligned, the stronger the response. (Bottom Row) The elements aligned into a circle on the left are much more salient than random elements (Kovács & Julesz, 1993, Braun, 1999). They should produce an activation pattern like the one on the right (Mundhenk & Itti, 2003, Mundhenk & Itti, 2005).This is discussed at length in chapter 3. 17 model behaved in a manner that was expected (Itti & Koch, 2001b). The computational saliency model was able to detect many odd-man-out features, search asymmetries and conditions for pop-out that would be expected of human observers. Additionally, the model could be augmented to included top-down attentional effects (Itti, 2000) by adjusting features weights in a manner similar to the mechanism proposed 25 years earlier for directed attention by Shiffrin and Schneider (Shiffrin & Schneider, 1977). Thus, for instance, when looking for a red Coke can, it is almost a simple matter to weight the red feature more during search. 1.2.3 Beyond the Basic Saliency Model The original saliency model of Itti and Koch lacked three components. One was the interaction of non-local features. Thus, as can be seen in figure 1.8, contours and line segments which extend past the classic receptive fields of the basic feature detectors have been found to be salient (Kovács & Julesz, 1993, Polat & Sagi, 1993b, Gilbert et al., 1996, Braun, 1999, Geisler, Perry, Super & Gallogly, 2001). The second element missing was temporal attention. This itself is comprised of three components which may or may not be independent of each other. They are motion, change and masking. Thus, things which are in motion tend to draw our attention. However simple changes such as the appearance or disappearance of an element in a video can draw or attention as well (Mack & Rock, 1998). The third element of temporal attention, masking, has been studied quite extensively (Breitmeyer & Öğmen, 2006). It is where something at one instance in a sequence of images is blocked from perception by something spatially proximal that comes before or after it. It includes both backwards and forwards masking, 18 the attentional blink (Raymond, Shapiro & Arnell, 1992) and both automatic and controlled mechanisms (Sperling & Weichselgartner, 1995, Olivers & Meeter, 2008). Further, the temporal components of attention are hypothesized to be comprised of more than one processing stage (Chun & Potter, 1995). The third element, top-down attention has been partially implemented since the original model was incepted (Itti, 2000, Navalpakkam & Itti, 2005). However, a complete model of top-down attention is probably many years away since it requires construction of the “top” component which may include consciousness itself. A non-local extension to the saliency model was eventually provided by T Nathan Mundhenk (Mundhenk & Itti, 2003, Mundhenk & Itti, 2005) and was extensively tested. This is covered in chapter 3. The extensions to temporal saliency are covered in chapters 2, 4, 5 and 6. They include extensions by the addition of a motion channel in chapter 2 (Mundhenk, Landauer, Bellman, Arbib & Itti, 2004b, Mundhenk, Navalpakkam, Makaliwe, Vasudevan & Itti, 2004c, Mundhenk, Everist, Landauer, Itti & Bellman, 2005a) and extension by the usage of Bayesian Surprise in chapters 4, 5 and 6 (Itti & Baldi, 2005, Einhäuser, Mundhenk, Baldi, Koch & Itti, 2007b, Mundhenk, Einhäuser & Itti, 2009). 1.3 The Current State of Attention and Other Models Many contemporary models of attention are designed to address one or more of the shortcomings of the original saliency model discussed in the last section, while many are attempts at general improvements or are different models altogether. 19 1.3.1 Top-Down Models Modeling the factors of top-down v. bottom-up attention goes back very far. As can be seen in figure 1.9 an early model was provided by Shiffrin and Schneider, but that model lacked a good notion of feature integration as well as an attentional map. Jeremy Wolfe (Wolfe, 1994a) provided a good synthesis of the model of Shiffrin and Schneider with the model of Koch and Ullman. Thus, the affects of top-down controll were merged with a feature integration attention model which also included an attention map. However, this is an example of a static scene top-down model. That is, prior knowledge is integrated as a top-down mechanism, but not necessarily online. Current extensions of this model include the integration of task influence (Navalpakkam & Itti, 2005) as well as an explanation of feature tuning (Navalpakkam & Itti, 2007). Figure 1.9: (Left) An early example of an attention model with top-down guided search activation is the attention model of Shiffrin and Schneider (Shiffrin & Schneider, 1977). Here automatic parallel processing layers that compute attention can be controlled by a more serialized attention director. (Right) The model by Wolfe (Wolfe, 1994a) is conceptually a synthesis of Shiffrin & Schneider with Koch and Ullman (Koch & Ullman, 1985). That is, it has added feature integration and a saliency map. 20 Many other models which integrate top-down attention are concerned with online handling of features as well as task demands. Sperling et al (Sperling, Reeves, Blaser, Lu & Weichselgartner, 2001) has provided one such model with a gamma shaped window function of attention. Task it treated as a spatial cue to certain locations allowing a “Quantal” discrete attention window to be opened at that location for a certain amount of time. It also includes bottom-up attention using the original term “automatic” attention. However, like with the model of Wolfe, it has not been nearly as completely implemented as the Itti and Koch model. One might consider it a partial implementation in comparison. A recent and important contribution to the modeling of top-down attention is provided by Olivers and Meeter. This is known as the Boost and Bounce theory of attention (Olivers & Meeter, 2008). In many ways it is an extension of Sperling et al, but it has more explicit handling of features as well as an improved description of the interaction of frontal cortical mechanisms with visual cortical processing. Again, however, the implementation is very computationally limited. 1.3.2 Other Contemporary Models of Saliency Currently there are a variety of other attention models in existence. Some are variants of the model of Itti and Koch (Frintrop, 2006, Itti & Baldi, 2006, Gao, Mahadevan & Vasconcelos, 2008) while others are more unique (Cave, 1999, Li, 2002, Bruce & Tsotsos, 2006). The model by Simone Frintrop is known as VOCUS. Its goal is to use models of saliency to improve computer vision search. It implements top-down task improvements in a manner similar to Itti and Koch, but adds a top-down 21 excitation/inhibition mechanism. It also uses the CIE Lab (McLaren, 1976) color space for color opponents and implements a form of 3D saliency for laser range finders. Dashan Gao et al (Gao et al., 2008) have implemented an interesting variation on Itti and Koch which is to change the treatment of center surround interactions. The center surround response is termed “Discriminant” center surround because it forms a center surround response based on the strength of a linear discriminant. The more crisp the discrimination of the center of a location is when compared with its surround, the stronger a response is given at that location. However, this is a mechanism very similar to the way the model of Surprise (Itti & Baldi, 2005, Itti & Baldi, 2006) computes spatial attention. The model by Bruce and Tsotsos (Bruce & Tsotsos, 2006) is an information maximization model. It works by taking in a series of images and forming a bases set of features. The bases set is then used to convolve an image. The response to each basis feature is competed against the basis features from all other patches. Thus, if a basis feature gives a unique response at an image location, it is considered salient. The most notable difference with this model compared with Itti and Koch is the derivation of basis features from prior images similar to Olshausen and Field (Olshausen & Field, 1996). However, the rectification using a neural network may compute competition in a way which is not sufficiently different from a WTA competition, but it may be arguably more biologically plausible. The model by Li is much more different. Li’s model (Li, 2002) is strongly model theoretic and somewhat neglects the task of image processing. However, it is claimed that it can provide saliency pre-attentively without the use of separate feature saliency maps. 22 Thus, the model should compute a singular saliency from combining features responses at the same time. This may be a more plausible method for computing saliency, but it is unclear if it functionally gains much over other models of saliency. 1.3.3 The Surprise Model There are two notable trends with saliency models. One is the emergence of information theoretic constructs and the other is the continued divergence between static saliency models and dynamic models of attention. With the recent exception of Gao (Gao et al., 2008) attention models were either static feature based models or dynamic, but primarily theoretical models (Sperling et al., 2001). The introduction of Surprise based attention (Itti & Baldi, 2005, Itti & Baldi, 2006) created for the first time a statistically sound and dynamic model of attention. In chapter 4, we will introduce surprise based attention and show that it does an excellent job of taking into account dynamic attentional effects seen in rapid serial vision experiments. This is then shown to give a good framework for a short term attention gate mechanism in chapter 5. In short, the new framework has some similarities to Bruce and Tsostos in that prior images are used to create belief about new images. However, surprise computes these beliefs online. This means that it does not need to be trained or have strong prior information about feature prevalence. Instead the sequence provides the needed information. The extensive testing and validation in chapters 4-6 also demonstrate firmly that it explains many temporal attention effects. Additionally, we postulate that we have gained further insight into the attentional window into the brain. 23 Chapter 2: Distributed Biologically Based Real Time Tracking with Saliency Using Vision Feature Analysis Toolkit (VFAT)4 In a prior project, we developed a multi agent system for noticing and tracking different visual targets in a room. This was known as the iRoom project. Several aspects of this system included both individual noticing and acquisition of unknown targets as well as sharing that information with other tracking agents (Mundhenk et al., 2003a, Mundhenk, Dhavale, Marmol, Calleja, Navalpakkam, Bellman, Landauer, Arbib & Itti, 2003b). This chapter is primarily concerned with a combined tracker that uses the saliency of targets to notice them. It then classifies them without strong prior knowledge (priors) of their visual feature, and passes that information about the targets to a tracker, which conversely requires prior information about features in order to track them. This combination of trackers allows us to find unknown, but interesting objects in a scene and classify them well enough to track them. Additionally, information gathered can be placed into a signature about objects being tracked and shared with other camera agents. The signature that can be passed is helpful for many reasons since it can bias other agents towards a shared target as well as help in creating task dependant tracking. 2.1.1 Vision, Tracking and Prior Information For most target acquisition and tracking purposes, prior information about the targets features is needed in order for the tracker to perform its task. For instance, a basic color tracker that tracks objects based on color needs to know a priori what the color of 4 For more information see also: http://ilab.usc.edu/wiki/index.php/VFAT_Tech_Doc 24 the target that it wishes to track is. If one is going to track a flying grape fruit, then one would set a tracker with a certain color of yellow and some threshold about which the color can vary. In general, many newer trackers use statistical information about an objects features which allows one to define seemingly more natural boundaries for what features one would expect to find on a target (Lowe, 1999, Mundhenk et al., 2004b, Mundhenk et al., 2004c, Mundhenk et al., 2005a, Siagian & Itti, 2007). However, in order to deploy such a tracker, one needs to find the features, which describe the object before tracking it. This creates two interesting problems. The first problem is that the set of training examples may be insufficient to describe the real world domain of an object. That is, the trainer leaves out examples from training data, which may hold important information about certain variants of an object. We might think for instance from our flying grapefruit tracking example, that of the fruits that fly by, oranges never do. As a result, we would unknowingly let our tracker have some leeway and track grapefruit that might even be orange in appearance. It might however turn out that we were wrong. At some point, an orange flies by and our tracker tracks it the same as a flying grapefruit. This can happen for several reasons, the first is that we had never observed an orange fly by and as such didn’t realize that indeed, they can fly by. Another reason is that the world changed. When we set up the tracker, only grapefruits could fly by. However, the thing that makes them fly, now acts on oranges, which may be an accidental change, for instance if an orange tree begins to grow in our flying grapefruit orchard. However, it might also be the case that someone has decided to start throwing oranges in front of our tracker. As such, the domain of trackable objects can change either accidentally or 25 intentionally. In such a case, our tracker may now erroneously tracks flying oranges as flying grapefruit. As can be seen from our first example, our tracker might fail if someone tries to fool it. Someone starts throwing oranges in front of our tracker, or perhaps they might wrap our grapefruits in a red wrapper so that our tracker thinks they are apples. If we are selling our flying grapefruits and our tracker is supposed to make sure each one makes it to a shipping crate, it would fail if someone sneaks them by as another fruit. As such, once a dishonest person learns what our tracker is looking for, it becomes much easier to fool. This is seen in the real world in security applications, such as Spam filtering, where many email security companies have to update information on what constitutes Spam on a regular bases to deal with spammers who learn simple ways around the filters. It should be expected that the same problem would go for any other security related application including a vision-based tracker. In the case of our flying grapefruit tracker, its function may not be explicitly security related, but as a device related to accounting, it is prone to tampering. What is needed then for vision based tracking is the ability to be able to define its own priors. It has been proposed that gestalt rules of continuity and motion allow visual information to be learned without necessarily needing prior information about what features individual objects possess (Von der Malsberg, 1981, Prodöhl, Würtz & von der Malsberg, 2003, Mundhenk et al., 2004b, Mundhenk & Itti, 2005). That is, the human visual system does not necessarily know what it is looking for, but it knows how to learn how to look. This itself constitutes a kind of prior information which one might consider meta-prior information. That is, information about what structure or meta-model is 26 needed to gather prior information, such as Bayesian information, is itself a type of prior information. Using meta-prior information, an artificial agent might learn on its own how to form groups that can be used to create statistical relationships and build new prior information about what it wishes to track. Thus, abstractly speaking, meta-priors are concerned with learning about how to learn. 2.1.3 Meta-priors, Bayesian Priors and Logical Inductive Priors Figure 2.1: It is interesting to note how different AI solutions require different amounts of prior information in order to function. Additionally, it seems that the more prior information a solution requires the more certainty it has in its results, but the more biased it becomes towards those results. Thus, we can place solutions along a spectrum based on the prior information required. Popular solutions such as Back Propagation Neural Networks and Support Vector Machines seem to fall in the middle of the spectrum in essence making them green machines and earning them the reputation of being the 2nd best solution for every problem. We propose that meta-priors are part of a spectrum of knowledge acquisition and understanding. At one end of the spectrum, are the rigid rules of logic and induction from which decisions are drawn with great certainty, but with which unknown variables must be sparse enough to make those reasonable decisions (figure 2.1). In the middle we place 27 more traditional statistical methods, which either require what we will define as strong meta-priors in order to work or require Bayesian priors. We place the statistical machines in the middle, since they allow for error and random elements as part of probabilities and do not need to know everything about a target. Instead, they need to understand the variance of information and draw decisions about what should be expected. Typically, this is gifted to a statistical learner in the form of a kernel or graph. Alternatively, the meta-prior does not make an inference about knowledge itself, but instead is used to understand its construction. From this, we then state, that meta-priors can lead to Bayesian priors, which can then lead to logical inductive priors. From meta-priors we have the greatest flexibility about our understanding of the world and in general terms, the least amount of bias; whereas on the other end of the spectrum, logical inductive priors have the least flexibility, but have the greatest certainty. An ideal agent should be able to reason about its knowledge along this spectrum. If a probability becomes very strong, then it can become a logical rule. However, if a logical rule fails, then one should reason about the probability of it doing so. Additionally, new things may occur which have yet unknown statistical properties. As such, the meta-priors can be used to promote raw data into a statistical framework or to re-reason about a statistical framework, which now seems invalid. Using certain kinds of meta-prior information, many Bayesian systems are able to find groupings which can serve as prior information to other programs which are unable to do so themselves. However, most Bayesian models work from meta-priors that require a variety of strong meta-priors. For instance, the most common requirement is that the number of object or feature classes must be specified. This can be seen in expectation 28 maximization, K-means and back-propagation neural networks, which need to have a set size for how many classes exist in the space they inspect. The number of classes thus, becomes a strong and rather inflexible meta-prior for these methods. Additionally, other strong meta-priors may include space size, data distribution types and the choice of kernel. The interesting thing about meta-priors is that they can be flexible or rigid. For instance, specifying you have several classes that are fit by a Gaussian distribution is semi-flexible in that you have some leeway in the covariance of your data, but the distribution of the data should be uni-modal and have a generally elliptical shape. An example of more rigid meta-priors would be specifying a priori the number of classes you believe you will have. So for instance, going back to our grapefruit example, if you believe your data to be Gaussian, you suspect that flying grapefruit have a mean color with some variance in that color. You can make a more rigid assumption that you will only see three classes such as, flying grapefruit, oranges and apples. All of these are of course part of the design process, but as mentioned they are prone to their own special problems. Ideally, an intelligent agent that wishes to reason about the world should have the ability to reason with flexible weak meta-priors but then use those to define Bayesian like priors. Here we define weak meta-priors as having flexible parameters that can automatically adjust to different situations. So for instance, we might set up a computer vision system and describe for it the statistical features of grapefruit, oranges and apples. However, the system should be able to define new classes from observation either by noticing that a mass of objects (or points) seem to be able to form their own category (Rosenblatt, 1962, Dempster, Laird & Rubin, 1977, Boser, Guyon & Vapnik, 1992, Jain, 29 Murty & Flynn, 1999, Müller, Mika, Rätsch, Tsuda & Schölkopf, 2001, Mundhenk et al., 2004b, Mundhenk et al., 2005a) or through violation of expectation and surprise (Itti & Baldi, 2005, Itti & Baldi, 2006). An example of the first is that if we cluster data points that describe objects, and if a new object appears such as a kiwi, a new constellation of points will emerge. An example of the second is that if we expect an apple to fly by, but see an orange, it suggests something interesting is going on. It might be that new fruit have entered our domain. In the first case, our learning is inductive, while in the second case it is more deductive. We thus define weak meta-priors to be situationally independent. That is, the meta-prior information can vary depending on the situation and the data. Ideally, information within the data itself is what drives this flexibility. So for instance, when selecting what is the most salient object in a scene, we might select a yellow ball. However, a moving gray ball may be more salient if presented at the same time as the yellow ball. Thus, the selection feature for what is most salient is not constantly a color, but can also be motion. So it is the interplay of these features, which can promote the saliency of one object over the other (Treisman & Gelade, 1980). Yet another example is that the number of classes is not defined a priori as a strong meta-prior, but instead, variance between features causes them to coalesce into classes. So as an abstract example, the number of planets in a solar system is not pre-determined. Instead, the interplay of physical forces between matter will eventually build a certain number of planets. Thus, the physical forces of nature are abstractly a weak meta-prior for what kind of planets will emerge, and how many will be formed. 30 2.1.4 The iRoom and Meta-prior Information Here we now review a vision system for following and tracking objects and people in a room or other spaces that can process at the level of weak meta-priors, Bayesian priors and even logical inductive priors. From this, we then need artificial experts, which can use weak meta-priors to process information into more precise statistical and Bayesian form information. Additionally, once we know things with a degree of certainty, it is optimal to create rules for how the system should behave. That is, we input visual information looking for new information from weak meta-priors, which can be used to augment a vision system that uses Bayesian information. Eventually strong Bayesian information can be used to create logical rules. We will describe this process in greater detail in the following pages but give a brief description here. Using a biological model of visual saliency from the iLab Neuromorphic Vision Toolkit (INVT) we find what is interesting in a visual scene. We then use it to extract visual features from salient locations (Itti & Koch, 2001b) and group them into classes using a non-parametric and highly flexible weak-meta prior classifier NPclassify (Mundhenk et al., 2004b, Mundhenk et al., 2005a). This creates initial information about a scene: for instance how many classes of objects seem present in a scene, where they are and what general features they contain. We then track objects using this statistically priorless tracker but gain advantage by taking the information from this tracker and handing it to a simple tracker, which uses statistical adaptation to track a target with greater effectiveness. In essence, it takes in initial information and then computes its own statistical information from a framework using weak meta-prior information. That 31 statistical information is then used as a statistical prior in another simpler and faster tracker. 2.2 Saliency, Feature Classification and the Complex Tracker There were several components used in the tracking system in iRoom. As mentioned, these started by needing less meta-prior information and then gathering information that allows the tracking of targets by more robust trackers that require more information about the target. The first step is to notice the target. This is done using visual saliency. Here very basic gestalt rules about the uniqueness of features in a scene are used to promote objects as more or less salient (Treisman & Gelade, 1980, Koch & Ullman, 1985, Itti & Koch, 2001b). This is done by competing image feature locations against each other. A weak image feature that is not very unique will tend to be suppressed by other image features, while strong image features that are different will tend to pop out as it receives less inhibition. In general, the saliency model acts as a kind of max selector over competing image features. The result from this stage is a saliency map that tells us how salient each pixel in an image is. Once the saliency of locations in an image can be computed, we can extract information about the features at those locations. This is done using a Monte Carlo like selection that treats the saliency map as a statistical map for these purposes. The more salient a location in an image is, the more likely we are to select a feature from that location. In the current working version we select about 600 feature locations from each frame of video. Each of the feature locations contains information about the image such as color, texture and motion information. These are combined together and used to 32 Figure 2.2: The complex feature tracker is a composite of several solutions. It first uses INVT visual saliency to notice objects of interest in a scene. Independent Component Analysis and Principle Component Analysis (Jollife, 1986, Bell & Sejnowski, 1995, Hyvärinen, 1999) are used to reduce dimensions and condition the information from features extracted at salient locations. These are fed to a non-parametric clustering based classification algorithm called NPclassify, which identifies the feature classes in each image. The feature classes are used as signatures that allow the complex tracker to compare objects across frames and additionally share that information with other trackers such as the simple tracker discussed later. The signatures are also invariant to many view point effects. As such they can be shared with cameras and agents with different points of view. classify each of the 600 features into distinct classes. For this we use the non-parametric classifier NPclassify mentioned above. This classifier classifies each feature location without needing to know a priori the number of object feature classes or how many samples should fall into each class. It forms classes by weighting each feature vector from each feature location by its distance to every other point. It then can link each feature location to another, which is the closest feature location that has a higher weight. This causes points to link to more central points. Where a central point links to another cluster it is not a member of, we tend to find that the link is comparatively rather long. 33 We can use this to cut links, thus, creating many classes. In essence, feature vectors from the image are grouped based on value proximity. As an example, two pixels that are close to each other in an image and are both blue would have a greater tendency to be grouped together than two pixels in an image that are far apart and are blue and yellow. Once we have established what classes exist and which feature locations belong to them, we can statistically analyze them to determine prior information that will be useful to any tracker, which requires statistical prior information in order to track a target. Thus, we create a signature for each class that describes the mean values for each feature type as well as the standard deviation within that class. Additionally, since spatial locations play a part in weighting feature vectors during clustering, feature vectors that are classified in the same class tend to lie near each other. Thus, the signature can contain the spatial location of the class as well. Figure 2.2 shows the flow from saliency to feature classification and signature creation. The signatures we derive from the feature properties of each class exist to serve two purposes. The first is that it allows this complex tracker to build its own prior awareness. When it classifies the next frame of video, it can try and match each of the new objects it classifies as being the same object in the last frame. Thus, it is not just a classifier, but it can track objects on its own for short periods. Further, we can use information about targets to bias the classification process between frames. So for instance, we would expect that the second frame of video in a sequence should find objects which are similar to the first frame. As such, each classified object in any given frame, biases the search in the next frame, by weighting the classifier towards finding objects of those types. 34 While this seems very complex, signature creation is fairly quick, saliency computation is done in real time on eight 733 MHz Pentium III computers in a Beowulf cluster. The rest of the code runs in under 60 ms on an Opteron 150 based computer. This means we can do weak meta-prior classification and extraction of signatures at around > 15 frames per second. 2.2.1 Complex Feature Tracker Components 2.2.1.1 Visual Saliency The first stage of processing is finding which locations in an image are most salient. This is done using the saliency program created by (Itti & Koch, 2001b), which works by looking for certain types of uniqueness in an image (Figure 2.3). This simulates the processing in visual cortex that the human brain performs in looking for locations in an image, which are most salient. For instance, a red coke can placed among green foliage would be highly salient since it contrasts red against green. In essence, each pixel in an image can be analyzed and assigned a saliency value. From this a saliency map can be created. The saliency map simply tells us the saliency of each pixel in an image. 2.2.1.2 Monte Carlo Selection The saliency map is taken and treated as a statistical map for the purpose of Monte Carlo selection. The currently used method will extract a specified number of features from an image. Highly salient locations in an image have a much higher probability of being selected than regions of low saliency. Additionally, biases from other modules may cause certain locations to be picked over consecutive frames from a video. For instance, if properties of a feature vector indicate it is very useful, then it makes sense 35 to select from a proximal location in the next frame. Thus, the saliency map combines with posterior analysis to select locations in an image which are of greatest interest. Figure 2.3: The complete VFAT tracker is a conglomeration of different modules that select features from an image, mix them into more complex features and then tries to classify those features without strong meta-priors for what kind of features it should be looking for. 36 2.2.1.3 Mixing Modules 2.2.1.3.1 Junction and End Stop Extraction Figure 2.4: Saliency is comprised of several channels which process an image at a variety of different scales and then combine those results into a saliency map. During the computation of visual saliency, orientation filtered maps are created. These are the responses of the image to Gabor wavelet filters. These indicate edges in the image. Since each filter is tuned to a single preferred orientation, a response from a filter indicates an edge that is pointed in the direction of preference. The responses from the filters are stored in individual feature maps. One can think of a feature map as simply an image which is brightest where the filter produces its highest response. Since the feature 37 maps are computed as part of the saliency code, re-using them can be advantageous from an efficiency standpoint. From this we create feature maps to find visual junctions and end-stops in an image by mixing the orientation maps (Figure 2.4). We believe such new complex feature maps can also tell us about the texture at image locations which can help give us the gist of objects to be tracked. The junction and end stop maps are computed as follows. Note that this is a different computation then the one used in appendix D and chapter 5 in the attention gate model. At some common point i,j on the orientation maps P the filter responses from the orientation filters are combined. Here the response to an orientation in one orientation map ij p is subtracted from an orthogonal map’s orientation filter output orth ij p and divided by a normalizer n which is the max value for the numerator. For instance, one orientation map that is selective for 0 degree angles is subtracted from another map selective for 90 degree angles. This yields the lineyness of a location in an image because where orthogonal maps overlap in their response is at the junctions of lines. (2.1) ; {1,2} orth k ij ij ij p p a k n − = ∈ We then compute a term (2.2) which is the orthogonal filter responses summed. This is nothing more than the sum of the responses in two orthogonal orientation maps. 38 Figure 2.5: The three images on the right are the results of the complex junction channel after ICA/PCA processing from the original image on the left. As can be seen it does a reasonable job of finding both junctions and end stops. (2.2) ; {1, 2} orth k ij ij ij p p b k n + = ∈ The individual line maps are combined as: (2.3) 1 2 ij ij ij a a n α + = This gives the total lineyness for all orientations. We then do a similar thing for our total response maps: (2.4) 1 2 ij ij ij b b n β − = The final junction map γ is then computed by subtracting the lineyness term from the total output of the orientation filters: (2.5) ij ij ij γ =α − β 39 Since the junction map is computed by adding and subtracting orientation maps which have already been computed during the saliency computation phase, we gain efficiency we wouldn’t have had if we were forced to convolve a whole new map by a kernel filter. Thus, this junction filter is fairly efficient since it does not require any further convolution to compute. Figure 2.5 shows the output and it can be seen that it is effective at finding junctions and end-stops. 2.2.1.3.2 ICA/PCA We decrease the dimensionality of each feature vector by using a combination of Independent Component Analysis (ICA) (Bell & Sejnowski, 1995) and Principle Component Analysis (PCA) (Jollife, 1986). This is done using FastICA (Hyvärinen, 1999) to create ICA un-mixing matrices offline. The procedure for training this is to extract a large number of features from a large number of random images. We generally use one to two hundred images and 300 points from each image using the Monte Carlo selection processes just described. FastICA first determines the PCA reduction matrix and then determines the matrix that maximizes the mutual information using ICA. Unmixing matrices are computed for each type of feature across scales. So as an example, the red-green opponent channel is computed at different scales, usually six. PCA/ICA will produce a reduced set of two opponent maps from the six original scale maps (This is described in detail later and can be seen in figure 2.7). Using ICA with PCA helps to ensure that we not only reduce the dimension of our data set, but that the information sets are fairly unique. From the current data, we reduce the total number of dimensions with all channels from 72 to 14 which is a substantial efficiency gain 40 especially given the fact that some modules have complexity O(d2) for d number of feature channels (dimensions). Figure 2.6: NPclassify works by (A) first taking in a set of points (feature vectors) (B) then each point is assigned a density which is the inverse of the distance to all other points (C) Points are then linked by connecting a point to the nearest point which has a higher density (D) Very long links (edges) are cut if they are for instance statistically longer than most other links. This creates separate classes. 2.2.1.4 Classification Modules 2.2.1.4.1 Classification of Features with NPclassify5 Features are initially classified using a custom non-parametric clustering algorithm called NPclassify6. The idea behind the design of NPclassify is to create a 5 This component is implemented in the iLab Neuromorphic Vision Toolkit as VFAT/NPclassify2.C/.H 6 A description and additional information on top of what will be discussed can be found at: http://www.nerd-cam.com/cluster-results/. 41 clustering mechanism which has soft parameters that are learned and are used to classify features. We define here soft parameters as values which define the shape of a meta-prior. This might be thought of as being analogous to a learning rate parameter or a Bayesian hyperparameter. For instance, if we wanted to determine at which point to cut off a dataset and decided on two standard deviations from the mean, two standard deviations would be a soft parameter since the actual cut off distance depends on the dataset. NPclassify (Figure 2.2, 2.6 and 2.7) (Mundhenk et al., 2004b, Mundhenk et al., 2005a) works by using a kernel to find the density at every sample point. The currently used kernel does this by computing the inverse of the sum of the Euclidian distance from each point to all other points. After density has been computed the sample points are linked together. This is done by linking each point to the closest point which has a higher density. This creates a path of edges which ascends acyclically along the points to the point in the data set which has the highest density of all. Classes are created by figuring out which links need to be cut. For instance, if a link between two sample points is much longer than most links, it suggests a leap from one statistical mode to another. This then may be a good place to cut and create two separate classes. Additionally, classes should be separated based upon the number of members the new class will have. After classes have been created, they can be further separated by using interclass statistics. The advantage to using NPclassify is that we are not required to have a prior number of classes or any prior information about the spatial or sample sizes of each class. 42 Figure 2.7: On the left are samples of features points with the class boundaries NPclassify has discovered. Some of the classes have large amounts of noise while others are cramped together rather than being separated by distance. On the right are the links NPclassify drew in order to create the clusters. Red links are ones which are too long and were clipped by the algorithm to create new classes. 43 Instead, the modal distribution of the dataset combined with learned notions of feature connectedness determine whether a class should be created. So long as there is some general statistical homogeneity between training and testing datasets we should expect good performance for clustering based classification. The training results are discussed later in the section on training results. Figure 2.8: The results using NPclassify are shown next to the same results for k-means on some sham data. The derived clusters are shown with the Gaussian eignenmatrix bars (derived using the eigenmatix estimation in section 2.2.1.4.2). In general, NPclassify creates more reliable clusters particularly in the presence of noise. Additionally, it does so without needing to know a priori how many classes one has. As such, we do have a few meta-priors still present. The first is a basic kernel parameter for density. In this case, the Euclidian distance factor makes few assumptions 44 about the distribution other than that related features should clump together. The second meta-prior is learned as a hyperparameter for a good cutoff. This can be derived using practically any gradient optimization technique. So it is notable, that NPclassify is not without some type of prior, but the assumptions on the data is quite relaxed and only assumes that related feature samples will be close to each other in feature space. An example of NPclassify working on somewhat arbitrary data points can be seen in figure 2.8. 2.2.1.4.2 Gaussian Generalization and Approximation7 In order to store classes for future processing it is important to generalize them. Gaussian ellipsoids are used since their memory usage for any class is O(d2) for d number of dimensions for a given class. Since d is fairly low for us, this is an acceptable complexity. Additionally, by using Gaussians we gain the power of Bayesian inference when trying to match feature classes to each other. However, the down side is that computing the eigen matrix necessary for Gaussian fitting scales minimally as d3 for dimensions and s2 for the number of samples. That is, it is O(d3 + s2). This is due to the fact that computing such elements using the pseudo inverse method (or QR decomposition) involves matrix inversion and multiplication. In order to avoid such large complexity we have implemented an approximation technique that scales minimally as d2 for dimensions and s for the number of samples - O(sd2). This means that a net savings happens if the number of samples is much larger than the number of dimensions. So for 7 This component is implemented in the iLab Neuromorphic Vision Toolkit as VFAT/covEstimate.C/.H 45 instance, if there are more than 100 samples and only 10 dimensions, this will produce a savings over traditional methods. Figure 2.9: After NPclassify has grouped feature samples together they can be fit with Gaussian distributions. This helps to determine the probability that some new feature vector belongs to a given class or that two classes compute in consecutive frames using NPclassify are probably the same class. If the distributions overlap greatly as on the left figure, then two classes are probably the same class. The approximation method works by using orthogonal rotations to center and remove covariance from the data. By recording the processes, we can then compute the probability on data points by translating and transforming them in the same way to align with the data set. What we want to be able to do is to tell the probability of data points belonging to some class as well as being able to tell if two classes derived in consecutive frames are probably the same class (see figure 2.9) The first step is to center the data about the origin. This is done by computing the mean and then subtracting that number from each feature vector. Next we compute approximate eigenvectors by trying to find the average vector from the origin to all 46 feature vector coordinates. So for k th feature vector, we first compute the ratio between its distance l from the origin along dimensions j and i. This yields the ratio rijk. That is, after aligning the feature vector with the origin, we take the ratio of two features in the same vector (we will do this for all possible feature pairs in the vector). (2.6) jk ijk ik l r l = Next we find the Euclidian distance uijk from the origin along dimensions j and i. (2.7) 2 2 uijk = lik −ljk By Summing the ratio of rijk and uijk for all k feature vectors, we obtain a mean ratio that describes the approximated eigenvector along the dimensions i and j. (2.8) 0 k ijk ij k ijk r m = u = Σ A normalizer is computed as the sum of all the distances for all samples k. (2.9) 0 k ij ijk k n u = = Σ Next we determine the actual angle of the approximated eigenvector along the dimensions i and j. (2.10) tan 1 ij ij ij m n θ − ⎛ ⎞ = ⎜⎜ ⎟⎟ ⎝ ⎠ 47 Once we have that, we can rotate the data set along that dimension and measure the length of the ellipsoid using a basic sum of squares operation. Thus, we compute ρik and ρjk which is the data set rotated by θij. Here ξ is the positions of kth feature vector along the i dimension and ψ is the position of the feature vector along the j dimension. What we are doing here is rotating covariance out along each dimension so that we can measure the length of the eigenvalue. Thus, we iterate over all data points k and along all dimensions i and along i+1 dimensions j summing up σ as we go. We only sum j for i+1 since we only need to use one triangle of the eigenvector matrix since it is symmetric along the diagonal. (2.11) i +1 ≤ j (2.12) cos( ) sin( ) ik ij ij ρ =ξ ⋅ θ +ψ ⋅ θ (2.13) sin( ) cos( ) jk ij ij ρ = −ξ ⋅ θ +ψ ⋅ θ What we have done is figure out how much we need to rotate the set of feature vectors in order to align the least squares slope with the axis. Once this is done, we can rotate the data set and remove covariance. Since the mean is zero because we translated the data set by the mean to the origin, variance for the sum of squares is computed simply as: (2.14) 2 0 k ik iij k s n ρ = = Σ (2.15) 2 0 k jk jji k s n ρ = = Σ 48 Each sum of squares is used to find the eigenvalue estimate by computing Euclidian distances. That is, by determining the travel distance of each eigenvector during rotation and combining that number with the computed sum of squares we can determine an estimate of the eigenvalue from triangulation. The conditional here is used because σii is computed more than once with different values for θij. Thus, σii is the sum of all the products of θij and siij. (2.16) ( )2 iff = 0 cos( ) otherwise iij ii ii ii iij ij s s σ σ σ θ ⎧⎪ = ⎨ + ⋅ − ⎪⎩ (2.17) ( )2 iff = 0 cos( ) otherwise jji jj jj jj jji ij s s σ σ σ θ ⎧⎪ = ⎨ + ⋅ − ⎪⎩ The end result is a non-standard eigenmatrix which can be used to compute the probability that a point lies in a Gaussian region. We do this by performing the same procedure on any new feature vector. That is, we take any new feature vector and replay the computed translation and rotations to align it with covariance neutral eigenmatrix approximation. Probability for the feature vector is then computed independently along each dimension thus eliminating further matrix multiplication during the probability computation. To summarize, by translating and rotating the feature set, we have removed covariance so we can compute probabilities assuming dimensions do not interact. In essence this removes the need for complex matrix operations. While the complexity is high, it is one order lower than the standard matrix operations as was mentioned earlier. 49 Examples of fits created using this method can be seen in figure 2.7 where NPclassify has created classes and the eigenmatrix is estimated for the ones created. 2.2.1.4.3 Feature Contiguity, Biasing and Memory Once features have been classified we want to use them to perform various tasks. These include target tracking, target identification and feature biasing. Thus from a measurement of features from time t, we would like to know if a collection of features at time t+1 is the same, and as such either the same object or a member of the same object. By using Bayesian methods we can link classes of features in one frame of a video to classes in the next frame by tying a class to another which is its closest probabilistic match. Additionally, we use the probability to bias how the non-parametric classifier and saliency work over consecutive frames. For NPclassify we add a sink into the density computation. That is, we create a single point whose location is the mean of a class with the mass of the entire class. Think of this as dropping a small black hole in a galaxy that represents the mass of the other class. By inserting this into the NPclassify computation, we skew the density computation towards the prior statistics in the last iteration. This creates a Kalman filter like effect that smoothes the computation of classes between frames. This is a reasonable action since the change in features from one frame to the next should be somewhat negligible. 2.2.1.5 Complex Feature Tracker Methods and Results 2.2.1.5.1 Complexity and Speed One of the primary goals of VFAT is that it should be able to run in real time. This means that each module should run for no more than about 30 ms. Since we are using a Beowulf cluster, we can chain together modules such that even if we have several 50 steps that take 30 ms each, by running them on different machines we can create a vision pipeline whereby a module finishes a job and hands the results to another machine in a Beowulf cluster that is running the next process step. In time trials the modules run within real time speeds. Using a Pentium 4 2.4 GHz Mobile Processor with 1 GB of RAM, each module of VFAT runs at or less than 30 ms. The longest running module is the NPclassify feature classifier. If given only 300 features it runs in 23 ms, for 600 features it tends to take as long as 45 ms. On a newer system it should be expected to run much faster. 2.2.1.5.2 Training for Classification Table 2.1: Following PCA the amount of variance accounted for was computed for each type of feature channel. Each channel started with six scales (dimensions). For many channels, 90% of variance is accounted for after a reduction to two dimensions. For all others, no more than three dimensions are needed to account for 90% of variance. Two modules in VFAT need to be trained prior to usage. These include ICA/PCA and NPclassify. Training for both has been designed to be as simple as possible in order to maintain the ease of use goal of the iRoom project. Additionally and fortunately, training of both modules is relatively quick with ICA/PCA taking less than a minute using the FastICA algorithm under Matlab and NPclassify taking around two hours using 51 gradient descent training. Since we only need to ever train once, this is not a prohibitive amount of time. 2.2.1.5.3 Training ICA/PCA Figure 2.10: The various conspicuity maps of the feature channels from the saliency model are shown here ICA/PCA reduced. Training was completed by using 145 randomly selected natural images from a wide range of different image topics. Images were obtained as part of generic public domain CD-ROM photo packages, which had the images sorted by topic. This enabled us to ensure that the range of natural images used in training had a high enough variety to prevent bias towards one type of scene or another. For each image, 300 features were extracted using the Monte Carlo / Visual saliency method described earlier. In all this 52 gave us 43,500 features to train ICA/PCA on. The results are shown on table 2.1. For most channels, a reduction to two channels from six still allowed for over 90% of variance to be accounted for. However, directional channels that measure direction of motion and orientation of lines in an image needed three dimensions to still account for more than 90% of all variance. Assuming that the data is relatively linear and a good candidate for PCA reduction, this suggests that we can effectively reduce the number of dimensions to less than half while still retaining most of the information obtained from feature vectors. Visual inspection of ICA/PCA results seems to show the kind of output one would expect (Figure 2.10 and 2.11). For instance, when two channels are created from six, they are partially a negation to each other. On the red/green channel, one of the outputs seems to show a preference for red. However, the other channel does not necessarily show an anti-preference for red. This may suggest that preferences for colors may also depend on the scales of the images. That is, since what makes the six input images to each channel different is the scale at which they are processed, scale is the most likely other form of information processed by ICA/PCA. This might mean for instance that the two channels of mutual information contain information about scaling. We might guess that of the two outputs from the red/green channel, one might be a measure of small red and the other of large green things. If this is the case it makes sense since in nature, red objects tend to be small (berries, nasty animals, etc.) while green things tend to be much more encompassing (trees, meadows, ponds). 53 Figure 2.11: From the original image we see the results of ICA/PCA on the red/green and blue/yellow channels. As can be seen some parts of the outputs are negations of each other which makes sense since ICA maximizes mutual information. However, close examination shows they are not negatives. It is possible that scale information applies as a second input type and prevents obvious negation. 2.2.1.5.4 Training NPclassify To hone the clustering method we use basic gradient decent with sequential quadratic programming using the method described by (Powell, 1978). This was done offline using the Matlab Optimization Toolbox. For this study, error was defined as the number of classes found versus how many it was expected to find (see Figure 2.12). Thus, we presented the clustering algorithm with 80 natural training images. Each image 54 Figure 2.12: In this image there are basically three objects. NPclassify has found two (colors represent the class of the location). This is used as the error to train it. So for 80 images it should find x number of objects. The closer it gets to this number, the better. Notice that the points are clustered in certain places. This is due to the saliency/Monte Carlo method used for feature selection. had a certain number of objects in it. For instance an image with a ball and a wheel in it would be said to have two objects. The clustering algorithm would state how many classes it thought it found. If it found three classes in an image with two objects then the error was one. The error was computed as average error from the training images. The training program was allowed to adjust any of several hard or soft parameters for NPclassify during the optimization. The training data was comprised of eight base objects of varying complexity such as balls and a wheel on the simple side or a mini tripod and web cam on the more 55 complex side. Objects were placed on a plain white table in different configurations. Images contained different numbers of objects as well. For instance some images contained only one object at a time, while other contained all eight. A separate set of validation images was also created. These consisted of a set of eight different objects with a different lighting created by altering the f-stop on the camera. Thus, the training images were taken with an f-stop of 60 while the 83 validation images were taken with an f-stop of 30. Additionally, the angle and distance of view point is not the same between the training and validation sets. The validation images were not used until after optimal parameters were obtained by the training images. Then the exact same parameters were used for the validation phase. Our first test was to examine if we could at the very least segment images such that the program could tell which objects were different from each other. For this test spatial interaction was taken into account. We did this by adding in spatial coordinates as two more features in the feature vectors with the new set of 14 ICA/PCA reduced feature vectors. The sum total of spatial features were weighted about the same as the sum total of non-spatial features. As such, the membership of an object in one segmented class or the other was based half by its location in space and half by its base feature vector composition. Reliability was measured by counting the number of times objects were classified as single objects, the number of times separate objects were merged as one object and the number of time a single object was split into two unique objects. Additionally, there was a fourth category for when objects were split into more than three objects. This was small and contained only four instances. 56 The results were generally promising in that based upon simple feature vectors alone, the program was able to segment objects correctly with no splits or merges in 125 out of the 223 objects it attempted to segment. In 40 instances an object was split into two objects. Additionally 54 objects were merged as one object. While on the surface these numbers might seem discouraging there are several important factors to take into account. The first is that the program was segmenting based solely on simple features vectors with a spatial cue. As such it could frequently merge one shiny black object into another shiny black object. In 62 % of the cases of a merger, it was obvious that the merged objects were very similar with respect to features. 2.2.1.5.5 NPclassify v. K-Means NPclassify was also tested on its general ability to classify feature clusters. In this case it was compared with K-means. However, since K-means requires the number of classes to be specified a priori, this was provided to it. So in essence, the K-means experiment had the advantage of knowing how many classes it would need to group, while NPclassify did not. The basic comparison test was similar to the test presented in the previous section. In this case, several Gaussian like clusters were created of arbitrary 2 dimensional features. They had between 1 and 10 classes in each data set. 50 of the sets were clean with no noise such that all feature vectors belonged explicitly to a ground truth class. However, in 50 other sets, small amounts of random noise were added. The comparison metric for K-means and NPclassify was how often classes were either split or merged 57 when they should have not been. The mean error for both conditions is shown below in figure 2.13. It should be noted that while K-means may be sensitive to noise in data, it is used here since it is well known and can serve as a good base line for any clustering algorithm. Figure 2.13: NPclassify is compared with K-Means for several data sets. The error in classification for different sets is the same if there is little noise in the data. However, after injecting some noise, NPclassify performs superior. The general conclusion is that compared with K-means, NPclassify is superior particularly when there is noise in the data. This is not particularly surprising since as a spanning tree style algorithm, NPclassify can ignore non proximal data points much more easily. That is, K-means is forced to weigh in all data points and really has no innate ability to determine that an outlying data point should be thrown away. However, NPclassify will detect the jump in distance to an outlier or noise point from the central density of the real class. 58 2.2.1.5.6 Contiguity Figure 2.14: Tracking from frame 299 to frame 300 the shirt on the man is tracked along with the head without prior knowledge of what is to be tracked. It should be noted that that while the dots are drawn in during simulation, the ellipses are drawn in by hand for help in illustration in gray scale printing. Contiguity has been tested but not fully analyzed (Figure 2.14). Tracking in video uses parameters for NPclassify obtained in section 2.2.1.5.4. Thus, the understanding of how to track over consecutive frames is based on the computers subjective understanding of good continuity for features. In general, classes of features can be tracked for 15 to 30 frames before the program loses track of the object. This is not an impressive result in and of itself. However, several factors should be noted. First is that each object that VFAT is tracking is done so without priors for what the features of each should be. Thus, the program is tracking an object without having been told to either track that object or what the object its tracking should be like. The tracking is free form and in general without feature based priors. The major limiter for the contiguity of tracking is that an object may lose saliency as a scene evolves. As such an object if it becomes too low in saliency will have far fewer features selected for processing from it, which destroys the track of an object with the current feature qualities. However, as will be noted in the next 59 section, this is not a problem since this tracker is used to hand off trackable objects to a simple tracker which fixates much better on objects to be tracked. 2.3 The Simple Feature Based Tracker8 Figure 2.15: The Simple tracker works by taking in initial channel values such as ideal colors. These are used to threshold an image and segment it into many candidate blobs. This is done by connecting pixels along scan lines that are within the color threshold. The scan lines are then linked which completes a contiguous object into a blob. The blobs can be weeded out if they are for instance too small or too far from where the target last appeared. Remaining blobs can then be merged back and analyzed. Finding the center of mass of the left over blobs gives us the target location. By finding the average color values in the blob, we can define a new adapted color for the next image frame. Thus, the threshold color values can move with the object. Once a signature is extracted using the complex tracker described in the previous section, it can be feed to a faster and simpler tracking device. We use a multi channel 8 For more information see also: http://ilab.usc.edu/wiki/index.php/Color_Tracker_How_To 60 tracker, which uses color thresholding to find candidate pixels and then links them together. This allows it to not only color threshold an image, but to segregate blobs and analyze them separately. So for instance, if it is tracking a yellow target, if another yellow target appears, it can distinguish between the two. Additionally, the tracker also computes color adaptation as well as adaptation over any channel it is analyzing. We compute for instance a new average channel value c (2.18) as the sum of all pixel values in this channel p c over all N pixels in tracked ‘OK’ blobs (as seen in figure 2.15) p from the current frame t to some past frame t′ . In basic terms, this is just the average channel value for all the trackable pixels in several consecutive past frames. Additionally we computeσ , which is just the basic standard deviation over the same pixels. (2.18) ( )2 0 and 0 1 t N t N ip ip i t p i t p t t i i i t i t c cc c N N σ = ′ = = ′ = = ′ = ′ − = = − ΣΣ ΣΣ Σ Σ Currently, we set a new pixel as being a candidate for tracking if for all channels that have a pixel value p c : (2.19) p c −α ⋅σ ≤ c ≤ c +α ⋅σ Thus, a pixel is thresholded and selected as a candidate if it falls within the boundary of each channel that is its mean value computed from eq. (2.18) +/- the product of the standard deviation and a constantα . Forgetting is accomplished in the adaptation by simply windowing the sampling interval. 61 This method allows the tracker to track a target even if its color changes due to changes in lighting. It should be noted that the simple tracker can track other features in addition to color so long as one can create a channel for it. That is, an RGB image can be separated into three channels, which are each gray scale images. In this case, we create one for red, one for green and one for blue. We can also create images that are for instance, the responses of edge orientation filters or motion filters. These can be added as extra channels in the simple tracker in the same manner. However, to preserve luminance invariance we use the H2SV color scheme described in appendix G. This is just an augmentation of HSV color space that solves for the singularity at red by converting hue into Cartesian coordinates. In addition to the basic vision functional components of the simple tracker, its code design is also important. The tracker is object oriented which makes it easy to create multiple independent instances of the simple tracker. That is, we can easily run several simple trackers on the same computer each tracking different objects from the same video feed. The computational work for each tracker is fairly low and four independent trackers can simultaneously process 30 frames per second on an AMD Athlon 2000 XP processor based machine. This makes it ideal for the task of tracking multiple targets at the same time. 2.4 Linking the Simple and Complex Tracker In order for the simple tracker and the complex tracker to work together they have to be able to share information about a target. As such the complex tracker must be able to extract information about objects that is useful to the simple tracker (Figure 2.16). 62 Additionally, linking the simple tracker with the complex tracker creates an interesting problem with resource allocation. This is because each simple tracker we instantiate tracks one target at a time while the complex tracker has no such limit. A limited number of simple trackers can be created and there must be some way to manage how they are allocated to a task based on information from the complex tracker. Figure 2.16: The simple and complex trackers are linked by using the complex tracker to notice and classify features. The complex tracker then places information about the feature classes into object feature class signatures. The complex tracker uses these signatures to keep track of objects over several frames or to bias the way in which it classifies objects. The signatures are also handed to simple trackers, which track the objects with greater proficiency. Here we see two balls have been noticed and signatures have been extracted and used to assign each ball to its own tracker. The smaller target boxes on the floor show that the simple tracker was handed an object (the floor), which it does not like and is not tracking. Thus, the simple tracker has its own discriminability as was mentioned in section 2.3 and figure 2.15. We address the first problem by making sure both trackers work with similar feature sets. So for example, the complex tracker when it runs will examine the H2SV color of all the classes it creates. It then computes the mean color values for each class. This mean color value along with the standard deviation of the color can then be handed to the simple tracker, which uses it as the statistical prior color information for the object it should track. 63 Figure 2.17: This is a screen grab from a run of the combined tracker. The lower left two images show the complex tracker noticing objects, classifying and tracking them. The signature is handed to the simple tracker, which is doing the active tracking in the upper left window. The combined tracker notices the man entering the room and tracks him without a priori knowledge of how he or the room looks. Once he walks off the right side, the tracker registers a loss of track and stops tracking. The bars on the right side show the adapted actively tracked colors from the simple tracker in H2SV color. The lower right image shows that many blobs can fit the color thresholds in the simple tracker, but most are selected out for reasons such as expected size, shape and position. The second issue of resource allocation is addressed less easily. However, there are simple rules for keeping resource allocation under control. First, don’t assign a simple tracker to track an object that overlaps with a target another simple tracker is tracking in the same camera image. Thus, don’t waste resources by tracking the same target with two or more trackers. Additionally, since the trackers are adaptive we can find that two trackers were assigned to the same target, but we didn’t know this earlier. For instance, if 64 accidentally one simple tracker is set to track the bottom of a ball and one the top of the ball, after a few iterations of adaptation, both trackers will envelop the whole ball. It is thus advantageous to check for overlap later. If we find this happening, we can dismiss one of the simple trackers as redundant. Additionally, our finite resources mean we do not assign every unique class from the complex tracker to a simple tracker. Instead, we try and quantify how interesting a target is. For instance, potential targets for the simple tracker may be more interesting if they are moving, have a reasonable mass or have been tracked by the complex tracker for a long enough period of time. 2.5 Results On the test videos used, the system described seems to work very well. A video of a man entering and leaving a room (Figure 2.17) was shown five times to the combined complex and simple tracker. In each run, the man was noticed within a few frames of entering the cameras view. This was done without prior knowledge of how the target should appear and without prior knowledge of the room’s appearance. The features were extracted and a simple tracker was automatically assigned to track the man, which did so, until he left the room, at which point the simple tracker registered a loss of track. Interestingly enough, the tracker extracted a uniform color over both the man’s shirt and his skin. It was thus able to, on several instances, track the man as both his shirt and his skin. Thus, even though the shirt was burgundy and the skin reddish, the combined tracker was able to find a statistical distribution for H2SV color that encompassed the color of both objects as unique from the color of objects in the rest of the room. 65 The tracker was also tested on a video where a blue and yellow ball both swing on a tether in front of the camera. In five out of five video runs, both balls are noticed and their features extracted. Each ball is tracked as a separate entity by being assigned by the program its own simple tracker. Each ball is tracked until it leaves the frame, at which point the simple trackers register a loss of track. The balls even bounce against each other, which demonstrates that the tracker will trivially discriminate between objects even when they are touching or overlapping. In both video instances, objects are tracked without the program knowing the features of the object to be tracked a priori. Instead, saliency is used to notice different possible targets and the complex tracker is used to classify possible targets into classes. This was then used to hand target properties to the simple trackers as automatically generated prior information about the targets to be tracked. Additionally, the simple tracker will register a loss of track when the target leaves the field of view. This allows us to not only notice when a new target enters our field, but also when it leaves. The tracking was also aided by the use of H2SV color. Prior to using the H2SV color scheme, the purple shirt the man is wearing was split as two objects since the color of many of the pixels bordered on and even crossed into the red part of the hue spectrum. Thus, standard HSV created a bi-modal distribution for hue. The usage of H2SV allowed us to now track the purple shirt as well as objects that are reddish in hue, such as skin. H2SV color also works for tracking of objects in the center of the spectrum, which we observed by tracking objects that are green, yellow and blue. In addition to tracking using a static cameral, the same experiment was done using a moving camera. This is much less trivial since the common method of eigen 66 background subtraction cannot be used to determine new things in a scene from the original scene. Again the tracker was able to track a human target without prior knowledge of features even as the camera moved. This is a distinct advantage for our tracker and illustrates the advantage of using saliency to extract and bind features since it can compensate for global motion. 2.6 Discussion 2.6.1 Noticing The most notable and important aspect of the current work is that we are able to track objects or people without knowing what they will look like a priori and we are able to do so quickly enough for real time applications. Thus, we can notice, classify and track a target fairly quickly. This has useful applications in many areas and in particular security. This is because we track something based on how interesting it is and not based on complete prior understanding of its features. Potentially, we can then track any object or person even if they change their appearance. Additionally, since we extract a signature that describes a target that is viewpoint invariant, this information can be used to share target information with other agents. 2.6.2 Mixed Experts Additionally, we believe we are demonstrating a better paradigm in the construction of intelligent agents, one that uses a variety of experts to accomplish the task. The idea is to use a variety of solutions that work on flexible weak meta-prior information, but then use their output as information for a program that is more biased. 67 This is founded on the idea that there is no perfect tool for all tasks and that computer vision is comprised of many tasks such as identification, tracking and noticing. To accomplish a complex task of noticing and tracking objects or people, it may be most optimal to utilize many different types of solutions and interact them. Additionally, by mixing experts in this way, no one expert necessarily needs to be perfect at its job. If the experts have some ability to monitor one another, then if one expert makes a mistake, it can possibly be corrected by another expert. It should be noted that this tends to follow a biological approach in which the human brain may be made up of interacting experts, all of which are interdependent on other expert regions in order to complete a variety of tasks. Another important item to note in the mixed experts paradigm is that while it may make more intuitive sense to use such an approach, new difficulties arise as our system becomes more abstractly complex. So as an example, if one works with support vector machines only, then one has the advantage of a generally well-understood mathematical framework. It is easier to understand a solutions convergence, complexity and stability in a system if it is relatively homogeneous. When a person mixes experts, particularly if the experts act very differently, the likelihood of the system doing something unexpected or even catastrophic tends to increase. Thus, when one designs an intelligent agent with mixed experts, system complexity should me managed carefully. 2.6.3 System Limitations and Future Work The system described has its own set of limitations. The work up to this point has concentrated on being able to notice and track objects in a scene quickly and in real time. 68 However, its identification abilities are still somewhat limited. It does not contain a memory such that it can store and identify old targets in the long term. However, such an ability is in the works and should be aided by the ability of the tracking system to narrow the area of the image that needs to be inspected which should increase the speed of visual recognition 69 Chapter 3: Contour Integration and Visual Saliency In the visual world there are many things, which we can see, but certain features, sets of features and other image properties tend to more strongly draw our visual attention towards them. A very simple example is a stop sign, in which combinations of red color and angular features of an octagon combine with a strong word “stop” to create something that hopefully we would not miss if we come upon it. Such propensity of some visual features to attract attention defines in part the phenomenon of visual saliency. Here we assert, as others (James, 1890, Treisman & Gelade, 1980, Koch & Ullman, 1985, Itti & Koch, 2001b) that saliency is drawn from a variety of factors. At the lowest levels, color opponents, unique orientations and luminance contrasts create the effect of visual pop-out (Treisman & Gelade, 1980, Wolfe, O'Neill & Bennett, 1998). Importantly, these studies have highlighted the role of competitive interactions in determining saliency --- hence, a single stop sign on a natural scene backdrop usually is highly salient, but the saliency of that same stop sign and its ability to draw attention is strongly reduced as many similar signs surround it. At the highest levels it has been proposed |
Archival file | uscthesesreloadpub_Volume17/etd-Mundhenk-2997.pdf |