Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Computational modeling and utilization of attention, surprise and attention gating
/
Computational modeling and utilization of attention, surprise and attention gating
(USC Thesis Other)
Computational modeling and utilization of attention, surprise and attention gating
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COMPUTATIONAL MODELING AND UTILIZATION OF ATTENTION, SURPRISE AND ATTENTION GATING by Terrell Nathan Mundhenk A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2009 Copyright 2009 Terrell Nathan Mundhenk ii Epigraph “I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living; It's a way of looking at life through the wrong end of a telescope. Which is what I do, and that enables you to laugh at life's realities.” Dr Seuss iii Dedication For my parents Terry and Ann iv Acknowledgements This really is the trickiest part to write because I want to thank so many people for so many things. First off I would like to thank my sister Amy (Zon) who thought enough of me to whip out a copy of the Schizophrenia paper I wrote with Michael Arbib when she visited that research neurologist at the Cleveland Clinic. I thought that was funny, but it made me feel as if I was doing something interesting. Then there are my closest friends Paul Gunton, Brant Heflin, Mike Olson and Tim Olson who exhibited confidence in my ability to actually complete this silly thing. I would also like to thank Kate Svyatets for sticking with me through my thesis and mood swings. Life is meaningless without friends and loved ones, so I am in your debt. I would also like to acknowledge the excellent scientists I worked closely with over the years, without whom my research would not have been possible. Firstly this includes Michael Arbib who has essentially been a co-thesis advisor to me. I can hardly communicate the enormous amount of things which I learned from him over the years. His honesty and integrity were of great value and, I knew when he said something, he really meant it. Next, I would like to thank Kirstie Bellman and Chris Landauer at the Aerospace Corporation. They gave me more of an idea of what real engineering is about than just about anyone. I also want to give a great big thanks to Wolfgang Einhäuser my co-author on several publications. I don’t think I’ve ever worked with anyone so totally on the ball. I also need to mention Ernest Hall who was my mentor during my undergraduate years and source of encouragement during my graduate years. I don’t think I would be in a research field if it wasn’t for him. v I also must extend my deepest gratitude to the many faculty members who have provided excellent feedback and conversation over the years of my graduate education. I cannot mention every teacher and mentor who touched my life over the past few years, because there were so many. However I would like to extend special thanks to: Irving Biederman, Christof Koch, Christoph von der Malsburg, Bartlett Mel and Stefan Schaal. I would also like to acknowledge many of the students and post-doctoral researchers I have collaborated with or were in general very helpful in assisting me with my research through direct assistance or discussion. They are: Jeff Begley, James ‘Jimmy’ Bonaiuto, Mihail Bota, Vincent J. Chen, Aaron D'Souza, Nitin Dhavale, Lior Elazary, Jacob Everist, Doug Garrett, Larry Kite, Hendrik Makaliwe, Salvador Marmol, Thomas Moulard, Pradeep Natarajan, Vidhya Navalpakkam, Jan Peters, Rob Peters, Eric Pichon, Jack Rininger, Christian Siagian, and Rand Voorhies. If I forgot to mention anyone, I’m really quite sorry. Lastly, but far from leastly, I would like to thank my Thesis Advisor Laurent Itti for all his help, input and encouragement he has provided over the years. After my first year at USC, I was getting kind of board being out of the research game. At the time, I was taking robotics from Stefan Schaal. I talked to him about who was doing interesting research in computer vision and he suggested I talk to a promising new faculty member. I took him up on his advice which turned out to be an excellent decision. iLab was new and only had a few students back then, now it so vibrant and full of life with so many projects. I will surely miss Laurent and iLab and I am certain I will look back on these days with great positive satisfaction. vi Table of Contents Epigraph ii Dedication iii Acknowledgements iv List of Tables x List of Figures xi Abbreviations xv Abstract xviii Preface xix About this thesis xix Graduate works not included in this thesis xx Other works of interest not included in this thesis xxi Don’t read the whole thesis xxii Chapter 1: A Brief Introduction to Vision and Attention 1 1.1 What Does our Brain Want to Look For? 5 1.2 How Does our Brain Search for What it Wants? 9 1.2.1 What’s a Feature? 9 1.2.2 How do we Integrate These Features? 12 1.2.3 Beyond the Basic Saliency Model 17 1.3 The Current State of Attention and Other Models 18 1.3.1 Top-Down Models 19 1.3.2 Other Contemporary Models of Saliency 20 1.3.3 The Surprise Model 22 Chapter 2: Distributed Biologically Based Real Time Tracking with Saliency Using Vision Feature Analysis Toolkit (VFAT) 23 2.1.1 Vision, Tracking and Prior Information 23 2.1.3 Meta-priors, Bayesian Priors and Logical Inductive Priors 26 2.1.4 The iRoom and Meta-prior Information 30 2.2 Saliency, Feature Classification and the Complex Tracker 31 2.2.1 Complex Feature Tracker Components 34 2.3 The Simple Feature Based Tracker 59 2.4 Linking the Simple and Complex Tracker 61 2.5 Results 64 2.6 Discussion 66 2.6.1 Noticing 66 vii 2.6.2 Mixed Experts 66 2.6.3 System Limitations and Future Work 67 Chapter 3: Contour Integration and Visual Saliency 69 3.1 Computation 75 3.2 The model 77 3.2.1 Features 77 3.2.2 The Process 83 3.2.3 Kernel 87 3.2.4 Pseudo-Convolution 91 3.3 Experiments 97 3.3.1 Local element enhancement 98 3.3.2 Non-local Element Enhancement 103 3.3.3 Sensitivity to Non-contour Elements 112 3.3.4 Real World Image Testing 118 3.4 Discussion 122 3.4.1 Extending Dopamine to Temporal Contours via TD (dimensions) 125 3.4.2 Explaining Visual Neural Synchronization with Fast Plasticity 126 3.4.3 Contours + Junctions, Opening a New Dimension on Visual Cortex 127 3.4.4 Model Limitations 128 3.5 Conclusion 129 Chapter 4: Using an Automatic Computation of an Image’s Surprise to Predicts Performance of Observers on a Natural Image Detection Task 130 4.1.1 Overview of Attention and Target Detection 131 4.1.2 Surprise and Attention Capture 134 4.2 Methods 136 4.2.1 Surprise in Brief 136 4.2.2 Using Surprise to Extract Image Statistics from Sequences 139 4.3 Results 144 4.4 A Neural Network Model to Predict RSVP Performance 152 4.4.1 Data Collection 153 4.4.2 Surprise Analysis 154 4.4.3 Training Using a Neural Network 154 4.4.4 Validation and Results of the Neural Network Performance 156 4.5 Discussion 164 4.5.1 Patterns of the Two-Stage Model 165 4.5.2 Information Necessity, Attention Gating and Biological Relevance 169 4.5.3 Generalization of Results 173 4.5.4 Comparison with Previous RSVP Model Prediction Work 173 4.5.5 Network Performance 174 4.5.6 Applications of the Surprise System 175 4.6 Conclusion 176 viii Chapter 5: Modeling of Attentional Gating using Statistical Surprise 177 5.1 From Surprise to Attentional Gating 180 5.2 Methods 183 5.2.1 Paradigm 183 5.2.2 Computational Methods 184 5.3 Results 188 5.3.1 Relation of Results to Previous Studies Which Showed Causal Links between Surprise and Target Detection 193 5.4 Discussion 196 5.4.1 Variability of the Attention Gate Size Fits within the Paradigm 196 5.4.2 The Attention Gate may Account for Some Split Attention Effects 198 5.4.3 Unifying Episodic Attention Gate Models with Saliency Maps 199 Chapter 6: A Comparison of Surprise Methods and Models Using the Metric of Attention Gate (MAG) 201 6.1 The MAG Method for Comparison of Different Models 201 6.1.1 Fishers Linear Discriminant and Fitness 203 6.1.2 Data Sets Used 206 6.2 Comparison of Opponent Color Spaces using MAG 207 6.2.1 iLab RGBY 210 6.2.2 CIE Lab 211 6.2.3 iLab H2SV2 214 6.2.4 MAG Comparison of Color Spaces 214 6.3 Addition of Junction Feature Channels 216 6.4 Comparison of Different Statistical Models 217 6.5: Checking the Problem with Beta 219 6.5.1 Asymptotic Behavior of β 219 6.5.2 What Happens if We Fix the β Hyperparameter to a Constant Value? 221 6.5 Method Performance Conclusion 226 References 228 Appendices 245 Appendix A: Contour Integration Model Parameters 245 Appendix B: Mathematical Details on Surprise 246 Appendix C: Kullback-Liebler Divergences of Selected Probability Distributions 253 C.1 Conceptual Notes on the KL Distance 253 C.2 KL of the Gaussian Probability Distribution 255 C.3 KL of the Gamma Probability Distribution 255 C.4 KL of the Joint Gamma-Gaussian or Gamma-Gamma Distribution 258 Appendix D: Junction Channel Computation and Source 262 D.1 Junction Channel Source Code 264 Appendix E: RGBY and CIE Lab Color Conversion 267 E.1 RGBY Color Conversion 267 ix E.2 CIE Lab Color Conversion 268 Appendix F: HSV Color Conversion Source 273 F.1 RGB to HSV Transformation 274 F.1.1 HSV Transformation C / C++ Code 275 F.2 HSV to RGB Transformation: 278 F.2.1 RGB Transformation C/C++ Code 279 Appendix G: H2SV Color Conversion Source 281 G.1 HSV to H2SV Transformation 282 G.1.1 HSV to H2SV1 Variant 282 G.1.2 HSV to H2SV2 Variant 283 G.2 H2SV to HSV Simple Transformation 284 G.2.1 H2SV1 to HSV Simple 284 G.2.2 H2SV2 to HSV Simple 284 G.3 H2SV to HSV Robust Transformation 285 G.3.1 General Computations: 285 G.3.2 C / C++ Code for Robust Transformation 286 Appendix H: Selected Figure Graphing Commands for Mathematica 288 x List of Tables Table 2.1: Variance accounted for in ICA/PCA. 50 Table 3.1: Table of probabilities of results at random. 107 Table 3.2: Types of features found salient by CINNIC. 119 Table 4.1: M-W feature significance per type. 145 Table 6.1: MAG scores for color spaces. 213 Table 6.2: MAG scores for junction filters. 215 Table 6.3: MAG scores for statistical models. 217 Table 6.4: MAG scores for different values of beta. 223 xi List of Figures Figure 1.1: Examples of retinotopic maps of the visual cortex. 2 Figure 1.2: What does the brain find visually interesting? 4 Figure 1.3: Why the brain looks for so many types of features. 6 Figure 1.4: The increasing complexity of the visual system. 7 Figure 1.5: Examples of basic feature detectors. 11 Figure 1.6: Generations of feature based attention models. 13 Figure 1.7: Orientation features and Gabor pyramid example with Ashes. 15 Figure 1.8: Butterfly regions and contour integration example. 16 Figure 1.9: Examples of top-down models of attention. 19 Figure 2.1: Bayesian priors and Meta Priors spectrum. 26 Figure 2.2: From features to ICA to clustering. 32 Figure 2.3: The VFAT architecture graph. 35 Figure 2.4: General saliency model graph. 36 Figure 2.5: Junction detection from INVT features with ICA. 38 Figure 2.6: Feature clustering example shown with node climbing. 40 Figure 2.7: Examples of feature clustering on different data points. 42 Figure 2.8: NPclassify compared with K-means. 43 Figure 2.9: Example of similarity by statistical overlap. 45 Figure 2.10: Example of feature output following ICA/PCA. 51 Figure 2.11: ICA inversion and color features. 53 Figure 2.12: Example of image feature clustering. 54 xii Figure 2.13: NPclassify compared quantitatively with K-means. 57 Figure 2.14: Features clustered during tracking. 58 Figure 2.15: The simple feature tracker. 59 Figure 2.16: The complex tracker handing off to simple trackers. 62 Figure 2.17: Screen shot of the VFAT based tracker. 63 Figure 3.1: The Braun Make Snake contour. 70 Figure 3.2: The basics of contour alignment and processing. 78 Figure 3.3: Neuron priming diagram. 80 Figure 3.4: Neuron group suppression in theory. 82 Figure 3.5: The basics of the CINNIC alignment and processing. 84 Figure 3.6: Hypercolumns and pseudo-convolution. 91 Figure 3.7: Breakdown of the CINNIC process. 95 Figure 3.8: CINNIC multiple scales and averaging. 96 Figure 3.9: 2AFC simulation for the Polat Sagi display. 99 Figure 3.10: Fit of CINNIC to observer AM. 101 Figure 3.11: Interaction of element size and enhancement. 103 Figure 3.12: CINNIC working on Make Snake contours. 105 Figure 3.13: Performance of CINNIC on Make Snake contours. 106 Figure 3.14: The subjective perception of contours and element separation. 108 Figure 3.15: Accounting for performance of CINNIC with kernel size. 110 Figure 3.16: CINNIC sensitivity to junctions. 113 Figure 3.17: Explaining sensitivity of junctions by CINNIC. 115 Figure 3.18: CINNIC sensitivity to salient locations and face features. 120 xiii Figure 3.19: CINNIC and fast plasticity. 127 Figure 4.1: Overview of the surprise system. 138 Figure 4.2: The surprise map over sequence frames. 141 Figure 4.3: Peaks of surprise seem predictive. 144 Figure 4.4: Mean surprise and visual features. 148 Figure 4.5: Standard deviation of surprise and visual features. 150 Figure 4.6: Spatial location of max surprise and visual features. 151 Figure 4.7: The surprise prediction system. 155 Figure 4.8: How surprise prediction was analyzed. 158 Figure 4.9: Performance of surprise prediction. 162 Figure 4.10: Theoretical aspects of surprise prediction. 171 Figure 5.1: Surprise peaks at flankers for hard targets. 179 Figure 5.2: Attention gating and the contents of working memory. 180 Figure 5.3: From RSVP to attention gate computation. 182 Figure 5.4: Computation of the attention gate. 186 Figure 5.5: Computing the overlap ratio. 189 Figure 5.6: Surprise attention gate quantitative results. 191 Figure 5.7: Subjective results on Transportation Targets. 192 Figure 5.8: Subjective results on Animal Targets. 193 Figure 5.9: Explaining past results for Easy-to-Hard. 195 Figure 5.10: Attention gating and detecting multiple targets. 199 Figure 6.1: Which of the two models is better or worse? 202 Figure 6.2: Pretty fisher information graph 205 xiv Figure 6.3: The MAG, an overview. 207 Figure 6.4: A general color space overview. 208 Figure 6.5: RGBY Color space example. 210 Figure 6.6: CIE Lab color space example. 211 Figure 6.7: H2SV2 color space example. 212 Figure 6.8: MAG and color space results. 213 Figure 6.9: MAG and junction filter results. 215 Figure 6.10: MAG and statistical model results. 217 Figure 6.11: The asymptotic behavior of beta. 220 Figure 6.12: MAG performance for different values of beta. 223 Figure B.1: Different views on the Gamma PDF. 247 Figure B.2: Surprise in Wows! 248 Figure B.3: The DoG Filter. 251 Figure C.1: From a PDF to the integrated KL region. 254 Figure C.2: The Joint gamma-gamma KL. 257 Figure D.1: The junction filter. 262 Figure E.1: CIE 1931 XYZ color space. 269 Figure E.2: Map of the CIE Lab gamut space. 270 Figure F.1: HSV color space. 273 Figure G.1: H2SV color space. 281 xv Abbreviations AI Artificial Intelligence AIP Anterior Interparietal Sulcus AMD Advanced Micro Devices BPNN Back Propagation Neural Network CIE International Commission on Illumination CINNIC Carefully Implemented Neural Network for Integrating Contours CRT Cathode Ray Tube (monitor) DoG Difference of Gaussian EPSP Excitatory Post Synaptic Potential EQ Equation ERF Error Function ERFC Complementary Error Function fMRI Functional (Nuclear) Magnetic Resonance Imaging FS Fast Spiking GABA Gamma Aminobutyric Acid GB Gigabyte (1 billion bytes) GCC GNU C++ Compiler GIMP GNU Image Manipulation Program GNU GNU's Not Unix [sic] (An open source, free software consortium) GPL GNU General Public License xvi HSV Hue/Saturation/Value H2SV HSV Variant with two hue components H2SV2 H2SV with Red/Green Blue/Yellow opponents Hz Hertz (cycles per second) ICA Independent Component Analysis INVT iLab Neuromorphic Vision Toolkit IPSP Inhibitory Post Synaptic Potential IT Inferior Temporal Cortex KL Kullback-Liebler Divergence (sometimes called the KL distance) Lab CIE Lab Color (Luminance with two opponents, a Red/Green b Blue/Yellow) MAG Metric of Attention Gate MHz Megahertz (1,000,000 cycles per second) ms Milliseconds (1/1000 of a second) O Worst Case Asymptotic Complexity (called the big “O” notation) OpenCV Open Computer Vision (Intel Toolkit) PCA Principal Component Analysis PDF Probability Distribution Function PFC Pre-Frontal Cortex POMDP Partially Observable Markov Decision Process RAM Random Access Memory RGB Red, Green and Blue Color xvii RGBY Red/Green and Blue/Yellow Color RMSE Root Mean Squared Error RSVP Rapid Serial Vision Presentation SMA Supplementary Motor Area SQRT Square Root T Terrell TD Temporal Difference V1 Primary Visual Cortex V2 – V5 Regions of Extrastriate Cortex VFAT Vision Feature Analysis Toolkit WTA Winner Take All xviii Abstract What draws in human attention and can we create computational models of it which work the same way? Here we explore this question with several attentional models and applications of them. They are each designed to address a missing fundamental function of attention from the original saliency model designed by Itti and Koch. These include temporal based attention and attention from non-classical feature interactions. Additionally, attention is utilized in an applied setting for the purposes of video tracking. Attention for non-classical feature interactions is handled by a model called CINNIC. It faithfully implements a model of contour integration in visual cortex. It is able to integrate illusory contours of unconnected elements such that the contours “pop-out” as they are supposed to and matches in behavior the performance of human observers. Temporal attention is discussed in the context of an implementation and extensions to a model of surprise. We show that surprise predicts well subject performance on natural image Rapid Serial Vision Presentation (RSVP) and gives us a good idea of how an attention gate works in the human visual cortex. The attention gate derived from surprise also gives us a good idea of how visual information is passed to further processing in later stages of the human brain. It is also discussed how to extend the model of surprise using a Metric of Attention Gating (MAG) as a baseline for model performance. This allows us to find different model components and parameters which better explain the attentional blink in RSVP. xix Preface About this thesis This thesis is about the computational modeling of visual attention and surprise. The aspects that will be covered in this work include: • Utilization of the computation of attention in engineering. • Extensions to the computational model of attention and surprise. • Explaining human visual attention and cognition from simulation using computational models. This work is integrative and based on the philosophy that computer vision is aided by better understanding of the human brain and it’s already developed exquisite mechanisms for dealing with the visual world as we know it. At the same time, development of biologically inspired computer vision techniques, when done correctly, yields insight into the theoretical workings of the human brain. Thus, the integration of engineering, neuroscience and cognitive science gives rise to useful synergy. The second chapter covers the utilization of saliency as an engineering topic. This is an example of applying what we have learned from the human brain towards an engineering goal pursued with real world applications in mind. It is somewhat more applied and as a result, many components are not biologically motivated. The reader should keep in mind that project goals placed constraints on what can be done. In this xx case, a real time system able to process images very quickly was needed. Additionally, the project as is typical for engineering endeavors required “deliverables”. Chapters three and six cover methods for extending or changing the way in which surprise is computed. In the case of the former, a model of contour integration is created and examined. This allowed the creation of an extension to the basic saliency model for non-local interactions. Its primary contribution however turned out to be gainful knowledge of the human visual mechanisms involved. The fourth and fifth chapters deal with temporal dimensions of attention using surprise. The goals are to test and extend the model to see if predictions can be made of observer performance. Thus, it is suggested that a better fit model, which is improved in its ability to predict human performance, is closer to the actual mechanisms which the human brain uses. This also has reciprocal engineering applications since it can be used to help determine what humans will attend to in a dynamic scene. Graduate works not included in this thesis I have tried to keep all work included in this document constrained to the topic of visual attention and to work with salient results. As such, much of the work I have done in pursuit of my doctorate is not included. These works include, but are not limited to (in chronological order): • The Beobot Project (Mundhenk, Ackerman, Chung, Dhavale, Hudson, Hirata, Pichon, Shi, Tsui & Itti, 2003a) • Schizophrenia and the Mirror Neuron System (Arbib & Mundhenk, 2005) • Estimation of missing data in acoustic samples (Mundhenk, 2005) xxi • Surprise Reduction and Control (Mundhenk & Itti, 2006) • Three Dimensional Saliency (Mundhenk & Itti, 2007) Of interest in particular is the work on Schizophrenia and Mirror Neuron system which has been cited 45 times according to Google scholar. Also of interest is the Beobot project paper which was the most downloaded paper from iLab for three years straight, and it is still in the top five downloads to this day. Other works of interest not included in this thesis Also not included is the large amount of educational materials created and posted online. These include: • http://www.cool-ai.com –AI homeworks, projects and lecture notes for usage in AI courses. • Wikipedia and Wikiversity – contributions including: o http://en.wikiversity.org/wiki/Learning_and_Neural_Networks - Created self-guided teaching page on Neural Networks. o http://en.wikipedia.org/wiki/Cicadidae - Contributed Wikipedia featured picture of the day and written content. o http://en.wikipedia.org/wiki/Gamma_distribution - contributed graphics and corrections. o http://en.wikipedia.org/wiki/Kullback-Leibler_divergence - contributed graphics and corrections. o http://en.wikipedia.org/wiki/Methods_of_computing_square_roots - Added algorithms and analysis. xxii • http://www.cinnic.org/CINNIC-primer.htm – Contour Integration Primer. • http://www.nerd-cam.com/how-to/ - Detailed Instructions on how to build your own robotic camera. Don’t read the whole thesis This thesis uses the standard “stapled papers” framework. While each chapter has been integrated into a coherent work, they each will stand on their own. As a result, the reader is advised to get what they want and get out. That is, go ahead and read a chapter which interests you, but don’t bother to read other parts. However, there tends to be more information here than in the authors papers cited. As such, this thesis may be of use in getting some of the model details not covered in the authors published materials due to space constraints in peer reviewed journals. Have fun T. Nathan Mundhenk 1 Chapter 1: A Brief Introduction to Vision and Attention You got to work today without running over any pedestrians. How did you do that? To be sure this is a good thing. You can pick up items without even thinking about it; you can thumb through a magazine until you get to a favorite advertisement; You can tell a shoe from a phone and you can tell if that giant grizzly bear is in fact gunning for your ice cream cone. You do all sorts of things like this every day and frequently they seem utterly simple. To be certain, sometimes you cannot find your keys to save your life, but even while searching for them, you don’t bang into the furniture in your apartment, at least to too much. How did you do this? I ask, because like just about every person on earth, I’m not totally sure. OK, true, you’ll be glad to know I have some ideas. However, the pages that follow will only scratch the surface of how human beings such as ourselves view the world. To this day, much of human vision still remains a mystery. However, many things about human vision are well established. For instance, we do in fact see from our eyes and the information from them does travel to our brain. The brain itself is where what we see is processed and it turns out that its job it not merely to cool our blood as Aristotle believed it to be. However, there is a place between seeing and understanding which resides within human brain itself, and how it takes the items in the world and places them into your mind is a complicated story. In this work, we will focus on an important part of this process, the notion of selection and attention. The idea as it were is that not everything presented to our eyes makes its way from the retina in the eyes to the seat of 2 consciousness. Instead, it seems that most of what we perceive is just a fraction of what we could. The brain is picky, and it only selects some things to present to us, but many other things simply fade from being. Figure 1.1: Retinotopy has been demonstrated repeatedly over the years in the visual cortex. Thus, its existence is well founded. An early example is given by Gordon Holmes who studied brain injuries in solders after the first world war (Holmes, 1917, Holmes, 1945) and traced visual deficits to specific injury cites in visual cortex. Then with primate (Macaque) cortex experiments using Deoxyglucose (Tootell, Silverman, Switkes & De Valois, 1982) it was shown that a pattern activated a region of visual cortex with the same shape. However, this method was limited due to the fact that the animal had to be sacrificed immediately after viewing the pattern in order to reveal and photograph the pattern on the cortex. Later in 2003, with fMRI using sophisticated moving target displays, (Dougherty, Koch, Brewer, Fischer, Modersitzki & Wandell, 2003) regions in the human brain were shown to correspond to locations in the visual cortex in much the same way. However, fMRI allows observation in healthy human volunteers, which is a distinct advantage since more advanced experiments such as those involving motion can be conducted. What then does the brain do to select the things it wants to see? One could suppose that a magic elf sits in a black box in the brain with a black magic marker 3 looking at photos of the world sent to it by the eyes. The elf inspects each photo and decides if it’s something it believes you should see. Otherwise it marks it with an ‘x’ which means that another magic elf should throw the image away. The idea of magic elves as a brain process is intriguing, however the evidence does not bear it out. Then again, the brain is in some sense a black box. Thus, while we do not think that magic elves are the basis for cognition, we still must make inferences about the brains basic workings from a variety of frequently indirect evidence. For instance, we can probe the brain of other primates. In figure 1.1 it is shown that we know that the visual cortex receives information from the eyes in retinotopic coordinates. We know this from experiments on primates where briefly flashed visual patterns caused a similar pattern to form on the visual cortex (Inouye, 1909, Holmes, 1917, Holmes, 1945, Tootell et al., 1982). Does the same thing happen in the human brain? The general consensus is yes, many pieces of visual information from the eye line up on the back of the brain somewhat like a movie projecting onto a screen. Newer studies with functional magnetic resonance imaging (fMRI) on humans reinforces this idea (Horton & Hoyt, 1991, Hadjikhani, Liu, Dale, Cavanagh & Tootell, 1998, Dougherty et al., 2003, Whitney, Goltz, Thomas, Gati, Menon & Goodale, 2003). Still, the evidence is indirect. No one has seen the movie on the back of the brain, but fortunately, the evidence is satisfying. Retinotopy in the visual cortex is an example of something which is well founded even if the evidence is sometimes indirect. However, do we have such a good notion about how the brain selects what it wants to see from input coming from the eyes? It turns out sort of, but not completely. However, this is not without good reason. What 4 captures ones attention is quite complex (Shiffrin & Schneider, 1977, Treisman & Gormican, 1988, Mack & Rock, 1998). So for instance, things which are colorful tend to get through the brain much easier than things which are dull. This is for instance why stop signs are red and not gray. This is also why poisonous snakes or monarch butterflies (which are also poisonous) have such vivid colors. Interestingly, it is not just the colors which attract our attention it is how the colors interact. For instance, something which is blue attracts more attention when it is next to something yellow while something red tends to get more attention when it is next to something green. So it’s not just the color of something that makes it more salient, it’s how the colors interact as opponents. Figure 1.2: What does the brain find visually interesting? There are many things (from left to right). Good continuation of objects which form a larger percept is interesting. Conspicuous colors, particularly the opponents red/green and blue/yellow stand out. Objects with unique features and special asymmetries (Treisman & Souther, 1985) compared with surrounding objects can stand out. Also motion is a very important cue. Ok, seems pretty simple, but that was just one piece of a rather gigantic puzzle. Just a sampling of what is visually interesting is shown in figure 1.2. It turns out that edges, bright patches, things which are moving and things which are lined up like cars in traffic and … well many things all can attract your attention and control what it is that your brain deems interesting. Still it gets even more complex, your brain itself can decide to change the game and shift the priority on certain visual features. As an example, if you 5 are looking for a red checker, your brain could decide to turn up the juice on the red color channel. That is, your brain can from the top-down change the importance of visual items making some things which were less interesting more interesting and vice versa (Shiffrin & Schneider, 1977, Wolfe, 1994a, Navalpakkam & Itti, 2007, Olivers & Meeter, 2008). So just on the front, we can see that the notion of visual attention and what gets from the retina in the eyes to the seat of thought is quite complex. It involves a great deal of things which interact in rather complex and puzzling ways. However, as mentioned we do know many things, and we are discovering new properties every day. Hopefully this work will help to illuminate some of the processes by which the visual world can pass through brain into the realm of thought. 1.1 What Does our Brain Want to Look For? Imagine that the world was not in color. Further, imagine that all you could see was the outlines of the stuff that makes up the world. You would still need to move around without tipping over chairs and be able to eat and recognize food. What then would draw your attention? You can still tell how to identify many things. After all, it is the world of lines which makes up the Sunday comic strips. You might not be able to tell something’s apart which you could back in our colorful world, but for the most part you could tell a table from a chair or an apple from a snake. In this case, what would your brain look for? 6 Figure 1.3: Why does the brain look for so many different types of features? It depends on what it needs to find. Some images are defined by lines, others by colors and some by the arrangement of objects. All of the images shown are interpretable even though typical natural scene information of one type or another seems lacking. Shown from Top Left: Atari’s Superman, Picaso’s La Femme Qui Pleure, Gary Larson’s Far Side; Bottom Left: Liquid Television’s Stick Figure Theater, Van Gogh’s Starry Night Over the Rhone. In basic terms, what your brain wants to look for is information. Figure 1.3 shows several different scenes which one can interpret even though the information is presented very differently with typical information components such as color, lines or texture missing. As will be reviewed later, images are comprised of features, which are the essential bits of information for an image. These can include all of the above as well as more complex features such as junctions and motion. Not all features are necessary for object identification. A typical example is that people were able to enjoy television before it was in color. 7 Figure 1.4: (Left) Features gain increasing complexity and their responses become more and more task dependent. Additionally, visual information is sent down multiple pathways for different kinds of processing (Fagg & Arbib, 1998). Here the task of grasping a mug will prime features related to a mug top- down. These features in turn will be processed in different ways depending on whether we are trying to identify the mug (Ventral: What) or if we are trying to understand its affordances (Dorsal: How) (Ingle, Schneider, Trevathen & Held, 1967). How the brain splits visual information in this way and then reintegrates it, is still not completely understood. 1 (Right) The connection diagram of Felleman and Van Essen (Felleman & Van Essen, 1991) of the primate visual cortex demonstrates that elegant models such as the one by Fagg and Arbib still only scratch the surface of the full complexity of the workings of the brain. In addition to the essential bits of an image which are important, what the brain wants to see is also based on the task at hand. Figure 1.4 illustrates a model for the task of grasping an object (Fagg & Arbib, 1998). Initially the object to be grasped must be spotted. If a person has some idea of what they are looking for, then they can attempt to try and focus their attention towards something that matches the expected features of the object. For instance, if the object to be grasped is a red mug, then the initial search for it should bias one to look for red and round things. Such a bias becomes even more 1 This is a reconceptualiztion of the original Fagg & Arbib figure which appears in: [44] Fellous, J.-M., & Arbib, M.A. (2005). Who Needs Emotions? The Brain Meets the Robot. Oxford: Oxford University Press. 8 important in a cluttered scene where many simple salient items may be a distraction. Otherwise, finding a red mug in a plain white room would be more simple. Once the object has been spotted, appropriate features must be further extracted such as geometric land marks (Biederman, 1987, Biederman & Cooper, 1991). So the brain will need to find essential characteristics of the object for the task. In this case, we want to grasp or pick up the object. If a portrait of Weird Al Yankovic is painted on the side of the mug, it might grab our attention, but it is unimportant for the task of acquiring the mug. Instead, we should ignore the portrait and just scan the geometry. The task might be entirely different if we had another action we wanted to execute. For instance, if someone asks us whose face is on the mug, we would want to scan for face like features and perhaps ignore the geometric properties completely. In the mug example, we can imagine that many other factors might come into play. For instance the scene might change unexpectedly. As an example, our clumsy relative might have knocked over the mug. This sudden change in the scene would come as a surprise and should initiate a change in attention priorities. If the coffee is flowing towards my notebook computer I should notice that as soon as possible. Then I should perhaps cancel my grasping action and search for paper towels or maybe make a grasp for my computer. The brain also sometimes has very little choice in what it looks for. Some things are highly salient such as a stop sign or an attractive person. It can be hard to override the innate bottom-up search system at times. Thus, many things are attended to fairly quickly and automatically. This is a rather important trait, a rock hurling towards you at great speed demands your attention more than a cup of coffee. As such, we can see that what 9 the brain wants to see also depends on automatic bottom-up systems which can preempt current task demands. 1.2 How Does our Brain Search for What it Wants? 1.2.1 What’s a Feature? What the brain wants to see is based on what is useful for it to see. Early on, after the invention of photography in the 19 th century, many artists began to rethink what it was that they were doing. Up until then, artists created the essence of photographs with a paint brush, but since a machine could do the same thing faster and cheaper, direct photographic style artistry seemed like it would become archaic. This helped to bring about the Impressionist style of art. What is notable to our discussion is that artists began to experiment with imagery where fundamental features of a painting could be altered, but the scene could still be interpreted. As structure and form of art was changed and experimented with, it became more obvious that the brain did not need a direct photograph of a scene in order to understand it. Instead, it merely needed some form of structure which resembled the original scene. Partially as a result of this new way of looking at the world, early 20 th century cognitive scientists began to think about how objects and components of an image could be linked together to create something which the brain could understand. Both Structuralists such as William James (James, 1890) and in particular Gestalt psychologists such as Max Wertheimer (Wertheimer, 1923) and Kurt Koffka (Koffka, 1935) began to think about how the brain can take in parts of a scene and assemble them 10 into something the brain understands. They believed that perception was a sum of the parts, but at the time they lacked the scientific abilities to prove their ideas. That the visual world was composed of parts which the brain assembles had been proposed. However, what these parts looked like or what form they took was far from certain. Several theories came forward over the years to refine what kind of parts the brain uses to create the whole. A popular term for the elementary parts of an image was features. Several scientists in the 1950’s such as Gibson (Gibson, 1950), Barlow (Barlow, 1961) and Attneave (Attneave, 1954) began to note that prior information about shapes, line and textures could be collected and used to interpret abstracted scenes statistically. In particular, Fred Attneave proposed that much of the visual world is redundant and unnecessary for the task of recognition. A cat for instance could be represented as points (or perhaps better as junctions) which are connected by the brain to form the perception of a cat. Under this assumption, a large mass of visual information presented to the retinal, for instance all the parts of the image which are not junctions are extraneous. Partially as a result of such assertions, several theories were put forward claiming that there should be a bottleneck in attention (Broadbent, 1958, Deutsch & Deutsch, 1963). As such, the picture of the visual world was still hazy, but several theories were now giving an idea of how the brain sees the world and what it wants to find. First, the brain compiles images from parts to create a whole. Second, features of an image as simple as points, lines, textures or junctions scattered about a scene may be sufficient in order for the brain to understand an image, but that there may be limits on how much the brain can process at one time. However, several questions remained. First, what kind of features is 11 the brain looking for and second how does the brain look for and process these features keeping in mind that it has some limitations on capacity? Figure 1.5: (Left) Early visual processing by the brain looks for simple features. For instance the retinal begins by sorting out color opponents such as red/green and blue/yellow (Kuffler, 1953, Meyer, 1976). While the understanding of the center surround mechanism is somewhat recent, knowledge of the arrangement of color opponents is very old and its theory can be traced at least as far back as to the German physiologist Ewald Hering in 1872 (Hering, 1872) but was first described physiologically in the goldfish (Daw, 1967). We can simulate these mechanisms using the filters shown. Here we see DoG (Difference of Gaussian) filters which give the On Center / Off Surround response (von Békésy, 1967, Henn & Grüsser, 1968) for colors (Luschow & Nothdurft, 1993). (Right) Later, the visual cortex utilizes hyper columns (Hubel & Wiesel, 1974) to find lines in an image. We can use wavelets like the one on the right to give a response to lines in an image (Leventhal, 1991). The type of wavelets used are typically called Gabor wavelets in honor of the Hungarian engineer Gábor Dénes (Dennis Gabor). (Bottom) The bottom row shows a cross section of the filters on the top. The answers to these questions began to congeal with the development of improved psychometric instrumentation in the 1960s that could better time and control the reaction of human subjects with a wide variety of stimulus. [For instance see (Sperling, 1960, Raab, 1963, Sperling, 1965, Weisstein & Haber, 1965)]. This was accompanied by improved psychophysical instrumentation capable of direct 12 measurement of neural activity in animals [For instance (Daw, 1968, Henn & Grüsser, 1968)]. By the 1970’s combined with the seminal work by David Hubel and Torsten Wiesel (Hubel & Weisel, 1977) we were starting to get a pretty good idea of what kind of elementary features the brain is looking for. In figure 1.5 we see some of the features which we knew the brain to be sensitive to by the mid 1970’s. The brain has literal detectors for lines and color opponents such as red/green and blue/yellow. It should be noted however, that this is still the beginning of the story. We knew that there was a set of simple features which the visual cortex would pick up on, but there was no idea how these features could be assembled into larger objects. Additionally, were there more features or was this the full basis set? 1.2.2 How do we Integrate These Features? By the 1970’s two important concepts were beginning to emerge. One was the notion of focused attention. That is, if Attneave and his contemporaries are correct, the brain might be wise to only spend time processing parts of a scene and not the whole thing. Second, features such as lines and colors integrate and bind in the brain. For instance, it had been known since the 1930’s that the brain can bind colors and words. John Stroop (Stroop, 1935) showed that by flashing a word such as “blue” but coloring it red tended to trip up and slow down observers when asked to name it. Would such a mechanism also apply at the level of feature integration? 13 Figure 1.6: Three generations of models of feature based attention are shown in succession. Starting with Treisman, Gelade & Gormican (Treisman & Gelade, 1980, Treisman & Gormican, 1988) 2 it was hypothesized that the way visual features such as lines and colors integrate in parallel controls the serial components of attention. This model itself is a refinement of earlier theories of attention, for instance Shiffrin and Schneiders theory of automatic and controlled attention (Shiffrin & Schneider, 1977) and the pre-attentive and focal attention model of Neisser (Neisser, 1967). Later Koch and Ullman (Koch & Ullman, 1985) expanded this with the notion of having a saliency map which controls the spotlight of attention with a winner-take-all network. Following this, it was made into a fully functional computational model by Itti and Koch (Itti & Koch, 2001b). Several theoretical constructs were advanced and lead to increasing understanding on the question of attention (Figure 1.6). It was discovered that attention seems to be focal and that only parts of an image actually reach what many people would call consciousness. In 1967, this hypothesis was put forward by Ulric Neisser (Neisser, 1967) who suggested that there was a pre-attentive phase to visual processing when features were gathered together in parallel, but that later the features combined and were inspected serially by focal attention. This was further expanded by Richard Shiffrin and Walter Schneider (Shiffrin & Schneider, 1977) who saw a second dimension to attention. They suggested that some parts of attention are automatic and some parts are controlled. That 2 This drawing is from Treisman and Gormican 1988. It is based on the feature integration theory given in Treisman and Gelade 1980. However, Treisman and Souther 1985 gives a very similar figure. 14 is, some features in an image grab our attention automatically and almost reflexively. However, we are also consciously able control some things which we attend to. This is what is now thought of in broader terms as bottom-up and top-down attention. In 1980, Anne Treisman and Gerry Gelade further refined these ideas into a Feature Integration theory of attention (Treisman & Gelade, 1980). There idea was that the parallel computation of Neisser could be split into different features which could be processed separately in the pre-attentive stage and then brought together. Thus, the brain would compute its interest in colors, lines and intensities at the same time and that it is the sum integration of different features which determines the locus of attention. That is, attention is driven simultaneously be each type of feature, but the conjunction or independent dominance of a feature can draw in attention. However, the question was left open as to how the features could combine to create a master map of attention. A possible answer was given by Christof Koch and Shimon Ullman (Koch & Ullman, 1985) who gave the idea that the brain maintained a saliency map for the visual world and that a max selector processes (Didday, 1976, Amari & Arbib, 1977) would refine the saliency map so that only a single location in the visual field would stick out. This allowed for many things in the world to be salient at the same time, but suggested that the most salient item of all is that one which the brain will attend to. The theories of attention put forward by Treisman et al as well as Koch and Ullman gained further support over the next decade due to a variety of experimental results [For examples see (Nothdurft, 1991b, Nothdurft, 1991a, Nothdurft, 1992, Luschow & Nothdurft, 1993)]. In 1998 Laurent Itti, Christof Koch and Ernst Niebur further refined the model of Koch and Ullman and created a comprehensive 15 computational model that allowed direct testing of it (Itti, Koch & Niebur, 1998). It also included a comprehensive set of feature detectors as well as a Gaussian/Laplacian pyramid to detect features at many different scales (Figure 1.7). Figure 1.7: Gabor wavelet filters give a response to lines in an image. One way to do this is to create four or more wavelet filters each with its own directional orientation (Itti et al., 1998). On the left this can be seen as filters sensitive to lines are 0, 45, 90 and 135 degrees. On the right is an image which has been convolved by the filters at 0 and 90 degrees and the lines that were extracted by the filters. Since lines have different sizes we can convolve each image at a different scale to increase our chances of discovering lines of different widths (Tanimoto & Pavlidis, 1975, Burt & Adelson, 1983, Greenspan, Belongie, Goodman, Perona, Rakshit & Anderson, 1994) 3 . The essential gain was that the computer could be treated like a brain in a box. If the model of Koch and Ullman was correct, then a comprehensive computational model should have parity with the behavior of humans. Initial results showed that the saliency 3 The cats name is Ashes. 16 Figure 1.8: (Top Row) Features that the brain is looking for get increasingly complex. This happens frequently when simpler features are combined to create new ones (Field, Hayes & Hess, 1993, Kovács & Julesz, 1993, Polat & Sagi, 1994, Gilbert, Das, Ito, Kapadia & Westheimer, 1996, Li, 1998, Mundhenk & Itti, 2005). For instance, line fragments which Gabor filters pick up on can then be connected in a corresponding zone which completes contours. The butterfly pattern on the left will complete a contour when line fragments lie in the green zone and are aligned. This can be seen on the right where three co- linearly aligned fragments enhance each other to give a larger response. The graph is somewhat crude, but the point is that the more elements that are aligned, the stronger the response. (Bottom Row) The elements aligned into a circle on the left are much more salient than random elements (Kovács & Julesz, 1993, Braun, 1999). They should produce an activation pattern like the one on the right (Mundhenk & Itti, 2003, Mundhenk & Itti, 2005).This is discussed at length in chapter 3. 17 model behaved in a manner that was expected (Itti & Koch, 2001b). The computational saliency model was able to detect many odd-man-out features, search asymmetries and conditions for pop-out that would be expected of human observers. Additionally, the model could be augmented to included top-down attentional effects (Itti, 2000) by adjusting features weights in a manner similar to the mechanism proposed 25 years earlier for directed attention by Shiffrin and Schneider (Shiffrin & Schneider, 1977). Thus, for instance, when looking for a red Coke can, it is almost a simple matter to weight the red feature more during search. 1.2.3 Beyond the Basic Saliency Model The original saliency model of Itti and Koch lacked three components. One was the interaction of non-local features. Thus, as can be seen in figure 1.8, contours and line segments which extend past the classic receptive fields of the basic feature detectors have been found to be salient (Kovács & Julesz, 1993, Polat & Sagi, 1993b, Gilbert et al., 1996, Braun, 1999, Geisler, Perry, Super & Gallogly, 2001). The second element missing was temporal attention. This itself is comprised of three components which may or may not be independent of each other. They are motion, change and masking. Thus, things which are in motion tend to draw our attention. However simple changes such as the appearance or disappearance of an element in a video can draw or attention as well (Mack & Rock, 1998). The third element of temporal attention, masking, has been studied quite extensively (Breitmeyer & Ö ğmen, 2006). It is where something at one instance in a sequence of images is blocked from perception by something spatially proximal that comes before or after it. It includes both backwards and forwards masking, 18 the attentional blink (Raymond, Shapiro & Arnell, 1992) and both automatic and controlled mechanisms (Sperling & Weichselgartner, 1995, Olivers & Meeter, 2008). Further, the temporal components of attention are hypothesized to be comprised of more than one processing stage (Chun & Potter, 1995). The third element, top-down attention has been partially implemented since the original model was incepted (Itti, 2000, Navalpakkam & Itti, 2005). However, a complete model of top-down attention is probably many years away since it requires construction of the “top” component which may include consciousness itself. A non-local extension to the saliency model was eventually provided by T Nathan Mundhenk (Mundhenk & Itti, 2003, Mundhenk & Itti, 2005) and was extensively tested. This is covered in chapter 3. The extensions to temporal saliency are covered in chapters 2, 4, 5 and 6. They include extensions by the addition of a motion channel in chapter 2 (Mundhenk, Landauer, Bellman, Arbib & Itti, 2004b, Mundhenk, Navalpakkam, Makaliwe, Vasudevan & Itti, 2004c, Mundhenk, Everist, Landauer, Itti & Bellman, 2005a) and extension by the usage of Bayesian Surprise in chapters 4, 5 and 6 (Itti & Baldi, 2005, Einhäuser, Mundhenk, Baldi, Koch & Itti, 2007b, Mundhenk, Einhäuser & Itti, 2009). 1.3 The Current State of Attention and Other Models Many contemporary models of attention are designed to address one or more of the shortcomings of the original saliency model discussed in the last section, while many are attempts at general improvements or are different models altogether. 19 1.3.1 Top-Down Models Modeling the factors of top-down v. bottom-up attention goes back very far. As can be seen in figure 1.9 an early model was provided by Shiffrin and Schneider, but that model lacked a good notion of feature integration as well as an attentional map. Jeremy Wolfe (Wolfe, 1994a) provided a good synthesis of the model of Shiffrin and Schneider with the model of Koch and Ullman. Thus, the affects of top-down controll were merged with a feature integration attention model which also included an attention map. However, this is an example of a static scene top-down model. That is, prior knowledge is integrated as a top-down mechanism, but not necessarily online. Current extensions of this model include the integration of task influence (Navalpakkam & Itti, 2005) as well as an explanation of feature tuning (Navalpakkam & Itti, 2007). Figure 1.9: (Left) An early example of an attention model with top-down guided search activation is the attention model of Shiffrin and Schneider (Shiffrin & Schneider, 1977). Here automatic parallel processing layers that compute attention can be controlled by a more serialized attention director. (Right) The model by Wolfe (Wolfe, 1994a) is conceptually a synthesis of Shiffrin & Schneider with Koch and Ullman (Koch & Ullman, 1985). That is, it has added feature integration and a saliency map. 20 Many other models which integrate top-down attention are concerned with online handling of features as well as task demands. Sperling et al (Sperling, Reeves, Blaser, Lu & Weichselgartner, 2001) has provided one such model with a gamma shaped window function of attention. Task it treated as a spatial cue to certain locations allowing a “Quantal” discrete attention window to be opened at that location for a certain amount of time. It also includes bottom-up attention using the original term “automatic” attention. However, like with the model of Wolfe, it has not been nearly as completely implemented as the Itti and Koch model. One might consider it a partial implementation in comparison. A recent and important contribution to the modeling of top-down attention is provided by Olivers and Meeter. This is known as the Boost and Bounce theory of attention (Olivers & Meeter, 2008). In many ways it is an extension of Sperling et al, but it has more explicit handling of features as well as an improved description of the interaction of frontal cortical mechanisms with visual cortical processing. Again, however, the implementation is very computationally limited. 1.3.2 Other Contemporary Models of Saliency Currently there are a variety of other attention models in existence. Some are variants of the model of Itti and Koch (Frintrop, 2006, Itti & Baldi, 2006, Gao, Mahadevan & Vasconcelos, 2008) while others are more unique (Cave, 1999, Li, 2002, Bruce & Tsotsos, 2006). The model by Simone Frintrop is known as VOCUS. Its goal is to use models of saliency to improve computer vision search. It implements top-down task improvements in a manner similar to Itti and Koch, but adds a top-down 21 excitation/inhibition mechanism. It also uses the CIE Lab (McLaren, 1976) color space for color opponents and implements a form of 3D saliency for laser range finders. Dashan Gao et al (Gao et al., 2008) have implemented an interesting variation on Itti and Koch which is to change the treatment of center surround interactions. The center surround response is termed “Discriminant” center surround because it forms a center surround response based on the strength of a linear discriminant. The more crisp the discrimination of the center of a location is when compared with its surround, the stronger a response is given at that location. However, this is a mechanism very similar to the way the model of Surprise (Itti & Baldi, 2005, Itti & Baldi, 2006) computes spatial attention. The model by Bruce and Tsotsos (Bruce & Tsotsos, 2006) is an information maximization model. It works by taking in a series of images and forming a bases set of features. The bases set is then used to convolve an image. The response to each basis feature is competed against the basis features from all other patches. Thus, if a basis feature gives a unique response at an image location, it is considered salient. The most notable difference with this model compared with Itti and Koch is the derivation of basis features from prior images similar to Olshausen and Field (Olshausen & Field, 1996). However, the rectification using a neural network may compute competition in a way which is not sufficiently different from a WTA competition, but it may be arguably more biologically plausible. The model by Li is much more different. Li’s model (Li, 2002) is strongly model theoretic and somewhat neglects the task of image processing. However, it is claimed that it can provide saliency pre-attentively without the use of separate feature saliency maps. 22 Thus, the model should compute a singular saliency from combining features responses at the same time. This may be a more plausible method for computing saliency, but it is unclear if it functionally gains much over other models of saliency. 1.3.3 The Surprise Model There are two notable trends with saliency models. One is the emergence of information theoretic constructs and the other is the continued divergence between static saliency models and dynamic models of attention. With the recent exception of Gao (Gao et al., 2008) attention models were either static feature based models or dynamic, but primarily theoretical models (Sperling et al., 2001). The introduction of Surprise based attention (Itti & Baldi, 2005, Itti & Baldi, 2006) created for the first time a statistically sound and dynamic model of attention. In chapter 4, we will introduce surprise based attention and show that it does an excellent job of taking into account dynamic attentional effects seen in rapid serial vision experiments. This is then shown to give a good framework for a short term attention gate mechanism in chapter 5. In short, the new framework has some similarities to Bruce and Tsostos in that prior images are used to create belief about new images. However, surprise computes these beliefs online. This means that it does not need to be trained or have strong prior information about feature prevalence. Instead the sequence provides the needed information. The extensive testing and validation in chapters 4-6 also demonstrate firmly that it explains many temporal attention effects. Additionally, we postulate that we have gained further insight into the attentional window into the brain. 23 Chapter 2: Distributed Biologically Based Real Time Tracking with Saliency Using Vision Feature Analysis Toolkit (VFAT) 4 In a prior project, we developed a multi agent system for noticing and tracking different visual targets in a room. This was known as the iRoom project. Several aspects of this system included both individual noticing and acquisition of unknown targets as well as sharing that information with other tracking agents (Mundhenk et al., 2003a, Mundhenk, Dhavale, Marmol, Calleja, Navalpakkam, Bellman, Landauer, Arbib & Itti, 2003b). This chapter is primarily concerned with a combined tracker that uses the saliency of targets to notice them. It then classifies them without strong prior knowledge (priors) of their visual feature, and passes that information about the targets to a tracker, which conversely requires prior information about features in order to track them. This combination of trackers allows us to find unknown, but interesting objects in a scene and classify them well enough to track them. Additionally, information gathered can be placed into a signature about objects being tracked and shared with other camera agents. The signature that can be passed is helpful for many reasons since it can bias other agents towards a shared target as well as help in creating task dependant tracking. 2.1.1 Vision, Tracking and Prior Information For most target acquisition and tracking purposes, prior information about the targets features is needed in order for the tracker to perform its task. For instance, a basic color tracker that tracks objects based on color needs to know a priori what the color of 4 For more information see also: http://ilab.usc.edu/wiki/index.php/VFAT_Tech_Doc 24 the target that it wishes to track is. If one is going to track a flying grape fruit, then one would set a tracker with a certain color of yellow and some threshold about which the color can vary. In general, many newer trackers use statistical information about an objects features which allows one to define seemingly more natural boundaries for what features one would expect to find on a target (Lowe, 1999, Mundhenk et al., 2004b, Mundhenk et al., 2004c, Mundhenk et al., 2005a, Siagian & Itti, 2007). However, in order to deploy such a tracker, one needs to find the features, which describe the object before tracking it. This creates two interesting problems. The first problem is that the set of training examples may be insufficient to describe the real world domain of an object. That is, the trainer leaves out examples from training data, which may hold important information about certain variants of an object. We might think for instance from our flying grapefruit tracking example, that of the fruits that fly by, oranges never do. As a result, we would unknowingly let our tracker have some leeway and track grapefruit that might even be orange in appearance. It might however turn out that we were wrong. At some point, an orange flies by and our tracker tracks it the same as a flying grapefruit. This can happen for several reasons, the first is that we had never observed an orange fly by and as such didn’t realize that indeed, they can fly by. Another reason is that the world changed. When we set up the tracker, only grapefruits could fly by. However, the thing that makes them fly, now acts on oranges, which may be an accidental change, for instance if an orange tree begins to grow in our flying grapefruit orchard. However, it might also be the case that someone has decided to start throwing oranges in front of our tracker. As such, the domain of trackable objects can change either accidentally or 25 intentionally. In such a case, our tracker may now erroneously tracks flying oranges as flying grapefruit. As can be seen from our first example, our tracker might fail if someone tries to fool it. Someone starts throwing oranges in front of our tracker, or perhaps they might wrap our grapefruits in a red wrapper so that our tracker thinks they are apples. If we are selling our flying grapefruits and our tracker is supposed to make sure each one makes it to a shipping crate, it would fail if someone sneaks them by as another fruit. As such, once a dishonest person learns what our tracker is looking for, it becomes much easier to fool. This is seen in the real world in security applications, such as Spam filtering, where many email security companies have to update information on what constitutes Spam on a regular bases to deal with spammers who learn simple ways around the filters. It should be expected that the same problem would go for any other security related application including a vision-based tracker. In the case of our flying grapefruit tracker, its function may not be explicitly security related, but as a device related to accounting, it is prone to tampering. What is needed then for vision based tracking is the ability to be able to define its own priors. It has been proposed that gestalt rules of continuity and motion allow visual information to be learned without necessarily needing prior information about what features individual objects possess (Von der Malsberg, 1981, Prodöhl, Würtz & von der Malsberg, 2003, Mundhenk et al., 2004b, Mundhenk & Itti, 2005). That is, the human visual system does not necessarily know what it is looking for, but it knows how to learn how to look. This itself constitutes a kind of prior information which one might consider meta-prior information. That is, information about what structure or meta-model is 26 needed to gather prior information, such as Bayesian information, is itself a type of prior information. Using meta-prior information, an artificial agent might learn on its own how to form groups that can be used to create statistical relationships and build new prior information about what it wishes to track. Thus, abstractly speaking, meta-priors are concerned with learning about how to learn. 2.1.3 Meta-priors, Bayesian Priors and Logical Inductive Priors Figure 2.1: It is interesting to note how different AI solutions require different amounts of prior information in order to function. Additionally, it seems that the more prior information a solution requires the more certainty it has in its results, but the more biased it becomes towards those results. Thus, we can place solutions along a spectrum based on the prior information required. Popular solutions such as Back Propagation Neural Networks and Support Vector Machines seem to fall in the middle of the spectrum in essence making them green machines and earning them the reputation of being the 2 nd best solution for every problem. We propose that meta-priors are part of a spectrum of knowledge acquisition and understanding. At one end of the spectrum, are the rigid rules of logic and induction from which decisions are drawn with great certainty, but with which unknown variables must be sparse enough to make those reasonable decisions (figure 2.1). In the middle we place 27 more traditional statistical methods, which either require what we will define as strong meta-priors in order to work or require Bayesian priors. We place the statistical machines in the middle, since they allow for error and random elements as part of probabilities and do not need to know everything about a target. Instead, they need to understand the variance of information and draw decisions about what should be expected. Typically, this is gifted to a statistical learner in the form of a kernel or graph. Alternatively, the meta-prior does not make an inference about knowledge itself, but instead is used to understand its construction. From this, we then state, that meta-priors can lead to Bayesian priors, which can then lead to logical inductive priors. From meta-priors we have the greatest flexibility about our understanding of the world and in general terms, the least amount of bias; whereas on the other end of the spectrum, logical inductive priors have the least flexibility, but have the greatest certainty. An ideal agent should be able to reason about its knowledge along this spectrum. If a probability becomes very strong, then it can become a logical rule. However, if a logical rule fails, then one should reason about the probability of it doing so. Additionally, new things may occur which have yet unknown statistical properties. As such, the meta-priors can be used to promote raw data into a statistical framework or to re-reason about a statistical framework, which now seems invalid. Using certain kinds of meta-prior information, many Bayesian systems are able to find groupings which can serve as prior information to other programs which are unable to do so themselves. However, most Bayesian models work from meta-priors that require a variety of strong meta-priors. For instance, the most common requirement is that the number of object or feature classes must be specified. This can be seen in expectation 28 maximization, K-means and back-propagation neural networks, which need to have a set size for how many classes exist in the space they inspect. The number of classes thus, becomes a strong and rather inflexible meta-prior for these methods. Additionally, other strong meta-priors may include space size, data distribution types and the choice of kernel. The interesting thing about meta-priors is that they can be flexible or rigid. For instance, specifying you have several classes that are fit by a Gaussian distribution is semi-flexible in that you have some leeway in the covariance of your data, but the distribution of the data should be uni-modal and have a generally elliptical shape. An example of more rigid meta-priors would be specifying a priori the number of classes you believe you will have. So for instance, going back to our grapefruit example, if you believe your data to be Gaussian, you suspect that flying grapefruit have a mean color with some variance in that color. You can make a more rigid assumption that you will only see three classes such as, flying grapefruit, oranges and apples. All of these are of course part of the design process, but as mentioned they are prone to their own special problems. Ideally, an intelligent agent that wishes to reason about the world should have the ability to reason with flexible weak meta-priors but then use those to define Bayesian like priors. Here we define weak meta-priors as having flexible parameters that can automatically adjust to different situations. So for instance, we might set up a computer vision system and describe for it the statistical features of grapefruit, oranges and apples. However, the system should be able to define new classes from observation either by noticing that a mass of objects (or points) seem to be able to form their own category (Rosenblatt, 1962, Dempster, Laird & Rubin, 1977, Boser, Guyon & Vapnik, 1992, Jain, 29 Murty & Flynn, 1999, Müller, Mika, Rätsch, Tsuda & Schölkopf, 2001, Mundhenk et al., 2004b, Mundhenk et al., 2005a) or through violation of expectation and surprise (Itti & Baldi, 2005, Itti & Baldi, 2006). An example of the first is that if we cluster data points that describe objects, and if a new object appears such as a kiwi, a new constellation of points will emerge. An example of the second is that if we expect an apple to fly by, but see an orange, it suggests something interesting is going on. It might be that new fruit have entered our domain. In the first case, our learning is inductive, while in the second case it is more deductive. We thus define weak meta-priors to be situationally independent. That is, the meta-prior information can vary depending on the situation and the data. Ideally, information within the data itself is what drives this flexibility. So for instance, when selecting what is the most salient object in a scene, we might select a yellow ball. However, a moving gray ball may be more salient if presented at the same time as the yellow ball. Thus, the selection feature for what is most salient is not constantly a color, but can also be motion. So it is the interplay of these features, which can promote the saliency of one object over the other (Treisman & Gelade, 1980). Yet another example is that the number of classes is not defined a priori as a strong meta-prior, but instead, variance between features causes them to coalesce into classes. So as an abstract example, the number of planets in a solar system is not pre-determined. Instead, the interplay of physical forces between matter will eventually build a certain number of planets. Thus, the physical forces of nature are abstractly a weak meta-prior for what kind of planets will emerge, and how many will be formed. 30 2.1.4 The iRoom and Meta-prior Information Here we now review a vision system for following and tracking objects and people in a room or other spaces that can process at the level of weak meta-priors, Bayesian priors and even logical inductive priors. From this, we then need artificial experts, which can use weak meta-priors to process information into more precise statistical and Bayesian form information. Additionally, once we know things with a degree of certainty, it is optimal to create rules for how the system should behave. That is, we input visual information looking for new information from weak meta-priors, which can be used to augment a vision system that uses Bayesian information. Eventually strong Bayesian information can be used to create logical rules. We will describe this process in greater detail in the following pages but give a brief description here. Using a biological model of visual saliency from the iLab Neuromorphic Vision Toolkit (INVT) we find what is interesting in a visual scene. We then use it to extract visual features from salient locations (Itti & Koch, 2001b) and group them into classes using a non-parametric and highly flexible weak-meta prior classifier NPclassify (Mundhenk et al., 2004b, Mundhenk et al., 2005a). This creates initial information about a scene: for instance how many classes of objects seem present in a scene, where they are and what general features they contain. We then track objects using this statistically priorless tracker but gain advantage by taking the information from this tracker and handing it to a simple tracker, which uses statistical adaptation to track a target with greater effectiveness. In essence, it takes in initial information and then computes its own statistical information from a framework using weak meta-prior information. That 31 statistical information is then used as a statistical prior in another simpler and faster tracker. 2.2 Saliency, Feature Classification and the Complex Tracker There were several components used in the tracking system in iRoom. As mentioned, these started by needing less meta-prior information and then gathering information that allows the tracking of targets by more robust trackers that require more information about the target. The first step is to notice the target. This is done using visual saliency. Here very basic gestalt rules about the uniqueness of features in a scene are used to promote objects as more or less salient (Treisman & Gelade, 1980, Koch & Ullman, 1985, Itti & Koch, 2001b). This is done by competing image feature locations against each other. A weak image feature that is not very unique will tend to be suppressed by other image features, while strong image features that are different will tend to pop out as it receives less inhibition. In general, the saliency model acts as a kind of max selector over competing image features. The result from this stage is a saliency map that tells us how salient each pixel in an image is. Once the saliency of locations in an image can be computed, we can extract information about the features at those locations. This is done using a Monte Carlo like selection that treats the saliency map as a statistical map for these purposes. The more salient a location in an image is, the more likely we are to select a feature from that location. In the current working version we select about 600 feature locations from each frame of video. Each of the feature locations contains information about the image such as color, texture and motion information. These are combined together and used to 32 Figure 2.2: The complex feature tracker is a composite of several solutions. It first uses INVT visual saliency to notice objects of interest in a scene. Independent Component Analysis and Principle Component Analysis (Jollife, 1986, Bell & Sejnowski, 1995, Hyvärinen, 1999) are used to reduce dimensions and condition the information from features extracted at salient locations. These are fed to a non-parametric clustering based classification algorithm called NPclassify, which identifies the feature classes in each image. The feature classes are used as signatures that allow the complex tracker to compare objects across frames and additionally share that information with other trackers such as the simple tracker discussed later. The signatures are also invariant to many view point effects. As such they can be shared with cameras and agents with different points of view. classify each of the 600 features into distinct classes. For this we use the non-parametric classifier NPclassify mentioned above. This classifier classifies each feature location without needing to know a priori the number of object feature classes or how many samples should fall into each class. It forms classes by weighting each feature vector from each feature location by its distance to every other point. It then can link each feature location to another, which is the closest feature location that has a higher weight. This causes points to link to more central points. Where a central point links to another cluster it is not a member of, we tend to find that the link is comparatively rather long. 33 We can use this to cut links, thus, creating many classes. In essence, feature vectors from the image are grouped based on value proximity. As an example, two pixels that are close to each other in an image and are both blue would have a greater tendency to be grouped together than two pixels in an image that are far apart and are blue and yellow. Once we have established what classes exist and which feature locations belong to them, we can statistically analyze them to determine prior information that will be useful to any tracker, which requires statistical prior information in order to track a target. Thus, we create a signature for each class that describes the mean values for each feature type as well as the standard deviation within that class. Additionally, since spatial locations play a part in weighting feature vectors during clustering, feature vectors that are classified in the same class tend to lie near each other. Thus, the signature can contain the spatial location of the class as well. Figure 2.2 shows the flow from saliency to feature classification and signature creation. The signatures we derive from the feature properties of each class exist to serve two purposes. The first is that it allows this complex tracker to build its own prior awareness. When it classifies the next frame of video, it can try and match each of the new objects it classifies as being the same object in the last frame. Thus, it is not just a classifier, but it can track objects on its own for short periods. Further, we can use information about targets to bias the classification process between frames. So for instance, we would expect that the second frame of video in a sequence should find objects which are similar to the first frame. As such, each classified object in any given frame, biases the search in the next frame, by weighting the classifier towards finding objects of those types. 34 While this seems very complex, signature creation is fairly quick, saliency computation is done in real time on eight 733 MHz Pentium III computers in a Beowulf cluster. The rest of the code runs in under 60 ms on an Opteron 150 based computer. This means we can do weak meta-prior classification and extraction of signatures at around > 15 frames per second. 2.2.1 Complex Feature Tracker Components 2.2.1.1 Visual Saliency The first stage of processing is finding which locations in an image are most salient. This is done using the saliency program created by (Itti & Koch, 2001b), which works by looking for certain types of uniqueness in an image (Figure 2.3). This simulates the processing in visual cortex that the human brain performs in looking for locations in an image, which are most salient. For instance, a red coke can placed among green foliage would be highly salient since it contrasts red against green. In essence, each pixel in an image can be analyzed and assigned a saliency value. From this a saliency map can be created. The saliency map simply tells us the saliency of each pixel in an image. 2.2.1.2 Monte Carlo Selection The saliency map is taken and treated as a statistical map for the purpose of Monte Carlo selection. The currently used method will extract a specified number of features from an image. Highly salient locations in an image have a much higher probability of being selected than regions of low saliency. Additionally, biases from other modules may cause certain locations to be picked over consecutive frames from a video. For instance, if properties of a feature vector indicate it is very useful, then it makes sense 35 to select from a proximal location in the next frame. Thus, the saliency map combines with posterior analysis to select locations in an image which are of greatest interest. Figure 2.3: The complete VFAT tracker is a conglomeration of different modules that select features from an image, mix them into more complex features and then tries to classify those features without strong meta-priors for what kind of features it should be looking for. 36 2.2.1.3 Mixing Modules 2.2.1.3.1 Junction and End Stop Extraction Figure 2.4: Saliency is comprised of several channels which process an image at a variety of different scales and then combine those results into a saliency map. During the computation of visual saliency, orientation filtered maps are created. These are the responses of the image to Gabor wavelet filters. These indicate edges in the image. Since each filter is tuned to a single preferred orientation, a response from a filter indicates an edge that is pointed in the direction of preference. The responses from the filters are stored in individual feature maps. One can think of a feature map as simply an image which is brightest where the filter produces its highest response. Since the feature 37 maps are computed as part of the saliency code, re-using them can be advantageous from an efficiency standpoint. From this we create feature maps to find visual junctions and end-stops in an image by mixing the orientation maps (Figure 2.4). We believe such new complex feature maps can also tell us about the texture at image locations which can help give us the gist of objects to be tracked. The junction and end stop maps are computed as follows. Note that this is a different computation then the one used in appendix D and chapter 5 in the attention gate model. At some common point i,j on the orientation maps P the filter responses from the orientation filters are combined. Here the response to an orientation in one orientation map ij p is subtracted from an orthogonal map’s orientation filter output orth ij p and divided by a normalizer n which is the max value for the numerator. For instance, one orientation map that is selective for 0 degree angles is subtracted from another map selective for 90 degree angles. This yields the lineyness of a location in an image because where orthogonal maps overlap in their response is at the junctions of lines. (2.1) {} ;1,2 orth ij ij k ij pp ak n − =∈ We then compute a term (2.2) which is the orthogonal filter responses summed. This is nothing more than the sum of the responses in two orthogonal orientation maps. 38 Figure 2.5: The three images on the right are the results of the complex junction channel after ICA/PCA processing from the original image on the left. As can be seen it does a reasonable job of finding both junctions and end stops. (2.2) {} ;1,2 orth ij ij k ij pp bk n + =∈ The individual line maps are combined as: (2.3) 12 ij ij ij aa n α + = This gives the total lineyness for all orientations. We then do a similar thing for our total response maps: (2.4) 12 ij ij ij bb n β − = The final junction map γ is then computed by subtracting the lineyness term from the total output of the orientation filters: (2.5) ij ij ij γ αβ = − 39 Since the junction map is computed by adding and subtracting orientation maps which have already been computed during the saliency computation phase, we gain efficiency we wouldn’t have had if we were forced to convolve a whole new map by a kernel filter. Thus, this junction filter is fairly efficient since it does not require any further convolution to compute. Figure 2.5 shows the output and it can be seen that it is effective at finding junctions and end-stops. 2.2.1.3.2 ICA/PCA We decrease the dimensionality of each feature vector by using a combination of Independent Component Analysis (ICA) (Bell & Sejnowski, 1995) and Principle Component Analysis (PCA) (Jollife, 1986). This is done using FastICA (Hyvärinen, 1999) to create ICA un-mixing matrices offline. The procedure for training this is to extract a large number of features from a large number of random images. We generally use one to two hundred images and 300 points from each image using the Monte Carlo selection processes just described. FastICA first determines the PCA reduction matrix and then determines the matrix that maximizes the mutual information using ICA. Unmixing matrices are computed for each type of feature across scales. So as an example, the red-green opponent channel is computed at different scales, usually six. PCA/ICA will produce a reduced set of two opponent maps from the six original scale maps (This is described in detail later and can be seen in figure 2.7). Using ICA with PCA helps to ensure that we not only reduce the dimension of our data set, but that the information sets are fairly unique. From the current data, we reduce the total number of dimensions with all channels from 72 to 14 which is a substantial efficiency gain 40 especially given the fact that some modules have complexity O(d 2 ) for d number of feature channels (dimensions). Figure 2.6: NPclassify works by (A) first taking in a set of points (feature vectors) (B) then each point is assigned a density which is the inverse of the distance to all other points (C) Points are then linked by connecting a point to the nearest point which has a higher density (D) Very long links (edges) are cut if they are for instance statistically longer than most other links. This creates separate classes. 2.2.1.4 Classification Modules 2.2.1.4.1 Classification of Features with NPclassify 5 Features are initially classified using a custom non-parametric clustering algorithm called NPclassify 6 . The idea behind the design of NPclassify is to create a 5 This component is implemented in the iLab Neuromorphic Vision Toolkit as VFAT/NPclassify2.C/.H 6 A description and additional information on top of what will be discussed can be found at: http://www.nerd-cam.com/cluster-results/. 41 clustering mechanism which has soft parameters that are learned and are used to classify features. We define here soft parameters as values which define the shape of a meta-prior. This might be thought of as being analogous to a learning rate parameter or a Bayesian hyperparameter. For instance, if we wanted to determine at which point to cut off a dataset and decided on two standard deviations from the mean, two standard deviations would be a soft parameter since the actual cut off distance depends on the dataset. NPclassify (Figure 2.2, 2.6 and 2.7) (Mundhenk et al., 2004b, Mundhenk et al., 2005a) works by using a kernel to find the density at every sample point. The currently used kernel does this by computing the inverse of the sum of the Euclidian distance from each point to all other points. After density has been computed the sample points are linked together. This is done by linking each point to the closest point which has a higher density. This creates a path of edges which ascends acyclically along the points to the point in the data set which has the highest density of all. Classes are created by figuring out which links need to be cut. For instance, if a link between two sample points is much longer than most links, it suggests a leap from one statistical mode to another. This then may be a good place to cut and create two separate classes. Additionally, classes should be separated based upon the number of members the new class will have. After classes have been created, they can be further separated by using interclass statistics. The advantage to using NPclassify is that we are not required to have a prior number of classes or any prior information about the spatial or sample sizes of each class. 42 Figure 2.7: On the left are samples of features points with the class boundaries NPclassify has discovered. Some of the classes have large amounts of noise while others are cramped together rather than being separated by distance. On the right are the links NPclassify drew in order to create the clusters. Red links are ones which are too long and were clipped by the algorithm to create new classes. 43 Instead, the modal distribution of the dataset combined with learned notions of feature connectedness determine whether a class should be created. So long as there is some general statistical homogeneity between training and testing datasets we should expect good performance for clustering based classification. The training results are discussed later in the section on training results. Figure 2.8: The results using NPclassify are shown next to the same results for k-means on some sham data. The derived clusters are shown with the Gaussian eignenmatrix bars (derived using the eigenmatix estimation in section 2.2.1.4.2). In general, NPclassify creates more reliable clusters particularly in the presence of noise. Additionally, it does so without needing to know a priori how many classes one has. As such, we do have a few meta-priors still present. The first is a basic kernel parameter for density. In this case, the Euclidian distance factor makes few assumptions 44 about the distribution other than that related features should clump together. The second meta-prior is learned as a hyperparameter for a good cutoff. This can be derived using practically any gradient optimization technique. So it is notable, that NPclassify is not without some type of prior, but the assumptions on the data is quite relaxed and only assumes that related feature samples will be close to each other in feature space. An example of NPclassify working on somewhat arbitrary data points can be seen in figure 2.8. 2.2.1.4.2 Gaussian Generalization and Approximation 7 In order to store classes for future processing it is important to generalize them. Gaussian ellipsoids are used since their memory usage for any class is O(d 2 ) for d number of dimensions for a given class. Since d is fairly low for us, this is an acceptable complexity. Additionally, by using Gaussians we gain the power of Bayesian inference when trying to match feature classes to each other. However, the down side is that computing the eigen matrix necessary for Gaussian fitting scales minimally as d 3 for dimensions and s 2 for the number of samples. That is, it is O(d 3 + s 2 ). This is due to the fact that computing such elements using the pseudo inverse method (or QR decomposition) involves matrix inversion and multiplication. In order to avoid such large complexity we have implemented an approximation technique that scales minimally as d 2 for dimensions and s for the number of samples - O(sd 2 ). This means that a net savings happens if the number of samples is much larger than the number of dimensions. So for 7 This component is implemented in the iLab Neuromorphic Vision Toolkit as VFAT/covEstimate.C/.H 45 instance, if there are more than 100 samples and only 10 dimensions, this will produce a savings over traditional methods. Figure 2.9: After NPclassify has grouped feature samples together they can be fit with Gaussian distributions. This helps to determine the probability that some new feature vector belongs to a given class or that two classes compute in consecutive frames using NPclassify are probably the same class. If the distributions overlap greatly as on the left figure, then two classes are probably the same class. The approximation method works by using orthogonal rotations to center and remove covariance from the data. By recording the processes, we can then compute the probability on data points by translating and transforming them in the same way to align with the data set. What we want to be able to do is to tell the probability of data points belonging to some class as well as being able to tell if two classes derived in consecutive frames are probably the same class (see figure 2.9) The first step is to center the data about the origin. This is done by computing the mean and then subtracting that number from each feature vector. Next we compute approximate eigenvectors by trying to find the average vector from the origin to all 46 feature vector coordinates. So for k th feature vector, we first compute the ratio between its distance l from the origin along dimensions j and i. This yields the ratio r ijk . That is, after aligning the feature vector with the origin, we take the ratio of two features in the same vector (we will do this for all possible feature pairs in the vector). (2.6) jk ijk ik l r l = Next we find the Euclidian distance u ijk from the origin along dimensions j and i. (2.7) 22 ijk ik jk ul l = − By Summing the ratio of r ijk and u ijk for all k feature vectors, we obtain a mean ratio that describes the approximated eigenvector along the dimensions i and j. (2.8) 0 k ijk ij k ijk r m u = = ∑ A normalizer is computed as the sum of all the distances for all samples k. (2.9) 0 k ij ijk k nu = = ∑ Next we determine the actual angle of the approximated eigenvector along the dimensions i and j. (2.10) 1 tan ij ij ij m n θ − ⎛⎞ = ⎜⎟ ⎜⎟ ⎝⎠ 47 Once we have that, we can rotate the data set along that dimension and measure the length of the ellipsoid using a basic sum of squares operation. Thus, we compute ρ ik and ρ jk which is the data set rotated by θ ij . Here ξ is the positions of kth feature vector along the i dimension and ψ is the position of the feature vector along the j dimension. What we are doing here is rotating covariance out along each dimension so that we can measure the length of the eigenvalue. Thus, we iterate over all data points k and along all dimensions i and along i+1 dimensions j summing up σ as we go. We only sum j for i+1 since we only need to use one triangle of the eigenvector matrix since it is symmetric along the diagonal. (2.11) 1 ij + ≤ (2.12) cos( ) sin( ) ik ij ij ρ ξθ ψ θ =⋅ + ⋅ (2.13) sin( ) cos( ) jk ij ij ρ ξθ ψ θ =− ⋅ + ⋅ What we have done is figure out how much we need to rotate the set of feature vectors in order to align the least squares slope with the axis. Once this is done, we can rotate the data set and remove covariance. Since the mean is zero because we translated the data set by the mean to the origin, variance for the sum of squares is computed simply as: (2.14) 2 0 k ik iij k s n ρ = = ∑ (2.15) 2 0 k jk jji k s n ρ = = ∑ 48 Each sum of squares is used to find the eigenvalue estimate by computing Euclidian distances. That is, by determining the travel distance of each eigenvector during rotation and combining that number with the computed sum of squares we can determine an estimate of the eigenvalue from triangulation. The conditional here is used because σ ii is computed more than once with different values for θ ij . Thus, σ ii is the sum of all the products of θ ij and s iij . (2.16) () 2 iff = 0 cos( ) otherwise iij ii ii ii iij ij s s σ σ σθ ⎧ ⎪ = ⎨ +⋅ − ⎪ ⎩ (2.17) () 2 iff = 0 cos( ) otherwise jji jj jj jj jji ij s s σ σ σθ ⎧ ⎪ = ⎨ +⋅ − ⎪ ⎩ The end result is a non-standard eigenmatrix which can be used to compute the probability that a point lies in a Gaussian region. We do this by performing the same procedure on any new feature vector. That is, we take any new feature vector and replay the computed translation and rotations to align it with covariance neutral eigenmatrix approximation. Probability for the feature vector is then computed independently along each dimension thus eliminating further matrix multiplication during the probability computation. To summarize, by translating and rotating the feature set, we have removed covariance so we can compute probabilities assuming dimensions do not interact. In essence this removes the need for complex matrix operations. While the complexity is high, it is one order lower than the standard matrix operations as was mentioned earlier. 49 Examples of fits created using this method can be seen in figure 2.7 where NPclassify has created classes and the eigenmatrix is estimated for the ones created. 2.2.1.4.3 Feature Contiguity, Biasing and Memory Once features have been classified we want to use them to perform various tasks. These include target tracking, target identification and feature biasing. Thus from a measurement of features from time t, we would like to know if a collection of features at time t+1 is the same, and as such either the same object or a member of the same object. By using Bayesian methods we can link classes of features in one frame of a video to classes in the next frame by tying a class to another which is its closest probabilistic match. Additionally, we use the probability to bias how the non-parametric classifier and saliency work over consecutive frames. For NPclassify we add a sink into the density computation. That is, we create a single point whose location is the mean of a class with the mass of the entire class. Think of this as dropping a small black hole in a galaxy that represents the mass of the other class. By inserting this into the NPclassify computation, we skew the density computation towards the prior statistics in the last iteration. This creates a Kalman filter like effect that smoothes the computation of classes between frames. This is a reasonable action since the change in features from one frame to the next should be somewhat negligible. 2.2.1.5 Complex Feature Tracker Methods and Results 2.2.1.5.1 Complexity and Speed One of the primary goals of VFAT is that it should be able to run in real time. This means that each module should run for no more than about 30 ms. Since we are using a Beowulf cluster, we can chain together modules such that even if we have several 50 steps that take 30 ms each, by running them on different machines we can create a vision pipeline whereby a module finishes a job and hands the results to another machine in a Beowulf cluster that is running the next process step. In time trials the modules run within real time speeds. Using a Pentium 4 2.4 GHz Mobile Processor with 1 GB of RAM, each module of VFAT runs at or less than 30 ms. The longest running module is the NPclassify feature classifier. If given only 300 features it runs in 23 ms, for 600 features it tends to take as long as 45 ms. On a newer system it should be expected to run much faster. 2.2.1.5.2 Training for Classification Table 2.1: Following PCA the amount of variance accounted for was computed for each type of feature channel. Each channel started with six scales (dimensions). For many channels, 90% of variance is accounted for after a reduction to two dimensions. For all others, no more than three dimensions are needed to account for 90% of variance. Two modules in VFAT need to be trained prior to usage. These include ICA/PCA and NPclassify. Training for both has been designed to be as simple as possible in order to maintain the ease of use goal of the iRoom project. Additionally and fortunately, training of both modules is relatively quick with ICA/PCA taking less than a minute using the FastICA algorithm under Matlab and NPclassify taking around two hours using 51 gradient descent training. Since we only need to ever train once, this is not a prohibitive amount of time. 2.2.1.5.3 Training ICA/PCA Figure 2.10: The various conspicuity maps of the feature channels from the saliency model are shown here ICA/PCA reduced. Training was completed by using 145 randomly selected natural images from a wide range of different image topics. Images were obtained as part of generic public domain CD-ROM photo packages, which had the images sorted by topic. This enabled us to ensure that the range of natural images used in training had a high enough variety to prevent bias towards one type of scene or another. For each image, 300 features were extracted using the Monte Carlo / Visual saliency method described earlier. In all this 52 gave us 43,500 features to train ICA/PCA on. The results are shown on table 2.1. For most channels, a reduction to two channels from six still allowed for over 90% of variance to be accounted for. However, directional channels that measure direction of motion and orientation of lines in an image needed three dimensions to still account for more than 90% of all variance. Assuming that the data is relatively linear and a good candidate for PCA reduction, this suggests that we can effectively reduce the number of dimensions to less than half while still retaining most of the information obtained from feature vectors. Visual inspection of ICA/PCA results seems to show the kind of output one would expect (Figure 2.10 and 2.11). For instance, when two channels are created from six, they are partially a negation to each other. On the red/green channel, one of the outputs seems to show a preference for red. However, the other channel does not necessarily show an anti-preference for red. This may suggest that preferences for colors may also depend on the scales of the images. That is, since what makes the six input images to each channel different is the scale at which they are processed, scale is the most likely other form of information processed by ICA/PCA. This might mean for instance that the two channels of mutual information contain information about scaling. We might guess that of the two outputs from the red/green channel, one might be a measure of small red and the other of large green things. If this is the case it makes sense since in nature, red objects tend to be small (berries, nasty animals, etc.) while green things tend to be much more encompassing (trees, meadows, ponds). 53 Figure 2.11: From the original image we see the results of ICA/PCA on the red/green and blue/yellow channels. As can be seen some parts of the outputs are negations of each other which makes sense since ICA maximizes mutual information. However, close examination shows they are not negatives. It is possible that scale information applies as a second input type and prevents obvious negation. 2.2.1.5.4 Training NPclassify To hone the clustering method we use basic gradient decent with sequential quadratic programming using the method described by (Powell, 1978). This was done offline using the Matlab Optimization Toolbox. For this study, error was defined as the number of classes found versus how many it was expected to find (see Figure 2.12). Thus, we presented the clustering algorithm with 80 natural training images. Each image 54 Figure 2.12: In this image there are basically three objects. NPclassify has found two (colors represent the class of the location). This is used as the error to train it. So for 80 images it should find x number of objects. The closer it gets to this number, the better. Notice that the points are clustered in certain places. This is due to the saliency/Monte Carlo method used for feature selection. had a certain number of objects in it. For instance an image with a ball and a wheel in it would be said to have two objects. The clustering algorithm would state how many classes it thought it found. If it found three classes in an image with two objects then the error was one. The error was computed as average error from the training images. The training program was allowed to adjust any of several hard or soft parameters for NPclassify during the optimization. The training data was comprised of eight base objects of varying complexity such as balls and a wheel on the simple side or a mini tripod and web cam on the more 55 complex side. Objects were placed on a plain white table in different configurations. Images contained different numbers of objects as well. For instance some images contained only one object at a time, while other contained all eight. A separate set of validation images was also created. These consisted of a set of eight different objects with a different lighting created by altering the f-stop on the camera. Thus, the training images were taken with an f-stop of 60 while the 83 validation images were taken with an f-stop of 30. Additionally, the angle and distance of view point is not the same between the training and validation sets. The validation images were not used until after optimal parameters were obtained by the training images. Then the exact same parameters were used for the validation phase. Our first test was to examine if we could at the very least segment images such that the program could tell which objects were different from each other. For this test spatial interaction was taken into account. We did this by adding in spatial coordinates as two more features in the feature vectors with the new set of 14 ICA/PCA reduced feature vectors. The sum total of spatial features were weighted about the same as the sum total of non-spatial features. As such, the membership of an object in one segmented class or the other was based half by its location in space and half by its base feature vector composition. Reliability was measured by counting the number of times objects were classified as single objects, the number of times separate objects were merged as one object and the number of time a single object was split into two unique objects. Additionally, there was a fourth category for when objects were split into more than three objects. This was small and contained only four instances. 56 The results were generally promising in that based upon simple feature vectors alone, the program was able to segment objects correctly with no splits or merges in 125 out of the 223 objects it attempted to segment. In 40 instances an object was split into two objects. Additionally 54 objects were merged as one object. While on the surface these numbers might seem discouraging there are several important factors to take into account. The first is that the program was segmenting based solely on simple features vectors with a spatial cue. As such it could frequently merge one shiny black object into another shiny black object. In 62 % of the cases of a merger, it was obvious that the merged objects were very similar with respect to features. 2.2.1.5.5 NPclassify v. K-Means NPclassify was also tested on its general ability to classify feature clusters. In this case it was compared with K-means. However, since K-means requires the number of classes to be specified a priori, this was provided to it. So in essence, the K-means experiment had the advantage of knowing how many classes it would need to group, while NPclassify did not. The basic comparison test was similar to the test presented in the previous section. In this case, several Gaussian like clusters were created of arbitrary 2 dimensional features. They had between 1 and 10 classes in each data set. 50 of the sets were clean with no noise such that all feature vectors belonged explicitly to a ground truth class. However, in 50 other sets, small amounts of random noise were added. The comparison metric for K-means and NPclassify was how often classes were either split or merged 57 when they should have not been. The mean error for both conditions is shown below in figure 2.13. It should be noted that while K-means may be sensitive to noise in data, it is used here since it is well known and can serve as a good base line for any clustering algorithm. Figure 2.13: NPclassify is compared with K-Means for several data sets. The error in classification for different sets is the same if there is little noise in the data. However, after injecting some noise, NPclassify performs superior. The general conclusion is that compared with K-means, NPclassify is superior particularly when there is noise in the data. This is not particularly surprising since as a spanning tree style algorithm, NPclassify can ignore non proximal data points much more easily. That is, K-means is forced to weigh in all data points and really has no innate ability to determine that an outlying data point should be thrown away. However, NPclassify will detect the jump in distance to an outlier or noise point from the central density of the real class. 58 2.2.1.5.6 Contiguity Figure 2.14: Tracking from frame 299 to frame 300 the shirt on the man is tracked along with the head without prior knowledge of what is to be tracked. It should be noted that that while the dots are drawn in during simulation, the ellipses are drawn in by hand for help in illustration in gray scale printing. Contiguity has been tested but not fully analyzed (Figure 2.14). Tracking in video uses parameters for NPclassify obtained in section 2.2.1.5.4. Thus, the understanding of how to track over consecutive frames is based on the computers subjective understanding of good continuity for features. In general, classes of features can be tracked for 15 to 30 frames before the program loses track of the object. This is not an impressive result in and of itself. However, several factors should be noted. First is that each object that VFAT is tracking is done so without priors for what the features of each should be. Thus, the program is tracking an object without having been told to either track that object or what the object its tracking should be like. The tracking is free form and in general without feature based priors. The major limiter for the contiguity of tracking is that an object may lose saliency as a scene evolves. As such an object if it becomes too low in saliency will have far fewer features selected for processing from it, which destroys the track of an object with the current feature qualities. However, as will be noted in the next 59 section, this is not a problem since this tracker is used to hand off trackable objects to a simple tracker which fixates much better on objects to be tracked. 2.3 The Simple Feature Based Tracker 8 Figure 2.15: The Simple tracker works by taking in initial channel values such as ideal colors. These are used to threshold an image and segment it into many candidate blobs. This is done by connecting pixels along scan lines that are within the color threshold. The scan lines are then linked which completes a contiguous object into a blob. The blobs can be weeded out if they are for instance too small or too far from where the target last appeared. Remaining blobs can then be merged back and analyzed. Finding the center of mass of the left over blobs gives us the target location. By finding the average color values in the blob, we can define a new adapted color for the next image frame. Thus, the threshold color values can move with the object. Once a signature is extracted using the complex tracker described in the previous section, it can be feed to a faster and simpler tracking device. We use a multi channel 8 For more information see also: http://ilab.usc.edu/wiki/index.php/Color_Tracker_How_To 60 tracker, which uses color thresholding to find candidate pixels and then links them together. This allows it to not only color threshold an image, but to segregate blobs and analyze them separately. So for instance, if it is tracking a yellow target, if another yellow target appears, it can distinguish between the two. Additionally, the tracker also computes color adaptation as well as adaptation over any channel it is analyzing. We compute for instance a new average channel value c (2.18) as the sum of all pixel values in this channel p c over all N pixels in tracked ‘OK’ blobs (as seen in figure 2.15) pfrom the current frame t to some past frame t′. In basic terms, this is just the average channel value for all the trackable pixels in several consecutive past frames. Additionally we computeσ , which is just the basic standard deviation over the same pixels. (2.18) () 2 00 and 1 tN tN ip ip it p i t p tt ii it it ccc c NN σ ′′ == == ′′ == − == − ∑∑ ∑∑ ∑∑ Currently, we set a new pixel as being a candidate for tracking if for all channels that have a pixel value p c : (2.19) p ccc ασασ − ⋅≤ ≤ + ⋅ Thus, a pixel is thresholded and selected as a candidate if it falls within the boundary of each channel that is its mean value computed from eq. (2.18) +/- the product of the standard deviation and a constantα . Forgetting is accomplished in the adaptation by simply windowing the sampling interval. 61 This method allows the tracker to track a target even if its color changes due to changes in lighting. It should be noted that the simple tracker can track other features in addition to color so long as one can create a channel for it. That is, an RGB image can be separated into three channels, which are each gray scale images. In this case, we create one for red, one for green and one for blue. We can also create images that are for instance, the responses of edge orientation filters or motion filters. These can be added as extra channels in the simple tracker in the same manner. However, to preserve luminance invariance we use the H2SV color scheme described in appendix G. This is just an augmentation of HSV color space that solves for the singularity at red by converting hue into Cartesian coordinates. In addition to the basic vision functional components of the simple tracker, its code design is also important. The tracker is object oriented which makes it easy to create multiple independent instances of the simple tracker. That is, we can easily run several simple trackers on the same computer each tracking different objects from the same video feed. The computational work for each tracker is fairly low and four independent trackers can simultaneously process 30 frames per second on an AMD Athlon 2000 XP processor based machine. This makes it ideal for the task of tracking multiple targets at the same time. 2.4 Linking the Simple and Complex Tracker In order for the simple tracker and the complex tracker to work together they have to be able to share information about a target. As such the complex tracker must be able to extract information about objects that is useful to the simple tracker (Figure 2.16). 62 Additionally, linking the simple tracker with the complex tracker creates an interesting problem with resource allocation. This is because each simple tracker we instantiate tracks one target at a time while the complex tracker has no such limit. A limited number of simple trackers can be created and there must be some way to manage how they are allocated to a task based on information from the complex tracker. Figure 2.16: The simple and complex trackers are linked by using the complex tracker to notice and classify features. The complex tracker then places information about the feature classes into object feature class signatures. The complex tracker uses these signatures to keep track of objects over several frames or to bias the way in which it classifies objects. The signatures are also handed to simple trackers, which track the objects with greater proficiency. Here we see two balls have been noticed and signatures have been extracted and used to assign each ball to its own tracker. The smaller target boxes on the floor show that the simple tracker was handed an object (the floor), which it does not like and is not tracking. Thus, the simple tracker has its own discriminability as was mentioned in section 2.3 and figure 2.15. We address the first problem by making sure both trackers work with similar feature sets. So for example, the complex tracker when it runs will examine the H2SV color of all the classes it creates. It then computes the mean color values for each class. This mean color value along with the standard deviation of the color can then be handed to the simple tracker, which uses it as the statistical prior color information for the object it should track. 63 Figure 2.17: This is a screen grab from a run of the combined tracker. The lower left two images show the complex tracker noticing objects, classifying and tracking them. The signature is handed to the simple tracker, which is doing the active tracking in the upper left window. The combined tracker notices the man entering the room and tracks him without a priori knowledge of how he or the room looks. Once he walks off the right side, the tracker registers a loss of track and stops tracking. The bars on the right side show the adapted actively tracked colors from the simple tracker in H2SV color. The lower right image shows that many blobs can fit the color thresholds in the simple tracker, but most are selected out for reasons such as expected size, shape and position. The second issue of resource allocation is addressed less easily. However, there are simple rules for keeping resource allocation under control. First, don’t assign a simple tracker to track an object that overlaps with a target another simple tracker is tracking in the same camera image. Thus, don’t waste resources by tracking the same target with two or more trackers. Additionally, since the trackers are adaptive we can find that two trackers were assigned to the same target, but we didn’t know this earlier. For instance, if 64 accidentally one simple tracker is set to track the bottom of a ball and one the top of the ball, after a few iterations of adaptation, both trackers will envelop the whole ball. It is thus advantageous to check for overlap later. If we find this happening, we can dismiss one of the simple trackers as redundant. Additionally, our finite resources mean we do not assign every unique class from the complex tracker to a simple tracker. Instead, we try and quantify how interesting a target is. For instance, potential targets for the simple tracker may be more interesting if they are moving, have a reasonable mass or have been tracked by the complex tracker for a long enough period of time. 2.5 Results On the test videos used, the system described seems to work very well. A video of a man entering and leaving a room (Figure 2.17) was shown five times to the combined complex and simple tracker. In each run, the man was noticed within a few frames of entering the cameras view. This was done without prior knowledge of how the target should appear and without prior knowledge of the room’s appearance. The features were extracted and a simple tracker was automatically assigned to track the man, which did so, until he left the room, at which point the simple tracker registered a loss of track. Interestingly enough, the tracker extracted a uniform color over both the man’s shirt and his skin. It was thus able to, on several instances, track the man as both his shirt and his skin. Thus, even though the shirt was burgundy and the skin reddish, the combined tracker was able to find a statistical distribution for H2SV color that encompassed the color of both objects as unique from the color of objects in the rest of the room. 65 The tracker was also tested on a video where a blue and yellow ball both swing on a tether in front of the camera. In five out of five video runs, both balls are noticed and their features extracted. Each ball is tracked as a separate entity by being assigned by the program its own simple tracker. Each ball is tracked until it leaves the frame, at which point the simple trackers register a loss of track. The balls even bounce against each other, which demonstrates that the tracker will trivially discriminate between objects even when they are touching or overlapping. In both video instances, objects are tracked without the program knowing the features of the object to be tracked a priori. Instead, saliency is used to notice different possible targets and the complex tracker is used to classify possible targets into classes. This was then used to hand target properties to the simple trackers as automatically generated prior information about the targets to be tracked. Additionally, the simple tracker will register a loss of track when the target leaves the field of view. This allows us to not only notice when a new target enters our field, but also when it leaves. The tracking was also aided by the use of H2SV color. Prior to using the H2SV color scheme, the purple shirt the man is wearing was split as two objects since the color of many of the pixels bordered on and even crossed into the red part of the hue spectrum. Thus, standard HSV created a bi-modal distribution for hue. The usage of H2SV allowed us to now track the purple shirt as well as objects that are reddish in hue, such as skin. H2SV color also works for tracking of objects in the center of the spectrum, which we observed by tracking objects that are green, yellow and blue. In addition to tracking using a static cameral, the same experiment was done using a moving camera. This is much less trivial since the common method of eigen 66 background subtraction cannot be used to determine new things in a scene from the original scene. Again the tracker was able to track a human target without prior knowledge of features even as the camera moved. This is a distinct advantage for our tracker and illustrates the advantage of using saliency to extract and bind features since it can compensate for global motion. 2.6 Discussion 2.6.1 Noticing The most notable and important aspect of the current work is that we are able to track objects or people without knowing what they will look like a priori and we are able to do so quickly enough for real time applications. Thus, we can notice, classify and track a target fairly quickly. This has useful applications in many areas and in particular security. This is because we track something based on how interesting it is and not based on complete prior understanding of its features. Potentially, we can then track any object or person even if they change their appearance. Additionally, since we extract a signature that describes a target that is viewpoint invariant, this information can be used to share target information with other agents. 2.6.2 Mixed Experts Additionally, we believe we are demonstrating a better paradigm in the construction of intelligent agents, one that uses a variety of experts to accomplish the task. The idea is to use a variety of solutions that work on flexible weak meta-prior information, but then use their output as information for a program that is more biased. 67 This is founded on the idea that there is no perfect tool for all tasks and that computer vision is comprised of many tasks such as identification, tracking and noticing. To accomplish a complex task of noticing and tracking objects or people, it may be most optimal to utilize many different types of solutions and interact them. Additionally, by mixing experts in this way, no one expert necessarily needs to be perfect at its job. If the experts have some ability to monitor one another, then if one expert makes a mistake, it can possibly be corrected by another expert. It should be noted that this tends to follow a biological approach in which the human brain may be made up of interacting experts, all of which are interdependent on other expert regions in order to complete a variety of tasks. Another important item to note in the mixed experts paradigm is that while it may make more intuitive sense to use such an approach, new difficulties arise as our system becomes more abstractly complex. So as an example, if one works with support vector machines only, then one has the advantage of a generally well-understood mathematical framework. It is easier to understand a solutions convergence, complexity and stability in a system if it is relatively homogeneous. When a person mixes experts, particularly if the experts act very differently, the likelihood of the system doing something unexpected or even catastrophic tends to increase. Thus, when one designs an intelligent agent with mixed experts, system complexity should me managed carefully. 2.6.3 System Limitations and Future Work The system described has its own set of limitations. The work up to this point has concentrated on being able to notice and track objects in a scene quickly and in real time. 68 However, its identification abilities are still somewhat limited. It does not contain a memory such that it can store and identify old targets in the long term. However, such an ability is in the works and should be aided by the ability of the tracking system to narrow the area of the image that needs to be inspected which should increase the speed of visual recognition 69 Chapter 3: Contour Integration and Visual Saliency In the visual world there are many things, which we can see, but certain features, sets of features and other image properties tend to more strongly draw our visual attention towards them. A very simple example is a stop sign, in which combinations of red color and angular features of an octagon combine with a strong word “stop” to create something that hopefully we would not miss if we come upon it. Such propensity of some visual features to attract attention defines in part the phenomenon of visual saliency. Here we assert, as others (James, 1890, Treisman & Gelade, 1980, Koch & Ullman, 1985, Itti & Koch, 2001b) that saliency is drawn from a variety of factors. At the lowest levels, color opponents, unique orientations and luminance contrasts create the effect of visual pop-out (Treisman & Gelade, 1980, Wolfe, O'Neill & Bennett, 1998). Importantly, these studies have highlighted the role of competitive interactions in determining saliency --- hence, a single stop sign on a natural scene backdrop usually is highly salient, but the saliency of that same stop sign and its ability to draw attention is strongly reduced as many similar signs surround it. At the highest levels it has been proposed that we can prime our visual processes to help guide attention towards what we wish to search for in a visual scene (Wolfe, 1994b, Miniussi, Rao & Nobre, 2002, Navalpakkam & Itti, 2002). Given the organization of visual cortex it has also been proposed that saliency is gathered into a topographic saliency map. This is a landscape of neurons in partnership and competition with each other. For instance, neurons that are most excited have the greatest ability to competitively suppress their neighbors. This creates a winner-take-all 70 phenomenon whereby the strongest and most unique features in an image dominate other Figure 3.1: This is an example of a contour created by Make Snake (Braun, 1999). As can be seen, there appears to be a complete circle. However, the circle is created by unconnected Gabor wavelet elements. The mind connects these elements in a phenomenon known as contour integration. features to become salient. However, in addition to direct uniform center-surround competition, it has been suggested by several studies that saliency is enhanced when a series of elements like the dashed lines on a road are aligned in a collinear fashion 71 (Braun, 1999, Li & Gilbert, 2002, Peters, Gabbiani & Koch, 2003). Such a phenomenon is part of what is known as contour integration. Here, instead of a global inhibition for surround, neurons can selectively enhance other neurons with a similar preference for image features. In this case, neurons will enhance if they have a preference for the same line orientation and are aligned by preference in a collinear or co-circular fashion. Neurons, thus, compete with other neurons selectively, while enhancing the activity of others. In contour integration, bar or Gabor elements (defined as the product of a Gaussian “bell-curve” and a sinusoidal grating) that are collinear, when observed, seem to enhance their ability to “Pop out” in an image that is also filled with other Gabors that are non-aligned noise elements (Field et al., 1993, Kovács & Julesz, 1993, Braun, 1999, Gilbert, Ito, Kapadia & Westheimer, 2000). An example can be seen in figure 3.1, which shows Gabor elements of the same contrast, modulation, amplitude and size aligned into what seems to be an uneven circle. There is no direct physical link between the elements in this image that would give a direct cue as to their connectedness. Instead, the elements seem merely to point towards each other. The brain makes a functional gestalt leap and links these elements into a single unified contour (Wertheimer, 1923, Koffka, 1935). At the same time, the relative salience of the contour objects is elevated in the visual cortex. Thus, our brain reads between the lines as it were and creates the cognitive illusion of continuity even when objects along a contour are not physically connected. At the same time, our mind takes these contour elements and promotes their visual importance thus creating the effect of pop-out. 72 Several factors have been explored as being important to the phenomenon of contour integration. In particular, the properties of the elements in the contours can affect our ability to detect contours in a seemingly nonlinear fashion. For instance, contours can be affected by continuity of colors, phase of Gabors and luminance of aligned foreground elements (Field, Hayes & Hess, 2000, Mullen, Beaudot & McIlhagga, 2000). Similarly, statistics of the background can also affect our perception of contours. For instance, if contour elements have a stronger collinear orientation compared with background elements, that is, they are more aligned; the contour is more visible (Polat & Sagi, 1993b, Polat & Sagi, 1993a, Hess & Field, 1999, Usher, Bonneh, Sagi & Herrmann, 1999). Interestingly, when result data for enhancement of Gabor elements is plotted on a graph, enhancement for collinear elements is “U”-shaped. That is, a string of parallel Gabor elements, aligned like the steps on a ladder also have enhancement abilities but diagonally oriented elements (elements which point in the same direction but are off-set like a staircase) have far less ability to enhance (Polat & Sagi, 1993b, Polat & Sagi, 1993a, Yu & Levi, 2000). Thus, as elements are rotated relative to each other, they have the strongest enhancement if the elements are aligned collinear or directly parallel to each other, but enhancement drops as elements are rotated between being collinear and parallel. In addition to sameness of elements, contours also seem to become enhanced if the arrangement of the elements forms a closed loop (Kovács & Julesz, 1993, Braun, 1999). While there is some disagreement to the amount of pop-out from contour closure it is still nonetheless considered significant. This suggests that neurons sensitive to contour integration may perform some sort of linking to each other in a manner 73 conceptually similar to a closed circuit like loop (Li, 1998, Yen & Fenkel, 1998, Braun, 1999, Prodöhl et al., 2003). That is, neurons that don’t directly touch may propagate effect to each other through their neighbors. Thus, ideally, if we imagine that contour integration is the result of neurons of preferred orientation linking to each other, we might conclude that contour integration may not just involve linking nearest neighbors to each other in a linear one-shot excitement, but may involve continuous reciprocation of neurons such that effects can propagate around a network. Such a notion is supported by current observations that all of the neurons on a contour that are thought to enhance each other in contour integration cannot be directly connected due to the limited reach of visual cortical axons. Thus, neurons in V1 and V2 are limited in the scope of their direct effect onto each other and should not cross the entire visual field. For contour closure effects to occur, especially over long contours, there should be some sort of network propagation (Li, 1998, Yen & Fenkel, 1998, Braun, 1999). Contour integration can also be explored in both local and non-local ways. For instance, single Gabor element flankers and center-surround pedestals demonstrate that elements in a contour can enhance each other with only one flanker neighbor element to each side (Polat & Sagi, 1993b, Polat & Sagi, 1993a, Zenger & Sagi, 1996, Yu & Levi, 2000). However, contours are further enhanced as more elements are added (Braun, 1999, Li & Gilbert, 2002). This has become somewhat of a mystery for the reason that elements seem to enhance each other at distances that span beyond the size of the classical receptive field of neurons in the visual cortex (Braun, 1999). Thus, adding to the previous argument, there seems to be some ability for neurons in visual cortex to enhance a contour’s perceptibility at locations represented by neurons that they are not directly 74 connected to. Several theories have been advanced to explain how that can happen, for instance neural synchronization (Yen & Fenkel, 1998, Choe & Miikkulainen, 2004), potential propagation (Li, 1998) and fast plasticity (Braun, 1999, Mundhenk & Itti, 2003). In addition to their saliency effects, it has also been suggested that contours play an important role in object identification. In particular, the ends of contours, frequently referred to as end-stops, and the junctions of contours may hold important data for the geometric interpretation of objects (Biederman, Subramaniam, Bar, Kalocsai & Fiser, 1999, Rubin, 2001). Thus, contour enhancement may not only be important for drawing our attention to the contours qua contours, but to the places at which those contours join with other contours and yield useful geometric information about objects for identification Thus, it may be important for a mechanism that integrates contours for the sake of visual saliency to not only find contours, but to find the junctions at those contours even more salient. From this, we propose that a model of contour integration may do more than just enhance isolated contours. That is because more information is to be obtained from the junctions at contours. From an efficiency standpoint, junctions should also be detected if possible since this would reduce the number of neurons dedicated to the task of contour integration and end-stopping as well as speed up computation through parallel processing of information. This then could reduce redundancy and extra processing steps. 75 3.1 Computation Traditionally, it has been a challenge to model contour integration. Two approaches are generally taken when trying to model contour integration. The first is the biological route (Li, 1998, Yen & Fenkel, 1998, Grigorescu, Petkov & Westenberg, 2003, Mundhenk & Itti, 2003, Ben-Shahar & Zucker, 2004, Choe & Miikkulainen, 2004). In this method, the idea is to create a model of contour integration that explores how the brain may perform such activities. The other route is computational (Shashua & Ullman, 1988, Guy & Medioni, 1993), which is another important approach. However, these models tend to explore possibilities of contour integration computation or attempt to take a direct path to simulate contour integration for engineering applications. Here our approach is both. Our model attempts to explain saliency for contours in a manner that strives to illuminate the mechanisms that the brain uses, while attempting to optimize computation in order to be applied to visual saliency tasks in machine vision. Another important aspect of many contour integration algorithms has been the control of connectivity between computational elements. This is because, as has been mentioned, neurons seem to influence, beyond their own physical range, other neurons evaluating the same contour. This creates a situation where neural groups that process contour integration need to spread effect throughout the network while at the same time controlling the network and preventing it from losing control. Some biological approaches have included a global normalization gain control and neural synchronization for this effect (Yen & Fenkel, 1998). We attempt to control our model by taking advantage of the properties of GABAergic interneurons to control local groups of neurons discretely. As we will describe later, the corresponding group that processes 76 contours is broken into smaller local groups. Each local group is managed by its own single GABAergic interneuron, which controls gain by managing activity gradients for the local group it belongs to. Thus, each local group of neurons in the corresponding group has its own inhibitory bandleader to control its gain. The reason for taking this approach over global normalization is that we avoid direct influence between elements in the model that should not have direct interactions due to the limitations of the reach of neurons in visual cortex. Our model will also attempt to explain how contour enhancement can extend beyond the typical receptive field of neurons by utilizing a fast plasticity (Von der Malsberg, 1981, Von der Malsberg, 1987) based on dopaminergic temporal difference like priming effects and pyramidal image size reduction. We will also show our model’s abilities to perform similarly to humans in local enhancement tasks involving collinear aligned elements (Polat & Sagi, 1993b, Polat & Sagi, 1993a) as well as in longer contour tasks with elements that enhance beyond the range of the neurons receptive field. In addition, our model will take into account physiological mechanisms for contour integration by comparing our results to those of psychometric data. By fitting our algorithm to this data we will not only demonstrate the viability of our solution, but also show how we will have created a more complete solution in the process. 77 3.2 The model 3.2.1 Features We have created a model, which we call CINNIC (Carefully Implemented Neural Network for Integrating Contours). Our model simulates the workings of a corresponding group of hyper-columns in visual cortex. We use the term “corresponding” to mean small proximate hyper-column groups, which correspond to the same basic task, for instance, integrating contours for saliency. In essence, it can be thought of as a cube of brain matter. Each neuron in a corresponding group connects to the many neighboring neurons within its reach. Each neuron in the current corresponding group is sensitive to a distinct angle present in an image being observed by the model. That is, certain neurons activate more strongly when they are presented with a 45-degree line in their receptive field while others might be more sensitive to a 30-degree angle line. This means that each neuron in a hyper-column, and thus each neuron in the corresponding group has a preference to distinct angles (Hubel & Weisel, 1977). Contour integration is achieved in principle when neurons that are close and have similar preferred orientations either enhance if they are collinear to each other, or suppress if they are parallel to each other. This is a method used widely (Li, 1998, Yen & Fenkel, 1998, Grigorescu et al., 2003, Mundhenk & Itti, 2003). Figure 3.2 shows an example of these simple rules for enhancement. It should be mentioned that the reason to suppress parallel flanking elements is to preserve the uniqueness of the visual item. For instance, a single line on a blank background should be more salient than a group of parallel lines (Treisman & Gelade, 1980, Itti & Koch, 2001b). This can be intuitively thought of by thinking of one thin line drawn on a wall 78 compared with a line on a pin stripe suit. It is easy to imagine that a single line on the wall is more salient and more likely to pop out than a single line amongst several others on the pin stripe suit. Figure 3.2: (A) An image is taken (1) and is split into 12 orientation-filtered images (2), which are sent to their own layers in the corresponding group where Each of the 12 preferred orientations are rotated at 15 degrees (3). After interaction the output is collected at a top-level saliency map (4). (B) Interaction between layers is governed by collinearity. More collinear elements excite each other (α and β angles are small) while less collinear elements suppress each other (α and β are large). (C) Elements like (1) enhance, elements like (2) suppress, and highly parallel elements can enhance, like in (3). An overview of the functioning of the network is as follows. As each neuron in the corresponding group fires, it transmits synaptic current to a neuron at the top of its hyper- column. This top-level neuron is a leaky integrator that stores charge received from neurons in its hyper-column. The way to imagine this is that the top level of leaky integrator neurons map one to one with an input image and creates a saliency map. Thus, 79 an input pixel is connected to several neurons above it in a hyper-column and creates a one-to-one mapping for location between each hyper-column and an image pixel. That is, a hyper-column of neurons, and its leaky integrator neuron on top, map spatially to exactly one pixel in an image, but then connects outwards to surrounding pixels in a center-surround architecture. Each neuron has the ability to enhance its neighbor using fast dopamine-like priming connections. Thus, connectedness among neurons in the corresponding group is enhanced by their ability to prime each other. The reason for this is that it allows activity of neurons to propagate. This gives neurons the ability to extend their influence beyond their own reach to neurons outside their receptive field. For instance, an active neuron primes its neighbor which causes its neighbor to become more active following that priming which in turn causes the neighbor to prime its neighbor and so on. Dopamine- like neurons are used in our model since they are fairly ubiquitous and can prime one another in 50 – 100 ms (Schultz, 2002), which is well within the time span suggested for long-range contour integration of about 250 ms (Braun, 1999). We state this because contour detection performance saturates at 12 Gabor elements. 50-millisecond priming may be the right amount of time for effect to propagate in the network since depending on the exact speed of the network, a 10 or 12 cell network’s effect will have met half way by this point in time. Additionally, this means that our model depends on a Hebbian-like associative priming where neurons that receive input in one epoch of our model enhance their neighbors firing in the next. Figure 3.3 shows a frame-by-frame example of this 80 process. We reason for this method of propagation by observing that this process of Figure 3.3: An important element of the model is a fast plasticity term. In our model we follow the notion of priming via dopamine. (1) A neuron and its neighbor receive input. (2) The neuron on the right sends a signal to the neuron on the left. (3) The left neuron is now primed via dopamine. (4) When the neuron on the left receives another input, it is more likely to cross its firing threshold. This allows contour neurons to propagate activity to other contour neurons that are not directly connected. priming has been observed and simulated in the brain, for instance in striatal neurons (Suri, Bargus & Arbib, 2001, Schultz, 2002). Additionally, we should note that we emphasize the term dopamine-like. This is because other systems such as norepinephrine 81 neurons in the locus coeruleus and cholinergic neurons in basal forebrain also exhibit similar behavior (Schultz, 2002), and while fast plasticity has been observed in higher cortical areas such as the prefrontal cortex (Hemple, Hartman, Wang, Turrigiano & Nelson, 2000), and the rat visual cortex (Varela, Sen, Gibson, Fost, Abbott & Neslon, 1997), the time course and underlying mechanisms seem not to be well enough understood at the moment for our simulation. As such, we use the term dopamine-like since it seems that its mechanisms are generalizable enough for our purposes. Our model does not implement explicit temporal synchronization for propagation since it is our observation that evidence for its actions in V1 and V2 seem less certain, and that while some papers suggest explicit temporal synchronization based on their results (Lee & Blake, 2001) as we mention in the discussion, they can also be accounted for by a fast plasticity mechanism. Our argument will then be for such a process based upon its feasibility as well as the fitness of such a mechanism to explain the processes, which are observed in humans. As a last note we wish to point out that we do not object to explicit temporal synchronization at any theoretical level, it is to say, we believe that fast plasticity may better explain contour propagation. Another feature of our model is that it controls runaway gain from over-excitation of the corresponding group. It does this by using suppression of local groups of pyramidal neurons that are in subsections of the whole corresponding group. To accomplish this we hypothesize that medium sized basket type fast spiking (FS) 82 interneurons are stimulated from one or few putative inputs from the top leaky integrator Figure 3.4: This is a conceptual illustration of two suppression groups. Gain in the network is controlled by a Basket GABAergic interneuron like connection scheme. This works by spatially grouping local neurons into groups that are all suppressed by a local interneuron for that group. This creates a gain control, but keeps such control local to within the theoretical spatial range of axonal arbors in V1 and V2 neuron and exhibit strong control over the neurons they efferently connect to. Such neurons have been observed in the brain in many areas, particularly in the pre-frontal cortex (Krimer & Goldman-Rakic, 2001) and Striate Cortex (Pernberg, Jirmann & Eysel, 1998, Shevelev, Jirmann, Sharaev & Eysel, 1998). They need only one or few inputs and can give very strong inhibition. Here, these fast spiking parvalbumin-type interneurons are plausible since they require very few putative inputs in order to create inhibitory post synaptic potentials (IPSP) (Krimer & Goldman-Rakic, 2001). Further, they have been found to modulate pyramidal neuronal activity directly (Gao & Goldman-Rakic, 2003), 83 which are the type of neurons we have constructed our corresponding group from. A gradient-based suppression could be attained by having a second slow interneuron inhibit the first interneuron; this may be plausible since interneuron to interneuron connections are well known (Wang, Tegner, Constantinidis & Goldman-Rakic, 2004). If the activity of the first interneuron levels off, the second interneuron will catch up and suppress the first completely. Figure 3.4 shows a representation of this. Since interneutrons can spike at a variety of rates (Bracci, Centonze, Bernardi & Calabresi, 2003), the end result from this mechanism is that local groups of pyramidal neurons are inhibited proportionally to their local groups sum total excitation. 3.2.2 The Process In our computational model, before an image is sent to the corresponding group it must undergo some preprocessing. This takes several steps. The first is to take in a real world image. This can be a digital photograph, or an artificially-created stimulus such as an image of Gabors. The input image is filtered for orientation using Gabor wavelets. This creates several images, in our case 12, that have been filtered for orientation. In this model, 12 orientations are used since it is hypothesized that this is the number of the orientations the brain may use in V1 (Itti, Koch & Braun, 2000). The image is then reduced into 3 different scales of 64x64, 32x32 and 16x16 pixels by using the pyramid 84 Figure 3.5: (A) A Kernel is generated that dictates the base strength of the connections between neurons in the network. Each kernel slice shown represents the interaction between two neurons given their preferred orientations. Red represents inhibition while green represent excitation. (B) If two neurons are parallel in preference but not collinear, then they inhibit each other. (C) Parallel bars excite if they are close to collinear in preference. The three kernels shown (the same as highlighted in (A)) show the interaction if elements are related to each other as shown by the bars. For instance, if two elements are totally co-linear they would use the first kernel. The next kernel would be used if one element is offset by 15 degrees. (D) This is a side view of the 0 degree offset kernel. The kernel has modest 2 nd and 3 rd order polynomial curvature, which can be observed on close inspection. method for image reduction (Burt & Adelson, 1983). This yields 36 processed images, that is, 12 orientations by 3 scales. In the next stage, each scale is processed separately. As such, we have three independent sub-corresponding groups, one for each scale. Each 85 orientation image is sent to a layer in the sub-corresponding group for its scale that is selective for that orientation. For instance, the 90 degree orientation image inputs directly only into the layer that is designated as selective for 90 degree orientations. This creates a sub-corresponding group with a stacked topology where each layer is comprised of neurons sensitive to only one orientation. To reiterate, the structure places neurons directly above each other, which receive direct input from the exact same location in the visual field. Thus, the resulting region can be thought of as a cube of neurons where the i and j dimensions correspond to a specific location in the visual field and the third α dimension corresponds to the preferred orientation of the neuron. To make this cube of brain matter a corresponding group, connections are established between the neurons. Interaction between neurons is created using a hyper-kernel. Each hyper-kernel describes both the inhibitory and excitatory connections between neurons simultaneously rather than as two separate kernels where one is for inhibition and one is for excitation. This is done to speed up the computation operation and can be done since, if we neglect temporal differences between excitation and inhibition at this level, the summation of inhibition and excitation to another neuron results in a mutually exclusive inhibition or excitation result. That is, the hyper-kernel is the summation of excitation and inhibition kernels. Figure 3.5 shows the “slices” of the kernel we used and how it is used to define how neurons interact with each other by defining the weights of excitation and inhibition. Each hyper-kernel slice has a reach of 12 pixels (i.e. reaching out to a span of 12 neurons) for excitation and 10 for inhibition. It should be noted that this is the same across all scales. When the image is reduced, the kernel will reach across 1.4 degrees of visual angle for 64x64 pixel scaled image, 2.8 degrees for the 32x32 scaled image and 5.6 86 degrees for the 16x16 scaled image. Additionally, while the kernel at the 16x16 pixel scale is large in terms of visual angle, it has a relative lack of acuity since the image has been reduced dramatically. Thus, we still fall within size constraints for neuron reach since the kernel at 16x16 is still the same size. However, the image has shrunk. In all, 144 slices are created for our hyper-kernel. These represent all the possible connections between two neurons in the corresponding group. That is, each neuron is selective for one of 12 orientations and can interact with another neuron, which can be selective for one of 12 orientations. This creates 12x12 possible interactions. The spatial relation for each hyper-kernel is handled within each slice. That is, each slice maps retinotopically. Orientation is thus handled between slices, while translation is handled within slices of the hyper-kernel. It can be seen then, that the hyper-kernel is stacked in the same way as the layers of a corresponding group. Since it has the same topology, it can then pass over and through a corresponding group in much the same way a standard 2D kernel is passed over a standard 2D image. However, the process moves the hyper kernel in a 2 dimensions over the 3D corresponding group (with 4D connections), so in essence, the convolution adds an extra set of dimensions over 2D convolution. This can be thought of as moving a hypercube of 12 spatially overlapping cubes (one for each orientation) simultaneously in a Cartesian manner along two dimensions through a larger box of the same height (which can be thought of as the corresponding group). Each orientation-selective neuron, when stimulated by input from the image and by input from other neurons that excite it, will send synaptic current to a non-orientation selective top layer of leaky integrator neurons at the top of its hyper-column. The top layer of leaky integrator neurons is treated as a saliency map for these purposes. The top 87 layer can reciprocate to control gain of local neurons using suppression from fast spiking interneurons. That is, the activity of the saliency map’s top layer neurons controls the activity of the gain control for the interneurons. Thus, a noisy image is gain controlled locally using the gradient of excitation in a local group controlled by a single interneuron for that group. Contours are sharpened and extended using the dopamine-like priming described previously. The outputs from the three different scaled sub-corresponding groups are merged together using a weighted average. The end effect is a combined saliency map from across scales, which is the final output from CINNIC. 3.2.3 Kernel As mentioned, the hyper-kernel is defined to contain both excitation and inhibition in it. However, excitation e is defined in the kernel in a slightly different way than inhibition (or suppression) s. As can be seen in figure 3.5, excitation is strongly sensitive to the preferred orientation between two neurons, while inhibition is mostly sensitive to the spatial location between two neurons. That is, excitation is sensitive to the preferred orientation of both neurons in an interaction, while inhibition is only sensitive to the orientation of the operating neuron so most of its effect is from the distance between neurons. The excitation term can be seen in equation (3.1). Here a α is a term for the collinear disjunction (how much this neuron’s preferred orientation points to the other neuron’s) between this neuron and the other neuron. a β conversely describes how much the other neuron points to this one. The planar Euclidian distance between these neurons 88 is expressed as e d , this can be thought of more in terms of the distance between the hyper-columns a neuron resides in and not the direct distance between two neurons in space. The excitation output expression to the kernel is e K αβ , this is the excitation that will be expressed by the kernel from the preferred orientation α of the neuron that is operating (this neuron) and the orientation β of the neuron to be operated on (the other neuron). In simplest terms, e a α and e a β describe how much two neurons point towards each other in a collinear fashion. That is, e a α is the angle from the other neuron to this one, and e a β is the angel from this neuron to the other as seen in figure 3.2. Thus, as equation (3.1) shows, the excitation part of the kernel is the average over a collinearity term and distance. (3.1) (( ))/2 ee ee Kd aa αβ α β =+ ⋅ The output angles are derived as: (3.2) ( ) ( ) 23 23 1 e e e ee ee f al P P α = ⋅Α + ⋅ Α + ⋅ Α + (3.3) ( ) ( ) 23 23 1 e e e ee ee f al P P β = ⋅Β + ⋅ Β + ⋅ Β + The terms 23 ee P P … are constants used to curve the kernel’s shape with a 3 rd order polynomial. That is, as preferred orientation e a differences increase and the distance e d between neurons increases, excitation tapers off along a slightly flat, but in this case, an almost monotonically decreasing polynomial function. The polynomial is used since its ability to take on a variety of shapes is very strong. Additionally, since it is 89 applied radially, it can take on shapes similar to a Gaussian, but we are able to avoid explicitly making such assumptions. e Α and e Β are expressions for how far off collinearity is in this interaction. Basically, these range from 1 to 0 with 1 being if two neurons are collinear and 0 if two neurons are non-collinear to a degree that surpasses a threshold. e f l simply normalizes e Α and e Β to be within the 0 to 1 threshold. Here normalization is used to constrain values used in the kernel manufacture so that initially values for inhibition fall within the same range as excitation. Inhibition is expressed in more simple terms as: (3.4) (( ))/2 sss KWd ac αβ α =⋅ + ⋅ In this equation the major difference from excitation is c which is the difference between preferred angles in the two layers being interacted (remember, inhibition is only sensitive to the operating neurons orientation α and not the receiving neurons orientation β). That is, it is less important how much another neuron’s preference points at this neuron compared with how much this neuron’s preference points at it during inhibition. Spatial location is thus more important than strict collinearity for inhibition. The reason for this is because originally, better results were obtained early on by removing the s a β term between elements and replacing it with c. This also has the effect of making inhibition more purely center surround in its effects. Just as with excitation s d is the distance between this neuron’s column and the other neuron’s column and s a α is based upon the orientation of the operating neuron. Again note figure 3.5, which shows the general shapes of the kernel. The most obvious 90 result of the difference between excitation and inhibition is that inhibition is strongly symmetric over both principal axes. Thus, the shape of its field of influence stays ellipsoidal. W is a constant that gives a gain to the inhibition, either making it stronger, or weaker than excitation depending on what value we decide is suitable. Again, s a α is expressed as: (3.5) ( ) ( ) 23 23 1( 1) s ssssss f al P P α =− ⋅ ⋅Α + ⋅ Α + ⋅ Α + Where again s f l is a normalizer and s Α range between 1 and 0 depending on the angle offset of this neuron and the other neuron. Similar but orthogonal to excitation, s a α is equal to 1 if the operating neuron and the neuron being operated on are parallel, but not collinear. It becomes 0 if the two neurons are orthogonal. Thus, an important note about this system is that preferentially orthogonal neurons do not have direct influence on each other for either excitation or inhibition, but do carry indirect influence as will be discussed later in our discussion of junction finding. Values for s a α and e a α are derived such that they are mutually exclusive causing both excitation and inhibition to zero at the same angle. Thus, when s K αβ and e K αβ are combined into a single kernel it is a simple matter of mapping one over the other. This can be thought of as having computed the hill and the valley separately and then bringing the two together. Since the system is discrete, any minor disjoint is not noticed. 91 3.2.4 Pseudo-Convolution Figure 3.6: This graph illustrates the way in which neurons interact with neurons in other hypercolumns. By mapping the hyper-kernel K over the neuron α,i,j we can find the base synaptic current generated that should be sent to another neuron at the relative position β,k,l. The main process of CINNIC lies in the mechanisms of the corresponding group. Interactions in a corresponding group, which defines how collinear sensitive neurons work, uses a pseudo-convolution. The major difference between CINNIC’s hyper-kernel convolution and traditional convolution is that the results from the operation are stored at the other pixel, not the pixel being operated on. This was done earlier on when we were experimenting with other features that were later removed. Equation (3.6) shows the basic pseudo-convolution operation, which is also illustrated in figure 3.6. Here x is an orientation processed image pixel at image location , ij in one of the 12 different orientation layers α. Each processed image pixel, which becomes represented as a 92 neuron, is multiplied by the sum of its interactions with other pixels (neurons) in its receptive field at the relative location , kl with respect to the neuron , , ij α , with a field size of m by n . That is, , kl is the location of the other neuron relative to this neuron. The main interaction of this pixel-neuron () ij x α and the other pixel-neuron in its receptive field () kl x β is described by their weights from the kernel ()( ) () ki l j K αβ −− described earlier (where ()( ) ki l j −− is the corresponding hyper-kernel slice pixel mapped onto the field n by m). An approximation for the dopamine-like fast plasticity term at time t is described as () t kl f β which is derived in equation (3.9). Thus, this neuron () ij x α will dopamine prime the neuron at location ,, kl β Further, iff the interaction is inhibitory (the neural activity is computed as less than zero), () t kl g represents an addition to suppression from the gain control group suppression term from () kl x β ’s group (eq. (3.7)) which is the last complete iteration. Thus, this represents the GABA based group suppression mentioned earlier. This interaction is combined with the base excitation to this neuron times a constant gain Awith a pass though term, which is again() ij x α . That is, the sum excitation of this neuron also includes the input pixel intensity from the orientation image as well as the activity from other neuron interactions in its corresponding group. The linear output from this neuron is stored in() t ij v α , which is the total activity for this pixel- neuron after a single pseudo-convolution iteration at time t . (3.6) ()( ) 0, 1 0, 1 0,11 ( ) () () ( )( )( )( ) ttt ij ij ij kl kl kl k i l j km ln vx Ax x g f K αα α β β αβ β −− ∈− ∈− ∈ =⋅+ ∑ 93 (3.7) ()( ) () iff ( ) 0 () 1 otherwise kl k i i j kl gK g αβ −− ≤ ⎧ = ⎨ ⎩ The resulting potential is sent to an upper level of leaky integrator neurons (eq. (3.8)). This is the neuron that rests at the top of the hyper-column and along with the other neurons at the top of their respective hyper-columns forms a saliency map for this scale. A simple leak is approximated here with a constant leak term L with the sum being placed in () t ij V as a quick, but sufficient leaky integrator approximation, with the down side of not being proportional to potential. In essence, this sums the potential of all 12 neurons in this column that receive input from the same pixel in the image. (3.8) 0,11 () ( ) tt ij ij VvL α α∈ = − ∑ Dopamine-like fast plasticity () t ij f α is approximated as in equation (3.9). Here a neuron is primed to have a greater weight if it received input during the last iteration 1 () t ij v α − , which is proportional to that input. A constant F controls the gain on this effect. A ceiling is placed (eq. (3.10)) which limits this effect to be no less than 1 (no effect) or greater than 5 (strong effect). In this case, the selection of a ceiling of 5 is slightly arbitrary and dependant on observations that it worked well in our early test cases. (3.9) 1 () ( ) tt ij ij f vF αα − = ⋅ (3.10) 1( ) 5 t ij f α ≤ ≤ 94 Group suppression (eq. (3.11)) is based upon the gradient of the increase in excitation for all neurons in this group and approximates the GABAergic gradient circuit previously described. That is, all the neurons that are in this group () pq V have their output summed, with the finite difference determining the gradient. A constant gain R is applied and the constant T is a resistance threshold term that assures that group suppression can only occur when excitation has reached a certain level. i N and N j express the exclusive set membership of hyper-columns in this local group which is 1/8 th x1/8 th of the total image size (Think of the suppression groups laid out like a grid). In other words, if the image is 64x64 pixels, a local suppression group is 8x8 pixels in size. This size makes the range of this inhibition roughly the same size as the kernel and assures even division. (3.11) 11 (, ) N N ( ) () () ( ) ij ttt t ij pq pq ij pq gR V V T g − − ∈× ⎡⎤ ⎡⎤ =−−+ ⎢⎥ ⎢⎥ ⎢⎥ ⎢⎥ ⎣⎦ ⎣⎦ ∑ (3.11.a) N ( /16); ( /16) 1 i im i m =−+ − (3.11.b) N ( /16); ( /16) 1 j jm j m =−+ − All the potential is run through a logistic sigmoid (expressed in general form with a skew term β, eq. (3.12)), which simulates firing rates. Thus, the final top most saliency map for contours at this image scale is taken from equation (3.13). (3.12) ( ) 1 (1 exp( 2 )) Sv v β = +− 95 Figure 3.7: CINNIC works in several phases. The first is to take in a real world image. A Gaussian filter is applied that creates 12 orientation selective images. The image is then rescaled using an image pyramid into 3 different scales. The 12 orientation selective images are then pseudo-convolved and the corresponding region is run with dopamine-like fast plasticity and group suppression over several iterations. The three different scales are then brought back together using a weighted average and combined into a contour saliency map. (3.13) (( ) ) tt ij ij ISV = The final saliency map for all scales is created by taking a weighted average of all the scales (sub-corresponding groups), as can be seen in equation (3.14) (Figure 3.7). Here iju I is the saliency map for this sub-corresponding group at its own scale u while u w is the weight bias given to this scale (a number from 0 to 1). u n is the number of scales 96 analyzed (in this case 3) and ij M is the final saliency map derived from across all differently scaled sub-corresponding groups. (3.14) iju u u ij u Iw M n ⎛⎞ ⋅ ⎜⎟ ⎝⎠ = ∑ Figure 3.8: The top three images show the results of pseudo-convolution at each of the three scales used. The bottom right image shows the weighted average of the three images. The circles represent what the program feels are the five most salient contour locations. The bottom left image is the input image with the most salient points shown with the red circle on the most salient point and the blue circle on the least salient of the top five. 97 Thus, ij M represents a saliency map of what parts of the image are most salient based on contour information. If the algorithm is effective then ij M should have a large value corresponding to a contour segment at location , ij in the input image. It should correspondingly have a low value where no contour segment or a noise segment lies. The most salient point or points are the pixels from ij M which have the highest or maximum values (Figure 3.8). Additionally, it should be noted that while the saliency map that is output shows clearly the contours, since the goal of this work is to simulate visual saliency, the most important component of the output should be the salient points that draw attention to the contours. 3.3 Experiments To investigate the validity of our model we followed a multi-tier approach. The idea was that our model should be viable at several levels. First we looked at how our model worked with simple element interactions. For instance, how would our model work on a Gabor patch with two flankers only. In this, we should see saliency enhancement with greater collinear alignment as observed in humans (Polat & Sagi, 1993b, Polat & Sagi, 1993a). Additionally, enhancement should extend beyond a small number of elements. That is, we needed to check if our model worked on chains of Gabor elements. This would validate our model against data that shows that enhancement is formed for collinear Gabor elements, against background noise Gabor elements, along paths extending beyond the receptive field of V1 neurons (Braun, 1999). The third level of validation involved real world images. This was the next logical increment. That is, we 98 first test if our model works on a few simple Gabor elements (simple, local), then we test longer chains of Gabor elements with Gabor noise (simple, non-local), finally we test on natural images (complex, non-local). We should expect to find validity of our model at all three levels if we are to claim that it could be a reasonable approximation to contour integration in humans. Additionally, we also report on results that suggest that the CINNIC model is also sensitive to junctions and end-stops. This is to illustrate the generalization of the CINNIC model as well as demonstrate possible efficiencies in visual cortex for finding junctions with the same or a similar mechanism as used for contour integration. Additionally, a unified mechanism that finds contours and junctions may help explain some psychophysical observations made by others, which we discuss later. 3.3.1 Local element enhancement As has been discussed, contour integration behavior can be seen in cases where only a few Gabor or other directionally specific elements, such as a line segment, flank one element (Polat & Sagi, 1993b, Polat & Sagi, 1993a, Kapadia, Ito, Gilbert & Westheimer, 1995, Gilbert et al., 1996, Kapadia, Westheimer & Gilbert, 2000, Freeman, Driver, Sagi & Zhaoping, 2003) We attempted to replicate work by Polat and Sagi (Polat & Sagi, 1993b, Polat & Sagi, 1993a), which shows that a Gabor element when flanked by one collinear Gabor on either side can be enhanced from this arrangement. That is, the ability to detect the Gabor element in the center is increased or in some cases decreased as two flanking Gabors are altered for distance from the central Gabor. Enhancement changes should also be observed with alterations in contrast/amplitude for the Gabors. The results obtained by Polat and Sagi show that when the flanking elements are moved 99 away from the central Gabor in increments of λ (which is the wavelength size for the Gabor wavelet and is used as the measure for the separation between Gabor elements), at very close distances, flanking Gabors seem to make it harder to detect the central Gabor. Maximal enhancement is obtained when the flanking Gabors are separated from the central Gabor by approximately 2 λ. However, as the flankers are moved even further away, the enhancement effect seems to be completely diminished. This reaches a complete absence of enhancement when separation is about 12 λ. Figure 3.9: The program makes a decision as to which of two images has the target in it. The model estimates this decision by taking the probability of a decision as the Poisson of the output at the target. The error is the error function (EFC) of the two distributions for both target and non-target. Target amplitude is changed until error rate is 25% (75% correct). This response marks the relative enhancement Using this experiment as a guide, we optimized the kernel parameters of our model to create an outcome that resembled the above as closely as possible. This was done by creating Gabor images with flankers at 0,1,2,3,4 and 12 λ. We created our Gabor images as closely as possible to the ones used in their experiments (Polat & Sagi, 1993b, Polat & Sagi, 1993a). Additionally, the images could have alterations for the amplitude of the target Gabor in the same way they altered their image targets. In their experiments, 100 they found the amplitude of enhancement for a center Gabor element when flanked by two collinear Gabor elements of the same size using a two alternative forced choice paradigm. Thus, they did this by showing two images and forcing a participant to choose which one had the central Gabor in it and which one had an image with only the flankers and no central element. When the amplitude of the central element yielded a 75% correct rate, that was considered the threshold amplitude of detection for that particular separation of Gabor condition. They then mapped the relative enhancement of the target Gabor in the condition by comparing it with a single stand alone Gabor with no flankers which served as the baseline for detection threshold. We achieved a similar result by estimating the error rate using the error function from the Poisson obtained from the output of the target/no-target conditions (Figure 3.9). This method used previously by our group (Itti et al., 2000) and others estimates the error from physiological observations since noise and error in the brain follows a Poisson distribution (approximated by a Gaussian where mean equals variance). By modeling this, from equation (3.15), where 1 μ and 2 1 σ are mean and standard deviation for the first distribution and 2 μ and 2 2 σ are the mean and standard deviation for the second distribution. We could show that given the output stimulus in the target/no-target condition, what would be the probability that it would pick one image over the other. This method was used because it gives us dramatically increased performance over using a Monte Carlo simulation for determining error in a two alternative forced choice paradigm which was pivotal to train our model as will be described. 101 (3.15) () 12 22 12 1 () 2 2 P error erfc μμ σ σ − = + What this means is that we showed our algorithm the target and no target images. An intensity value from the saliency map at the location in ij M (eq. (3.14)) where the target Figure 3.10: The algorithm was optimized against observer AM. The pre-optimized output has a similar shape, but approaches the performance results from observer AM following optimization of CINNIC using hill climbing. The decision process from the program yields results that are within 2 standard errors (0.05) at its greatest difference found at a separation of 2 λ. Gabor from the input location corresponded to was obtained. The value from ij M was then considered to be a mean value with the expected standard deviation of outputs defined from the Poisson distribution. Using an iterative technique, amplitude was 102 adjusted for the central Gabor using a hill climbing method with momentum, until the error rate was 75% +/- 1. The amplitude at threshold was then compared with the output from an image with a single unflanked element, to measure relative enhancement just as in the study by Polat and Sagi. Our results were then compared with their results. The error was tallied and used to drive a second custom gradient descent search algorithm whose goal it was to minimize the error between our results and theirs by adjusting kernel parameters. As can be seen in figure 3.10, error was reduced substantially and fit Polat & Sagi’s experimental output for subject AM almost perfectly with a maximum error at less than 2 standard errors off of subject AM’s results, as estimated for this experimental paradigm in Polat & Sagi, (1993b) pg 76 and Polat & Sagi, (1993a) pg 995. These results fare particularly well for our model because not only do they fit the experimental result of Polat and Sagi, but they also have the same eccentric nature of reducing enhancement for Gabors that are particularly close. To illustrate why we observed the result of decreased enhancement at very close distance between Gabors, kernel slices from CINNIC were extracted and interacted with targets of different sizes to measure the enhancement when two targets are moved closer or further away. What we discovered is that with larger targets of approximate size, 4 λ, when compared with the 64x64 scale kernel, had the ability to contact neurons that were in inhibitory regions as well as the excitatory regions. This stimulus is about the same size as the Gabors used in the above study that were about 3.5 λ in size. The decreased enhancement occurred as the elements moved closer to each other. Figure 3.11 shows that as target objects get larger, they begin to have far stronger inhibitory ability at close distances. Thus, for enhancement, given a wedge shaped excitation range, there is an 103 optimal distance for enhancement between two elements, with that distance being closer for smaller Gabors. Also note that enhancement begins to fall off between 2.4λ and 1.6λ. This is where you would expect it to fall given the current reviewed psychophysical data and the outcome of CINNIC. Figure 3.11: As a collinear element draws closer, its receptive field begins to overlap another element’s region of surround inhibition (red). Here the stimulus element sizes may be compared with the kernel at the 64x64 pixel scale, which are 2.396 λ (3 pixels), 4 λ (5 pixels) and 5.597 λ (7 pixels). The separations for elements shown are at 2.4 λ, 1.6 λ and .8 λ. Here we interacted two single elements with a kernel. As elements get larger and closer, it can be seen that enhancement dips. Careful analysis shows that this is due to overlap of elements into inhibition zones, in the surround, as they move closer. Thus, no special kernel, or neural structure is necessary to create inverse enhancement at very close distances between two elements. This explains the dip in enhancement at close distances observed in CINNIC and by Polat and Sagi. 3.3.2 Non-local Element Enhancement Further testing of CINNIC was done using a special program called Make Snake provided and created by Braun (1999). This was used to generate test images in which a salient closed contour is embedded among noise elements. Using these stimuli, we tested 104 under which conditions our algorithm would detect the contour elements as being the most salient image elements. Make Snake creates images like the one presented in figure 3.1. The output is several Gabor patches aligned with randomized phase into a circular contour. The circle itself is carefully morphed by the program using energy to flex the joints of an “N-gon” to create a variety of circular potato-like contour shapes. The circles made up of foreground elements are controlled for the number of elements as well as the spacing in λ sinusoidal wavelengths. The elements can also be specified in terms of size and wave period. Background noise Gabors are added randomly and are the same size as foreground elements but may be at different separation distances. They are placed in such a way that they are moved like particles in liquid to a minimum spacing specified by the user. Gabors are added and floated until minimum spacing requirements are satisfied. The end result can also create accidental smaller contours among the noise background elements. Test images were created 1024x1024 pixels in size and corresponded to a simulated total visual angle of 7.37x7.37 degrees. Test images were created using two different Gabor sizes, a small Gabor (70 pixels wide with a 20 pixel Gabor wave period) and a large Gabor (120 pixels wide with a 30 pixel wave period). The background elements were kept at a constant minimum spacing (48 pixels for the smaller Gabors and 72 pixels for the larger Gabors). Spacing for larger foreground Gabors (120 pixel size) was varied between 2 and 3.5 λ in steps of 0.1666 λ. This was constrained since values above 3.5 λ made the contour circle larger than the image frame itself. The smaller Gabors (70 pixel size) had more leeway and could be varied from 1.5 to 6 λ in steps of 105 0.5 λ. For both Gabor sizes, the minimum spacing is set the way it is because below this, the foreground elements begin to overlap. It should be noted that the ratio of foreground separation to the minimum background separation was the same for both large and small Gabor patch conditions given the same λ. That is, the background elements had the same constant λ separation for all images. The smaller Gabors in these tests were the same size in pixels as the Gabors used in the experiment in 3.1. This size corresponded to a visual size of 0.5 degrees. Figure 3.12: Input images created by Make Snake are run through CINNIC. The output saliency map is processed to find the five most salient points. These five points are compared with a mask that represents the position of foreground contour elements. This allows the ground truth for such images to be determined with greater ability since foreground elements are controlled. 106 For each condition, Gabor size and foreground spacing, 100 images were created. An output mask was also created representing where foreground elements were positioned. This was used for later statistical analysis. In all, 2000 images were created and tested. Figure 3.13: The results from processing 2000 images from Make Snake by CINNIC are shown. The sum of all images where the most salient point was on a foreground contour is shown in dark gray for each of the λ separation conditions. In the experiment all images where the second most salient point was on a foreground element but the first was not are labeled 2 nd and are in a lighter shade of gray. In each condition, the general saliency result can be seen by summing the number of images where a foreground element is among the five most salient points found. At separations between 2.4 λ and 3.2 λ foreground and background element separation is about the same. At 5.14 λ, elements fall beyond the reach of enhancement defined by the finest resolution kernel. Thus, we expect to begin to see a drop off here. There is a slight pickup in enhancement between 3.2 λ and 5.14 λ perhaps due to optimal separation where elements do not overlap each other’s inhibition regions. Statistical analysis was done by taking the output saliency map from CINNIC, which always ran with identical model parameter settings for all images, and comparing it to the mask; this was done by looking for the top most salient points in the combined saliency image map M (eq. (3.14)). When a salient point was found, the local region was concealed by a disk to prevent the same element area from being counted twice. Salient points were marked as first, second, third and so on depending on its value in the salience map. That is, the most salient point was ranked first, and the second most salient point 107 was ranked second and so on. Analysis was done by finding the most salient point in an image, which was also found within the foreground element mask. The rank of the most salient point also within the mask was the rank given to the image. For instance, if the most salient point CINNIC found that also corresponded to a real contour element as indicated by the mask was the second most salient point, that image was given a rank of second. The number of images of each rank was summed to find out, for instance, how many images had their most salient point also lie within the mask (ranked as 1 st ). Figure 3.12 illustrates how images looked and the subsequent saliency map looked after processing as compared with an example of the masks used to rank the contour images. Table 3.1: As l separation increases between foreground elements, saliency decreases. For the smaller Gabor sized image, around 75% of all images with a foreground separation of 1.5 to 5 l have a foreground element as one of the top five most salient. The probability of obtaining such a result at random is far less than .005 percent. For images with larger Gabor elements, almost all the images contain a foreground element that is highly salient. Again the probability is very low suggesting that the null hypothesis should be rejected. 108 As can be seen in figure 3.13, for the larger Gabors of size 120, the top five most salient points fall on a contour in a minimum of 95 of 100 images for all conditions. For half the conditions, all 100 images have a top five salient point falling on the contour. Further, we analyzed the probability of obtaining these results at random. This was done by counting the number of pixels in the mask and the number of pixels not in the mask. This determined the probability at random of a salient point falling on the mask. Given 100 images and five samples per image we could then use a Bernoulli binomial probability distribution and ascertain the probability of our results. This was done using Figure 3.14: The declining performance of CINNIC at increasing λ separation is easy to understand by inspecting the contour images at 1.5, 3.5 and 6.0 λ of foreground separation. Casual observation shows that saliency decreases with larger separation of contour elements. At 6.0 λ contour elements are almost invisible. 109 equation (3.16) as defined by Hayes, p.139, eq. 3.10.2 (Hayes, 1994), which is used when sampling from a stationary Bernoulli process, with the probability of a success equal to p, and with the probability of achieving exactly r successes in N independent trials. We use this to obtain the probability from binomial sampling as: (3.16) ( successes; , ) rN r N pr N p p q r − ⎛⎞ = ⎜⎟ ⎝⎠ This is applied iteratively for all values greater than the number of successes. So what we are interested in is the probability of r or more successes to fill out the full cumulative probability distribution. This can be given as the final cumulative probability: (3.17) () ;, N iN i ir N pr N p pq i − = ⎛⎞ = ⎜⎟ ⎝⎠ ∑ So for N = 100 trials and r = 95 successes, this would be summed from 95 to 100. From table 3.1 we see that the p of obtaining these results at random for larger Gabors is at maximum 4.0x10 -05 . The results for smaller Gabors of size 70 are not as potent. The top five salient points fall on a contour element between 75 and 80 percent of the time. However, the probability of obtaining these results is still very small and is at a maximum of p 2.3x10 - 05 for conditions where foreground element separation ranges between 1.5 to 5.5 λ. Only in the condition at a separation of 6 λ do the results come out as non-significant at a p of 0.465. This is understandable since at larger separations of foreground elements, detection of contours seems to become less tangible as can be seen in figure 3.14. 110 Figure 3.15: The size of kernels at each of the three scales is shown compared with Make Snake image. The line on the Make Snake image shows the width of each kernel for close reference against an image with foreground separation of 4.5 λ which is the same separation as the peak observed in figure 3.13. As can be seen, when the image is reduced to 16x16, the kernel stretches across much of the image, but with little specificity of affect on the image due to the scale reduction. A question raised by our results is that of why there seems to be an optimal separation distance in the data while an optimal distance is not explicitly defined in the 111 neural connection weights. We believe this is due to two factors. The first as explained in our first experiment is that as elements get too close, they tend to inhibit each other as the elements overlap with inhibitory regions. The second seems to be that group suppression begins to over activate and has a greater likelihood of treating real foreground contour Gabors as noise background Gabors. That is, at closer distances, the gain for a foreground Gabor may be high enough to trip its own suppression. This we believe creates the slight dip in the Gabor size 70 results. Additionally, suppression from over facilitation of local Gabor elements should be expected since it has been found to exist by psychophysical experiments (Polat, Mizobe, Pettet, Kasamatsu & Norcia, 1998). The final tapering off on the size 70 Gabor results seems to come as the Gabor separation becomes too large for the kernel in the 64 x64 pixel scale sub-corresponding group to connect them. Thus, at 5.14 λ, the first kernel can no longer bridge between two Gabor elements and its stimulus ends altogether in the final saliency map (Figure 3.15). It should also be noted that using this same display Braun (1999) noticed that one of two subjects showed slightly improved threshold when the ratio of foreground element distance λ to background distance λ was increased from 1 to 1.25. The ratio 1.25 corresponds to between 3 λ and 4 λ of foreground separation in our results, which is slightly less than the peak at 4.5 λ in the data presented in table 3.1. That is, our results also peak near a ratio of 1.25. As such it is not a perfect fit, but it does display an increase of enhancement at about the same ratio and drops off near a ratio of 1.6, which is between 3.8 λ and 5.1 λ . This corresponds with the drop off in threshold of human 112 subjects, which occurs at a ratio of about 1.6. As such enhancement of contours by CINNIC is within a similar range for drop off in threshold as observed in human subjects. 3.3.3 Sensitivity to Non-contour Elements 3.3.3.1 Sensitivity to Junctions In addition to selectivity for contour elements, we have found that CINNIC is sensitive to junctions and conditionally to end stops, which has been described in the visual cortex (Gilbert, 1994). This is important since junctions seem to hold important visual information, especially for reconstruction of geometric interpretation of objects (Biederman et al., 1999, Rubin, 2001). For instance, following a Geon theoretical construct for object identification, simple lines without junctions may lack certain necessary information since it may be harder to determine where line segments connect to each other. However, junctions hold more information than single lines since they contain the line projections as well as the determined junctions. Thus, a junction is a line plus its intersection and thus holds more information. It is also interesting to note this sensitivity to junctions since it creates a possibility that the mechanisms described in this paper are generic enough to be applied to not only contour finding, but to junction finding as well. That is, it is interesting to think that only mild augmentation of a corresponding group can change it from a contour detector to a junction detector or that one corresponding group may detect both junctions and contours at the same time. From a functionally simplistic standpoint, this is an attractive idea. Especially since the most interesting junctions are probably found at the 113 end of longer contours rather than shorter contours, such a synergy may also prove advantageous. For instance, when not wanting to walk into a desk, the corners and the center of the contour edges are very important to notice. Figure 3.16: The five images shown above demonstrate CINNIC’s sensitivity to junctions in the elemental shapes seen. Here, the most salient point is always on a junction (red circle) and there is always another point of very high saliency (in the top five) on a junction. When not falling on a junction, the most salient point is near the center between two junctions, which is quite possibly the second most important part to find salient. Some of the anomalies observed such as a saliency point in blank space are due to the algorithm blanking out the saliency map as it selects points to prevent it from picking the same point more than once. CINNIC was not designed explicitly to filter for junctions and end-stops. However, analysis of processed images seemed to reveal this ability as can be seen in figure 3.16. For plus shaped cross junctions and T-junctions it is easy to show that CINNIC should be sensitive to these types of image features. This is because CINNIC was designed without orthogonal suppression. Thus, two orthogonal lines will not cancel out. Additionally, since two orthogonal lines are processed in two separate layers in the corresponding group which are summed, the junction of two line segments are additive. This can be seen in equation (3.8). Thus, if each pixel element in two intersecting lines is equal to one, the saliency map at the point of intersection would be equal to two. This can also be seen for T-junctions. Again, the enhancement of the junction should be 1.5 times that of elements on either line segment. This is because a half line segment that joins a full line segment should enhance less than a full line segment. Thus, a T-junction would 114 intuitively have 1.5 times the excitation of a single line rather than 2 times for a plus junction. Another interesting facet of these results is that they suggest a possible explanation for the reduced enhancement when the gestalt continuity of a line is violated. For instance, studies have shown that when a line is presented with two flanking lines its enhancement is greater than if one of the flankers is in the shape of a T (Kapadia et al., 1995). Such a result might be predicted by our model since the flanking T would then be promoted to have a higher saliency value than the central stimulus element. As such, continuity is not broken by suppression from the T so much as it is broken by having a lower saliency than the T. It should be noted additionally that we do not predict enhancement of a central line element with two orthogonally oriented elements if they are divided by a large enough gap. That is, it is important to note that enhancement of junctions here likely relies on a joint overlap of orthogonal lines. While the evidence for plus and T-junctions is intuitive, it is not as much so for L junctions. Thus, we tested L junctions against the CINNIC kernel. In this case, we assumed perfect co-linearity on the kernel. This allows us to test elements against only one kernel slice, which keeps analysis much simpler. Figure 3.17 shows the results of passing two types of L joints in front of the CINNIC kernel. Each L joint can be thought of as infinitely long. That is, the end-stops on the L junctions will never pass in front of the kernel. Two types of L junction line segments are used. One is a two pixel wide line while the other is one pixel wide. To determine the enhancement of a junction, we 115 Figure 3.17: These graphs show enhancement of pixels from an image when convolved with an orthogonal slice from the CINNIC kernel. As can be seen, in the top left graph, the corners on L junctions, both one and two pixels wide, are enhanced more than their neighbor pixels and other pixels along the L out to a distance of 4.8 λ. Additionally, in the top right, we can see that the corners on bars are enhanced over pixels outside of their receptive field (> 4.8 λ) along the same bar as the two parallel edges are separated and additionally as group suppression is added. The bottom row shows that end stops with a point are not enhanced at base group suppression, but as suppression is added, the end point overtakes its three closest neighbors (.8,1.6,2.4 λ) when group suppression reaches 150%. This effect is not seen for the non-pointed bar. Thus, the current version of CINNIC is only conditionally sensitive to end-stops. Note, each pixel corresponds to a width of .8 λ with the 64x64 scale kernel. compare the enhancement of the pixel that lies on the junction compared with other pixels on the line. That is, we move the L over the kernel. Then each pixel will report some enhancement level. If CINNIC could have sensitivity to junctions, we would expect that the junction pixel would be more enhanced than other pixels on the line not on the junction. 116 For the one pixel width (.12 to .46 degree of visual field depending on the image scale, 64 x 64 to 16 x 16), it can be seen that the kernel will enhance the junction pixel more strongly than neighboring pixels along the line as far away as 5 pixels (.575 to 2.38 degrees). When the kernel is moved to a point, 6 pixels (.69 to 2.78 degrees) in distance from the junction pixel, enhancement is the same as for the junction pixel. This can be considered intuitively this way: A line segment that is half way through the kernel will enhance one half as much as a full line passing all the way through the kernel. However, at the junction, two halves sum to the same enhancement as a full line. Thus, by the 6 th pixel inward, enhancement is the same since the junction has moved outside of the kernels field and is now essentially a simple bar. So for any L junction, enhancement will be higher at the junction pixel than any other part in the line segment for a radius of 5 pixels. Very similar results are found with L junctions of width 2 (.23 to .93 degrees). However the maximal enhancement is found at the inner elbow junction and not the outer junction. That is, an L junction of two pixels in width has two pixel junctions, one on the inside and the other on the outside of the joint. The inner junction seems to have more enhancement for a radius of 5 pixels. Since the enhancement of the junction is isolated, this means that even if it has a similar enhancement of a line segment six pixels in, it may be enhanced more since it will not push the local region activity higher and increase the group suppression. Thus, enhanced lines are more likely to create levels of excitement that will trip group suppression than junctions, which are more isolated in their activity. From this it might be hypothesized that group suppression may aid in the discovery of L junctions in CINNIC. 117 3.3.3.2 Conditional Sensitivity to End-stops Using the same procedure as for the analysis of L junctions, we checked the sensitivity of CINNIC to end-stops. We found that there was some elevated sensitivity to end-stops, but only under certain conditions. Three conditions were tested. The first involved the outline of bars. Enhancement was tested for the junction area on the outline of a bar vs. an edge in the middle of the bar. The results in figure 3.17 show that when the bar is wide enough, sensitivity is increased for the end-stop junction. Additionally, this effect is increased as group suppression effects are added. Thus, the junctions on the end of bars are enhanced over elements in the middle of the bar by even greater amounts as group suppression is added. Further, the bar of width 4 (.46 to 1.86 degrees) becomes stronger then a middle segment when group suppression reaches 50% above normal. The second and third test involved passing a bar of width 2 in front of the Kernel. As can be seen the second bar was sharply pointed at its tip Figure 3.17. The kernel showed no enhancement for the end of the plain bar even if group suppression increased to 250%. However, for the pointed bar, enhancement was seen over the other 4 segments tested once group suppression reaches 150% above normal. Thus, it can be seen that CINNIC has sensitivity for some types of end-stops. This agrees well with research on V1 neurons which shows that most neurons in this region have some sensitivity to end-stops (Jones, Grieve, Wang & Silito, 2001, Sceniak & Hawken, 2001, Pack, Livingstone, Duffy & Born, 2003). Additionally it follows a very similar pattern of behavior seen in end-stop neurons in the cat visual cortex. In this case, end-stop sensitive neurons were found to detect end stops after an initial saturation 118 period. Thus, the neurons for a brief interval (<30 ms) were sensitive to non-end-stopped elements, but built up to end-stop sensitivity (Pack et al., 2003). Our model agrees with these observations since build up of group suppression increases end stop detection and would also create a delay for such detection as suppression builds. This is also similar to the model by Rao and Ballard (Rao & Ballard, 1999), which used a predictive feedback suppression mechanism to facilitate end-stop detection. However, the primary difference is that suppression in CINNIC comes from activity in the corresponding group and not from a higher-level process. 3.3.4 Real World Image Testing Real world image testing was conducted by inspecting the output of CINNIC on 132 real world images. We did this by inspecting each image and cataloging the results by hand. This was done due to the fact that classifying contours in an image a priori is extremely difficult due to the subjective nature of classifying image elements in a natural image. However, this has a new subjective drawback in that the efficacy is based on a post hoc analysis, which may carry a different expectation bias. In either case, the results are difficult to not bias. Either the experimenter subjectively leaves out or includes contours before the analysis is done, or the experimenter sees results in the post analysis due to a personal bias. The latter approach provides room for re-analysis and careful evaluation which we believe to be a strength in a situation where there seems to be a bias no matter which method is used. It is hoped that the reader will consider the previously presented material as showing some of the more controlled evidence of the model’s efficacy and take this as evidence of real world applicability. 119 Table 3 2: Post Hoc analysis of CINNIC for its sensitivity to certain kinds of features again suggests that it is not only sensitive to contours, but junctions as well. This can be seen as the most salient point in 42% of random real world images analyzed lies on a contour junction. Prior probability is not supplied since it is not known by us what the real incidence of contour junctions is in real world images. Thus, the true posterior significance is unknown. Each image was analyzed for salient content of contours, junctions, end-stops and short contours, which often include, for example, eyes or mouths. For each image it was noted what the nature of the most salient location was. So for instance, if the most salient contour location was on a junction, then that image was counted as having its most salient contour on a junction. Each image was thus counted in one of five exclusive groups (a) Contours without a junction, (b) junctions between two contours, (c) end-stopped points from a contour, (d) Short contours that tended to be eyes and mouths or (e) none of the above, which tended to mean it was a poor result. Table 3.2 shows the results. As can be seen, these results agreed with the analysis provided from junctions. In essence, it was observed that most of the top salient locations, as determined by CINNIC, seem to lie on a junction. Additionally, the conditional end-stop sensitivity can be seen in about 10% of all real world images. Thus, CINNIC has a strong sensitivity to contours at junction 120 points and additionally has some sensitivity to end-stops, which is to be expected since most neurons in V1 have some end-stop sensitivity. Figure 3.18: The five most salient points are shown in 12 real world images processed by CINNIC (red is most salient, next is orange etc.). Notice the prevalence of representation by facial features, junctions and end-stops. Since there seems not to be any studies, which suggest the real prior probability of junctions in natural images, we are forced to read these results from a worst-case hypothetical framework. Thus, the significance of these results may be interpreted as follows, since each junction in an image requires at least one line segment edge pixel, there can never be more junction pixels than contour non-junction pixels. Thus, in a worst-case scenario, at most 50% of all detected contours would be on junctions if the 121 likelihood of falling on a junction vs. a non-junction was totally random. However from our image analysis, contour junctions are more likely to be detected as the most salient object in an image than contours not on a junction. Thus, this analysis again suggests that CINNIC is indeed more sensitive to junctions than contour segments without junctions. Additionally, it can be seen from figure 3.18 that in many images CINNIC finds facial features salient 9 . In the 27 images where human or animal facial features are visible, CINNIC finds 14 to have salient facial features in the top five most salient points. Here we define facial features as noses, mouths, eyes or ears. That means that based on contour analysis alone, half of all faces have a highly salient feature. This suggests that CINNIC may be able to play a role in a face finding algorithm. It also suggests that contour integration mechanisms may be involved in a dual role that includes not only landscape contour finding but face finding as well. Here CINNIC seems sensitive to facial features such as short contours since they are isolated from other similar parallel lines on smooth faces. Thus, even though they are short, they are not suppressed by anything else. The reason why we believe that face feature finding is interesting is that it suggests that CINNIC may approximate more generic mechanisms in visual cortex, and as such may be a closer fit to what processes actually occur in the brain. For instance, it is suggested that the interaction of simple horizontal and vertical lines derived from important facial objects such as eyes and noses play a part in facial categorization (Peters et al., 2003). If this is correct then a neural device that finds such features and can describe them in terms of lines may be necessary. Thus, it may be possible to augment 9 All results for real world images may be viewed at http://www.cinnic.org 122 the simple butterfly kernel connection with some of the other mechanisms described here to find a variety of different useful features. 3.4 Discussion The CINNIC model performs contour integration and seems to satisfy the criteria of its design. First it uses simple biologically plausible mechanisms for its actions. Second, it performs its action with enough speed that a real time implementation is within our grasp. Third, it helps to illuminate what processes are at work in human contour integration and fourth, current examination of CINNIC shows its performance to be within parameters of human contour integration as seen from psychophysical data. The model is biologically plausible because all neural connections within the network are of types that are known to exist in the human brain. For instance, no neuron should connect to any neuron that is outside its physical reach. This means that no global mechanisms were introduced to control the gain of the network. Indeed, each neuron is independent from any other neuron for which it is not connected from its kernel interactions or through its group suppression. Our model then uses dopamine-like priming to connect neurons that do not directly connect. While this may not have been directly observed in V1, the actions of dopamine priming as well as other types of priming are well known to exist in the human brain (Schultz, 2002). Other models have explained linking using neural synchronization. While this has been observed in human neural networks, its observation and importance in the neocortex remains subject for debate. 123 Further, other computational models have shown dopamine modulation to be effective at linking sequences (Suri et al., 2001). Since visual contours are spatial sequences, this would show yet another way in which dopamine-like priming would be feasible in the long-range connection of contours. More evidence for the dopamine- priming hypothesis can be seen in the degradation of contour integration in patients with schizophrenia (Silverstein, Kovács, Corry & Valone, 2000). This lends support to a dopamine hypothesis since dopamine is one of the neurotransmitters suspected of playing a major role in schizophrenia (Kapur & Mamo, 2003), with effects seen in striatal dopamine neurons as well (Laruelle, Kegeles & Abi-Dargham, 2003). The group suppression we have used is also plausible because GABAergic interneurons of many types are found throughout the brain. Interneurons are also known to connect to many neurons at the same time, sending inhibitory synaptic currents to a possibly large population of pyramidal dopamine neurons (Durstewitz, Seamans & Sejnowski, 2000, Gao & Goldman-Rakic, 2003). The firing of these neurons has also been shown to have dramatic effects on the neurons they connect since they can exhibit spikes at very high rates (100 Hz) (Bracci et al., 2003) and can have low firing thresholds as well as a need for few inputs (Krimer & Goldman-Rakic, 2001) Also, the group suppression in our model uses an axonal reach that is about the same size as the reach for pyramidal neurons created by our kernel. Thus, it fits well within spatial constraints. Another feature, which makes out model unique, is that it not only works in saliency for contours, but also for junctions. As mentioned this was an unexpected result. However, it is very interesting for several reasons. The first is that it suggests that V1 and V2 neurons can have dual or multiple roles and that the feature detection dimensionality 124 within various processing units in visual cortex may be higher than is generally considered the case. Thus, following the logic behind the utilization of Gabors in vision, neural structures may exist, which have a broad utility. The structure for CINNIC my shed light on a structure that allows neurons to become sensitive to many different visual features but yet not be exotic from each other. That is, contours, end-stops and junctions may be detected by the same mechanisms, but the detectors are different due to subtle variations such that a base neuron is taken in infancy and morphed subtly to its new function through learning. However, a morphed neuron’s structure is still very similar to its original structure and is similar to other feature detectors that operate on seemingly unrelated features. Such a theory would be in agreement with observations that natural images can be described with a relatively small number of Gabor derived kernels very efficiently (Olshausen & Field, 1996). As such one might expect the flora of feature detectors to be somewhat constrained at this level of cortex. Additionally, the analysis here lends support to the notion of the importance of the temporal domain in perceptual dimensionality. That is, as has been suggested, (Prodöhl et al., 2003) perception may not just be a matter of the three dimensional structure of neurons, but may also hinge on the pattern of the working of neurons. As such, an end- stop detector is only an end-stop detector after a certain interval of suppression from interneurons. Prior to that its role may be different and it may be a simple contour detector. Since most neurons in V1 show end-stop sensitivity and end-stopped neurons take extra time to register those end stops, it seems feasible that a neuron may detect different kinds of features at different times. 125 3.4.1 Extending Dopamine to Temporal Contours via TD (dimensions) In addition to static contours, dynamic contours may also be enhanced by mechanisms of fast plasticity. For example, covert object tracking (in the absence of eye movements) could be enhanced by similar mechanisms as have been proposed here. This can be hypothesized since any neuron that receives an input in our model will attempt to prime its neighbors. When an object moves to the next neuron, it maintains a saliency enhancement (imagine the phosphors on an old TV still glowing in a trail as a dot moves across the screen). Additionally, neurons along the trajectory of the object will receive the greatest enhancement, which will maintain the saliency along that path. Because of dopamine’s involvement in fast temporal difference correlation (Suri et al., 2001, Schultz, 2002) it may be a natural candidate for such actions. Thus, the key to understanding temporal contours and smooth pursuit may merely lie with the basic contour integration mechanisms. Additionally, it is easy to imagine that the dopamine-like priming mechanism we have hypothesized here not only enhances contours, but may play an integral part in training the system in a similar manner as suggested in Rao & Ballard (1999). For instance, it has been proposed that observed movement of objects trains neurons to recognize contours (Prodöhl et al., 2003). As such, following our hypothesis, a dopamine-like priming may not only enhance contours, but may train contour integrating neurons. Since dopamine is known to play a role in reinforcement learning (Suri et al., 2001, Schultz, 2002) it is an excellent candidate for such a mechanism, and since it is already in place for the purpose of learning, an Occam’s razor reasoning would state that 126 if it can also fulfill the role of non-local interaction for contour integration, it is the most reasonable candidate to do so since that would be the simplest explanation. 3.4.2 Explaining Visual Neural Synchronization with Fast Plasticity It is important to note that temporal synchronization in vision does not necessitate correlated firing as a cause. For instance, Lee and Blake (Lee & Blake, 2001) observed that alternating motion of Gabor patches allowed greater facilitation of contours if the Gabor motion alternates in a correlated manner. That is, they displayed Gabor contour patterns much like the Make Snake patterns. However, the Gabors were given visual motion by changing the wave phase in a direction that created an orthogonal motion to the Gabor patches. The direction of the motion was randomized, but switching the direction of Gabor elements could be correlated. As such, in the highly correlated condition, direction was shifted simultaneously while in the low correlation condition, switching was somewhat random. Facilitation was observed when switching was correlated. We believe this can be explained by fast plasticity as follows. Due to collinear relation, neurons with different motion sensitivities will prime. For instance, two collinear Gabor patches, one moving in the direction of 0 degrees and one moving in the direction of 180 degrees will prime neurons in a Hebbian fashion. When the Gabors switch, two completely different sets of motion sensitive neurons will prime. Through this alternation, it will create two sets of mutually exclusive linked sets. By removing correlation it will begin to create cross-linked pairs of neurons and increase the number of 127 primed synapses which will increase noise in the network. As such the more synchronous the alternation of motion is, the more crisp the plastic connections will be (Figure 3.19). Figure 3.19: Temporal grouping can be explained by fast-plasticity mechanisms. If alternation is strongly correlated then plastic connections are strong and less ambiguous. Also, by the second alternation, all connections are primed unlike uncorrelated alternation where only some connections are primed. As such, correlated temporal alternation would facilitate neurons more strongly than a less correlated temporal structure if it used fast-plasticity based priming. 3.4.3 Contours + Junctions, Opening a New Dimension on Visual Cortex The research thus far agrees with work to date that suggests that V1 neurons are extremely powerful for extracting data from a scene (Olshausen & Field, 1996) Additionally, it also helps to validate hypotheses that suggest neurons in V1 have a high dimensionality for visual processing. That is, a neural group may not be responsible for just sensitivity to one feature, but may have sensitivity to multiple features. Additionally, interaction between partially sensitive neurons may create complete sensitivity. So for instance, if two or more groups have some sensitivity to end-stops, then the combination of their sensitivities may yield full sensitivity to end-stops. 128 It should also be noted that at least in terms of junctions, one would expect that the same mechanism would be responsible for finding L, T and + shaped junctions. This is due to recent research that suggests that searches for L vs. T vs. + junctions is inefficient (Wolfe & DiMase, 2003). That is, because we are unable to find different types of junctions faster among noise of different junction types, from a saliency stand point, one would expect that V1 or other saliency centers do not differentiate them and thus, would be explained by the brain using the same mechanism to find junctions irrespective of the type. 3.4.4 Model Limitations Like most computation models of biological systems CINNIC has its limitations. The first is that the model does not include effects on contour integration from color (Mullen et al., 2000). One reason for not accounting for color is that it would most likely add another dimension to the pseudo-convolution computation. That is, in addition to orientation and position as dimensions, color would become a third set making the hyper- kernel six dimensional with the addition of blue-yellow and red-green channels. The model also does not account for enhancement of parallel elements. This, as mentioned previously, is when Gabor elements are aligned like the rungs on a ladder. The primary question on parallel enhancement is where it occurs. For instance, is there a second set of contour integrators for parallel elements or do parallel elements enhance in the exact same corresponding group as collinear elements? Such questions still need to be answered. If they do enhance in the same corresponding group, then the shape of neural 129 receptive fields in a contour integration model may need to be rethought since the classic butterfly shape used in most contour integrators cannot account for such enhancements. An Additional limitation is that inhibition and excitation are treated with temporally similar dynamics at the kernel level. This may be considered a weakness of the model. However, it should be remembered that inhibition does have a build up pattern via the group suppression mechanism. As such, temporal differences between excitation and inhibition mechanisms are partially addressed. Indeed, as mentioned, the key to detection of L junctions and end-stops by contour integrators may be the temporal difference between excitation and inhibition. 3.5 Conclusion We believe we have created a reasonable model and simulation of contour integration in visual cortex for saliency. As the results have shown, we have fit the results of human observers to within two standard errors for a single Gabor element with two flankers. We have also achieved reasonable results for images with multiple Gabor elements, which are statistically significant. Taken with our results from real world images we suggest that this makes our model a reasonable approximation of human contour integration. Additionally, we believe that our model demonstrates how the neural mechanisms for contour integration may be extended into other types of feature processing. 130 Chapter 4: Using an Automatic Computation of an Image’s Surprise to Predicts Performance of Observers on a Natural Image Detection Task What are the mechanisms underlying human target detection in RSVP (Rapid Serial Visual Presentation) streams of images, and can they be modeled in such a way as to allow prediction of subject performance? This question is of particular interest since, when images are presented at high speed, humans can detect some but not all images of a particular type (target images; e.g., images containing an animal) which they would be able to detect with far greater accuracy at a slower rate of presentation. Our answer to this question is that two primary forces are at work related to attention and are part of a two or more stage model (Reeves, 1982, Chun & Potter, 1995, Sperling et al., 2001). Here we will suggest that the first stage is purely an attentional mask with the blocking strength of attention given by image features which have already been observed. The second stage on the other hand can block the perception of the image if another image is already being processed and is monopolizing its limited resources. We consider here the metric of Bayesian surprise (Itti & Baldi, 2005, Itti & Baldi, 2006) to predict how easily a target image containing an animal may be found among 19 frames of other natural images (distracters) presented at 20 Hz. In a first experiment, we show that surprise measures are significantly different for target images which subjects find easy to detect in the RSVP sequences vs. those which are hard. We then present a second experiment which attempts to predict subject performance by utilizing the surprising features to determine the strength of attentional capture and masking. This is 131 done using a back-propagation neural network whose inputs are the features of surprise and whose output is a prediction about the difficulty of a given RSVP sequence. 4.1.1 Overview of Attention and Target Detection It has long been argued that attention plays a crucial role in short term visual detection and recall (Neisser & Becklen, 1975, Hoffman, Nelson & Houck, 1983, Duncan, 1984, Mack & Rock, 1998, Tanaka & Sagi, 2000, Sperling et al., 2001, VanRullen & Koch, 2003a, Wolfe, Horowitz & Michod, 2007). This also applies to detection of targets when images are displayed, one after another, in a serial fashion (Raymond et al., 1992, Duncan, Ward & Shapiro, 1994). Many studies have demonstrated that a distracting target image presented before another target image blocks its detection in a phenomenon known as the attentional blink (Raymond et al., 1992, Marois, Yi & Chun, 2004, Evans & Treisman, 2005, Sergent, Baillet & Dehaene, 2005, Maki & Mebane, 2006, Einhäuser, Koch & Makeig, 2007a, Einhäuser et al., 2007b). Thus, one image presented to the visual stream can interfere with another image that quickly follows, essentially acting as a forward mask (e.g. target in frame A blocks target in frame B). Additionally, with attentional blink, interference follows a time course, whereby optimal degradation of detection and recall performance for a second target image can occur when it follows the first target image from 200-400 ms (Einhäuser et al., 2007a), which is evidence of a second stage processing bottleneck. In most settings, an intermediate distracter between the first and second targets is needed to induce an attentional blink, a phenomenon known as lag-1 sparing (Raymond et al., 1992). 132 However, for some types of stimuli, such as a strong contrast mask with varying frequencies and naturalistic colors superimposed with white noise, interference can occur very quickly (VanRullen & Koch, 2003b). This may create a situation whereby for some types of stimuli, a target is blocked by a prior target with a very recent onset (<50 ms prior), or by contrast, a much earlier onset (>150 ms prior). As such, there seems to be a short critical period with a U-shaped performance curve where interference is reduced against the second target. That is, interference is reduced if the preceding distracter comes in a critical period of approximately a 50-150 ms window before the target, but is larger otherwise. This interval we will generically refer to as the sparing interval. In addition to interference with the second target, detection of the first target itself can be blocked by backward masking (e.g. target in frame B blocks target in frame A) (Raab, 1963, Weisstein & Haber, 1965, Hogben & Di Lollo, 1972, Reeves, 1980, Breitmeyer, 1984, VanRullen & Koch, 2003b, Breitmeyer & Ö ğmen, 2006 ). However, in natural scene RSVP, backward masking occurs at a very short time interval, < 50 ms without a good ability to dwell in time (Potter, Staub & O'Conner, 2002) That is, interference is not U-shaped in the same way as with forward masking. As we will mention later, longer intervals (>150 ms) may conversely enhance detection of the first target (e.g. target in frame B enhances target in frame A). However, backwards masking in the case of RSVP still retains a U like shape as the effects of the mask peak and decrease. The difference is that it does not have a second almost discrete episode of new masking following a short interval of sparing as forward masking does. That is, once the backwards mask fades in effect the first time, it is finished masking. The forward mask on the other hand has the ability mask twice. 133 Putting these pieces together, there is a lack of literature showing a strong reverse attentional blink which would be produced by a second interval of backwards masking. With different stimuli, both forward and backward masking can be observed over very short time periods by flanking images in a sequence. However, only forward masking interferes with targets observed several frames apart, following a sparing lag with a much higher target onset latency. It should be noted that these masking effects are not universally observed in all experiments. As such, the mechanisms responsible for masking are dependant to some degree on both masking and target stimuli, which may result in a target being spared or completely blocked. Much like masking from temporal offsets, if the first target is displayed spatially offset from the second target, interference is decreased for recall of the first target (Shih, 2000). As an example, a large spatial offset would occur if a target in frame A is in the upper right hand corner while a target in frame B is in the lower left hand corner. Thus, if we overlapped the frames, the targets themselves would not overlap. As a result, recognition of individual target images allows for some parallel spatial attention (Li, VanRullen, Koch & Perona, 2002, Rousselet, Fabre-Thorpe & Thorpe, 2002, McMains & Somers, 2004). This is also seen if priming signals for target and distracter are spatially offset (Mounts & Gavett, 2004). However, at intervals over 100 ms, spatial overlap may actually prime targets in an attentional blink task (Visser, Zuvic, Bischof & Di Lollo, 1999). We then should gather that objects offset in space lack some power to mask each other, but in contrast, may at longer time intervals lack the ability to prime each other. Additionally it has been found that more than one object at different temporal offsets can be present in memory pre-attentively, at the same time, but only if they do not overlap at 134 critical temporal, spatial and even feature offsets (VanRullen, Reddy & Koch, 2004). However, there is reduced performance as more items, such as natural images, are added in parallel (Rousselet, Thorpe & Fabre-Thorpe, 2004). Thus, even if objects do not interfere along critical dimensions, performance may degrade as a function of the number of complex distracters added. 4.1.2 Surprise and Attention Capture Prediction of which flanking images or objects will interfere with detection of a target might be accounted for in a Bayesian metric of statistical surprise. Such a metric can be derived for attention based on measuring how a new data sample may affect prior beliefs of an observer. Here, surprise (Itti & Baldi, 2005, Itti & Baldi, 2006), is based on the conjugate prior information about observations combined with a belief about the reliability of each observation (For mathematical details, see Appendix B and C). Surprise is strong when a new observation causes a Bayesian learner to substantially adjust its beliefs about the world. This is encountered when the distribution of posterior beliefs highly differs from the prior. The present paper extends our previous work (Einhäuser et al., 2007b), by optimally combining surprise measures from different low- level features. The contribution of different low-level features to “surprise masking”, and thus their role in attention, can be individually assessed. Additionally, we will demonstrate how we have extended on this work by creating a non-relative metric that can compare difficulty for RSVP sequences with disjoint sets of target and distracter images. That is, our original work was only able to tell us if a new ordered sequence was relatively more difficult than its original ordering. The current work will focus on giving 135 us a parametric and absolute measure based on how many observers should be able to spot a target image in a given sequence set. While surprise has been shown to affect RSVP performance, it remains to be seen how surprise from different types of image features interacts with recall. Importantly, critical peaks of surprise, along specific feature dimensions, can be measured and used to assess the degree to which flanking images may block one another. For instance, should an image with surprising horizontal lines have more power in masking a target than an image with surprising vertical lines? This is important since some features may be more or less informative, for instance if they have a low signal to noise ratio (SNR) between the target and distracters (Navalpakkam & Itti, 2006). Additionally, some features may be primed in human observers, making them more powerful. As an example, if features can align and enhance along temporal dimensions (Lee & Blake, 2001, Mundhenk, Landauer, Bellman, Arbib & Itti, 2004a, Mundhenk, Everist, Landauer, Itti & Bellman, 2005b) in much the same way they do spatially (Li, 1998, Yen & Fenkel, 1998, Li & Gilbert, 2002, Mundhenk & Itti, 2005 ), then some features that appear dominant may have a fortunate higher incidence of temporal and/or spatial colinearity in image sequences. In order to predict and eventually augment detection of target images, a metric is needed that measures the degree of interference of images in a sequence. Here we follow the hypothesis that temporally flanking stimuli interfere with target detection, if they “parasitically” capture attention so that the target may fall within an attentional blink period (e.g. target in frame A and C may interfere with target in frame B). The challenge is then to devise measures of such attentional masking for natural images using stimulus statistics. We herein derive and combine several such measures, based on a previously 136 suggested model of spatial attention that uses Bayesian surprise. Note that this “surprise masking” hypothesis does not deny other sources of recognition impairment, e.g., some targets are simply difficult to identify even in spatio-temporal isolation. Instead “surprise masking” addresses recall impairment arising from the sequential ordering of stimuli, rather than the inherent difficulty of identifying the target itself. If surprise gives a good metric of interference and masking between temporally offset targets in RSVP, we should expect that by measuring surprise between images by its spatial, temporal and feature offsets we will obtain good prediction of how well subjects will detect and recall target images. To this end, we first present evidence that surprise gives a good indication of the difficulty for detecting targets in RSVP sequences and that it elegantly reveals almost classic intrinsic masking effects in natural images. We then show for the first time a neural network model that predicts subject performance using surprise information. Additionally, we discuss the surprise masking observed between images in an RSVP sequence within the framework of a two stage model of visual processing (Chun & Potter, 1995). 4.2 Methods 4.2.1 Surprise in Brief Although the mathematical details of the surprise model are given in appendix B, we provide here an intuitive example for illustration. The complete surprise system subdivides into individual surprise models. Each of these individual models is defined by its feature (color opponents, Gabor wavelet derived orientations, intensity), location in the image and scale. The respective feature values will be maintained over time, in the 137 current state, at each location in the image, by feeding the observed value to a surprise model at that location. That is, every image location, feature and scale has its own surprise model. New data coming into surprise models can then by compared with past observations in a reentrant manner (Di Lollo, Enns & Rensink, 2000) whereby observed normalcy by the models can guide attention when there is a statistical violation of it. Different frequencies of surprising events are accounted for with several time scales by feeding the results of successive surprise models forward. Imagine that we present the system initially with the image of a yellow cat. If the system is continuously shown pictures of yellow cats, many surprise models will begin to develop a belief that what they should expect to see are features related to yellow cats. However, not all models will develop the same expectation. Models in the periphery will observe the corners or edges of the image, which in a general sample would be more likely to contain background material such as grass or sky. If after several pictures of yellow cats a blue chair is shown, surprise will be largest with models which had been locally exposed more often to the constant stimuli of the yellow cats. Additionally, surprise should be enhanced more for models which measure color, than for models which measure intensity or line orientations since, in this example, color was more constant prior to the observation of the blue chair. Finally, surprise can be combined across different types of features (channels), scales and locations into a single surprise metric. This combination can either be hardwired (Einhäuser et al., 2007b) or optimized for the prediction task (the present paper). As is typical for Bayesian models, there is an initial assumption about the underlying statistical process of the observations. Since we wish to model a process of 138 neural events, which should bring us closer to a neurobiological like event of surprise, a Poisson / Gamma process is used as the basis. The Gamma process is used since it gives the probability distribution of observing Poisson distributed events with refractoriness such as neural spike trains (Ricciardi, 1995). Surprise is then extended into a visually relevant tool by combining its functioning with an existing model of visual saliency (Itti & Koch, 2001a) (see also another saliency model: (Li, 2002)) . As a less abstract example, the Gamma process can be used to model the expected waiting time to observe event B after event A has been observed. If events A and B are neural spikes, then we can see how it is useful in gauging the change in firing rates related to alterations in visual features. Figure 4.1: With an image sequence, we run each frame through the surprise system. Feature maps such as orientation and color are extracted from the original images in the sequence (noted with “ ⊗ ” . Each feature map is then processed by surprise for both space and time. In spatial surprise, a locations’ beliefs about what its value should be are described in the Gamma probability process with the hyperparmeters , p p α β ′′ . This is compared with its surround ( , p p α β ) within the current frame in the surprise model (noted with the box “!”) (for details see Appendix B). In temporal surprise, each new location ( , tt α β ′ ′) is compared with the same location in a previous old frame ( , tt α β ) using the same update and surprise computation as space. Both space and time have their own and hyperparameters. Spatial and temporal surprise for each location within the same frame are merged in a weighted sum “ Σ ”. Additionally, the , tt α β ′′ and , p p α β ′′ are successively feed forward into models several times to give it sensitivity to surprise at different time scales. Here the time scales are used to create sensitivity to recurring events at different frequencies. The complete surprise for each frame of the same feature is the product (shown with the box “ Π ”) of merged spatial/temporal surprise for each time scale. 139 Here the Gamma process is given with the standard α and β hyperparameters which are updated with each new frame and are part of a conjugate prior over beliefs (figure 4.1). Abstractly, α corresponds to the expected value of visual features while β is the variance we have over the features as we gather them. Since we have a conjugate prior, the old α and β values are fed back into the model as the current best estimate when computing the new α and β (denoted α′and β ′ ) values for the next frame. In a sense, the new expected sample value and variance are computed by updating the old expected sample value and variance with new observations. This gives us a notion of belief as we gather more samples since we can have an idea of confidence in the estimate of our parameters. Additionally, as observations are added, so long as our beliefs are not dramatically violated, surprise will tend to relax over time. To put this another way, metaphorically speaking, the more things a surprise model sees, the less it tends to be surprised by what it sees. 4.2.2 Using Surprise to Extract Image Statistics from Sequences With RSVP image sequences, each image is fed like a movie into the complete surprise system one at a time. As the system observes the images, it formulates expectations about the values of each feature channel over the image frames. Like with the yellow cat example, images presented in series that are similar, tend to produce low levels of surprise. High peaks in surprise then tend to be observed when a homogeneous sequence of images gives way to a rather heterogeneous one. However, the term homogeneous is used loosely since repetitions of heterogeneous images are in terms of surprise homogeneous. That is, certain patterns of change can be expected, and believed 140 to be normal. Even patterns of 1/f noise have an expected normalcy and as such should not necessarily be found to be surprising. It should also be noted that surprise is measured not just between images, but within images as well. We suggest that an image can stand on its own with its inherent surprise. As a result, surprise is a combination of the spatial surprise within an image and the temporal surprise between images in a sequence. From the surprise system we can analyze surprise for different types of features as well. This is possible since the surprise model extracts initial feature maps in a manner identical to the saliency model of Itti and Koch (2001). So for instance, for any given frame, we can tell how surprising red/green color opponents are because we have a map of the red/green color opponents for each frame. A surprise model is then created just to analyze the surprise for the red/green color opponents. The entire surprise system thus contains models of several feature channels which can be taken as a composite in much the same way as the Itti and Koch saliency model does. A large number of features are supported within the surprise system but we use here intensity, red/green, blue/yellow color opponents and Gabor orientations at 0,45, 90 and 135 degrees. 141 Figure 4.2: The top row of images are three from a sequence of 20 that were fed into the system one at a time. The bottom row shows the derived combined surprise maps, which gives an indication of the statistical surprise the models find for each image in the sequence. Each map is a low-resolution image, which has a surprise value at each pixel, which corresponds to a location in the original image. Basic statistics are derived from the maximum, minimum, mean and standard deviation of the surprise values in each surprise map as well as the distance between maximum surprise among images giving us five basic statistic types per image. As mentioned the final surprise map is computed from each feature type independently, so while computing surprise we have access to the surprise for each feature. As an added note, the surprise for each independent image as expressed in the surprise map may look like a poor quality saliency map. However, what is surprising in each image frame is strongly affected by the images that precede it. This causes the spatial surprise component to be less visible in the output shown. Once surprise has been determined for each model (color, Gabor orientations etc.) in the surprise system, each image in the RSVP sequence is automatically labeled in terms of surprise statistics (Figure 4.2). If desired, the resolution of surprise can be as fine grained as individual pixels. This can, as mentioned, be further parsed into the surprise system along each feature type. However, advantage may be gained in using broader aggregate descriptions of surprise. That is, for each RSVP image, we here obtain a single maximum or minimum value for surprise. This would simply be the maximum or minimum value of surprise given all the surprise values over all the model results for a frame. 142 Maximum surprise is a useful metric if attention uses a maximum selector method (Didday, 1976) based on surprise statistics. The maximum valued location of surprise should be the location in the image most likely to be attended. By using this metric, we can assess the likelihood that attended regions will overlap in space from frame to frame increasing their potential interaction (e.g. masking or enhancement). The overlap from this metric can only be assumed based on proximity. That is, the closer a maximum location is between frames, the more likely two surprising locations are to overlap. However, as a spatial metric this is somewhat limiting as it does not give us a metric of the true spatial extent and distribution of surprise within a frame, for that we use the mean and standard deviation of surprise. So, within a frame, we use the point of maximum surprise to figure the attended location, but mean and standard deviation are used as a between image statistic to show the power and tendency to overlap that indicates to what extent natural images are competing (or cooperating) with each other. The three statistics (Maximum, Mean and Standard Deviation) combined; ideally will yield a basic metric of the attentional capture an image in a sequence exhibits. Additionally, it will allow us to consider whether attention capture is competitive (Keysers & Perrett, 2002). It is then useful to analyze surprise by its overlap between images. We do this because attention capture may be more parasitic along similar feature dimensions or along more proximal spatial alignment. Put another way, interaction or interference between locations in image frames may be stronger if they are spatially overlapping or their feature statistics are highly similar. Closer spatial, temporal and feature proximity between two locations naturally lends itself to more interaction. 4.2.3 Performance on RSVP and its Relationship with Surprise 143 The psychophysical experiments used here to exercise the system were described in detail in (Einhäuser et al., 2007b). In brief: for the first experiment, eight human observers were shown 1000 RSVP sequences of 20 natural images. Half of the sequences contained a single target image, which was a picture of an animal. The observers’ task was to indicate whether a sequence contained an animal or not. The speed of presentation at 20 Hz, yielded a comparably low performance, which was, however, still far above chance with a desirable abundance of misses (no target reported despite being present) as compared to false alarms (target reported despite not being present). This large number of missed target sequences is particularly convenient when we account for the sequences in which no subject is able to spot the target, 29 in all. This is a large enough set to allow for statistical contrast with sequences which subjects perform superior on. Here we refine the utility of observer responses to assign a difficulty to all of the 500 target-present sequences. If one or none of the eight observers detected the target in a particular sequence, then it is classified as hard. If seven or eight of the human observers correctly spotted the target, the sequence is classified as easy. All other sequences are classified as intermediate. That is, if the target in a sequence of images can be spotted more often, then something about that sequence makes it easier. Note that this definition is different from (Einhäuser et al., 2007b), where only targets detected by either all or none of the observers were analyzed. By the present definition, there were 73 hard, 188 easy, and 239 intermediate sequences. As will be seen, partitioning image sequences in such a way yields a visible contrast between sequences labeled as easy and those labeled as hard. For visibility and clarity of illustration, intermediate sequences will be omitted from the graphs of the initial analysis, but will be included again in the neural network 144 model that follows. This is reasonable since the intermediate sequences yield no additional insight into the workings of surprise, but instead make the important aspects illustrated more difficult to visually discern for the reader. 4.3 Results Initial experiments showed a cause and effect relationship between surprise levels in image sequences and performance of observers on the RSVP task. Specifically, as can be seen in the example in figure 4.3, flanking a target image with images that register higher surprise make the task more difficult for observers. Experimental support for a causal role of surprise is provided by experiment 2 in (Einhäuser et al., 2007b), in which re-ordering image sequences according to simple global surprise statistics significantly affects recall performance. To improve the power of prediction and augmentation in an Figure 4.3: (A) Mean surprise produced by each image in an example RSVP sequence is shown. The sequence on the top is easy and all 8 of 8 observers manage to spot the animal target. By rearranging the image order, we can create a hard condition in which none of the eight observers spot the animal target. This was achieved by placing images in an order which produced surprise peaks both before and after the target image at +/- 50 ms with relative surprise dips at +/- 100 ms creating a strong “M” shape in the plot centered at the target (we also refer to this as a “W” shape in the case of easier sequences or M-W for both). Note that the strength in the graphs shown is in the gain and not just the absolute value. A slight downward slope is observed since surprise can tend to relax over time given similar images. (B) The M-W shape can be seen more clearly when the gain of mean surprise for all hard and easy sequences is observed. 145 RSVP task, we need to analyze the individual features to see if they yield helpful information beyond the combined surprise values. This is because it has remained open whether prediction performance could be improved further by extracting refined spatio- temporal measures from the surprise data. In particular, does a feature-specific surprise system rather than an aggregate measure over all features further enhance prediction? Optimal feature biasing in a search (Navalpakkam & Itti, 2007) or feature priming might lend more power to certain types of features. Table 4.1: The significance of the gain of surprise for the target frame over the flanking frames is presented. For each feature type and the combined saliency map, the mean of the two flanking images is combined and compared against the mean of the target. From the two means, the t value is presented. For P values which are non-significant, the field is left blank. It is notable that mean, standard deviation and spatial offset of surprise spikes for easy sequences at the target frame. This applies for most feature types which seems to contribute to the M-W shape. 146 Analysis was carried out on all image sets which contained an animal target (“target-present” sequences). From each image in each sequence, surprise statistics were extracted for each feature channel, such as color opponents and Gabor wavelet line responses. A mean surprise is computed by taking the average of the surprise values from all locations within a frame for any given feature channel. Mean surprise of individual Gabor wavelet channels exhibit the “W” or “M” shaped interference from flanking images to the target (Figure 4.4) that was also characteristic for the full system. The W- M shape is itself very similar to forms seen in masking studies as a bi-modal enhancement of Type-B masking functions (Weisstein & Haber, 1965, Cox & Dember, 1972, Reeves, 1982, Breitmeyer & Ö ğmen, 2006). The important thing to notice is that the “W” pattern is seen with target enhancement while the “M” pattern is seen with target masking. The significance of the M-W jumps for each feature types is presented in table 4.1. For most feature types, and in particular, for easy sequences, mean surprise jumps significantly for the target frame when compared with the flanking frames. As can be seen, Gabor surprise is increased before and after the target for “hard” as compared to “easy” sequences (Figure 4.4). That is, surprise produced by edges in the image interferes with target detection. However, the difference in surprise between easy and hard targets is only significant for vertical orientations (P < .005 in both flanking images; Figure 4.4C) and diagonal orientations (for the image following the target P < .01 and .005 respectively; Figure 4.4B and 4.4D) at time +/- 1 frame (i.e. +/- 50 ms) from the target frame. The vertical orientation results clearly show a stronger significant probability. Horizontal orientations do not exhibit significant effects (P >= .05 for any image in the sequence; Figure 4.4A). We can also see that there is significance in the 147 red-green color opponents 10 (For the image following the target P < .01; Figure 4.4E). It is notable that the effect for vertical lines not only holds for the immediate flankers of the target, but also for images that precede the target by as much as 250 ms (P < .005 for vertical and diagonal orientation). Conversely, there is evidence for long term backwards enhancement at 250 ms following the target which is significant for the 135 degree orientation channel (p < .005; Figure 4.4D). Thus, within the sequences viewed by the observers, vertical line statistics when plotted and analyzed, seem to stand out more when observers are watching for animals among natural scenes. Interestingly, the significant jumps in surprise at +/- 250 ms illustrates that the effects of surprise seem too prolonged to be constrained to early visual cortex (Schmolesky, Wang, Hanes, Thompson, Leutgeb, Schall & Leventhal, 1998). Using the same analysis as for figure 4.4, a similar pattern can also be seen in figure 4.5 with the standard deviation of surprise within an image. However, the significance is concentrated on the target image itself. That is, if surprise values form a strong peak which gives a greater spread of surprise values in an image, then observers are more likely to experience that a target is easy to find. Again, the vertical line statistics yield a smaller probability value (For the target image, P < .001; figure 4.5C) than the horizontal line statistics (For the image after the target, P < .05; figure 4.5A) and red- green color opponents are significant as well (P < .05 for the target image and .01 for the image following the target; figure 4.5E). 10 See Appendix G for color opponent computation details. 148 Figure 4.4 Mean surprise for each frame per feature type is shown for RSVP sequences which observers found easy v. sequences observers found hard. All sequences have been aligned similar to figure 4.3 for analysis so that the animal target offset is frame 0. D is labeled to show the properties of all the sub-graphs in figure 4.4. At -250 ms we see strong masking within this feature since surprise is very high for difficult sequences. -100 ms is the lag-1 sparing interval and as such we should expect to see no effect at this frame. We see at +/- 50 ms strong masking by elevated surprise. The effect at this interval is the most common. At +250 we see forward priming or enhancement since surprise is higher for this frame in easy sequences. All plots show the characteristic “W” and “M” shapes centered at the target, but are most visible in C and D. However, there is obviously more power for vertical orientation features. Additionally, the Red/Green opponent channel gives M-W results like the Gabor feature channels. However, Blue/Yellow does not exhibit the M-W pattern (at least not significantly). Oddly, it is notable that surprise is significantly different at both +/- 250 ms (five frames), far beyond what would be expected for simple bottom-up feature effects. Also, the lag sparing is seen in all sub-graphs at -100 ms, even in C highlighting the strength of the effect. Error bars have been Bonferroni corrected for multiple t-tests. 149 Figure 4.4 continued 150 Figure 4.5: The standard deviation of surprise per frame like the mean, also seems to be significant for hard v. easy image sequences with the M-W pattern. However, significance seems more concentrated on the target image rather than the distracter images. Again, vertical lines are more prominent. Error bars have been Bonferroni corrected for multiple t-tests. 151 Figure 4.6: Here the mean spatial offset of maximum surprise from one frame to the next (illustrated in figure 4.2) is shown for both easy and hard sequences. Frame -5 (-250 ms) is left out since it is always zero in a differential measurement. An M-W pattern is visible, but much less so than for figure 4.4 and 4.5. It is notable that in three of the feature types, easy sequence targets have maximum surprise peaks further in distance from the peaks in the flanking images than hard sequences do. Again, like with mean and standard deviation of surprise, the greatest power seems to be in the vertical orientation statistics. Red/Green color results for space seem to be inverted suggesting that it has priming spatial effects to some extent even though strong mean red/green surprise blocks as seen in figure 4.4. Error bars have been Bonferroni corrected for multiple t-tests. 152 That the target image benefits most from a high standard deviation of surprise is due to strong peaks in surprise which give a wider spread of values. That is, the target image may draw more attention by having basic large peaks in surprise, while masking images gain from having a broader (larger area) generally high value. In more basic terms, attention is drawn to stronger singular surprise values, while it is blocked more effectively by flanking images with a wider surprising area. One may speculate that this reflects the locality of attention: wider surprise surfaces have a better chance to overlap with the target and are thus more likely to mask it. This hypothesis is borne out by observing the spatial information from the surprise metric. If the point of maximum surprise in the target image is offset more from the points of maximum surprise in the flanking images, observers find the target easier to spot (Figure 4.6). Thus, to summarize, a more peaked surprise, with more spatial offset in the target image seems to aid in the RSVP task. Additionally, different types of features, within a given sequence, may have more power than others. However, the Red/Green color opponent acts differently in this particular metric. It appears that it exhibits some attentional capture within the target image itself directed away from the animal target. That is, the red/green channel shows distraction behavior within the target image itself if it is not blocked by the flanking frames. 4.4 A Neural Network Model to Predict RSVP Performance So far we have considered statistical measures pooled over many sequences and obtained a hypothesis on how surprise modulates recall. If the hypothesis holds true, a basic computational system should be able to predict observer performance on the RSVP 153 task. Ideally, the system predicts, given a sequence of images, how well observers will detect and report them. Additionally, the system can be agnostic to whether or not a target image is in a sequence to determine if a sequence might have intrinsic memorability for any image for our given sequences. To do this, we added a test where we increased the difficulty for the system by testing to see if it can predict the difficulty of a sequence without knowing which image in the sequence contains the target. We added this as a comparison to our standard system which has knowledge of which frame contains the target. This also allows us to test the importance of knowing the target offset, which the evidence suggests is quite helpful in the prediction of observer performance. Training our algorithm to predict performance on the RSVP task required three primary steps. Each step is outlined below. These include the initial gathering of human observer data on the RSVP task, processing of the image sequences to obtain surprise values, and training using back propagation on a feed-forward neural network, which attempts to give an answer as to how difficult a given sequence should be for human observers. 4.4.1 Data Collection RSVP data were collected from two different experiments where image sequences were shown to eight observers (Einhäuser et al., 2007b). From the two experiments, we obtained a total of 866 unique sequences of images which were randomly split into three balanced equal sized sets of 288 sequences for training, testing and final validation (two sequences were randomly dropped to maintain equal sizing for all three sets). As mentioned, each sequence could be given a rank based on how many observers correctly 154 identified the target as being present in the sequence. The rank of difficulty was a number from zero to eight, rather than just easy, intermediate and hard. This matched how many observers managed to spot the animal target. This is used, as a training value, where the output of a network is its prediction as to how many observers in a new group of eight should correctly identify the target present in a given sequence. The control image sequences, the ones without animal targets, were not included for training since we are testing a computational model of bottom-up attention, which has no ability to actually identify an animal. The result of including non-target sequences in the system will only yield a prediction of how likely an observer is to spot a non-target which from an attentional standpoint is the same as asking an observer to spot the target. 4.4.2 Surprise Analysis Each sequence was run through an algorithm, which extracted the statistics of surprise or for comparison, contrast. The contrast model (Parkhurst & Niebur, 2003), a common statistical metric which measures local variance in image intensity over 16x16 image patches, was used as a baseline measure. Thus, we created and trained two network models with the same image sets for contrast and surprise. The statistics were obtained by running the same set of 866 sequences through each of the systems (surprise or contrast), which returned statistics per frame. 4.4.3 Training Using a Neural Network Training was accomplished using a standard back-propagation trained feed- forward neural network (Figure 4.7). The network was trained using gradient descent with momentum (Plaut, Nowlan & Hinton, 1986). The input was first reduced in 155 dimension to about 1/3 the original size from 528 features (listed in figure 4.7) to 170 using Principal Component Analysis (PCA) derived from the training set (Jollife, 1986). The input layer to the neural network had 170 nodes and the first hidden layer had 200 Figure 4.7: The network uses the mean, standard deviation and spatial offset for maximum surprise per feature type (the same which were illustrated in figures 4.4-4.6) for each image in a given RSVP sequence. This is taken from 22 input frames along 8 different feature types. The features are then reduced in dimension using PCA. The output is the judgment about how difficult an image sequence should be for human observers. nodes. This was a number arrived at by trying different sizes in increments of 50 neurons (the validation set was not used in any way during fitting). A network with 150 hidden units has a very similar, but slightly larger error, while 250 neurons has an error which is a few percentage points larger. A second hidden layer had three nodes. For comparison, a two-layer network model was also tested. The output layer for all networks had one node. 156 Tangential sigmoid transfer functions were used from the hidden layers. The output layer had a standard linear transfer function. Training was carried out for 20,000 epochs which is sufficient for the model to converge without over fitting. Training, testing and validation were done using three disjoint random sets with 288 samples of sequences each. The training set was used for training the network model. The testing set augmented the training set to test different meta parameters (e.g. number of neurons, PCA components). Only after all possible meta-parameter tuning was concluded, the validation set was used to evaluate performance. All results shown are from the final validation set, which was not used during training or for adjusting any of the meta-parameters (network size, PCA components, etc). As noted, the final output from the neural network was a number from zero to eight which represented how many observers it predicted should be able to correctly identify the target on the input sequence. That is, if of the eight observers, four registered a ‘hit’ on that sequence, then the target value was four. A single scalar output (a number 0-8) is used because it seems to exhibit resistance to noise in our system compared with several different network types with eight or nine outputs (e.g. a binary code using a normalized maximum). To keep the output from the neural network consistent, the output value was clamped (bounded) and rounded so that the final value given was forced to be an integer between zero and eight. 4.4.4 Validation and Results of the Neural Network Performance Performance evaluation used a validation set with 288 sequences, which had not been used at any time during training or parameter setting. The idea was to try and 157 predict how observers should perform on the RSVP task. This was done by running the validation set and obtaining general predicted performance of observers on the RSVP task. As mentioned, this was a number from zero to eight for each image sequence, depending on how many people the trained network expected to correctly recognize the presence of a target animal in the sequence. To make this a meaningful output, we needed to define a bound for testing the null hypothesis. In this case, we should see that the network and Surprise account for a large proportion of RSVP performance. The null hypothesis would predict that information prior to image sequence processing would be at least as sufficient as surprise for the task. To do this, we transformed the sequence prediction into a probability. The idea was to ascertain how well the network would perform if it was asked to tell us for each image whether it expected an observer to correctly identify a target. Basically, we will compare the results of different systems, such as Surprise to that of an ideal predictor. So for instance, if the network output the number six for a sequence, then 75% of the time we would predict that an observer should notice the target (six out of eight times). Error is specified from this as how often the network will predict a hit compared with how often it should be expected to if it is an ideal predictor. This gives us an error with an applied utility since it can be used directly to measure how often our network should correctly predict subject performance. An RMSE (Root Mean Square Error) gives us a good metric for network performance, but it does not provide a direct prediction of expected error against subjects in this case. 158 Figure 4.8: An error metric is derived as how well we should be able to predict a 9th new observer’s RSVP performance for the image sequences we already have data for. We compute the predicted hits for each image from the output of the neural network based on the difficulty rank it gave to each sequence. The final error is the difference between the network prediction and that of an ideal predictor. To compute error (see figure 4.8), we find the empirical error given the network output difficulty rating and how many hits should be expected given how the network classified the sequence’s difficulty. This gives us an error of expected hits by the network compared with expected hits given the actual target value for all trials in the validation set. Here, the neural network performance is compared with that of the ideal predictor. As an example, for all trials in which 4 out of 8 observers correctly spotted the target animal (a difficulty of 4), our most reasonable expectation is that observers should have a 50% chance of spotting the target. Ideally, if the neural network predictor is accurate, then it 159 should also predict that subjects will hit on those targets 50% of the time. If however it rates some of the “50%” targets as “100%” (gives them a difficulty of 8), then the model will be too optimistic and predict a greater than 50% chance of hits for the “50%” targets. To do this (Figure 4.8), the output from the network is sorted into an nm × confusion matrix of hits where the rows represent the expected prior target output t and the columns represented the actual network output y. Thus, if a target in a sequence had six out of eight hits from the original human observers, this would give it a difficulty, and a target value of six. If the neural network then ranked it as a five, then it would incrementally go into cell bin ( ) ( ) ,5,6 ij = . Where j is the prior observed for how easy this target was for subjects: (4.1) 08 ; 0 8 ij ≤≤≤≤ Expected hits by the network per cell is given by the product of the number of hits binned in a cell Y ij N by the probability that the hits will be observed given the prior observed difficulty j as ( ) Pj : (4.2) ( ) ( ) 8 ; 0 1 Pj j P j = ≤≤ Symbolically we will also assume that: (4.3) ( ) ( ) ij Pi P j =↔ = Y ij N is how many sequences were binned by the network into each cell bin in the confusion matrix N at cell i,j. Summing the columns of the expected hits gives the number of expected hits from the neural network given the prior determined difficulty class j and the network determined difficulty i. Note that this must be less than the total number of trials which is 288. 160 (4.4) () () () 8 0 ; 0 288 YY Y ij i P hits j N P i P hits j = =⋅ ≤ ≤ ∑ Also note that the sum of the expected hits must be less than (or equal to) the total number of trials: (4.5) ( ) 0 288 Y j Phitsj ≤≤ ∑ For comparison, we take the prior target data and compute its expected values given the difficulty as well: (4.6) () ( ) ( ) ; 0 288 TT T j P hits j N P j P hits j =⋅ ≤ ≤ A final sum error is derived by subtracting the number of expected hits as determined by the neural network y (4.4) with the expected number of hits from the real targets t (4.6). That is divided by the total number of possible hits which yields the final empirical error E which is the number of expected hits in error divided by the total number of trials. (4.7) ( ) ( ) YT j T j j P hits j P hits j E N − = ∑ ∑ If E is 0, then the system being tested performs the same as an ideal predictor. It should be noted, that the ideal predictor in this case is highly ideal since it is assumed to have perfect knowledge of sequence difficulty. Baseline conditions for testing the null hypothesis were created by generating different sets of N φ . The condition naive N was created by putting the same number of hits into each bin. Thus, it is the output one would expect given a random uniform guess by the neural network. bayes N was created by binning hits at the mean (mean number of hits for all sequences which is 5.44 out of 8). That is, in the absence of additional information, 161 we will tend to guess each sequence difficulty as the mean difficulty. Metaphorically speaking, this is particularly important since neural networks can learn to guess the mean value when they cannot understand how to produce the target value from the input features. It should be noted that bayes N is not naïve. It still uses prior information to determine the most likely performance for sequences in the absence of better evidence. It is used as a null hypothesis since it is the best guess for observer performance prior to image processing but knowing the summary distribution of observer performance. Additionally, it presents a strong challenge to any models performance since as a maximum likelihood estimate; it has the potential to do very well depending on the nature of observer performance variance. In addition to the surprise system, a contrast system was also trained in the same way. A range of error was computed by training both the contrast and surprise systems 25 times and computing the final validation error each time. This gives a representation of the error range produced by the initial random weight sets in the neural network training. The error is the standard 5% error over the mean and variance of the validation set error for 25 runs. It should be noted that we attempted to maintain as close a parity as possible between the training of the contrast system and the surprise system. However, there is enough difference between the two systems such that there is room for broader interpretation as to performance comparison. Since, we introduce the contrast system as a general metric and this paper is not a model comparison paper the results should not be interpreted as a critique of it. 162 Figure 4.9: (A) is the expected error for prediction in performance for the RSVP task. Error bars represent the expected range of performance for the neural networks trained on the task. Both the surprise system and the contrast model perform much better than the null hypothesis would predict. Combining the image surprise data and contrast data does not yield better results. (B) shows that a three-layer network performs slightly better than a two-layer network. Additionally, as expected, when the frame offset for the target is provided to the network, performance increases further since it can now look for the characteristic M-W patterns in the flanking images as seen in figures 4.4-4.6. Note that (A) and (B) show the same type of information and can be directly compared. Figure 4.9 shows the expected error from both the contrast system and surprise system compared with the worst possible predictions. It can be seen that the surprise system (77.1% correct, error bars shown in figure) performs better than the contrast system (74.1% correct) and that the differences are significant within the bounds of the neural network variance. The surprise and contrast data can be combined by feeding both data sets in parallel into the network. If performance of the two combined is better than 163 the surprise system by itself, then it suggests that the contrast system contains extra useful information which the surprise system alone does not have. However, the combined model (76.6% correct) does not perform better than the original surprise system. Its performance appears slightly worse, but is not significantly different than the surprise system alone. Both the contrast and surprise systems perform much better than the worst possible Naïve model (67.1% correct) and noticeably better than the null hypothesis Bayes model (71.7% correct). It is important to note that while neither the surprise nor contrast systems perform perfectly, they should not be expected to do so since they have no notion of the difficulty of actually recognizing the particular animal in a target image. They are only judging difficulty based on simple image statistics for attention in each sequence. Additionally, as mentioned, the task was made more difficult by withholding knowledge of the position of the target in the sequence, from the training system. It can be seen that when the target frame is known, the model performs its best at 79.5% correct. We can put this performance in perspective by using the Naïve model as a baseline. We compute a standard performance increase bayes f ′ as the difference between the performance of the Naïve model naive f and the Bayes model bayes f . (4.8) bayes bayes naive f ff ′=− The same metric for the Surprise system is: (4.9) surprise surprise naive f ff ′ =− 164 The gain of the Surprise system over the Bayes model is computed as: (4.10) surprise surprise bayes f g f ′ = ′ This tells us that the performance increase of the Surprise system is 2.7 times that of the Bayes model. 4.5 Discussion The results shown in figures 4.3-4.6 are not unexpected given that the M-W curves closely resemble the timing and shape of both visual masking (Breitmeyer & Ö ğmen, 2006) and RSVP attentional blink study results (Raymond et al., 1992, Chun & Potter, 1995, Keysers & Perrett, 2002, Einhäuser et al., 2007a). What is interesting is that surprise elegantly reveals the masking behavior in natural images. This allows efficient use of natural scenes in vision research since low level bottom-up masking behavior can be accounted for and controlled. As an added note on computational efficiency, a single threaded surprise system will process several images a second with current hardware. This is important since it has allowed us to extract surprise from thousands of sequences. Since the surprise system is open source and downloadable, other researchers should be able to take advantage of it to create RSVP sequences with a variety of masking scenarios. Ease of use has also been factored into the surprise system since it is built upon the widely used saliency code of Itti and Koch (2001). In addition to the basic metric of surprise, when we combine it with neural network learning, it predicts which RSVP sequences of natural images are difficult for 165 observers from those that are easy. This provides two major contributions. The first is that surprise has been shown to deliver a good metric for visual detection and recall experiments. Thus, a variety of vision experiments can be conducted, which can use surprise to gain insight into the intrinsic attentional statistics, especially for natural images. For instance, as we have done, surprise was successfully manipulated by re- ordering images in sequences, to modulate recall (Einhäuser et al., 2007b). Here we have extended on this work by creating a non-relative metric that can compare difficulty for sequences with disjoint sets of target and distracter images. That is, our original work was only able to tell us if a new ordered sequence was relatively more difficult than its original ordering. The current work gives us a parametric and absolute measure based on how many observers should be able to spot a target. The second contribution is that a system can be created to predict performance on a variety of detection tasks. In this article, we focused on the specific RSVP task of animal detection. However, the stimulus which we used is general enough as to suggest that the same procedure could be used for many other types of images and sequences. 4.5.1 Patterns of the Two-Stage Model While the interaction between space, time and features is not completely understood, the interactions within these domains are well illustrated by our experimental results. Surprise in flanking images aligned spatially and along feature dimensions block the perception of a target particularly in a window of +/- 50 ms around the target image. Additionally, the time course is consistent at least with respect to Gabor statistics, which show the M-W pattern of attentional capture and gating. However, we must note that this 166 study does not explicitly establish a constant time course for any image sequence, but there is some expectation of the time course to be bounded based on the reviewed masking, RSVP and attentional blink experiments. With significant peaks at widely divergent times, our results are compatible with the two-stage model described by (Chun & Potter, 1995) (for a conceptual diagram, see figure 4.10). In the ≈50 ms surrounding the target, it is easier for images with targets of higher surprise to directly interfere with each other. Since targets can both forward and backward mask at this interval, competition to pass the first stage is more direct. Already present targets can be suppressed by new more surprising ones or block less surprising targets from entering. Then we see a period of about 100 ms where the target image is in a sense safer, this is at the same point in time that lag sparing is observed in attentional blink studies. After that (>= 200 ms), the data shows that forward masking can block the perception of the target. Conversely, the data indicates that backwards masking does not occur at this interval, but instead we see enhancement. The results seen here might be a result of the manner in which the visual cortex gates and pipelines information and may be thought of as a two or three stage model seen in figure 4.10 (stages two and three may be the same stage, but we parse them here for illustrative purposes). (1) Very surprising stimuli capture attention from other less surprising stimuli at +/- 50 ms. Additionally, competition is spatially local allowing for some types of parallel attention to occur. That is, some image features can get past the first stage in parallel if they are far enough apart from each other. Parallelism, while not illustrated by the data presented here is expected based on the reviewed studies on split attention (Li et al., 2002, Rousselet et al., 2002, McMains & Somers, 2004, Rousselet et 167 al., 2004). (2) Stimuli is pipelined at 100 ms to the next stage of visual processing. This leads to a lag sparing effect or the safety period observed in figures 4.4 and 4.5. This is also observed in the literature we have reviewed which illustrates the lag-1 sparing in RSVP (Raymond et al., 1992) or a relaxed period of masking with a type-B function (Weisstein & Haber, 1965, Cox & Dember, 1972, Hogben & Di Lollo, 1972, Reeves, 1982, Breitmeyer, 1984, Breitmeyer & Ö ğmen, 2006 ). The pipeline may do further processing, but is highly efficient and most importantly avoids conflicts. (3) After 150 ms, a new processing bottleneck is encountered. Unlike (1) which is a strict attention gate meant to limit input into the visual system, we suggest as others have (Chun & Potter, 1995, Sperling et al., 2001) that (3) is a bottleneck by the fact of its intrinsic limited processing resources. Conflict can occur if an image that arrived first is not finished processing when a second image arrives. The situation is made more difficult for the second image if the first image has sequestered enough resources to prevent the second image from processing. The second image in may be integrated or blocked while at the same time enhancing the first image into the third stage. That is, an image or set of targets may dominate visual resources after ≈250 ms, causing new input stimuli into a subordinate role. This later stage may also resemble strongly several current theories of visual processing (Lamme, 2004, Dehaene, Changeux, Naccache, Sackur & Sergent, 2006) if it is thought of in terms of a global workspace or as visual short term memory (Shih & Sperling, 2002). To further refine what we have said, the first stage is selective for images based more strongly on surprise and is not strictly order-dependant, which is why figure 4.4 and 4.5 show the strong M-W pattern. The utility of this is perhaps to triage image 168 information so that the most important nuggets are transported to later stages which have limited processing capability. The third stage is selective based on more complex criteria such as order which is why we observe the asymmetry at +/- 250 ms in figure 4.4 and 4.5. However, evidence also suggests that if a second target is salient enough, it may be able to co-occupy later stages of visual processing with the first target allowing for detection of both targets in RSVP (Shih & Reeves, 2007). Importantly, the relationship between the first target and the second target appears to be asymmetric in the later stages, which was illustrated in figure 4.4 from the Surprise System at the +/- 250 ms extremes. Surprise and attention in the first stage does not necessitate that later stages do not also contain attention mechanisms. An attentional searchlight (Crick, 1984, Crick & Koch, 2003) theoretical model could allow for attention at many levels. That is, as information is processed, refined and sent to higher centers of processing, additional attention processing can allow a continuation of the refinement. Selection can be pushed from the bottom-up or pulled from top-down processes. Two stage models which place attention in a critical role in later stages have been proposed. For instance, (Ghorashi, Smilek & Di Lollo, 2007) suggest that lag sparing in RSVP attentional blink are due to a pause while the visual system is configured for search in latter stages of processing. While it is unclear if a pause causes lag sparing from the research presented, significance of specific features we have illustrated from surprise at +/- 250 ms might lend credence to feature specific attention in later stages, 169 4.5.2 Information Necessity, Attention Gating and Biological Relevance The results show that the gating of information in the visual cortex during RSVP may be based on a sound statistical mechanism. That is, target information which carries the most novel evidence is of higher value and is what is selected to be processed in a window of 100 ms. Target information separated by longer time intervals may also be indirectly affected by surprise because of interference in later stages of processing. This happens if more than one overlapping target is able to enter later visual processing at the same time. Since surprise effects are indirect at this stage, surprise may never be able to account for all RSVP performance effects. The mechanism for surprise in the human brain we believe is based on triage. Target information is held briefly during early processing allowing a second set of target features in the following stages to also be processed. Masking at the shortest interval happens when the two sets of image data are competed against each other. If the new target information is more surprising, it usurps the older target information (attention capture), otherwise it is blocked. Masking in this paradigm depends on competition within a visual pipeline. Its purpose is to create a short competition which may even introduce a delay for the sake of keeping the higher visual processes from becoming overloaded. After 50 ms, if target information is still intact, it moves beyond this staging area and out of the reach of competition from very new visual information. Masking and the loss of target detection after that point, might be based on how the visual system constructs an image from the data it has. After 150 ms, image data is lost in assembly by virtue of an integrative mechanism in a larger working memory or global workspace. 170 Interestingly, if surprise does account well for the type-B masking effects many have observed, it suggests that masking in the short interval is not a result of a flaw in the visual system, but rather it is the result of an internally evolved mechanism to directly filter irrelevant stimuli. That is, in the +/- 50 ms bounding a target, masking is an effect of an optimal statistical process. The absence of masking, if it was observed, would indicate that too much information is being processed by the visual system. 171 Figure 4.10 The top row is a hypothetical summary of figures 4.4 and 4.5. Flanking images block the target image based on increase in mean surprise within 100 ms and show the characteristic M-W pattern. A strong tail is observed as much as 250 ms prior. Notably, the flat portion at 100 ms is at the lag sparing interval for attentional blink. This suggests that a two-stage model similar to (Chun & Potter, 1995) is at work. In the first stage fierce competition creates a mask with the power of an images surprise. Wider masks are more likely to block information, while stronger surprise in an image is more likely to break through a prior blocking mask even if it is wide. In the later stages, parts of images which have passed the first stage are assembled into a coherent object or set of objects. Once the assembly is coherent enough, subordinate information from new images can help prime it and make it more distinguishable. However, information from the subordinate images will be absorbed if the second stage has reached a critical coherence. This can be summarized by stating that the effects from surprise in the first stage are a result of direct competition and blocking while the effects seen in the latter stages are a side effect of the result of priming, absorption or integration. 172 Figure 4.10 continued 173 4.5.3 Generalization of Results Returning to the topic of feed-forward neural networks, another facet to mention is on the limitations of generalization for networks such as the one we used here. Validation was carried out using images of the same type. As such, we do not necessarily know that it will perform the same for a completely different type of image set. As an example, one might imagine target pictures of rocks with automobile distracters. However, evidence suggests that at less than 150 ms only coarse coding is carried out by the visual system (Rousselet et al., 2002). Thus, we would expect generalization to other types of images so long as coarse information statistics are similar. Additionally, the input to the neural network was extremely general and contained a relative lack of strong identifying information. That is, we only used whole image statistics of mean, standard deviation and spatial offset for surprise. It is quite conceivable that images of other types of targets, and distracters, may very well exhibit the same basic statistics and, as a result, the neural network should generalize. 4.5.4 Comparison with Previous RSVP Model Prediction Work An important item to note here is the difference between the system we have presented and the one presented by Serre et al. (Serre, Oliva & Poggio, 2007). The primary difference is that our system is based on the temporal differentiation across images in a sequence whereas the model by Serre et al. is an instantaneous model. That is, it has no explicit notion of the procession of images in a sequence, which as we have shown has effect on the performance of RSVP. This is particularly true since the same 174 images in different order around the target can affect performance on the RSVP task (Einhäuser et al., 2007b). However, the RSVP task performed by the Serre model is sufficiently different, that any real cross-comparison between the two models is difficult on anything other than a superficial level. Indeed, the Serre model has much better target feature granularity, which perhaps gives it better insight into feature interactions within an image. Thus, one may be able to produce an even better RSVP predictor by combining the Serre model with the one we presented here, thereby exploiting both sequence and fine granularity aspects. 4.5.5 Network Performance It is also interesting to note that with the neural network, the contrast system performed almost as well as the surprise system. However, when the surprise and contrast data are combined, performance is not increased suggesting that the contrast system does not contain any useful new information that the surprise system does not already have. That is, given that the combined surprise/contrast system performs almost exactly the same as the surprise system by itself, and that the contrast system still performs much better than chance, suggests that the surprise system intrinsically contains the contrast system’s information (indeed, intensity contrast, computed with center-surround receptive fields, is one of the features of the surprise system which is similar but not the exact same thing as the contrast system). Otherwise, we should expect to see an improvement in the performance of the combined surprise contrast system. 175 Another network performance issue that should be mentioned is that in one condition, the neural network performance was hindered by forcing it to not know the target offset in each sequence. This was done to see if a sequence of images could be labeled with a general metric so that we could conceivably gauge the recall of any image in the sequence, not just the target. However, if knowing the target offset is a luxury one can be afforded, then we illustrated that prediction performance can be improved even further. This is to be expected given the M-W shape for surprise as seen in figure 4.10 where surprise from the flanking images yields the most information to RSVP performance on the target. 4.5.6 Applications of the Surprise System In addition to the theoretical implications of our results there are also applications which can be derived. For instance, the surprise analysis could be used to help predict recall in movies, commercials or – by adjustment of timescale - even magazine ads. Since it gives us a good metric of where covert attention is expected, it may also have utility in optimizing video compression to alter quality based on where humans are expected to notice things (Itti, 2004). As long as images are viewed in a sequence, the paradigm should hold. Even large static images, whose scanning requires multiple eye-movements, effectively provide a stimulus sequence in retinal coordinates and will be an interesting issue for further research. 176 4.6 Conclusion Surprise is a useful metric for predicting the performance of target recall on an RSVP task if natural images are used. Additionally, in past experiments, we have shown that this can be used to augment image sequences to change the performance of observers. An improved system of prediction can be made using a feed-forward neural network, which integrates spatial, temporal, and feature concerns in prediction, and can even be done so without knowing the offset of the target in the sequence. Type-B and spatial masking effects are revealed by surprise which provides further evidence that RSVP performance is related to masking and attentional capture. This also suggests that the masking and attentional capture are accounted for in a statistical process, perhaps as a triage mechanism for visual information in the first stage of a two-stage model of visual processing. 177 Chapter 5: Modeling of Attentional Gating using Statistical Surprise That not all visual information received by the eyes reaches human awareness has been known for quite some time (Raab, 1963, Weisstein & Haber, 1965, Reeves, 1980, Hoffman et al., 1983, Duncan, 1984, Chun & Potter, 1995, Mack & Rock, 1998, Itti & Koch, 2001b, Sperling et al., 2001, VanRullen & Koch, 2003a, Shih & Reeves, 2007). However, the mechanisms by which the brain selects visual content is not completely understood. The general consensus is that information arrives from the retina in a parallel fashion but is consolidated or edited in early visual centers to create a more serialized flow of information (Treisman & Gelade, 1980, Koch & Ullman, 1985, Itti & Koch, 2001b). Since features must be integrated to observe and process objects, serial processing has parallel components. An example is that to recognize a chair quickly, one needs to process parts of the corners, edges or other necessary visual features, to compose the chair’s elements with sufficient content for reconstruction. This is true whether speaking of strictly 3D assembly (Biederman et al., 1999) or hierarchical template assembly (Serre et al., 2007) of the visual representation. It has been suggested that an attentional gate can account for what items or features in an image are able to pass an attentional window into further processing by the visual system (Reeves & Sperling, 1986, Sperling & Weichselgartner, 1995, Cave, 1999, Di Lollo et al., 2000, Lamme & Roelfsema, 2000, Shih, 2000, Sperling et al., 2001, Keysers & Perrett, 2002, Crewther, Lawson & Crewther, 2007, Olivers & Meeter, 2008, Soto, Hodsol, Rotshtein & Humphreys, 2008). In such a model, the strength of attention 178 derived from visual features may determine the magnitude and duration of an attentional window. Additionally, the attention gate mechanism is hypothesized to be episodic and spatial so that it is selective as to both where and when items can pass. Further this has been related to RSVP (Rapid Serial Vision Presentation) performance and attentional capture (Shih & Reeves, 2007). Thus, an attention gate may account for what parts of individual images in a sequence may be detectable by observers. A somewhat recent trend has been to suggest that the brain and by relation the visual cortex operates on statistically optimal principles (Field, 1987, Olshausen & Field, 1996, Rao & Ballard, 1999, Geisler et al., 2001, Guyonneau, VanRullen & Thorpe, 2004, Itti & Baldi, 2006). However, the idea is not new (Attneave, 1954, Barlow, 1961). One may ponder whether the attention gate mechanisms which limit what visual information reaches later stages of processing are editing visual information in a manner to favor things which have a greater statistical likelihood of usefulness. That is, given a bottleneck or a bandwidth limit on visual processing (Duncan, 1984, Nakayama, 1990, Chun & Potter, 1995, Rousselet et al., 2004) the human brain may triage elements coming in from a visual scene by importance. Those parts of the scene that are richer in useful information should be selected more often than parts of the scene which are information poor. Previously, in chapter 4, we demonstrated the basis of such a method for information triage. Using a measure of statistical surprise, we were able to manipulate the order of the presentation of natural scene images in an RSVP paradigm to make a target image more difficult to detect for human observers (Einhäuser et al., 2007b, Mundhenk et al., 2009). In our study, we used a very basic metric of the mean surprise for an image. 179 By buttressing a target image in order with images of higher surprise, the target image was made more difficult to spot. The distracter images were made more statistically informative and thus had a greater attention capture based on information triage. The question has however remained, to what degree can surprise account for how much target information is gated? Additionally, can we observe what parts of a scene are being edited out and which parts are not? Figure 5.1: Mean surprise for image frames in the animal target RSVP sequences is shown. For the frames right before and right after the target image, a high mean surprise leads to difficulty in detection of the target in the target frame. Here, “hard” image sequences are extremely difficult since none of the eight observers are able to detect the target. Easy sequences on the other hand have targets which are always detectable. 180 5.1 From Surprise to Attentional Gating Our motivation for the current model is based in part on the observation that attentional capture, the ability of items in a scene to steal the spotlight of attention, occurs both within an image (in space) and between images (in time) that flank each other in a sequence (Shih, 2000, Mundhenk et al., 2009). In our prior research, we found that when human observers are shown an RSVP sequence of natural images, the target images in the sequence can be of a varying degree of difficulty to detect. Some are in fact so Figure 5.2: Images from an RSVP stream have a certain amount of surprise which describes what visual information in an image frame can be gated to the next stage of processing. This competition happens both in space and time. Thus, the attention capture in one frame can keep items in a flanking frame from passing. Hypothetically, an object can be recognized if there is sufficient visual information to pass the gate during viewing. (Note: Information Gated images shown are results from the attention gate surprise system on an easy RSVP sequence with animal targets) difficult, that observers are never able to spot the target. The opposite is also true; there are many sequences where the observers are always able to spot the target. When one inspects the mean surprise of images in the sequences which are hard and compares them 181 with the sequences which are easy, it becomes apparent that the images flanking the target are much more surprising when the sequence is in fact hard (Mundhenk et al., 2009) (Figure 5.1). Going beyond correlation, we showed a causal link between surprise and observer performance (Einhäuser et al., 2007b). Placing images which are more surprising before and after an easy target makes the target significantly more difficult to detect. This suggested that the surprise system discussed in chapter 4 (Itti & Baldi, 2005, Itti & Baldi, 2006, Mundhenk et al., 2009) could be further enhanced if we could directly show what parts of the target image are being taken away by the flanking target images. Ideally, attention capture by the flankers should overlap with the target more when it is difficult to detect. Our model of attentional gating is conceptually derived from (Sperling et al., 2001) ,but adding to it the idea that a region in any given image has a certain amount of attention capture as given by its statistical surprise. However, this is a model of bottom- up attention, so the cue in our paradigm will come from the intrinsic information in an image. The more surprise and thus the more attention capture that is exhibited, the better able a region is to pass through (or control) the attentional gate to the next stages of visual processing. As a gate, if a region of the visual field is passing visual information to later processing, then that particular set of information should be the only collection to pass through the gate’s window. Thus, attentional capture works both ways. A target can steal attention for itself and subvert another target at the same spatial location in a prior frame, but that target itself is also prone to being subverted in the same way. An illustration of this can be seen in figure 5.2. 182 Figure 5.3: Image frames in an RSVP sequence are seen by observers at 20 Hz (50 ms each). The surprise system takes in the same set of images and generates an attention gate map for each image in the sequence. The map is then a representation of what parts of the image should pass through the attention gate to further processing. On the right, this is shown for an easy transportation target. It is clear that much of the visual information about the target passes the attention gate. This is quantified on the bottom right by showing the overlap of the ground-truth for the target and the attention gate. A logical method for testing an attentional gate is to see if there is any difference in the amount of target information that can pass through for hard RSVP sequences when compared with easy sequences (Figure 5.3). If it is valid, then much more visual information pertaining to the target image in an RSVP sequence should be passing through the attention gate if a target is easy to spot rather than when it is difficult to spot. The rationale being that a far more constricted attention gate lets through less information which gives visual recognition centers less material to use. However, too large an attention gate may be detrimental since it may allow more irrelevant information to pass through. As such, a good metric should account for the direct overlap between the target and the attention gate. 183 5.2 Methods 5.2.1 Paradigm For comparing the computer model with human performance, data from two psychophysical experiments on target detection during rapid serial visual presentation (RSVP) were used. This is essentially the same as the method used in chapter 4. All stimuli were based on the photographic database introduced for RSVP tasks earlier [http://visionlab.ece.uiuc.edu/datasets.html (Li et al., 2002)], which consists of 1123 outdoor scenes (distracters), 1323 animal targets and 763 "means-of-transportation" targets. Observers were instructed to decide at the end of each 1-s sequence of 20 images presented at 20Hz whether or not the sequence had contained a target, either an animal (excluding humans) or a "means-of-transportation". The animal-detection data were acquired during an earlier study (Einhäuser et al., 2007b): In a first experiment, eight observers viewed 1000 sequences each, of which 500 contained one animal target. In a second experiment, another eight observers viewed those sequences in which all eight initial observers had correctly detected a target, and 3 modified versions that contained the same target and distracters but in different orders. The "means-of-transportation" data, which have not been published elsewhere, were recorded with similar methodology: Four observers with normal or corrected-to-normal vision participated (3 male, 1 female, age 26-34). Over two sessions, each observer viewed a total of 1500 image sequences, of which 750 contained a means-of-transportation target. Like for the initial animal experiment, no target was used more than once per observer; in no case any distracter was repeated within a sequence In addition to a mere forced-choice yes/no response, 184 observers simultaneously rated their confidence on a 3-point scale (not used for the present analysis). Stimuli had a resolution of 384x256 pixels, spanned 18°x12° of visual angle, and were presented in a dark room on a FlexScan F77S (EIZO, Hakusan, Ishikawa, Japan) 19.7' CRT monitor with 100Hz refresh rate at 48cm distance from the observer. All procedures were in accordance with national and institutional guidelines and the Declaration of Helsinki. 5.2.2 Computational Methods All target sequences (500 animal and 500 transportation) were separated into three groups based upon human observer performance. • If all observers were able to spot a target in a given sequence, it was labeled as easy (animal: N=122, transportation: N=116). • If all observers missed a target in a given sequence, it was labeled hard (animal: N=29, transportation: N=92). • All other sequences were labeled as intermediate (animal: N=349, transportation: N=292). For the sequences with animal targets, this meant that an easy target was spotted by all eight of the eight participants in that group. For the transportation targets, an easy sequence had four out of the four participants spotting it. This is a method we have used in previous work (Einhäuser et al., 2007b) and is covered in chapter 4. For visibility, intermediate sequences are not shown in all the results that follow, but are put aside. From the easy/hard sequences we created balanced ground-truth sets. For each target image in the ground-truth set, the target was carefully traced using the Gimp paint 185 program. The smallest set is that of the hard animal target sequences of which there are only 29. To balance this, 29 easy sequences were randomly selected and added to the ground-truth set. For the transportation targets, we chose a similar sized set of 30 easy and 30 hard sequences at random. In all, 118 ground-truth sequences were created. The 118 ground-truth sequences were run through the surprise system which returned an attention gate mask. This is the surprise system covered in chapter 4’s prediction as to what parts of the target image will pass though the early stages of cortical visual processing. The surprise system was run with sensitivity to several features. These are intensities; 0, 45, 90 and 135 degree Gabor wavelet measured orientations; red/green and blue/yellow color opponents; long range lines (Gilbert et al., 1996) as well as X, T and L junctions. At the level of feature selection, the surprise system uses the same methods as the saliency algorithm of Itti and Koch (2001). Further details, can be seen in the source code which is GPL open source and downloadable (http://ilab.usc.edu/toolkit/). Also see appendix B and C for mathematical details about surprise. The attention gate is computed as a measure of attentional capture in a given frame. The raw attentional capture is taken from the surprise map for a frame that is computed by the surprise system (Itti & Baldi, 2005, Itti & Baldi, 2006, Einhäuser et al., 2007b, Mundhenk et al., 2009). There are two stages to taking a visual input and producing an output from the attention gate (Figure 5.4). The first stage stops input at the gate itself as a blocking mechanism set up by the images that preceded the input image. The second stage implements attentional capture against the input image from the image 186 that comes afterwards. Thus, the complete attention gate accounts for blocking by images that come both before and after the current frame. Figure 5.4: Each frame in an RSVP sequence has a map of surprise computed for it. This is the input to the attention gate mechanism. Attention capture from the frame that came before it as well as capture from the frame that follows is subtracted from it to give the image frames final gated portion. The attention gate G is a layer of leaky integrator neurons with a very strong leak. Memory is very short term and does not last much more than 50 to 100 ms. The input to the attention gate G is the surprise conspicuity map image S given by the surprise system. The derivation of S is described in great detail in (Itti & Baldi, 2005, Itti & Baldi, 2006, Mundhenk et al., 2009) as well as in appendix B, but can be thought of as a Bayesian information outlier detector. If a location in S is greater than G then the attention gate takes the value in S at that location. Thus, the attention gate G’ is created by a max operator between the input surprise map and itself. (5.1) ( ) max , GSG ′ = 187 The attention gate map M is the map image which shows the mask of what visual information will be able to pass to later processing stages. It is defined as: (5.2) Here the clamp operator sets any locations in the map M to zero if it is less than zero. This makes sure that all locations in M have a positive value. What we can see is that the attention gate map M is each location in the input surprise map S minus each location in the attention gate G. As a result, only image locations with a surprise stronger than the gate itself can pass through and not be blocked. (5.1) and (5.2) cover blocking of incoming image information by image frames that came before it. To account for attentional capture from an image frame that follows, the attention gate map M must be held for a single frame or about 50ms. (5.3) The final output map from the attention gate is given as M’. This is the attention gate map M’ held for 50ms and then subtracting the locations next attention gate map M from the one that follows from it. To recap, the important thing to notice is that the attention gate map has attention subtracted by frames that come before and after it so that it is only positive in locations where its surprise value is superior to the frames which flank it. The attention gate output image is an 8 bit gray scale image. This was thresholded at a luminance of 12.5% gray (1/8 th ) to create a bit mask. Thus, all attention gate pixels with an intensity ranging from 100% pure white to 12.5% grey are considered to be open and letting that target image information pass. This threshold was decided upon a priori ( ) clamp M SG = − ( ) clamp M MM ′′ =− 188 before any of the 118 ground-truth images were analyzed. A set of Matlab scripts were used to compare the ground-truth targets with the attention gate output. The most basic comparison is to compute the union between the ground-truth mask image and the attention gate image and find the ratio of overlap between the two. Since both images are bit masks, this is trivially done. 5.3 Results The area of the ground-truth for easy targets is on average larger than it is for hard targets. Easy transportation targets are 60% larger (t(58) = 2.817; p < 0.005) and easy animal targets are 36% larger (t(56) = 2.007; p < 0.05 ). Additionally, the surprise computed attention gate itself tends to be larger for both easy transportation targets which is on average 82% larger (t(58) = 4.096; p < 0.001) and easy animal targets which is on average 48% larger (t(56) = 2.722; p < 0.01) than hard target attention gates. To adjust for the inherent larger size of easy targets and to obtain a true metric of target pass through, we used a method to measure the overlap between the attention gate and ground- truth. Notably, this takes into account the target size. This is accomplished by computing an overlap ratio between the attention gate map and the ground-truth image bitmaps 189 Figure 5.5: For each target image a ground-truth image T was created by hand. This allows a direct comparison between the target image itself and the attention gate which shows what gets through. In this case, an easy animal target is shown. The bottom row shows the components of the overlap ratio. The number of black (true) pixels in is divided by the number of black pixels in to create the overlap ratio. All results and experimental material can be downloaded from: http://ilab.usc.edu/wiki/index.php/Attention_Gate. TM ∩ TM ∪ 190 (Figure 5.5). The scalar number n functionally computed from the number of pixels overlapping jointly between the ground-truth image T and the attention gate map image M is given as: (5.4) Then the disjoint set is given as the pixels which either do or do not overlap. If the two image bitmaps T and M are overlapped, this is the count of all pixels which are true in either map without double counting overlapping pixels. This is given as: (5.5) The overlap ratio is then the ratio of the joint set to the disjoint set: (5.6) () () ( ) () ( ) ( ) ( ) ()( )( ) nT M n T M nT M r nT M n T n M n T M nT M n T M nT M ∩∩ ∩ == = ∪+ −∩ ∩¬+¬∩+∩ Here r is the overlapping ratio between the final target attention gate map M which is the output from (5.6) and the ground-truth bitmap image T. Note that r ranges from 0 to 1. Thus, the area where the target and attention gate overlap is divided by the total disjoint area. This ratio controls for the size of either the attention gate or ground- truth since the ratio itself can only increase if there is a much greater amount of overlap than non-overlap. If and only if both images completely overlap and perfectly match, then r is 1 [e.g. ]. If and only if the two sets are completely mutually exclusive with no overlap, then r is 0. ( ) nT M ∩ ( ) nT M ∪ () ( ) nT M n T M ∩= ∪ 191 Figure 5.6: On the left, the mean ratio of the attention gate to the target is shown for each frame including the target. For both animal and transportation targets, if a target is hard to detect, it will overlap with the attention gate about the same as it would with the attention gate in any other frame. However, easy targets overlap with the attention gate by more than twice as much and show a noticeable peak above the baseline. On the right, the distribution of ratios at the target frame for easy and hard sequences can be seen. It is obvious that there are two distinct distributions and that most hard sequences are bunched on the left with an overlap r of less than 25% and that very few easy sequences have an overlap of r less than 20%. The results for comparing the overlap can be seen in figure 5.6. The average overlap between attention gate and ground-truth for easy sequences is more than twice as large as the overlapping area for hard sequences for both transportation targets (t(58) = 6.089; p < 0.001) and animal targets (t(56) = 4.617; p < 0.001). Thus, on average, 36- 42% of the attention gate directly overlaps easy targets while only 17% overlaps hard targets. 192 Figure 5.7: On the left, transportation targets which are easy or hard are shown with their overlapping attention gate. It is observable that the easy attention gates are larger than the attention gates for hard sequences. Also, many of the easy targets can be discerned through the attention gate on inspection. However, results can vary. On the right are the original transportation targets. It is noticeable that the easy targets are more vivid than the hard targets which is part of what the surprise system keys off of. On inspection of the results for the target images (Figure 5.7 and 5.8), one can see that the attention gate for easy targets is visibly much larger and the target itself is much more prominent. This is to be expected if the attention gate hypothesis is true since a larger proportion of usable visual features will pass through increasing the chance that the target will be identifiable. However the results are not perfect. Some easy targets are 193 still blocked out while some hard targets are prominently visible beyond the attention gate. Figure 5.8: These are results like in figure 5.7. Here animal targets are shown rather than transportation targets. 5.3.1 Relation of Results to Previous Studies Which Showed Causal Links between Surprise and Target Detection In our previous study (Einhäuser et al., 2007b) we were able to show a causal link between surprise and target detection in RSVP. This was done in two stages. First we established which sequences were easy for participants then we took those particular 194 sequences and made them more difficult by rearranging the flanking images so that the mean surprise for the target was made lower while the mean surprise for the flankers was made higher. The idea was to show that attentional capture of flanking images drives performance on the RSVP task we used. The new set of easy sequences made harder was significantly more difficult for participants. However, as a puzzle, some of the new harder sequences (21 out of 122) stayed easy while only a few others went from being easy to being completely difficult (6 out of 122). It was not precisely known at the time why this pattern emerged from the mean surprise based method we used and it suggested a limitation of the model. Analysis within the attention gate paradigm seems to explain why this is so. For this analysis we created two equal sized sets of 20 sequences. In the first, we placed sequences which were easy but became hard after we had rearranged the flanking images. However, to create a set size large enough for statistical analysis this group includes sequences in which one or two observers were able to detect the target. We then created an equal size set of easy sequences which stayed easy. This was done by randomly dropping one of the 21 sequences. This created 20 easy to easy and 20 easy to hard sequences. Targets which were easy but became hard were on the very low end of overlap ratio (Figure 5.9). The overlap, which was still significantly higher than that of the baseline hard sequences, was nudged into no longer being significantly different. That is, easy sequences which became hard after rearranging the flanking images went from having borderline small attention gate overlap to significantly small attention gate overlap. Easy sequences which stayed easy on the other hand started with a mean overlap 195 Figure 5.9: On the left, the baseline overlap r between animal targets and attention gate is the same as in figure 5.6. Next to it in the middle is the overlap from sequences where a new arrangement of the flanking images made the target hard to detect. The right column shows the overlap where changing the arrangement did not make the targets more difficult to detect. While reducing surprise indeed had its intended effect of making sequences more difficult to detect, the largest effect was seen when an easy sequence was borderline difficult in overlap. almost the same as typical baseline easy sequences and stayed within the range of standard error for the easy sequences. As such, easy sequences which became hard were borderline easy and were nudged into the realm of hard targets while easy targets which stayed easy were prototypically easy and were not nudged far enough to be within the realm of prototypical hard sequences. Thus, the attention gate hypothesis seems to explain the results of the previous study particularly if targets become difficult to detect when sufficient visual information is choked off. Easy targets which became hard had 196 just the right amount of visual information passing through the attention gate to satisfy detection requirements in the prior condition, but not afterwards. 5.4 Discussion The results reinforce the attention gate hypothesis as well as illustrating the viability of using surprise to simulate it. Additionally, the latter point adds weight to the notion that the brain uses statistically optimal methods for processing visual information. Finally, the much higher overlap between ground-truth and attention gate infer that the region passing through the attention gate is shown by the surprise system. This last item is important since if it is true, then we have begun to have a window into consciousness. That is, by creating a realistic attention gate simulation, we know what visual items will be edited out and cannot reach awareness. However, this is only one gate in a much larger attention mechanism. Thus, we still do not know what items that get past this attention gate will be lost in later visual and even cognitive processing. 5.4.1 Variability of the Attention Gate Size Fits within the Paradigm That some hard targets can pass through the attention gate is less of a problem than if some easy targets are completely blocked. This is because a hard target may be difficult to detect for reasons other than attention gating. That is, a target may pass the attentional gate, but may still be inherently difficult to recognize. Thus, a failure to detect an object may occur at many stages of visual processing following the attention gate. This exception however cannot be applied for easy targets. Once a target has been blocked for further processing by the attentional gate, it should be dead. In both the 197 transportation and animal easy targets, 3-7% of the targets are completely blocked by the attention gate. There may be one of several different causes for these errors. (1) The system does not account for an important feature type which is salient in the target. For instance, the system does not account for the interaction of line orientations with colors which conceivably might be important (Mullen et al., 2000). (2) The system is not accounting for feature priming or task dependence which may be occurring (Navalpakkam & Itti, 2006). In this case, participants may be learning to cue off of feature statistics which are more common for certain targets. In the current system, no one feature is set to be primed and is in a generic state. As an example, if participants are learning to prime their visual system for horizontal lines, which might be more common for transportation targets, they could have an enhanced ability to detect such targets. The surprise system can be primed, but given the time it takes to do so, this will be kept as a subject for future investigation. (3) The system may simply need further optimization or improved math. Many of the feature detectors are well tuned, but may need yet more tuning. Further, one cannot expect a perfect fit of the human brains own attention gate, particularly given inter subject variability. Supporting the hypothesis that only sufficient image information is needed for detection, we showed that easy sequences which became hard had a low amount of overlap between the target and the attention gate, but that the overlap was sufficient in order for enough target information to pass through the attention gate to further visual processing. The causal factor was to reduce the aperture on the attention gate enough to prevent the meager but adequate feature information from further passage. This suggests that a topic for future research would be to determine which target locations are being 198 processed and consider those rather than the target as a whole. This way the aperture of the attention gate can be measured more directly. This could be done using methods where it is better understood how the features integrate in given target images and which ones are critical for target identification such as vertex or contour deleted objects (Biederman & Cooper, 1991, Fiser, Biederman & Cooper, 1996). 5.4.2 The Attention Gate may Account for Some Split Attention Effects In many visual tasks, observers are able to attend to more than one location at the same time but with a diminishing return on performance (McMains & Somers, 2004, VanRullen, Carlson & Cavanagh, 2007). An attention gate may help account for this since it can encompass more than one quantal area at the same time, but that it limits the area of the image that can pass though. Thus, recalling figure 5.6 and 5.7, it can be seen that multiple discrete locations can pass through in several images, but that others may have only one location passing through. That is, the attention gate contains continuous, but independent windows which can be processed separately. The primary limit on the number of items that can be added to an image and detected may be limited by the number of discrete attention gate windows that can be formed or by the number of targets that can pass through one window as is illustrated in figure 5.10. In our case, the attention gate seems to rarely form more than four discrete windows which would be consistent with the proposed upper limit on divided vision tasks (McMains & Somers, 2004). This hypothesis might be tested by creating single frame multiple target images for recall and seeing if the number of discrete attention gate windows corresponds with observer performance on the recall task. Thus, the number of discrete windows should 199 line up with the number of objects which subjects are able to recall. Additionally, such an experiment should take into account the amount of overlap between the targets and the attention gate windows as we have done here since important components of the objects to be used may be blocked at the gate. Figure 5.10: The attention gate for the image on the top will only allow information on one target to pass. As a result, only target is detectable. However, on the bottom image as many as three targets could be detected given the attention gate. 5.4.3 Unifying Episodic Attention Gate Models with Saliency Maps A notable item in the current work is that the attention gate model is a synthesis of the models proposed for episodic attention gating (Reeves & Sperling, 1986, Sperling & Weichselgartner, 1995, Sperling et al., 2001, Shih & Sperling, 2002) with the attention model of Itti and Koch (Itti, 2000, Itti et al., 2000, Itti & Koch, 2001b). Thus, the current 200 attention gate may be said to unify the two models. Importantly, the attention gate contains spatial and temporal information pertinent to RSVP sequences and allows for quantal regions of attention in episodic models. At the same time, these components are derived from the visual features and engine of the Itti and Koch model. The two components are melded via the computation of surprise as a dynamic mode of attention. The basic computation of surprise as a Gamma process is not unlike the temporal factors of the episodic models which are also Gamma in nature. 201 Chapter 6: A Comparison of Surprise Methods and Models Using the Metric of Attention Gate (MAG) In this chapter, we will compare different methods and models for computing surprise. The basic model of surprise uses a gamma/Poisson process to compute surprise. This is a reasonable assumption particularly given the fact that we are modeling neural spike trains as well as events over time. We can check the rationality of this assumption by testing a few other model types. For instance, we can test a Gaussian model instead and see if it produces better results. In addition to different statistical models, we will also try different feature types to see what features help the process. In the most basic model, we check for surprise over intensity, Gabor orientation filters and colors as computed using basic red/green and blue/yellow color opponents. On top of this, does it help to have surprise be sensitive to junctions or might other models of color opponents help? Using other color models may be advantageous since it has been suggested that a color opponent based on CIE Lab color (MacAdam, 1935, Adams, 1942, McLaren, 1976) yields superior results (Frintrop, 2006) to the current model’s RGBY color method used. 6.1 The MAG Method for Comparison of Different Models Each method and model can be compared based on its performance at creating a separation between easy and hard sequences in RSVP tasks similar to what was discussed in chapter 5. To recap, human observers were shown several RSVP sequences with either an animal or transportation target. Of these sequences, some are very easy and all 202 subjects seem to be able to spot them. However, some are extremely difficult and subjects are never able to spot the target when presented. We can account for a good deal of this performance difference in a model of attention gating. To do this, we measure the overlap r between the attention gate as computed from the surprise system and the ground truth for the target. A good model should be one which creates a good separation between easy and hard targets given the r overlap ratio. Thus in basic terms, a superior model should be able to create the greatest difference between easy and hard targets. As such, hard targets should overlap very little (low r) and easy targets should overlap much more (high r). Put more simply, a better model should improve the results seen in chapter 5. Figure 6.1: The model on the left appears to be better since there is less overlap between the frequency distributions of r for the two classes of easy and hard sequences. The less two distributions overlap the better the results are in general. This is because it suggests that two processes are more distinct. The fitness of any given model or method can be judged as its ability to separate hard RSVP sequences from easy ones. This can be done in several ways. A very basic method is to directly count the overlap in the histogram such as the one in figure 6.1. That is, using the basic Riemann sum (or similar analytical methods) over the graphs, we 203 could compute the areas for the easy and hard distributions and divide by the overlapped area. While this method is direct, it does not generalize. That is, it is highly susceptible to bumps and eccentricities in the distributions. An alternative method, using Fishers Linear Discriminant, is to assume that both class distributions are normal. Under this assumption, overlap can be computed based on a derivation of the Fisher Criteria. That is, we compute the mean and variance of the two classes and indirectly measure overlap of the two approximated normal distributions. This method has the strength in that it makes a generalization of the data, but if the assumption of a normal distribution is completely incorrect, the results can be misleading. Fortunately, this method is robust for unimodal distributions even if they are not normal (Hayes, 1994, Bishop, 1995). Thus, so long as the distribution of data is not too exotic, this method should suffice and in general the data does not appear to be very unusual. 6.1.1 Fishers Linear Discriminant and Fitness Fishers Linear Discriminant (Fisher, 1936) [see also: (Bishop, 1995, Itti, 2000)] can be augmented to directly yield a probability of error given the linear boundary. Here we will discuss the discriminant and how it can be augmented to yield a normalized probability of error. To do this we first start with the basic Fisher criterion which is given as: (6.1) () 2 12 22 12 μμ σ σ − + 204 Notice that the value yielded by this equation will grow smaller if the two means 1 μ and 2 μ grow closer in value or if the variance gets larger. Thus, the further apart two distributions are and the less variance they have, the larger the number from this equation should be expected to become. A derivation can then be made from this that allows the probability of error given the two class distributions (Itti, 2000). This is given as: (6.2) () () 21 22 12 1 erfc 2 2 P error μμ σσ ⎛⎞ − ⎜⎟ = ⎜⎟ ⎜⎟ + ⎝⎠ Here erfc is the basic complementary error function defined as: (6.3) () () 2 2 erfc 1 erf t z z zedt π ∞ − ≡− = ∫ To use this, we compute the mean and standard deviation of overlap r between the attention gate and the ground truth targets per usual. The greater the distance between the means for hard and easy targets as well as the smaller the variance, the greater the separation is and thus the better the ability of the model to separate the two classes. Figure 6.2 gives a conceptual illustration of how this looks. 205 Figure 6.2: On the top graph, given two normal distributions, the P(error) is the area of the statistical overlap between the two. This area is given by (6.2). Notice that the further away the two classes get and the smaller their variance as captured in σ , the smaller P(error) gets. As such, the bottom graph shows how the fisher information gives a lower P(error) as the difference between the mean diff μ and standard deviation diff σ increases between two distributions. (See appendix H for figure graphing functions) 206 6.1.2 Data Sets Used The datasets used in all comparisons are the same as in chapter 5. These are the RSVP sequences where the target is either an animal or a transportation device. After analysis of observer performance 29 hard and 29 easy animal sequences were selected while 30 hard and 30 easy transportation sequences were selected. For each one, a ground truth was created allowing analysis for how much the attention gate overlaps with the target ground truth. This also creates two separate metrics for each stimulus type. As such, two fitness metrics will be presented for each method as well as a combined fitness metric, the Metric of Attention Gate (MAG), which represents the combined fitness over both types of targets. 6.1.2.1 Combining Performance Metrics for Multiple Data Sets to Create the MAG Each of the two data sets yields its own metric for P(error) so that one can get an idea of how each method works per data set. This is useful if a method does particularly wonderful on one data set, but quite awful on another. If the two data sets had been combined, the mean value might hide the fact that the method used completely fails for certain kinds of targets. However, it is convenient to have a single metric which combines independent metrics to give a basic single metric over each method. The combined performance, the MAG, is given as: (6.4) () ( ) 22 1 0.5 anim trans P error P error MAG + =− Here ( ) anim Perror and ( ) trans Perror are each separately computed using (6.2). The first error is for animal target RSVP sequences while the second error is for 207 transportation target RSVP sequences. (6.4) essentially treats the error metrics as an x and y position in space. The output is a normalized Euclidian distance shown in figure 6.3. The normalization which is included in the computation forces the performance metric to range between 0 and 1. In this case, a score of 1 is perfect while a score of 0 is extremely poor. Figure 6.3: In this example model 2 outperforms model 4 on the animal error metric, but underperforms on the transportation error metric. The Euclidian distance, which is the basis of MAG, reveals that model 2 does so poorly on the transportation metric that it is overall worse than model 4. Note that the MAG is derived as one minus the Euclidian distance, so that a larger MAG is better. 6.2 Comparison of Opponent Color Spaces using MAG In this section we will discuss different color spaces as well as their fitness in the surprise model using MAG to compare each one. The differences between the color spaces can be seen in figure 6.4. The three we will discuss are: 208 Figure 6.4: The color transformation of the three color models is shown using the “Gimp color chooser” as the base image. Each image to the right of the original shows the response of each output channel from the color space. Light regions correspond with a strong response. So for instance, the Red/Green opponent is white in places in the image where it is red and black where it is green. • iLab RGBY – This is the basic color opponent computation used by the current saliency implementation and described in (Foley, van Dam, Feiner & Hughes, 1990) and (Itti, 2000). It has been used in the standard saliency source since at least 2001 (Itti & Koch, 2001b). • CIE Lab – This is the color space suggested by Frintrop (2006) for use in saliency computation. It computes two color opponents as well as a luminance channel. Appendix E contains details about how it is computed. • iLab H2SV2 – This color space is a derivation of HSV (Smith, 1978) color space with the Hue channel split into color opponents via polar to 209 Cartesian coordinate transformation (Mundhenk et al., 2005b). Details on H2SV2 color space are given in appendix F and G. An important note going forward is that all of these color spaces have variants. For instance, there are many variations on CIE Lab such as Hunter Lab (Hunter-Lab, 2008). In many cases, the various constants have been tweaked by different authors to obtain a variety of results. Thus, if one color space performs less optimal than another, there may be a variant which performs better. As such, none of the color spaces analyzed should be completely written off based on their performance here. 210 6.2.1 iLab RGBY Figure 6.5: iLab RGBY components are shown here with three test images. iLab RGBY (Figure 6.5) is the most basic color model to compute. Red, green and blue are directly computed from their RGB values in the original image. Yellow is computed as a composite of red and green. Color opponents are computed by subtracting one color channel from its opponent. So for instance, red/green is computed by subtracting green from red. The RGBY method has its strength in its simplicity. 211 However, the color opponents are not orthogonal. This is because the yellow channel contains information directly from the red and green color channels. For details on the conversion, see appendix E. 6.2.2 CIE Lab Figure 6.6: CIE Lab components are shown here with three test images. 212 Figure 6.7: iLab H2SV2 components are shown here with three test images. CIE Lab color should in theory be one of the best candidates for computing saliency of colors. It has been directly calibrated to match human perception of color and the response of each channel should match the magnitude of observed differences for colors. Additionally, it computes an orthogonal set for color opponents that take into account perceived saturation. However, it does not contrast as well as the other color spaces. Thus, looking at figure 6.6 one can see that colors in this space seem not to pop 213 out as much and that the channel responses are lower in contrast. Additionally, the red/green opponent channel does not place red in the center of the maximum response. As such, the red/green channel may actually be more of a magenta/green channel. Figure 6.8: The results for the three color spaces are shown for both animal and transportation targets. It can be seen that for both target types, iLab H2SV2 has the lowest error followed by CIE Lab and iLab RGBY. As a baseline, a “no color” condition is used. This shows that even some color processing is better than none. Table 6.1: Generally speaking the H2SV2 color channel does the best and performs about 11% better than the baseline RGBY color channel. The baseline here for percent improved is the RGBY color channel. 214 6.2.3 iLab H2SV2 The iLab H2SV2 color channel (Figure 6.7) is based on the HSV color space. The idea is to break the H hue component into red/green and blue/yellow opponents. This is done by transforming the hue which is in radial coordinates into Cartesian coordinates. H2SV2 creates orthogonal color opponents as well as an orthogonal intensity. However, the separation of saturation may handicap this color space. For instance, if there is a blue/yellow opponent patch in the image, the blue/yellow channel may give the same response whether the colors are vivid or dull. Additionally, the saturation channel does not seem to correlate with observer performance and as such it needs to be given a very small weight in the model. 6.2.4 MAG Comparison of Color Spaces The three color spaces were tested using MAG and the results can be seen in figure 6.8 and table 6.1. Each one was tested with a uniform set of parameters and channels in the attention gate. All three were tested in the presence of the orientation channels as well as the junction channels. The channel sets given as inputs to the surprise system are CIOXTLE for iLab RGBY, HOXTLE for iLab H2SV2 and QOXTLE for CIE Lab. The no color condition used the parameters IOXTLE. H2SV2 color seems to produce the best results for both animal and transportation targets. However, the CIE Lab results are very similar and could possibly exceed the H2SV2 results with minor augmentation. This might include adjustment of the a parameter in Lab so that the red component aligns more directly with red as both RGBY 215 and H2SV2 do. The MAG score agrees with what is already visible. iLab H2SV2 scores 0.665, CIE Lab scores 0.64 and iLab RGBY scores 0.597. Figure 6.9: The results are shown for the baseline model using all junction channels plus end-stops. In the upper right corner is the result when no junction channels are used. In the middle are implementations where only one channel was used. As can be seen, using any one of the junction channels helps compared with using none. The optimal results are obtained when all four are used at the same time. Table 6.2: Each junction model is shown here sorted by MAG performance. Notice that the performance using all junctions is almost 28% better than the performance of the method which uses none of the junctions. The baseline for percent improved is the no junction condition. 216 6.3 Addition of Junction Feature Channels The junction feature channel was implemented in the iLab Neuromorphic Toolkit by Vidhya Navalpakkam and is detailed in appendix D. It is unpublished work, so there is no other reference to it. In general, the junction channel allows selection of different kinds of channels in different combinations. One can for instance, create a model that is only sensitive to T junctions, but not to X or L junctions. However, and more usefully, one can create channels for all four feature types; X junctions, T junctions, L junctions and end-stops. As an important note, the junction channels are not orthogonal. For instance, the L junction channel can still pick up T junctions. This might at first seem like an error. However, it is known that search for different types of junctions is inefficient (Wolfe & DiMase, 2003). This is a result which is also borne out by other research (Mundhenk & Itti, 2005). Thus, overlapping and non-orthogonal sensitivity to different types of junctions during search is probably the reasonable model to use. However, and importantly, we only make this assumption for visual search and do not imply it for object identification. The junction channel improves performance significantly (Figure 6.9 and Table 6.2) for both types of RSVP targets using the attention gate model. While a performance increase is not unexpected, the amount of improvement was a pleasant surprise. Notably, even a single junction channel improves performance, but all four combined produces the largest increase of all. This also provides evidence that at the level of the attention gate, junctions may be included in the brains own computation of attentional selection. 217 6.4 Comparison of Different Statistical Models Figure 6.10: While many other statistical models were tried, the baseline Gamma/Poisson model performs far better than any of the others. Table 6.3: The final MAG performance metric is shown for the different statistical models tried. The Gamma/Poisson model performs superior and is 34% better than the next highest model, the outlier model. To test the correctness of the Gamma/Poisson assumption, four other statistical models were created and compared with the baseline Gamma/Poisson model. These include: 218 • A Joint Gamma/Gaussian Model – This assumes the spatial component of surprise to be normally distributed, but the time component to be gamma distributed. The Bayesian probability is computed jointly between the two which reduces the KL computation to a single combined result. Appendix C details the joint KL computation. • A statistical outlier model – This is a single Poisson outlier model. It attempts to track jumps in data, but does not compute a KL metric. Thus, its major difference from the base Gamma/Poisson model is that it does not as directly measure information change. • A Gaussian Model – This is similar to the baseline Gamma/Poisson model in that it computes a KL metric based on the Bayesian statistics from the incoming data. However, it assumes a Gaussian process rather than a Gamma/Poisson process. • A Max Gamma/Poisson model – This model is the same as the baseline Gamma/Poisson model, but rather than taking the product across time scales, it takes the max. • The baseline Gamma/Poisson Model – This is the baseline model. It assumes a Gamma/Poisson process and computes a KL metric to determine information change over time. The Results can be seen in figure 6.10 and table 6.3. The basic Gamma/Poisson model performs better than any of the other models. While this does not assure that the Gamma/Poisson model is in fact the optimal model, it does suggest that it is probably a better fit than many other models since none of the alternatives outperformed it. 219 Additionally, since neither the Gaussian model nor the joint Gamma/Gaussian model performed very well, we suggest that there is probably very little about surprise which is truly Gaussian in nature. However we must emphasize the word “suggest” since some results may be implementation dependent. 6.5: Checking the Problem with Beta The surprise model in its current implementation has some mathematical peculiarities which may need to be addressed. The most notable is that the beta hyperparameter is logistic and as such is asymptotically stable as time reaches infinity. In reality it approaches the stability point quite fast meaning that for any video sequence of even short length, setting it constant may have the same effect as allowing it to float. This would be helpful since it would allow the surprise system to perform fewer computations. That is, we would be able to skip the beta update and just make it a constant. 6.5.1 Asymptotic Behavior of β In the surprise model as it is usually defined with the Gamma/Poisson process, the β and β’ hyperparameters are asymptotic in their behavior and will approach a constant value as the limit of iterations τ approach infinity. The β update is given as: (6.5) 1 β ζβ ′ = ⋅+ The asymptotic behavior is due to the fact that its update is logistic. That is, the rate of gain will eventually reach the rate of decay from the ζ term. This can be seen in figure 6.11. We can also see this by starting with a basic limit theorem: 220 (6.6) lim 1 τ τ ζβ →∞ ⋅ + Figure 6.11: Given a decay ζ rate of 0.7, β will approach its asymptotic value 3.333 very quickly. In this example, if each iteration was a movie frame, β would saturate in half a second. To start to get an idea of where we will go with the current proof, imagine what happens as β gets large. The amount it loses in each iteration from the decay in β ζ ⋅ becomes larger. In fact, at a certain point, this loss is equal to 1. That is, the amount of loss due to the decay term is the same as the gain by adding one. Thus, the gain and loss will eventually reach a point of equilibrium as time approaches infinity. Now we will show the mathematical proof for this. This is easy to see by solving the differential equation for β’ using the incremental constant ζ. If it is stable then we can solve for a β where β = β’. 221 What we want is a constant C which should represent the asymptotic value of β. So we will substitute β with C and solve. (6.7) 1 CC ζ ⋅ =− Which then becomes: (6.8) 1 1 C ζ − =− thus: (6.9) 1 lim 1 1 C τ τ ζβ ζ →∞ ⎛⎞ ⋅+= =− ⎜⎟ − ⎝⎠ Here C is the final stable constant value of β given the update from adding l. Note that we can generalize this for any update factor u such that: (6.10) lim 1 u u uC τ τ ζβ ζ →∞ ⎛⎞ ⋅+ = =− ⎜⎟ − ⎝⎠ What this means is that β will reach a stable asymptotic value for any positive real value of u and we can compute it in closed form using the above solution for C in (6.10). 6.5.2 What Happens if We Fix the β Hyperparameter to a Constant Value? The default value for ζ in the surprise implementation is 0.5 for temporal surprise and 0.7 for spatial surprise. Using this, we can compute the asymptotic value from (6.9). 222 This would give us a constant value of β=2 for temporal surprise and β=3.333 spatial surprise. Here we test the constant under two conditions: • We fix β ' using the equation in (6.9). However, to create the proper spread between new and old β values we allow β to be β ζ ′⋅ . • We try fixing β to be an integer from 1 to 5. This way we can check to make sure that any arbitrary value for β has no guarantee of being optimal. This corresponds with the decay ζ rates of 0, 0.5, 0.666, 0.75 and 0.8. The β value is fixed right before KL computation so that the spatial component of surprise can be computed on its own since α and β are computed over a spatial pool. So we compute β as: (6.11) β βζ = ⋅ And (6.12) C β ′ = For temporal surprise this means that strictly speaking (6.13) C β ζ = ⋅ 223 Figure 6.12: The performance of the model is shown if β is fixed to one of five different values rather than incremented by 1. The performance is similar if β is fixed to the baseline and even performs slightly better if it is set to 3. On the far right is the performance if the value of β is set by the equation in (6.9). It is exactly the same to within several decimal places of precision. For the constant condition C, β = 2.333 and β’=3.333. Table 6.4: The MAG for different fixed values of β are shown compared to the baseline model on the top where β is incremented each iteration. Setting β to C yields almost exactly the same MAG as the baseline (the baseline is on top row). The bottom row shows the results from using the constants defined in equations 6.19 and 6.20. 224 This is accomplished very easily by replacing one line of code in the surprise system that computes the new β' as a result of incrementing β by 1. The new code just fixes β as the value computed from (6.9). After fixing β, we will test the model as we have done by computing the overlap between the attention gate and ground truth for RSVP sequences. In general, we find that fixing β to 3 exceeds the baseline MAG performance slightly (Figure 6.12). However, and most notably, the performance metric for when β is fixed using (6.9) is exactly the same to a precision of several decimal places as the baseline model (Table 6.4). The most notable thing here is that the value of β can be fixed to a constant and not incremented every iteration. While this behavior might change for different data sets, the current outcome plus the proof for the asymptotic behavior of β suggest that this is not a fluke result. Instead, it suggests that it is safe to simply fix β to a single value and save several computation steps. We can see that the optimization extends beyond just the β computation into the KL computation itself by replacing several terms that must be computed every time step with constants. Taking the KL computation: (6.14) ()( ) () ( ) () ()() ;, , ; , log log KL α βα γ λα β γ λα β α α β α α α βα β Γ′′ ′′ ′ ′ ′ =− + + + + − Ψ ′′ Γ We can replace two terms if β is constant: (6.15) 1 log C β β β ′ = 225 And: (6.16) 2 1 C β β β = − ′ This yields the new KL computation: (6.17) ()( ) () ( ) () ()() 12 ;, , ; , log KL C C ββ α γ λα β γ λα β α α α α α α Γ ′′ ′ ′ ′ =⋅ + + ⋅ + − Ψ ′ Γ This drops two divisions, one addition and a costly log operation. However, it should be remembered that the highest cost is in computation of the gamma Γ and digamma Ψ functions which still remain. We can further simplify this by assuming that β ‘=β. Under such an assumption, the KL dramatically simplifies to: (6.18) ()( ) () ( ) () ()() ;, , ; , log KL α γ λα β γ λα β α α α α Γ ′′′′ =+−Ψ ′ Γ This yields a MAG of 0.5845 which suggests that the β term components do perform a function that necessitates them to be different even if they take on a constant value. As such, we might not be able to simplify the computation much more than (6.17). 6.5.2.1 If the β Hyperparameter is Constant, Might it be Related to Some Other Principled Constant? The interesting thing to notice about β is that when it is fixed from the baseline decay rate, it approximates some common constants. So for instance: 226 (6.19) 3.3333 1.42857 2 2.3333 β β ′ == ≈ Also: (6.20) 2.3333 2 0.7 3.3333 2 β β ==≈ ′ As an experiment to see if the β terms may be approximating some unknown principled constants, we tried them in place giving the KL update: (6.21) ()( ) () () ( ) () ()() 2 ; , , ; , log 2 log 2 KL α γ λα βγ λα β αα α α α α α Γ ′′ ′ ′ ′ ′ =− + + + ⋅ + − Ψ ′ Γ This yields a superior MAG: 0.6896 and an improved P(error) for both animals: 0.179 and transportation targets: 0.127 over the baseline. 6.6 Method Performance Conclusion The results suggest that the current statistical model is probably a good fit compared with many other statistical models. However, the attention gate model as well as the surprise and even basic saliency model should be enhanced by the inclusion of the junction channels. As a caution, the junction channel implementation leaves a little to be desired and the end-stop detection is somewhat dubious since it does not seem to include a true aperture. For this reason, we have excluded calling the implementation a 227 junction/end-stop channel. A subject for future research is certainly to find out if a more elegant junction detector can be created which performs as well on the MAG score. It is also suggested to use another color model rather than the basic iLab RGBY color model. As had been suggested (Frintrop, 2006), CIE Lab color space outperforms iLab RGBY. Additionally, iLab H2SV2 performs the best. However, none of the color spaces analyzed seem perfect. CIE Lab does not appear to have proper alignment of color opponents in Red/Green and H2SV2 does not integrate saturation into its opponent channels. Further enhancement of the model may be obtained by creating a new Lab color space with properly aligned Red/Green color or by re-mixing saturation back into the H2SV2 color opponents. 228 References [1] Adams, E.Q. (1942). X-Z planes in the 1931 ICI system of Colorimetry. JOSA 32 (3), 168-173. [2] Amari, S., & Arbib, M.A. (1977). Competition and Cooperation in Neural Nets. In: J. Metzler (Ed.) Systems Neuroscience (pp. 119-165): Academic Press. [3] Arbib, M.A., & Mundhenk, T.N. (2005). Schizophrenia and the mirror system: an essay. Neuropsychologia, 43 (2), 268-280. [4] Attneave, F. (1954). Informational aspects of visual perception. Psychological Review, 61:183-193 [5] Barlow, H.B. (1961). Possible principles underlying the transformation of sensory messages. In: W.A. Rosenblith (Ed.) Sensory Communication (pp. 217-234). Cambridge MA: MIT Press. [6] Bell, A.J., & Sejnowski, T.J. (1995). An information maximisation approach to blind separation and blind deconvolution. Neural Computation, 7 (6), 1129-1159. [7] Ben-Shahar, O., & Zucker, S. (2004). Geometrical computations explain projection patterns of long-range horizontal connections in visual cortex. Neural Computation, 16, 445-476. [8] Biederman, I. (1987). Recognition-by-components: A theory of human image understanding. Psychological Review, 94, 115-147. [9] Biederman, I., & Cooper, E.E. (1991). Priming Contour-deleted images: Evidence for intermediate representations in visual object recognition. Cognitive Psychology, 23, 393- 419. [10] Biederman, I., Subramaniam, S., Bar, M., Kalocsai, P., & Fiser, J. (1999). Subordinate-level object classification reexamined. Psychological Research, 62, 131-153. [11] Bishop, C.M. (1995). Neural Networks for Pattern Recognition. Oxford: Oxford University Press. [12] Boser, B.E., Guyon, I.M., & Vapnik, V.N. (1992). A training algorithm for optimal margin classifiers. 5th Annual ACM Workshop on Computational Learning Theory (pp. 144-152). 229 [13] Bracci, E., Centonze, D., Bernardi, & Calabresi, P. (2003). Voltage-dependant membrane potential oscillations of rat striatal fast-spiking interneurons. Journal of Physiology, 549 (1), 121-130. [14] Braun, J. (1999). On detection of salient contours. Spatial Vision, 12 (2), 211-225. [15] Breitmeyer, B.G. (1984). Visual Masking. New York: Oxford University Press. [16] Breitmeyer, B.G., & Ö ğmen, H. (2006). Visual Masking: Time Slices Through Conscious and Unconscious Vision. New York: Oxford University Press. [17] Broadbent, D.E. (1958). Perception and Communication. New York: Pergamon. [18] Bruce, N., & Tsotsos, J. (2006). Saliency based on information maximization. NIPS (pp. 155-162). [19] Burt, P.J., & Adelson, E.H. (1983). The Laplacian pyramid as a compact image code. IEEE Transactions on Communications, 31, 532-540. [20] Cave, K.R. (1999). The FeatureGate Model of Visual Selection. Psychological Research, 62, 182-194. [21] Choe, Y., & Miikkulainen, R. (2004). Contour integration and segmentation with self-organized lateral connections. Biological Cybernetics, 90, 75-88. [22] Choi, S.C., & Wette, R. (1969). Maximum likelihood estimation of the parameters of the Gamma Distribution and their bias. Technometrics, 11 (4), 683-690. [23] Chun, M.M., & Potter, M.C. (1995). A two-stage model for the multiple target detection in rapid serial visual presentation Journal of Experimental Psychology: Human Perception and Performance, 21, 109-127. [24] Cox, S., & Dember, W.N. (1972). U-Shaped Metacontrast Functions with a Detection Task. Journal of Experimental Psychology, 95, 327-333. [25] Crewther, D.P., Lawson, M.L., & Crewther, S.G. (2007). Global and local attention in the attentional blink. Journal of Vision, 7 (14), 1-12. [26] Crick, F. (1984). Function of the Thalamic Reticular Complex: The Searchlight Hypothesis. PNAS, 81, 4586-4590. [27] Crick, F., & Koch, C. (2003). A Framework for Consciousness. Nature Neuroscience, 6 (2), 119-126. 230 [28] Daw, N.W. (1967). Goldfish Retina: Organization for Simultaneous Color Contrast. Science, 158 (3803), 942-944. [29] Daw, N.W. (1968). Colour-coded ganglion cells in the goldfish retina: Extension of their receptive fields by means of new stimuli. Journal of Physiology, 197, 567-592. [30] Dehaene, S., Changeux, J.-P., Naccache, L., Sackur, J., & Sergent, C. (2006). Conscious, Preconscious, and Subliminal Processing: a Testable Taxonomy. Trends in Cognitive Sciences, 10 (5), 204-211. [31] Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maxmimum Likelyhood from incomplete data via Em algorithm. Journal of the Royal Statistical Society B, 39 (1), 1- 38. [32] Deutsch, J.A., & Deutsch, D. (1963). Attention: Some Theoretical Considerations. Psychological Review, 70, 80-90. [33] Di Lollo, V., Enns, J.T., & Rensink, R.A. (2000). Competition for Consciousness Among Visual Events: The Psychophysics of Reentrant Visual Processes. Journal of Experimental Psychology: General, 129 (4), 481-507. [34] Didday, R.L. (1976). A Model of Visuomotor Mechanisms in the Frog Optic Tectum Math. Biosci., 30, 169-180. [35] Dougherty, R.F., Koch, V.M., Brewer, A.A., Fischer, B., Modersitzki, J., & Wandell, B.A. (2003). Visual field representations and locations of visual areas V1/2/3 in human visual cortex. Journal of Vision, 3 (10), 586-598. [36] Duncan, J. (1984). Selective Attention and the Organization of Visual Information. Journal of Experimental Psychology: General, 113 (4), 501-517. [37] Duncan, J., Ward, R., & Shapiro, K. (1994). Direct measurement of attentional dwell time in human vision. Nature 369 (26), 313-315. [38] Durstewitz, D., Seamans, J.K., & Sejnowski, T.J. (2000). Dopamine-mediated stabilization of delay-period activity in a network model of prefrontal cortex. Journal of Neurophysiology, 83 (3), 1733-1750. [39] Einhäuser, W., Koch, C., & Makeig, S. (2007a). The duration of the attentional blink in natural scenes depends on stimulus category. Vision Research, 47, 597-607. [40] Einhäuser, W., Mundhenk, T.N., Baldi, P., Koch, C., & Itti, L. (2007b). A bottom-up model of spatial attention predicts human error patterns in rapid scene recognition. Journal of Vision, 7 (10), 1-13. 231 [41] Evans, K.K., & Treisman, A. (2005). Perception of objects in natural scenes: is it really attention free? Journal of Experimental Psychology: Human Perception and Performance, 31 (6), 1476-1492. [42] Fagg, A.H., & Arbib, M.A. (1998). Modeling perietal-premotor interactions in primate control of grasping. Neural Networks, 11 (1277-1303) [43] Felleman, D.J., & Van Essen, D.C. (1991). Distributed hierarchical processing in primate visual cortex. Cerebral Cortex, 1, 1-47. [44] Fellous, J.-M., & Arbib, M.A. (2005). Who Needs Emotions? The Brain Meets the Robot. Oxford: Oxford University Press. [45] Field, D.J. (1987). Relations Between the Statistics of Natural Images and the Response Properties of Cortical Cells JOSA-A, 4, 2379-2394. [46] Field, D.J., Hayes, A., & Hess, R.F. (1993). Contour integration by the human visual system: Evidence for local association field. Vision Research, 33 (2), 173-193. [47] Field, D.J., Hayes, A., & Hess, R.F. (2000). The roles of polarity and symmetry in the perceptual grouping of contour fragments. Spatial Vision, 13 (1), 51-66. [48] Fiser, J., Biederman, I., & Cooper, E.C. (1996). To what extent can matching algorithms based on direct outputs of spatial filters account for human object recognition? Spatial Vision, 10 (3), 237-271. [49] Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179-188. [50] Foley, J.D., van Dam, A., Feiner, S., & Hughes, J. (1990). Computer Graphics, Principles and Practice (2nd ed) New York: Addison-Wesley. [51] Freeman, E., Driver, J., Sagi, D., & Zhaoping, L. (2003). Top-down modulation of lateral interactions in early vision does attention affect integration of the whole or just perception of parts. Current Biology, 13, 985-989. [52] Frintrop, S. (2006). VOCUS: A Visual Attention System for Object Detection and Goal-Directed Search. Lecture Notes in Computer Science, 3899. Berlin: Springer- Verlag. [53] Gao, D., Mahadevan, V., & Vasconcelos, N. (2008). On the plausibility of the discriminant center-surround hypothesis for visual saliency. Journal of Vision, 8 (7), 1- 18. 232 [54] Gao, W., & Goldman-Rakic, P.S. (2003). Selective modulation of excitatory and inhibitory microcircuits. PNAS, 100 (5), 2836-2841. [55] Geisler, W.S., Perry, J.S., Super, B.J., & Gallogly, D.P. (2001). Edge Co-occurrence in Natural Images Predicts Contour Grouping Performance Vision Research, 41, 711-724. [56] Ghorashi, S.M.S., Smilek, D., & Di Lollo, V. (2007). Visual Search is Postponed During the Attentional Blink Until the System is Suitably Reconfigured. Journal of Experimental Psychology: Human Perception and Performance 33 (1), 124-136. [57] Gibson, J.J. (1950). The perception of the visual world. Boston MA: Houghton Mifflin. [58] Gilbert, C.D. (1994). Circuitry, architecture and functional dynamics of visual cortex. In: G.R. Bock, & J.A. Goode (Eds.), Higher-order processing in the visual system (Ciba Foundation symposium 184) (pp. 35-62). Chichester: Wiley. [59] Gilbert, C.D., Das, A., Ito, M., Kapadia, M., & Westheimer, G. (1996). Spatial integration and cortical dynamics. PNAS, 93, 615-622. [60] Gilbert, C.D., Ito, M., Kapadia, M., & Westheimer, G. (2000). Interactions between attention context and learning in primary visual cortex. Vision Research, 40, 1217-1226. [61] Greenspan, H., Belongie, S., Goodman, R., Perona, P., Rakshit, S., & Anderson, C. (1994). Overcomplete steerable pyramid filters and rotation invariance. IEEE Computer Vision and Pattern Recognition (pp. 222-228). Seattle, WA. [62] Grigorescu, C., Petkov, N., & Westenberg, M.A. (2003). Contour detection based on non-classical receptive field inhibition. IEEE Transactions on image processing, 12 (7), 729-739. [63] Guy, G., & Medioni, G. (1993). Infering global perceptual contours from local features. CVPR (pp. 786-787). [64] Guyonneau, R., VanRullen, R., & Thorpe, S.J. (2004). Temporal Codes and Sparse Representations: A Key to Understanding Rapid Processing in the Visual System. Journal of Physiology - Paris, 98, 487-497. [65] Hadjikhani, N., Liu, A.K., Dale, A.M., Cavanagh, P., & Tootell, R.B.H. (1998). Retinotopy and color sensitivity in human visual cortical area V8. Nature Neuroscience, 1, 235-241. [66] Hayes, W.L. (1994). Statistics. Harcourt Brace College Publishers. 233 [67] Hemple, C.M., Hartman, K.H., Wang, X.-J., Turrigiano, G.G., & Nelson, S.B. (2000). Multiple forms of short-term plasticity at excitatory synapses in rat medial prefrontal cortex. Journal of Neurophysiology, 83, 3031-3041. [68] Henn, V., & Grüsser, O.J. (1968). The summation of excitation in the receptive field of movement sensitive neurons of the frog retina. Vision Research, 9, 57-69. [69] Hering, E. (1872). Sitzungsberichte der Kaiserlichen Akademie der Wissenschaften (Minutes of the meeting of the Imperial Academy of Sciences). Mathematisch– Naturwissenschaftliche Classe: K.-K. Hof- und Staatsdruckerei in Commission bei F. Tempsky. [70] Hess, R., & Field, D. (1999). Integration of contours, new insight. Trends in Cognitive Sciences, 3 (12), 480-486. [71] Hoffman, J.E., Nelson, B., & Houck, M.R. (1983). The Role of Attentional Resources in Automatic Detection. Cognitive Psychology, 15 (3), 379-410. [72] Hogben, J., & Di Lollo, V. (1972). Effects of Duration of Masking Stimulus and Dark Interval on the Detection of a Test Disk. Journal of Experimental Psychology, 95 (2), 245-250. [73] Holmes, G. (1917). Disturbances of vision by cerebral lesions British Journal of Optimology, 2, 353-384. [74] Holmes, G. (1945). The organization of the visual cortex in man. Proceedings of the Royal Society of London Series B, 132, 348-361. [75] Horton, J.C., & Hoyt, W.F. (1991). The Representation of the Visual Field in Human Striate Cortex: A Revision of the Classic Holmes Map. Arch Ophthalmol, 109 (816-824) [76] Hubel, D., & Weisel, T. (1977). Functional architecture of macaque monkey visual cortex. Proc R. Soc. London Ser B. 198, Jan-59 [77] Hubel, D.H., & Wiesel, T.N. (1974). Sequence regularity and geometry of orientation columns in the monkey striate cortex. Journal of Comparative Neurology, 158, 267-294. [78] Hunter-Lab (2008). Hunter L,a,b Color Scale. http://www.hunterlab.com/appnotes/an08_96a.pdf. [79] Hyvärinen, A. (1999). Fast and Robust Fixed-Point Algorithms for Independent Component Analysis. IEEE Transactions on Neural Networks 10 (3), 626-634. 234 [80] Ingle, D., Schneider, G.E., Trevathen, C.B., & Held, R. (1967). Locating and identifying: Two modes of visual processing (a symposium). Psychologische Forschung, 31, 1-4. [81] Itti, L. (2000). Models of Bottom-Up and Top-Down Visual Attention. Computational Neuro Science, Doctor of Philosophy (p. 202). Pasadena, Ca: California Institute of Technology. [82] Itti, L. (2004). Automatic Foveation for Video Compression Using a Neurobiological Model of Visual Attention. IEEE Transactions on Image Processing, 13 (10) [83] Itti, L., & Baldi, P. (2005). A principled approach to detecting surprising events in video. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 631-637). [84] Itti, L., & Baldi, P. (2006). Bayesian Surprise attracts human attention. Advances in Neural Information Processing Systems (NIPS), 19 (pp. 547-554): MIT Press. [85] Itti, L., & Koch, C. (2001a). Computational modeling of visual attention. Nature Neuroscience, 2 (3), 194-203. [86] Itti, L., & Koch, C. (2001b). Computational Modeling of Visual Attention. Nature Reviews Neuroscience, 2 (3), 194-203. [87] Itti, L., Koch, C., & Braun, J. (2000). Revisiting Spatial Vision. Towards a Unifying Model. JOSA-A, 17 (11), 1899-1917. [88] Itti, L., Koch, C., & Niebur, E. (1998). A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20 (11), 1254-1259. [89] Jain, A.K., Murty, M.N., & Flynn, P.J. (1999). Data Clustering: A Review. ACM Computing Surveys, 31 (3), 264-323. [90] James, W. (1890). The Principles of Psychology. Cambridge MA: Harvard UP. [91] Jollife, I.T. (1986). Principle component analysis. New York: Springer-Verlag. [92] Jones, H.E., Grieve, K.L., Wang, W., & Silito, A.M. (2001). Surround Supression in primate V1. J. Neurophysiol, 86, 2011-2028. [93] Kapadia, M.K., Ito, M., Gilbert, C.D., & Westheimer, G. (1995). Improvement in visual sensitivity by changes in local context Parallel studies in human observers and in V1 of alert monkeys. Neuron, 15, 843-856. 235 [94] Kapadia, M.K., Westheimer, G., & Gilbert, C.D. (2000). Spatial distribution of contextual interactions in primary visual cortex and in visual perception. J. Neurophysiol, 84, 2048-2062. [95] Kapur, S., & Mamo, D. (2003). Half a century of antipsychotics and still a central role for dopamine D2 receptors. Prog Neuropsychopharmacol Biol Psychiatry. 27, 7, 1081-1090. [96] Keysers, C., & Perrett, D.I. (2002). Visual masking and RSVP reveal neural competition. Trends in Cognitive Sciences, 6 (3), 120-125. [97] Koch, C., & Ullman, S. (1985). Shifts in selective visual attention towards the underlying neural circuitry. Human Neurobiology, 4 (4), 219-227. [98] Koffka, K. (1935). Principles of gestalt psychology. London: Lund Humphries. [99] Kovács, I., & Julesz, B. (1993). A closed curve is much more than an incomplete one Effect of closure in Figure-ground segmentation. PNAS, 90, 7495-7497. [100] Krimer, L.S., & Goldman-Rakic, P.S. (2001). Prefrontal Microcircuits membrane properties and excitatory input of local medium and wide arbor interneurons. The Journal of Neuronscience, 21 (11), 3788-3796. [101] Kuffler, S.W. (1953). Discharge patterns and functional organization of mammalian retina. Journal of Neurophysiology, 16, 37-68. [102] Kullback, S., & Leibler, R.A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22, 79-86. [103] Lamme, V.A., & Roelfsema, P.R. (2000). The distinct modes of vision offered by feedforward and recurrent processing. Trends in Neuroscience, 23 (11), 571-579. [104] Lamme, V.A.F. (2004). Seperate Neural Definitions of Visual Consciousness and Visual Attention; A Case for Phenomenal Awareness Neural Networks, 17, 861-872. [105] Laruelle, M., Kegeles, L.S., & Abi-Dargham, A. (2003). Glutamate, dopamine and schizophrenia: from pathophysiology to treatment. Ann N Y Acad Sci, 1003, 138-158. [106] Lee, S.-H., & Blake, R. (2001). Neural synergy in visual grouping: when good continuation meets common fate. Vision Research, 41, 2057-2064. [107] Leventhal, A.G. (1991). The Neural Basis of Visual Function (Vision and Visual Dysfunction Vol 4). Boca Raton, FL: CRC Press. 236 [108] Li, F.F., VanRullen, R., Koch, C., & Perona, P. (2002). Rapid natural scene categorization in the near absence of attention. PNAS, 99 (14), 9596-9601. [109] Li, W., & Gilbert, C.D. (2002). Global Contour Saliency and Local Colinear Interactions. Journal of Neurophysiology, 88, 2846-2856. [110] Li, Z. (1998). A Neural model of contour integration in the primary visual cortex. Neural Computation, 10, 903-940. [111] Li, Z. (2002). A Saliency Map in Primary Visual Cortex. Trends in Cognitive Sciences, 6 (1), 9-16. [112] Lowe, D.G. (1999). Object recognition from local scale-invariant features. Seventh IEEE International Conference on Computer Vision, 2 (pp. 1150-1157). [113] Luschow, A., & Nothdurft, H.C. (1993). Pop-out of Orientation but no pop-out of Motion at Isoluminance. Vision Research, 33 (1), 91-104. [114] MacAdam, D.L. (1935). Maximum visual efficiency of colored materials. JOSA, 25, 316-367. [115] Mack, A., & Rock, I. (1998). Inattentional Blindness. Cambridge, MA: MIT Press. [116] Maki, W.S., & Mebane, M.W. (2006). Attentional capture triggers an attentional blink. Psychonomic Bulletin and Review, 13 (1), 125-131. [117] Marois, R., Yi, D.-J., & Chun, M.M. (2004). The neural fate of consciously perceived and missed events in the attentional blink. Neuron, 41, 465-472. [118] McLaren, K. (1976). The development of the CIE 1976 (L*a*b*) uniform colour- space and colour-difference formula. Journal of the Society of Dyers and Colourists, 92, 338-341. [119] McMains, S.A., & Somers, D.C. (2004). Multiple Selection of Attentional Selection in Human Visual Cortex. Neuron, 42, 677-686. [120] Meyer, G. (1976). Psychophysical measurement of cortical color mechanisms not sensitive to spatial frequency. Perception, 5 (2), 143-145. [121] Miniussi, C., Rao, A., & Nobre, A.C. (2002). Watching where you look, modulation of visual processing of foveal stimuli by spatial attention. Neuropsychologia, 40 (13), 2448-2460. [122] Mounts, J.R.W., & Gavett, B.E. (2004). The role of salience in localized attentional interference. Vision Research, 44, 1575-1588. 237 [123] Mullen, K.T., Beaudot, W.H.A., & McIlhagga, W.H. (2000). Contour integration in color vision, a common process for the blue-yellow red-green and luminance mechanisms? Vision Research, 40, 639-655. [124] Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., & Schölkopf, B. (2001). An Introduction to Kernel-Based Learning Algorithms. IEEE Transactions on Neural Networks, 12 (2), 181-202. [125] Mundhenk, T.N. (2005). Filling in missing signal data holes using Fourier series regression and Kalman filtering. http://www.nerd-cam.com/misc/Filling-in-missing- spectral-data.pdf. [126] Mundhenk, T.N., Ackerman, C., Chung, D., Dhavale, N., Hudson, B., Hirata, R., Pichon, E., Shi, Z., Tsui, A., & Itti, L. (2003a). Low-cost high-performance mobile robot design utilizing off-the-shelf parts and the Beowulf concept: the Beobot project. SPIE Conference on Intelligent Robots and Computer Vision XXI, 5267 (pp. 293-303). Providence, RI. [127] Mundhenk, T.N., Dhavale, N., Marmol, S., Calleja, E., Navalpakkam, V., Bellman, K., Landauer, C., Arbib, M.A., & Itti, L. (2003b). Utilization and viability of biologically-inspired algorithms in a dynamic multi-agent camera surveillance system. SPIE Conference on Intelligent Robots and Computer Vision XXI, 5267 (pp. 281-292). Providence, RI. [128] Mundhenk, T.N., Einhäuser, W., & Itti, L. (2009). Automatic Computation of an Image’s Statistical Surprise Predicts Performance of Human Observers on a Natural Image Detection Task. Vision Research, 49, 1620-1637. [129] Mundhenk, T.N., Everist, J., Landauer, C., Itti, L., & Bellman, K. (2005a). Distributed biologically based real time tracking in the absence of prior target information. SPIE Conference on Intelligent Robots and Computer Vision XXIII, 6006 (pp. 330-341). Boston, MA. [130] Mundhenk, T.N., Everist, J., Landauer, C., Itti, L., & Bellman, K. (2005b). Distributed biologically based real time tracking in the absence of prior target information. SPIE, 6006 (pp. 330-341). Boston, MA. [131] Mundhenk, T.N., & Itti, L. (2003). CINNIC, a new computational algorithm for modeling of early visual contour integration in humans. Neurocomputing, 52-54, 599- 604. [132] Mundhenk, T.N., & Itti, L. (2005). Computational modeling and exploration of contour integration for visual saliency. Biological Cybernetics, 93 (3), 188-212. 238 [133] Mundhenk, T.N., & Itti, L. (2006). Surprise bottom-up reduction and control in images and videos. 13th Joint Symposium on Neural Computation (pp. online http://www.jsnc.caltech.edu/2006/posters/mundhenk-t.pdf). La Jolla, CA. [134] Mundhenk, T.N., & Itti, L. (2007). 3D Saliency Version 0.1a. http://www.mundhenk.com/publications/WEBPUBLISHED-2007-3D-Saliency.pdf. [135] Mundhenk, T.N., Landauer, C., Bellman, K., Arbib, M.A., & Itti, L. (2004a). Teaching the computer subjective notions of feature connectedness in a visual scene for real time vision. SPIE, 5608 (pp. 136-147). Philadelphia, PA. [136] Mundhenk, T.N., Landauer, C., Bellman, K., Arbib, M.A., & Itti, L. (2004b). Teaching the computer subjective notions of feature connectedness in a visual scene for real time vision. SPIE Conference on Intelligent Robots and Computer Vision XXII: Algorithms, Techniques, 5608 (pp. 136-147). Philadelphia, PA. [137] Mundhenk, T.N., Navalpakkam, V., Makaliwe, H., Vasudevan, S., & Itti, L. (2004c). Biologically inspired feature based categorization of objects. SPIE Human Vision and Electronic Imaging IX, 5292 (pp. 330-341). San Jose, CA. [138] Nakayama, K. (1990). The iconic bottleneck and the tenuous link between early visual processing and perception. In: C. Blakemore (Ed.) Vision: Coding and Efficiency (pp. 411-422): Cambridge University Press. [139] Navalpakkam, V., & Itti, L. (2002). A Goal Oriented Attention Guidance Model. Lecture Notes in Computer Science, 2525, 453-461. [140] Navalpakkam, V., & Itti, L. (2005). Modeling the influence of task on attention. Vision Research, 45 (2), 205-231. [141] Navalpakkam, V., & Itti, L. (2006). Optimal cue selection strategy. Advances in Neural Information Processing Systems (NIPS) (pp. 987-994). [142] Navalpakkam, V., & Itti, L. (2007). Search goal tunes visual features optimally. Neuron, 53 (4), 605-617. [143] Neisser, U. (1967). Cognitive Psychology. New York: Appleton-Century-Crofts. [144] Neisser, U., & Becklen, R. (1975). Selective Looking: Attending to Visually Specified Event. Cognitive Psychology, 7 (4), 480-494. [145] Nothdurft, H.C. (1991a). The role of local contrast in popout of orientation motion and color. Investigative Opthalmology & Visual Science, 32, 714. 239 [146] Nothdurft, H.C. (1991b). Texture segmentation and popout from orientation contrast. Vision Research, 31, 1073-1078. [147] Nothdurft, H.C. (1992). Feature analysis and the role of similarity in preattentive vision. Perception and Psychophysics, 52, 355-375. [148] Olivers, C.N.L., & Meeter, M. (2008). A Boost and Bounce Theory of Temporal Attention. Psychological Review, 115 (4), 836-863. [149] Olshausen, B.A., & Field, D.J. (1996). Emergence of simple-cell receptive-field properties by learning a sparse code for natural images. Nature, 381, 607-609. [150] Pack, C.C., Livingstone, M.S., Duffy, K.R., & Born, R.T. (2003). End-stopping and the aperture problem, Two-dimensional motion signals in Macaque V1. Neuron, 39, 671- 680. [151] Parkhurst, D.J., & Niebur, E. (2003). Scene content selected by active vision. Spatial Vision, 16 (2), 125-154. [152] Pernberg, J., Jirmann, K.U., & Eysel, U.T. (1998). Structure and dynamics of receptive fields in the visual cortex of the cat area 18 and the influence of GABAergic inhibition. Eur J Neurosci, 10 (12), 3596-3606. [153] Peters, R.J., Gabbiani, F., & Koch, C. (2003). Human visual object categorization can be described by models with low memory capacity. Vision Research, 43, 2265-2280. [154] Plaut, D., Nowlan, S., & Hinton, G.E. (1986). Experiments on learning by back propagation. Technical Report CMU-CS-86-126 (Pittsburgh, PA: Department of Computer Science, Carnegie Mellon University. [155] Polat, U., Mizobe, K., Pettet, M.W., Kasamatsu, T., & Norcia, A.M. (1998). Collinear stimuli regulate visual responses depending on cell’s contrast threshold. Nature, 391 (5), 580-584. [156] Polat, U., & Sagi, D. (1993a). The architecture of perceptual special interactions. Vision Research, 34 (1), 73-78. [157] Polat, U., & Sagi, D. (1993b). Lateral interactions between spatial channels, suppression and facilitation revealed by lateral masking experiment. Vision Research, 33 (7), 993-999. [158] Polat, U., & Sagi, D. (1994). Spatial interaction in human vision, From near to far via experience-dependant cascades of connections. PNAS, 91, 1206-1209. 240 [159] Potter, M.C., Staub, A., & O'Conner, D.H. (2002). The time course of competition for attention: Attention is initially labile. Journal of Experimental Psychology, 28 (5), 1149-1162. [160] Powell, M.J.D. (1978). A Fast Algorithm for Nonlinearly Constrained Optimization Calculations. In: G.A. Watson (Ed.) Lecture Notes in Mathematics, 630 (pp. 144–157). Berlin, Germany: Springer-Verlag. [161] Prodöhl, C., Würtz, R.P., & von der Malsberg, C. (2003). Learning the gestalt rule of collinearity from object motion. Neural Computation, 15, 1865-1896. [162] Raab, D.H. (1963). Backward Masking. Psychological Bulletin, 60 (2), 118-129. [163] Rao, R.P.N., & Ballard, D.H. (1999). Predictive coding in the visual cortex, a functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2 (1), 79-87. [164] Raymond, J.E., Shapiro, K.L., & Arnell, K.M. (1992). Temporary suppression of visual processing in an RSVP task: an attentional blink? Journal of Experimental Psychology: Human Perception and Performance, 18, 849-860. [165] Reeves, A. (1980). Visual Imagery in Backward Masking. Perception and Psychophysics, 28, 118-124. [166] Reeves, A. (1982). Metacontrast U-shaped Functions Derive from Two Monotonic Functions. Perception, 11, 415-426. [167] Reeves, A., & Sperling, G. (1986). Attention Gating in Short-Term Visual Memory. Psychological Review, 93, 180-206. [168] Ricciardi, L.M. (1995). Diffusion models of neuron activity. In: M.A. Arbib (Ed.) The Handbook of Brain Theory and Neural Networks (pp. 299-304). Cambridge, MA: The MIT Press. [169] Rosenblatt, F. (1962). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Washington D.C.: Spartan. [170] Rousselet, G.A., Fabre-Thorpe, M., & Thorpe, S.J. (2002). Parallel processing in high-level categorization of natural images. Nature Neuroscience, 5 (7), 629-630. [171] Rousselet, G.A., Thorpe, S.J., & Fabre-Thorpe, M. (2004). Processing of one, two or four natural scenes in humans: the limit of parallelism. Vision Research, 44, 877-894. [172] Rubin, N. (2001). The role of junctions in surface completion and contour matching. Perception, 30, 339-366. 241 [173] Sceniak, M.P., & Hawken, M.J.S., R (2001). Visual spatial characterization of macaque V1 neurons. J. Neurophysiol, 85, 1873-1887. [174] Schmolesky, M.T., Wang, Y., Hanes, D.P., Thompson, K.G., Leutgeb, S., Schall, J.D., & Leventhal, A.G. (1998). Signal timing across the macaque visual system. Journal of Neurophysiology, 79 (6), 3272-3278. [175] Schultz, W. (2002). Getting formal with dopamine and reward. Neuron, 36, 241- 263. [176] Sergent, C., Baillet, S., & Dehaene, S. (2005). Timing of the brain events underlying access to consciousness during the attentional blink. Nature Neuroscience, 8 (10), 1391-1400. [177] Serre, T., Oliva, A., & Poggio, T. (2007). A feedforward architecture accounts for rapid categorization. PNAS, 104 (15), 6424-6429. [178] Shashua, A., & Ullman, S. (1988). Structural Saliency. Proceedings of the International conference on computer vision (pp. 482-488). [179] Shevelev, I.A., Jirmann, K.U., Sharaev, G.A., & Eysel, U.T. (1998). Contribution of GABAergic inhibition to sensitivity to cross-like figures in striate cortex. Neuroreport, 9 (14), 3153-3157. [180] Shiffrin, R.M., & Schneider, W. (1977). Controlled and Automatic Human Information Processing: II. Perceptual Learning, Automatic Attending, and a General Theory. Psychological Review, 84 (2), 127-190. [181] Shih, S.-I. (2000). Recall of two visual targets embedded in RSVP streams of distractors depends on their temporal and spatial relationship. Perception and Psychophysics, 62 (7), 1348-1355. [182] Shih, S.-I., & Reeves, A. (2007). Attention capture in Rapid Serial Visual Presentation. Spatial Vision, 20 (4), 301-315. [183] Shih, S.-I., & Sperling, G. (2002). Measuring and Modeling the Trajectory of Visual Spatial Attention. Psychological Review, 109 (2), 260-305. [184] Siagian, C., & Itti, L. (2007). Rapid Biologically-Inspired Scene Classification Using Features Shared with Visual Attention. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29 (2), 300-312. [185] Silverstein, S.M., Kovács, I., Corry, R., & Valone, C. (2000). Perceptual organization the disorganization syndrom and context processing in chronic schizophrenia. Schizophrenia Research, 43, 11-20. 242 [186] Smith, A.R. (1978). Color Gamut Transform Pairs. SIGGRAPH, 12 (pp. 12-19). [187] Smith, T., & Guild, J. (1931). The C.I.E. colorimetric standards and their use. Transactions of the Optical Society, 33, 73-134. [188] Soto, D., Hodsol, l.J., Rotshtein, P., & Humphreys, G.W. (2008). Automatic guidance of attention from working memory. Trends in Cognitive Sciences, 12 (9), 342- 348. [189] Sperling, G. (1960). The information available in brief visual presentations. Psychological Monographs, 74, 498. [190] Sperling, G. (1965). Temporal and Spatial Visual Masking. I. Masking by Impulse Flashes. Journal of the Optical Society of America, 55 (5), 541-559. [191] Sperling, G., Reeves, A., Blaser, E., Lu, Z.-L., & Weichselgartner, E. (2001). Two computational models of attention. In: J. Braun, C. Koch, & J.L. Davis (Eds.), Visual Attention and Cortical Circuits (pp. 177-214). Cambridge, MA: MIT Press. [192] Sperling, G., & Weichselgartner, E. (1995). Episodic Theory of the Dynamics of Spatial Attention. Psychological Review, 102 (3), 503-532. [193] Stroop, J.R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18 (643-662) [194] Suri, R.E., Bargus, J., & Arbib, M.A. (2001). Modeling functions of striatal dopamine modulation in learning and planning. Neuroscience, 103 (1), 65-85. [195] Tanaka, Y., & Sagi, D. (2000). Attention and short-term memory in contrast detection. Vision Research, 40, 1089-1100. [196] Tanimoto, S., & Pavlidis, T. (1975). A hierarchical data structure for image processing. Computer Graphics and Image Processing, 4, 104-119. [197] Tootell, R.B., Silverman, M.S., Switkes, E., & De Valois, R.L. (1982). Deoxyglucose analysis of retinotopic organization in primate striate cortex Science, 218 (4575), 902-904. [198] Treisman, A., & Souther, J. (1985). Search asymmetry: A diagnostic for preattentive processing of seperable features. Journal of Experimental Psychology: General, 114 (3), 285-310. [199] Treisman, A.M., & Gelade, G. (1980). A feature-integration theory of attention. Cognitive Psychology, 12 (1), 97-136. 243 [200] Treisman, A.M., & Gormican, S. (1988). Feature Analysis in Early Vision: Evidence From Search Asymmetries. Psychological Review, 95 (1), 15-48. [201] Usher, M., Bonneh, Y., Sagi, D., & Herrmann, M. (1999). Mechanisms for spatial integration in visual detection, a model based on lateral interaction. Spatial Vision, 12 (2), 187-209. [202] VanRullen, R., Carlson, T., & Cavanagh, P. (2007). The blinking spotlight of attention. PNAS, 104 (49), 19204-19209. [203] VanRullen, R., & Koch, C. (2003a). Competition and selection during visual processing of natural scenes and objects. Journal of Vision, 3, 75-85. [204] VanRullen, R., & Koch, C. (2003b). Visual selective behavior can be triggered by a feed-forward process. Journal of Cognitive Neuroscience, 15 (2), 209-217. [205] VanRullen, R., Reddy, L., & Koch, C. (2004). Visual search and dual tasks reveal two distinct attentional resources. Journal of Cognitive Neuroscience, 16 (1), 4-14. [206] Varela, J.A., Sen, K., Gibson, J., Fost, J., Abbott, L.F., & Neslon, S.B. (1997). A quantitative description of short-term plasticity at excitatory synapses in layer 2/3 of rat primary visual cortex. The Journal of Neuroscience, 17 (20), 7926-7940. [207] Visser, T.A.W., Zuvic, S.M., Bischof, W.F., & Di Lollo, V. (1999). The Attentional Blink with Targets in Different Spatial Locations. Psychonomic Bulletin and Review, 6 (3), 432-436. [208] von Békésy, G. (1967). Sensory inhibition. Princeton University Press. [209] Von der Malsberg, C. (1981). The correlation theory of brain function. Internal Report 81–2 (Göttingen, Germany: Department of Neurobiology, Max-Planck-Institute for Biophysical Chemistry. [210] Von der Malsberg, C. (1987). Synaptic plasticity as basis of brain organization. In: S. Bernhard (Ed.) The Neural and Molecular Basis of Learning (pp. 411-432): Dahlem Konferenzen. [211] Wang, X.J., Tegner, J., Constantinidis, C., & Goldman-Rakic, P.S. (2004). Division of labor among distinct subtypes of inhibitory neurons in cortical microcircuits of working memory. PNAS, 101 (5), 1368-1373. [212] Weisstein, N., & Haber, R.N. (1965). A U-Shaped Backward Masking Function in Vision. Psychonomic Science, 2, 75-76. 244 [213] Wertheimer, M. (1923). Law of organization in perceptual form. In: E.W. D (Ed.) A source book of gestalt psychology (pp. 71-88). New York: The Humanities Press [214] Whitney, D., Goltz, H.C., Thomas, C.G., Gati, J.S., Menon, R.S., & Goodale, M.A. (2003). Flexible Retinotopy: Motion-Dependent Position Coding in the Visual Cortex. Science, 302 (5646), 878-881. [215] Wolfe, J.M. (1994a). Guided Search 2.0: A revised model of visual search. Psychonomic Bulletin and Review, 1 (2), 202-238. [216] Wolfe, J.M. (1994b). Visual search in continuous, naturalistic stimuli. Vision Research, 34 (9), 1187-1195. [217] Wolfe, J.M., & DiMase, J.S. (2003). Do intersections serve as basic features in visual search? Perception, 32, 645-656. [218] Wolfe, J.M., Horowitz, T.S., & Michod, K.O. (2007). Is visual attention required for robust picture memory? Vision Research, 47, 955-964. [219] Wolfe, J.M., O'Neill, P., & Bennett, S.C. (1998). Why are there eccentricity effects in visual search? Visual and attentional hypotheses. Percept Psychophys, 60 (1), 140-156. [220] Wright, W.D. (1929). A re-determination of the trichromatic coefficients of the spectral colours. Transactions of the Optical Society, 30, 141-164. [221] Yen, S., & Fenkel, L.H. (1998). Extraction of perceptually salient contours by striate cortical networks. Vision Research, 38 (5), 719-741. [222] Yu, C., & Levi, D.M. (2000). Surround modulation in human vision unmasked by masking experiments. Nature Neuroscience, 3 (7), 724-728. [223] Zenger, B., & Sagi, D. (1996). Isolating excitatory and inhibitory nonlinear spatial interactions involved in contrast detection. Vision Research, 36 (16), 2497-2513. 245 Appendix A: Contour Integration Model Parameters Max range for collinear separation for excitation 0º - 31º 2 e P (Kernel polynomial parameter) -0.75 3 e P (Kernel polynomial parameter) 0.095 W (Kernel inhibition multiplier) 0.65 2 s P (Kernel polynomial parameter) 0.16 3 s P (Kernel polynomial parameter) -0.1 A(Pass through multiplier) 30.0 L (Constant leak) 94.0 F (Fast plasticity gain) 1.0001 Max group size 768 neurons (8x8x12) T (Max group suppression threshold) 50,000 v (Group suppression gain) 0.0003 u w 64x64 scale weight 0.58 u w 32x32 scale weight 0.85 u w 16x16 scale weight 0.35 246 Appendix B 11 : Mathematical Details on Surprise Given a set M of possible models (or hypotheses), an observer with prior distribution () P M over the models, and data D such that: (7.1) () ( ) ( ) () | PD M P M PM D PD = Notably, the data can in this case be many things such as a pixel value or the value of the output of a feature detector. We are also not constrained to just vision. To quantitatively measure how much effect the data observation ( ) D had on the observer beliefs ( ) M , one can simply use some distance measure d between the two distributions () PM D and () P M . This yields to the definition of surprise as (Itti & Baldi, 2005, Itti & Baldi, 2006): (7.2) ( ) ( ) ( ) ,, SD d P M D P M ⎡ ⎤ = ⎣ ⎦ M One such measure is the Kullback-Liebler (KL) distance (Kullback & Leibler, 1951) generically defined as: (7.3) () ( ) () ln p Lp dx p =− ∫ x x x 11 A general surprise framework is downloadable for Matlab from http://sourceforge.net/projects/surprise- mltk. The full vision code in gcc is downloadable from http://ilab.usc.edu 247 By combining (7.3) and (7.1) we get the surprise given the data and the model as: (7.4) ( ) ( ) ( ) ( ) ,log log S D PD P M PDM dM =− ∫ M M Figure B.1: The Gamma Probability Distribution is shown for different values of λ, α and β. In all, three different views are shown for the same Gamma PDF. Since there are three free hyperparameters, each view is rather distinct. The lower right hand corner shows the KL distance of the Gamma PDF. (See appendix H for figure graphing functions) Surprise can have as its base a variety of statistical processes, so long as the KL distance can be derived. In this work, we used a Gamma / Poisson distribution since it 248 models neural firing patterns (Ricciardi, 1995) and since it gives a natural probability over the occurrence of events. We can define the Gamma process (Figure B.1) as: (7.5) () () () () 1 ;, for 0 e PM αβλ α β λγλαβ λ λ α − − = => Γ Figure B.2: This is an illustration of the surprise values derived from KL. Notice that when the new value α′jumps above the old value α , then surprise is much greater than when the expected value falls by the same amount. This is not unexpected since the KL divergence is asymmetric. (See appendix H for figure graphing functions) and its corresponding KL distance as: (7.6) ()( ) () ( ) () ()() ;, , ; , log log KL α ββ γ λα β γ λα β α α α α α α βα β Γ ′ ′′ ′ ′ ′ ′ =− + + + + − Ψ ′′ Γ For the Gamma PDF we have defined β such that: 249 (7.7) 1 β θ = This is the inverse scale Gamma PDF. For the non inverse Gamma PDF, the KL distance is given by simply switching β with β’ in equation (7.6). Here we note that Γ is the gamma function (not to be confused with the gamma probability distribution given in (7.5)) and Ψ is the digamma function which is the log derivative of the gamma function Γ . Surprise itself in this case, is the KL distance multiplied by 1/sqrt(2) (Units in Wows). How the changes in parameters effect surprise can be seen in figure B.2. To make this model work, we need to know what α and β are and how to update them to get α′ and β ′ . Importantly, in this notation, α′is just the update of α at the next time step. Given some input data d such as a feature filter response, we can derive α′ , which is abstractly analogous to the mean in the normal distribution. This is given as: (7.8) d ααζ β ′ = ⋅+ Here ζ is a decay forgetting term. To reiterate, this assumes that the underlying process is Poisson. Otherwise one must use a more complex method to compute α′ such as the Newton-Raphson method (Choi & Wette, 1969) or use a completely different process for computing surprise such as Gaussian process. The value of β ′ , which is abstractly analogous to the standard deviation in the normal distribution can be computed as: 250 (7.9) 1 β βζ ′ = ⋅+ In this formulation, β is independent of new data for temporal surprise, but not for spatial surprise which computes the β term from the image surround. (7.9) is sufficient for analysis of RSVP sequences since model evidence collection can start in a naïve state. Otherwise, for long sequences such as movies, β should be allowed to float based on longer term changes in the state of evidence. To deal with different time scales which have different gains, the results from a model are fed forward into successive models from 1 to i where i is the ith model out of n time scales () 1 in ≤≤ . In this case, 6 time scales are used. For each scale we will compute a different α′and thus a different surprise value. For each time scale, this takes the form of: (7.10) 1 1 2 2 ii i d αζ β αζ β αα ζ β ⋅+ ⋅+ ′=⋅ + Over a series of images, surprise can be computed in temporal and spatial terms. Temporal surprise is a straight forward computation with updates (7.8) and (7.9) where d is the value of a single location in an image andα is the hyper parameter computed on the previous frame as α′ . 251 Figure B.3: Shown are examples of DoG filters which can be used to create Center/On – Surround/Off effects. (See appendix H for figure graphing functions) Spatial surprise is computed with d also being the value of a single location, but α and β are treated like the Gamma mean and variance given values of the surrounding locations in an image. This makes spatial surprise the surprising difference between a location and its surround. Given the data from m locations in the surround D and locations j such that j D ∈ and a Difference of Gaussian (DoG) (Figure B.3) weighting kernel w, we compute a weight sum W of the kernel as: (7.11) 1 m j j Ww = = ∑ The spatial variance β is computed as: (7.12) 1 1 m j j j w W ββ = =⋅⋅ ∑ 252 Since this operates at multiple time scales, all j β are initialized to 1 for the first time scale, but are set to the previous time scales β ′ value thereafter. The expected value at an image locationα computed from β is: (7.13) 2 1 m j j j w W β αα = = ⋅⋅ ∑ Again, j α is initialized as the data value for the surrounding image locations in the first time scale, but is set to the previous value as in (7.8) after that. The values α and β can then be plugged into (7.12) and (7.13) for α and β respectively which allows surprise to be computed as in (7.6) for space at each location in an image. Total surprise for a feature channel S at each location is then computed as the product of surprise for the sum of temporal t and space p across all time scales n.: (7.14) ()( ) ( ) 13 11 n pt pn tn SS S S ⋅ ⎡⎤ =+ ⋅⋅ + ⎣⎦ … S 253 Appendix C: Kullback-Liebler Divergences of Selected Probability Distributions The Kullback-Liebler (KL) divergence (Sometimes also called the KL Distance) gives the amount of information gain between two statistical distributions. To a certain degree this can be imagined as how much two distributions overlap. In this case we will talk about the distance as being a metric of similarity between two probability distributions functions (PDF). Given two PDF’s Unknown True ( ) p x and Observed ( ) p x we can tell how alike they are in terms of information by the Kullback-Leibler divergence: (7.15) () ( ) () ln p Lp dx p =− ∫ x x x This is useful for us as the basis for surprise [(7.15) can be visualized in figure C.1]. This is because one way of interpreting the KL distance is that it is the difference between ( ) p x , which is what we believe is our best estimate and ( ) p x , a possible new observed distribution. If the difference between the two distributions is large then we are in general surprised since our observation deviates strongly from our prior belief. C.1 Conceptual Notes on the KL Distance The equation seems a bit odd to people who have not yet been introduced to information theory. It is most informative to understand what is in fact measured. To do this, it is helpful to go back to the original equations and justifications for entropy in 254 physical systems which form its basis. Basically, entropy as we are interested in is a differential. In physical systems heat is changed into entropy. Thus, entropy is computed from the change in temperatures in a system. In other words (and very importantly), entropy is a measure of change. In physical systems it is the loss of energy, but here it is the gain of information. Figure C.1: This is an illustration of the area which is integrated when computing the KL divergence. On the top left is an example of two Gaussian PDF’s and to the right of that is the area which when integrated gives the KL metric. Below it are three more examples showing that as the mean between two distributions grows linearly, the area integrated grows much faster. (See appendix H for figure graphing functions) Information itself is quantified as a minimal subset required to transmit some information. The Information is transmitted in a minimal fashion when each bit of information can be structured into a tree. This gives one a header bit and each bit that follows gives a unique piece of information from that bit. The minimum length of some 255 information we wish to transmit thus becomes the depth of such a tree. This is one place where the Log comes from since the depth of a tree is a logarithmic function of the total number of nodes. C.2 KL of the Gaussian Probability Distribution Given the Gaussian (Normal) probability distribution: (7.16) () () 2 2 2 1 ;, 2 x xe μ σ νμ σ σπ − − = We derive the KL divergence as: (7.17) ()( ) () () 2 2 22 1 ;, , ; , log 22 2 KL x x μμ σσ νμσ ν μσ σσ σ ⎛⎞ ′ ′ − ′ ′′ =− + − − ⎜⎟ ⎜⎟ ⋅⋅ ⎝⎠ Notice that this has some similarities to the Fisher criterion which is a basic measure of the similarity between two Gaussian PDF’s: (7.18) () 2 12 22 12 μμ σ σ − + C.3 KL of the Gamma Probability Distribution Given the Gamma PDF which gives the probability of a waiting time given a rate α and inverse scale parameter β: (7.19) () () 1 ;, x e xx α β α β γαβ α − − = Γ 256 The KL divergence is given as: (7.20) ()( ) () () () () () ;, , ; , log log KL x x α ββ γ αβ γ α β α α α α α α ββ α Γ ′ ′′ ′ ′ ′ ′ =− + ⋅ + ⋅ + + − ⋅Ψ ′′ Γ The gamma function Γ which is not to be confused with the gamma PDF gives a factorial with support for floating points. It can be defined as the definite integral in Euler’s integral form as: (7.21) () [] 1 0 ; 0 zt ztedtz ∞ −− Γ≡ℜ> ∫ Notice that interestingly, by this definition we can rewrite the Gamma PDF as: (7.22) () 1 1 0 ;, x t x e x tedt α β α α γαβ β −− ∞ −− ⋅ =⋅ ∫ 257 Figure C.2: The KL distance is given for a joint Gamma-Gamma distribution. The four α hyperparameters are changed, but the β hyperparameters are held constant in this example. (See appendix H for figure graphing functions) 258 In this form it is obvious how the equation normalizes in the same way as the Gaussian (normal) PDF. The digamma (psi) Ψ is the derivative of the gamma function given as: (7.23) () () ( ) () ln z d zz dz z ′ Γ Ψ≡ Γ ≡ Γ Fortunately, both digamma and gamma functions are commonly implemented in many programming languages including gcc, Matlab and Mathematica. However, closed form solutions do not exist and as such both are approximated by series. This also means both are expensive to compute. C.4 KL of the Joint Gamma-Gaussian or Gamma-Gamma Distribution The KL divergence for the joint independent Gamma-Gaussian and Gamma- Gamma distributions (Figure C.2) is separable. For instance, using the definitions from (7.16) and (7.19) for the Gamma and Gaussian PDF’s we would computed KL of the joint PDF as: (7.24) ()( ) ( ) ( ) ()( ) 0 ;, ;, ;, ;, log ;, ;, xy xydydx xy νμσ γ αβ νμσ γ αβ νμσ γ αβ ∞∞ −∞ ⎡⎤ ′′ ′ ′ ⋅ −⋅ ⋅ ⎢⎥ ⋅ ⎣⎦ ∫∫ But it turns out that the solution is simply: (7.25) () ( ) () ( )( ) ( ) ;, , ; , ; , , ; , KL x x KL x x νμσν μ σ γ α βγ α β ′′ ′ ′ + This holds for the joint Gamma-Gamma distribution as well. 259 The proof for this is to work out the integral in (7.24). This is not trivial, but it can be integrated by parts. In this case, it can be broken into 10 parts. Eight of which are integrable with no teasing using Mathematica 6. The 10 parts including a constant factor are: (7.26) 2() C α β πσα ′ = ′ ⋅⋅Γ (7.27) 2 2 () 1 2 1 log 2 x z f te z μ β α σ π σ ′ − ′ −− − ′ ⎛⎞ = ⎜⎟ ′ ⋅ ⎝⎠ (7.28) 2 2 () 1 2 log () x z g te z μ α β α σ β α ′ − ′ ′ −− ′− ′ ′ ⎛⎞ = ⎜⎟ ′ Γ ⎝⎠ (7.29) () 2 2 () 1 2 log x z z h te z e μ β αβ σ ′ − ′ −− ′ ′ −− ′ = (7.30) 22 22 () () 1 22 log xx z i te z e μμ β α σσ ′′ −− ′ −− − ′− ′′ ⎛⎞ ⎜⎟ = ⎜⎟ ⎝⎠ (7.31) 2 2 () 1 2 1 log 2 x z j te z μ β α σ π σ ′ − ′ −− ′− ′ ⎛⎞ = ⎜⎟ ⋅ ⎝⎠ (7.32) 2 2 () 1 2 1 log 2 x z k te z μ β α σ π σ ′ − ′ −− ′− ′ ⎛⎞ = ⎜⎟ ⋅ ⎝⎠ (7.33) 2 2 () 1 2 log () x z l te z μ α β α σ β α ′ − ′ −− ′− ′ ⎛⎞ = ⎜⎟ Γ ⎝⎠ 260 (7.34) () 2 2 () 1 2 log x z z m te z e μ β αβ σ ′ − ′ −− ′−− ′ = (7.35) 22 22 () ( ) 1 22 log xx z n te z e μμ β α σσ ′−− ′ −− − ′− ′ ⎛⎞ ⎜⎟ = ⎜⎟ ⎝⎠ (7.36) () 2 2 () 2 22 2 1 log x z o te z z μ β αα σ ′ − ′ −− ′−− ′ = The first part can be plugged in and worked out immediately in Mathematica 6 as: (7.37) 0 kl m o f g h j tt t t t t t tdydx ∞∞ −∞ −+++−−−− ∫∫ which turns out to be (if somewhat messy): (7.38) () 1 1 2 (0) (0) 111 2 ( ) log( ) log log log () log ( ) ( 1) log( ) ( ) () α α α β βπβ α α β σσσα β αψ α α β β ψ α β β β α − ′ ′ −− ⎡ ⎛ ⎛⎞ ⎛ ′ ⎛⎞ ⎛⎞ ⎛ ⎞ ′′′ ′ ⋅Γ − − + − ⎢ ⎜ ⎜⎟ ⎜ ⎜⎟ ⎜⎟ ⎜ ⎟ ⎜⎟ ⎜ ′′′ Γ ⎝⎠ ⎝ ⎠ ⎢ ⎝⎠ ⎝ ⎝⎠ ⎝ ⎣ ⎤ ⎞ ⎞ ⎛⎞ ′ ′ ′′ ′′ ′ ++⋅ +Γ+ ⋅− ⋅+− ⎥ ⎟ ⎟ ⎜⎟ ⎟ Γ ⎥ ⎝⎠ ⎠ ⎠⎦ The second part: (7.39) 0 ni t t dydx ∞∞ −∞ −− ∫∫ Can be integrated by applying some simplifications and integrating: 261 (7.40) () ( ) ( ) () 2 2 22 22 2 22 x xx dx e α μ σ μμ α σσ β ∞ ′ ′− −∞ ′ ⋅ ′−− − ′ Γ ′⋅⋅ ⋅− ′ ∫ Which turns out to be: (7.41) () () () 2 22 2 2 2 1 α π σσ μ μ α β σ σ ′ ′′ ⋅− + − ′ Γ ′ ′ From (7.38) and (7.41) we combine and simplify and find that in fact, (7.24) is the same as (7.25). The same method also establishes the same results for a joint Gamma-Gamma distribution. 262 Appendix D: Junction Channel Computation and Source Figure D.1: The junction filter can be thought of as a convolution with a set {r} of offset Gabor filters at a distance d from the center. For instance, when looking for junction, we will inspect the response at locations r 0 , r 2 , r 4 and r 6 . The junction channel used in the surprise and attention gate models were implemented by Vidhya Navalpakkam, but their implementation is not previously documented. Here we spell out the details of how the junctions conspicuity is computed in the iLab Neuromorphic Vision Toolkit. To start out with, the implantation is generic so that the same basic filter that is sensitive to end-stops can also be used to look for junctions. This is done by turning on or off selection for different lines. The inputs to the junction channel filter are Gabor orientation filtered images at 0, 45, 90 and 135 degrees. These are used to compute a response based on an offset from a central location (figure D.1). For instance, given a location L at coordinates (i,j) in an image and a constant distance d, r 0 is at location (i,j+d), r 2 is at (i+d,j), r 4 is at (i,j-d) and 263 r 6 is at (i-d,j). There would then be two computable X junctions. One comprised of r 0 , r 2 , r 4 and r 6 and other comprised of r 1 , r 3 , r 5 and r 7 . To compute two X junctions x 1 and x 2 we would use the following computation: (7.42) 1 024 6 21 3 5 7 x rrrr x rr r r = ⋅⋅⋅ = ⋅⋅ ⋅ Note that if any of the responses are 0, then the junction response is also 0. Also notice that there is no inhibition factor. So, if for instance r 5 gives a response, it will not interact with x 1 . Since the Gabor filtered locations are proximal, there should be resistance to orthogonal responses at near locations. The L, T junctions and end-stops are computed in a similar fashion. The T junctions, for all eight orientations k for n = 0 to 7 are given as: (7.43) 24 kn n n trr r + + = ⋅⋅ The eight L junctions are given as: (7.44) 2 kn n lrr + = ⋅ The eight end stops are given as: (7.45) kn s r = To recap, there are only two X junction computations since it is symmetric with respect the central location, but there are eight T, L and end-stop computations. As a final note, the end-stop portion of the detector is probably not very good for this purpose since 264 it does not buttress a collinear counter response. Also, without suppression, the detectors are not orthogonal. For instance, the L junction detector can pick up X junctions as well. D.1 Junction Channel Source Code namespace { // Trivial helper function for junctionFilterPartial() template <class DstItr, class SrcItr> inline void JFILT(DstItr dptr, SrcItr sptr, const int w, const int jmax, const int imax) { for(int j = 0; j < jmax; ++j) { for (int i = 0; i < imax; ++i) dptr[i] *= sptr[i]; sptr += w; dptr += w; } } } template <class T> Image<T> junctionFilterPartial(const Image<T>& i0, const Image<T>& i45, const Image<T>& i90, const Image<T>& i135, const bool r[8], const int dx = 6, const int dy = 6, const bool useEuclidDiag = false) { GVX_TRACE(__PRETTY_FUNCTION__); // the preamble is identical to junctionFilterFull: Image<T> result(i0.getDims(), ZEROS); const int w = i0.getWidth(), h = i0.getHeight(); int dx_diag, dy_diag; // Compute the diagonal offsets if needed as a euclidian distance // from the center in dx,dy or should we just put it on a "grid" if(useEuclidDiag) { dx_diag = (int)round(fastSqrt((dx*dx)/2.0f)); dy_diag = (int)round(fastSqrt((dy*dy)/2.0f)); } else { dx_diag = dx; dy_diag = dy; } 265 // non diagonal elements const int o0 = dx; const int o2 = -dy*w; const int o4 = -dx; const int o6 = dy*w; // diagonal elements const int o1 = dx_diag - dy_diag*w; const int o3 = -dx_diag - dy_diag*w; const int o5 = -dx_diag + dy_diag*w; const int o7 = dx_diag + dy_diag*w; const int offset = dx + dy*w; typename Image<T>::iterator const rpp = result.beginw() + offset; // compute the number of relevant features (for normalization): int nr = 0; for (int i = 0; i < 8; i++) if (r[i]) nr++; const int imax = w - dx*2; const int jmax = h - dy*2; // initialize the valid portion of the response array to 1.0's { typename Image<T>::iterator rp = rpp; for (int j = 0; j < jmax; ++j) { for (int i = 0; i < imax; ++i) rp[i] = T(1.0); rp += w; } } // loop over the bulk of the images, computing the responses of // the junction filter: { if (r[0]) JFILT(rpp, i0.begin() + o0 + offset, w, jmax, imax); if (r[1]) JFILT(rpp, i45.begin() + o1 + offset, w, jmax, imax); if (r[2]) JFILT(rpp, i90.begin() + o2 + offset, w, jmax, imax); if (r[3]) JFILT(rpp, i135.begin() + o3 + offset, w, jmax, imax); if (r[4]) JFILT(rpp, i0.begin() + o4 + offset, w, jmax, imax); if (r[5]) JFILT(rpp, i45.begin() + o5 + offset, w, jmax, imax); if (r[6]) JFILT(rpp, i90.begin() + o6 + offset, w, jmax, imax); if (r[7]) JFILT(rpp, i135.begin() + o7 + offset, w, jmax, imax); } // normalize the responses by the number of relevant features, // optimizing by using sqrt() or cbrt() where possible (this gives // an average speedup of ~3x across all junction filter types): { typename Image<T>::iterator rp = rpp; const double power = 1.0 / nr; for(int j = 0; j < jmax; ++j) { 266 for(int i = 0; i < imax; ++i) rp[i] = T(pow(rp[i], power)); rp += w; } } return result; } 267 Appendix E: RGBY and CIE Lab Color Conversion E.1 RGBY Color Conversion RGBY is computed in a very basic manner, but does not have a reverse conversion back to RGB color space. The fundamental purpose of RGBY color space is to create red/green and blue/yellow color opponents. The first step is to compute a luminance factor from the original r, g and b values in the image: (7.46) 255 fac l rg b = + + This is then used normalize the chroma by luminance: (7.47) nfac rrl = ⋅ (7.48) nfac g gl = ⋅ (7.49) nfac bbl = ⋅ The R, G, B, and Y values are computed as: (7.50) 2 nn n g b Rr + =−− (7.51) 2 nn n rb Gg + =−− (7.52) 2 nn n rg Bb + =−− 268 (7.53) ( ) 2 nn n n n Yr g r g b =+ − ⋅ − + The red/green and blue/yellow color opponents are computed by basic subtraction: (7.54) RGR G = − (7.55) BYB Y = − E.2 CIE Lab Color Conversion CIE Lab color (MacAdam, 1935, Adams, 1942, McLaren, 1976) requires two steps to convert from RGB. It is also reversible so that CIE Lab images can be converted back to RGB. The first step is to convert RGB color into CIE XYZ color space (Wright, 1929, Smith & Guild, 1931). This is done by a linear matrix transformation of RGB color space into XYZ color space. Note that there are variations on the XYZ matrix. Here we use the values supplied by OpenCV. (7.56) 0.412453 0.357580 0.1804423 0.212671 0.715160 0.072169 0.019334 0.119193 0.950227 Xr g b Yr g b Zr g b =⋅ + ⋅ + ⋅ =⋅ + ⋅ + ⋅ =⋅ + ⋅ + ⋅ 269 Figure E.1: The CIE 1931 color space chromaticity diagram from Wikipedia (http://en.wikipedia.org/). The outer curved boundary is the spectral (or monochromatic) locus, with wavelengths shown in nanometers. Note that the colors depicted depend on the color space of the device on which you are viewing the image, and no device has a gamut large enough to present an accurate representation of the chromaticity at every position. 270 Here r,g and b are the original red, green and blue color values from the RGB color space image. The X, Y and Z values are a derivation of Red, Green and Blue set to match human perception of colors (Figure E.1). However, the output values in XYZ space are difficult to interpret so it was later refined into CIE Lab color (Figure E.2) which processes the XYZ value into a Luminance L value, a red/green a value and a blue/yellow b value. A notable property of CIE Lab is that the L value relates to perceived luminance. So for instance a yellow patch with the same real luminance as a blue patch may have a higher L luminance due to the psychological perception of yellow being brighter than blue. Figure E.2: On the left is the ideal CIE Lab Color Sphere which shows the range of values for Lab and how they map to visible colors. The image on the right shows how the Lab colors can be constrained depending on the gamut used, in this case Adobe RGB 1998 (gamut rendered with Gamutvision 1.3.7 http://www.gamutvision.com/). Thus, the ideal sphere of color is typically not covered and the CIE Lab color space is covered by a more deformed object like the one on the right. The XYZ to Lab conversion is described in several different ways depending on implementation, but yields the same results. In addition to the XYZ input, a “whitepoint” 271 is needed. This is in general used to normalize the process. So for instance, a standard 8 bit RGB image would supply a white point of 255,255,255. Here we will assume the 8 bit RGB image which gives the white point computation as: (7.57) 255 0.412453 255 0.357580 255 0.1804423 255 0.212671 255 0.715160 255 0.072169 255 0.019334 255 0.119193 255 0.950227 n n n X Y Z =⋅ + ⋅ + ⋅ = ⋅ +⋅ +⋅ =⋅ + ⋅ + ⋅ We then compute a set of X,Y and Z values based on conditions: (7.58) 13 if 0.00885654 16 7.78704 otherwise 116 nn t n XX XX X X X ⎧ ⎛⎞ ⎪ > ⎜⎟ ⎪ ⎝⎠ = ⎨ ⎪ ⋅+ ⎪ ⎩ (7.59) 13 if 0.00885654 16 7.78704 otherwise 116 nn t n YY YY Y Y Y ⎧ ⎛⎞ ⎪ > ⎜⎟ ⎪ ⎝⎠ = ⎨ ⎪ ⋅+ ⎪ ⎩ (7.60) 13 if 0.00885654 16 7.78704 otherwise 116 nn t n ZZ ZZ Z Z Z ⎧ ⎛⎞ ⎪ > ⎜⎟ ⎪ ⎝⎠ = ⎨ ⎪ ⋅+ ⎪ ⎩ Then Lab is computed as: (7.61) 116 16 t LY = ⋅− (7.62) ( ) 500 tt aXY = ⋅− 272 (7.63) ( ) 200 tt bYZ =⋅ − These can be closely normalized between 0 and 255 by taking into to account the gamut limits of RGB color space: (7.64) () 255 116 16 100 t LY =⋅− ⋅ (7.65) ( ) 500 87 255 186 tt XY a ⋅− + =⋅ (7.66) ( ) 200 108 255 203 tt YZ b ⋅− + = ⋅ This yields Lab values that are very close to falling between 0 and 255. However if using another input color space other than 8-bit RGB, (7.64) - (7.66) will need to be normalized by different values. 273 Appendix F: HSV Color Conversion Source HSV color space (Smith, 1978) (http://en.wikipedia.org/wiki/HSV_color_space) transforms standard RGB (http://en.wikipedia.org/wiki/RGB_color_space) (Red, Green, Blue) color space into a new color space comprised of Hue, Saturation and Intensity (Value) (Figure F.1). Figure F.1: This is a standard representation of HSV color space as a cone. As you move outward, saturation S is larger and colors are more pure. As you move down, V decreases and colors get darker. Moving around the radius of cone changes the Hue color along the rainbow. Image sourced from The Wikipedia (http://en.wikipedia.org/). 274 • The Hue (http://en.wikipedia.org/wiki/Hue) component can be thought of as the actual color of the object. That is, if you looked at some object, what color would it seem to you? • Saturation (http://en.wikipedia.org/wiki/Saturation_%28color_theory%29) is a measure of purity. Whereas Hue would say that an object is green, Saturation would tell us how green it actually is. • Intensity (http://en.wikipedia.org/wiki/Color_value), which is also referred more accurately as value tells us how light the color is. We can see how RGB color is broken down into HSV color by inspecting the code below. F.1 RGB to HSV Transformation HSV is converted from RGB in a simple way. 1. First compute the base H,S and V a. H – Hue: if Color1 is Max then H = ( Color2 - Color3 ) / ( Max - Min ) b. S – Saturation: S = Max - Min / Max c. V – Value: V = max of either R,G or B 2. Normalize the values. a. General H,S,V has ranges: i. 0 <= Hue <= 360 ii. 0 <= Sat <= 100 iii. 0 <= Val <= 255 275 3. H - Hue is set based on the dominant color. It has three different basic ranges based on what that color is. a. if Red H *= 60 b. if Green H += 2; H *= 60 c. if Blue H += 4; H *= 60 In essence this forces the recognizable 0-360 value seen in hue 4. S - May be S *= 100, in some cases it can also be normalized from 0 to 1. 5. V - is V *= 255 • There are some other bells and whistles to take care of some divide by zero conditions and other such singularities. For instance if Max = Min then we need to fudge hue. • Note: In the source example we set NORM to true for normal HSV conversions. This will make the values use normal HSV range. Setting NORM to false forces H,S and V to range between 0 and 1. This is useful if next you will convert to H2SV color space. • R, G and B are assumed to be between 0 and 255. F.1.1 HSV Transformation C / C++ Code The source is given in macro form but it can be taken out of this form (http://en.wikipedia.org/wiki/C_preprocessor) and placed inside a class or function fairly easily. Replace: #define PIX_RGB_TO_HSV_COMMON(R,G,B,H,S,V,NORM) 276 With something like: void pixRGBtoHSVCommon(const double R, const double G, const double B, double& H, double& S, double& V, const bool NORM) Then get rid of all the forward slashes “\”. Floats can be used in place of doubles. It depends on whether you want precision or speed. The boolean value NORM is used to decide whether to output traditional HSV values where: 0 <= S <= 100 and 0 <= V <= 255. Else we keep the values at a norm where: 0 <= S <= 1 and 0 <= V <= 1. The latter is faster for executing your own code, but the former should be used for compatibility. // ###################################################################### // T. Nathan Mundhenk // mundhenk@usc.edu // C/C++ Macro RGB to HSV #define PIX_RGB_TO_HSV_COMMON(R,G,B,H,S,V,NORM) \ Blue Is the dominant color if((B > G) && (B > R)) \ { \ Value is set as the dominant color. V = B; \ if(V != 0) \ { \ double min; \ if(R > G) min = G; \ else min = R; \ Delta is the difference between the most dominant color and the least dominant color. This will be used to compute saturation. const double delta = V - min; \ if(delta != 0) \ { S = (delta/V); H = 4 + (R - G) / delta; } \ 277 else \ { S = 0; H = 4 + (R - G); } \ Hue is just the difference between the two least dominant colors offset by the dominant color. That is, here 4 puts hue in the blue range. Then red and green just tug it one way or the other. Notice if red and green are equal, hue will stick squarely on blue. H *= 60; if(H < 0) H += 360; \ if(!NORM) V = (V/255); \ else S *= (100); \ } \ else \ { S = 0; H = 0;} \ } \ Green is the dominant color. else if(G > R) \ { \ V = G; \ if(V != 0) \ { \ double min; \ if(R > B) min = B; \ else min = R; \ const double delta = V - min; \ if(delta != 0) \ { S = (delta/V); H = 2 + (B - R) / delta; } \ else \ { S = 0; H = 2 + (B - R); } \ H *= 60; if(H < 0) H += 360; \ if(!NORM) V = (V/255); \ else S *= (100); \ } \ else \ { S = 0; H = 0;} \ } \ Red is the dominant color. else \ { \ V = R; \ if(V != 0) \ { \ double min; \ if(G > B) min = B; \ else min = G; \ const double delta = V - min; \ 278 if(delta != 0) \ { S = (delta/V); H = (G - B) / delta; } \ else \ { S = 0; H = (G - B); } \ H *= 60; if(H < 0) H += 360; \ if(!NORM) V = (V/255); \ else S *= (100); \ } \ else \ { S = 0; H = 0;} \ } F.2 HSV to RGB Transformation: 1. Get some easy work done: a. If Value V = 0, then we are done, color is black set R,G,B to 0. b. If Saturation S = 0, no color is dominant, set to some gray color. 2. Hue is valued from 0 to 360, we chunk the space into 60 degree increments. At each 60 degrees we use a slightly different formula. In general we will assign and set R,G and B exclusively as: a. We set the most dominant color: i. If H is 300 to 60 , set R = V ii. If H is 60 to 180, set G = V iii. If H is 180 to 300, set B = V b. The least dominant color is set as: pv = Value * ( 1 - Saturation ) c. The last remaining color is set as either: i. qv = Value * ( 1 - Saturation * (Hue/60) - floor(Hue/60)) ii. tv = Value * ( 1 - Saturation * ( 1 - ((Hue/60) - floor(Hue/60)))) 3. Clean up: here we allow for i to be -1 or 6 just in case we have a very small floating point error otherwise. We also can deal with it sometimes if we have an undefined input. 279 4. Normalize R,G and B from 0 to 255. 5. Note: S and V are normalized between 0 and 1 in this example. H is its normal 0 to 360. F.2.1 RGB Transformation C/C++ Code // ###################################################################### // T. Nathan Mundhenk // mundhenk@usc.edu // C/C++ Macro HSV to RGB #define PIX_HSV_TO_RGB_COMMON(H,S,V,R,G,B) \ if( V == 0 ) \ { R = 0; G = 0; B = 0; } \ else if( S == 0 ) \ { \ R = V; \ G = V; \ B = V; \ } \ else \ { \ const double hf = H / 60.0; \ const int i = (int) floor( hf ); \ const double f = hf - i; \ const double pv = V * ( 1 - S ); \ const double qv = V * ( 1 - S * f ); \ const double tv = V * ( 1 - S * ( 1 - f ) ); \ switch( i ) \ { \ Red is the dominant color case 0: \ R = V; \ G = tv; \ B = pv; \ break; \ Green is the dominant color case 1: \ R = qv; \ G = V; \ B = pv; \ break; \ case 2: \ R = pv; \ G = V; \ 280 B = tv; \ break; \ Blue is the dominant color case 3: \ R = pv; \ G = qv; \ B = V; \ break; \ case 4: \ R = tv; \ G = pv; \ B = V; \ break; \ Red is the dominant color case 5: \ R = V; \ G = pv; \ B = qv; \ break; \ Just in case we overshoot on our math by a little, we put these here. Since it’s a switch, it won't slow us down at all to put them in. case 6: \ R = V; \ G = tv; \ B = pv; \ break; \ case -1: \ R = V; \ G = pv; \ B = qv; \ break; \ The color is not defined, we should throw an error. default: \ LFATAL("i Value error in Pixel conversion, Value is %d",i); \ break; \ } \ } \ R *= 255.0F; \ G *= 255.0F; \ B *= 255.0F; 281 Appendix G: H2SV Color Conversion Source The primary difficulty with HSV color space is that Hue is modulus. This forces red hues to be singular around the value 0. This can be a problem for instance if one wants to obtain the mean color of an object. If the object is red-ish, then half the hue distribution will have a value between 330 to 360 and the other half will be between 0 to 30. Thus, without windowing, we have a bimodal distribution where we should not have one. One solution, though not very elegant is to window around hue. Thus, we slid the scale to put red in the middle. However, this too has its problems particularly if we do in fact have a bimodal distribution otherwise. We can get around this by converting hue, which is in radial (polar) coordinates to Cartesian coordinates (Figure G.1). Thus, we break Hue into an x and y component. We can term these H2 and H1 respectively. Note, two variants on H2SV are described here, H2SV1 and H2SV2. Figure G.1: The transformation from Hue in HSV to H1 and H2 components is shown here. Additionally, it is shown that different types of biases can be applied to the transformation. For instance, a slight bias turns the H1 and H2 components into Red/Green (Hollow Box Line) and Blue/Yellow (Solid Box Line) color opponents. There are several advantages to using H2SV color space. For instance, H2SV color space is designed with video streams and tracking in mind. The transformation that 282 is used moves smoothly with changes in the environment. So as an object moves around and its colors change slightly, the change is linearly related to change in H2SV color space. Additionally, the H2SV2 variant which uses Red/Green and Blue/Yellow color opponents has the advantage of being invertible. As such, images can be transformed into the H2SV2 Red/Green Blue/Yellow opponents. Then they can undergo some image processing and be transformed back to HSV and then to RGB color. The H2SV2 color opponents have also been designed to align with human perception of color opponents. This allows H2SV2 to be used in biologically inspired computer vision applications. G.1 HSV to H2SV Transformation G.1.1 HSV to H2SV1 Variant We convert HSV to H2SV1 (or H2SV2 etc) by converting Hue which is in radial coordinates: 0 - 360 to Cartesian coordinates. The precise coordinate transform is flexible, so long as it is invertible. Here, we treat r (hue) as falling along a straight line (P=1). We could also use another method and treat it as if it is on a circle (P=2). All final values normalize from 0 to 1. // ###################################################################### // T. Nathan Mundhenk // mundhenk@usc.edu // C/C++ Macro HSV to H2SV #define PIX_HSV_TO_H2SV1_COMMON(H,H1,H2) \ if(H > 180) \ { \ H2 = ((H - 180)/180); \ if(H > 270) H1 = ((H - 270)/180); \ else H1 = (1 - (H - 90)/180); \ } \ else \ { \ H2 = (1 - H/180); \ if(H > 90) H1 = (1 - (H - 90)/180); \ 283 else H1 = (0.5 + H/180); \ } G.1.2 HSV to H2SV2 Variant We convert HSV to H2SV1 (or H2SV2 etc) by converting Hue which is in radial coordinates: 0 - 360 and to Cartesian coordinates. The precise coordinate transform is flexible, so long as it is invertible. Here, we treat r (hue) as falling along a straight line (P=1). We could also in another method treat it as if it is on a circle (P=2). This variant is designed to make H1 and H2 mimic Red/Green Blue/Yellow opponents Thus: • H1 is 0 at Blue and 1 at Yellow • H2 is 0 at Green and 1 at Red • All final values normalize from 0 to 1 Note: We can also use these macros on HSL color space (http://en.wikipedia.org/wiki/HSL_color_space) since HSV and HSL have the exact same basis for hue. // ###################################################################### // T. Nathan Mundhenk // mundhenk@usc.edu // C/C++ Macro HSV to H2SV #define PIX_HSV_TO_H2SV2_COMMON(H,H1,H2) \ if(H > 120) \ { \ H2 = ((H - 120)/240); \ if(H > 240) H1 = ((H - 240)/180); \ else H1 = (1 - (H - 60)/180); \ } \ else \ { \ H2 = (1 - H/120); \ if(H > 60) H1 = (1 - (H - 60)/180); \ else H1 = ((2.0/3.0) + H/180); \ } 284 G.2 H2SV to HSV Simple Transformation A simple transformation will convert H1 and H2 back to HSV Hue. However, if you perform operations in H2SV color space such as image convolutions, H1 and H2 cannot be transformed into Hue using the simple transformation. For that, one needs the robust transformation shown next. G.2.1 H2SV1 to HSV Simple Convert H2SV1 to HSV using a simple quick method. This just reverse maps the H1 and H2 components back to Hue in HSV using a simple inversion. // T. Nathan Mundhenk // mundhenk@usc.edu // C/C++ Macro HSV2 to HSV #define PIX_H2SV1_TO_HSV_SIMPLE_COMMON(H1,H2,H) \ if(H1 > 0.5) \ if(H2 > 0.5) H = 180 * H1 - 90; \ else H = 90 + 180 * (1 - H1); \ else \ if(H2 <= 0.5) H = 90 + 180 * (1 - H1); \ else H = 270 + 180 * H1; G.2.2 H2SV2 to HSV Simple Convert H2SV2 to HSV using a simple quick method. This just reverse maps the H1 and H2 components back to Hue in HSV using a simple inversion. // T. Nathan Mundhenk // mundhenk@usc.edu // C/C++ Macro H2SV to HSV #define PIX_H2SV2_TO_HSV_SIMPLE_COMMON(H1,H2,H) \ if(H1 > 2.0/3.0) \ if(H2 > 0.5) H = 180 * H1 - 120; \ else H = 60 + 180 * (1 - H1); \ else \ if(H2 <= 0.5) H = 60 + 180 * (1 - H1); \ else H = 240 + 180 * H1; 285 G.3 H2SV to HSV Robust Transformation In this example we show how to convert back from H2SV2 color space into HSV using a robust method. It becomes necessary to use this method if H2SV colors are blended. This is because the x and y coordinates move off the track from where they were initially transformed. Thus, they do not perfectly backwards map. Instead, we now triangulate the x and y coordinate and map it back to where it should be in the simple H2SV scheme. That is, we figure out how to put it back on its track by projecting where is should be. Again, this method is only necessary if either of the H1 or H2 components have been changed independently of each other. Here we convert H2SV2 to HSV using a robust method that allows us to deal with H1 and H2 having been adjusted independently. If they are never adjusted independently, then we can use the simple version The main difference here from simple is that: 1. We compute xp and replace H1 with it. 2. We compute a new saturation term S_NEW. What makes this robust is that the H1 and H2 coordinates do not have to match up with the original unit coordinates. Instead we compute a slope term and figure out where it should intersect the original unit coordinates. Saturation may be reduced if we are too far from a viable color coordinate and have to transform too much. G.3.1 General Computations: m: The slope of the H1/H2 lines xp: The Adjusted H1 (x value) - Useful if H1 was adjusted independent of H2 yp: The Adjusted H2 (y value) 286 d: The difference between the ideal (xp,yp) value and the real (H1,H2) ad: The length of (xp,yp) from the origin sfac: Adjustment to be applied to saturation if d is large - Useful if H1 and H2 are close to the origin, thus saturation should be reduced. G.3.2 C / C++ Code for Robust Transformation // ###################################################################### // T. Nathan Mundhenk // mundhenk@usc.edu // C/C++ Macro H2SV to HSV #define PIX_H2SV2_TO_HSV_ROBUST_COMMON(H1,H2,S,H,S_NEW) \ double sfac = 1; \ const double x = H1 - 2.0/3.0; \ Hue is between red and green. That is, 0 to 120 on the Hue interval of 360. if(x > 0) \ { \ const double y = H2 - 0.5; \ const double m = y/x; \ Hue is between 0 to 60 on the Hue interval of 360. if(y > 0) \ { \ const double xp = 0.5/(m + 3.0/2.0); \ if(xp > x) \ { \ const double yp = 0.5 - 0.5*xp/(1.0/3.0); \ const double d = sqrt(pow(xp - x,2) + pow(yp - y,2)); \ const double ad = sqrt(pow(xp,2) + pow(yp,2)); \ sfac = (ad - d)/ad; \ } \ H = 180 * (xp + 2.0/3.0) - 120; \ } \ Hue is between 60 to 120 on the Hue interval of 360. else \ { \ const double xp = -0.5/(m - 3.0/2.0); \ if(xp > x) \ { \ const double yp = -1.0 * (0.5 - 0.5*xp/(1.0/3.0)); \ const double d = sqrt(pow(xp - x,2) + pow(yp - y,2)); \ 287 const double ad = sqrt(pow(xp,2) + pow(yp,2)); \ sfac = (ad - d)/ad; \ } \ H = 60 + 180 * (1 - (xp + 2.0/3.0)); \ } \ } Hue is between green and red. That is, 120 to 360 on the Hue interval of 360. else \ { \ if(x != 0) \ { \ const double y = H2 - 0.5; \ const double m = y/x; \ Hue is between 120 to 240 on the Hue interval of 360. if(y <= 0) \ { \ const double xp = -0.5/(m + 0.5/(2.0/3.0)); \ if(xp < x) \ { \ const double yp = -1.0 * (0.5 + 0.5*xp/(2.0/3.0)); \ const double d = sqrt(pow(xp - x,2) + pow(yp - y,2)); \ const double ad = sqrt(pow(xp,2) + pow(yp,2)); \ sfac = (ad - d)/ad; \ } \ H = 60 + 180 * (1 - (xp + 2.0/3.0)); \ } \ Hue is between 240 to 360 on the Hue interval of 360. else \ { \ const double xp = 0.5/(m - 0.5/(2.0/3.0)); \ if(xp < x) \ { \ const double yp = 0.5 + 0.5*xp/(2.0/3.0); \ const double d = sqrt(pow(xp - x,2) + pow(yp - y,2)); \ const double ad = sqrt(pow(xp,2) + pow(yp,2)); \ sfac = (ad - d)/ad; \ } \ H = 240 + 180 * (xp + 2.0/3.0); \ } \ } \ else \ H = 240; \ } \ S_NEW = S * sfac; 288 Appendix H: Selected Figure Graphing Commands for Mathematica Selected figures and graphs can be reproduced using Mathematica 6.0 with the given command lines. Figure 6.2 Plot[{PDF[NormalDistribution[0, 1], x], PDF[NormalDistribution[5, 10], x]}, {x, -4, 4} , Filling -> Bottom, PlotRange -> Full] Plot[Evaluate[ Table[PDF[NormalDistribution[n - 1, 1], x], {n, 2}]], {x, -4, 4} , Filling -> Bottom, PlotRange -> Full] Plot[Evaluate[ Table[PDF[NormalDistribution[0, n], x], {n, 2}]], {x, -4, 4} , Filling -> Bottom, PlotRange -> Full] Plot[Evaluate[ Table[PDF[NormalDistribution[0, 1], x], {n, 2}]], {x, -4, 4} , Filling -> Bottom, PlotRange -> Full] Plot3D[1/2.*Erfc[m/(Sqrt[2*(1 + s^2)])], {m, 0, 2*Pi}, {s, 0, 10}, PlotRange -> Full] Figure B.1 Plot3D[Evaluate[Table[PDF[GammaDistribution[a, (1/b)], x], {x, 6}]], {a, 0.1, Pi}, {b, 0.1, Pi}, PlotRange -> Full, ColorFunction -> "DarkRainbow", PlotStyle -> Directive[Opacity[0.5], Red], AxesLabel -> Automatic, PlotPoints -> 50, ImageSize -> {1024, 1024}, LabelStyle -> Directive[FontSize -> 24]] Plot3D[Evaluate[Table[PDF[GammaDistribution[a, (1/b)], x], {a, 6}]], {x, 0.1, Pi}, {b, 0.1, Pi}, PlotRange -> Full, ColorFunction -> "DarkRainbow", PlotStyle -> Directive[Opacity[0.5], Red], AxesLabel -> Automatic, PlotPoints -> 50, ImageSize -> {1024, 1024}, LabelStyle -> Directive[FontSize -> 24]] Plot3D[Evaluate[Table[PDF[GammaDistribution[a, (1/b)], x], {b, 6}]], {a, 0.1, Pi}, {x, 0.1, Pi}, PlotRange -> Full, ColorFunction -> "DarkRainbow", PlotStyle -> Directive[Opacity[0.5], Red], AxesLabel -> Automatic, PlotPoints -> 50, ImageSize -> {1024, 1024}, LabelStyle -> Directive[FontSize -> 24]] 289 Plot3D[Evaluate[ Table[-a + a*Log[(b + 1)/b] + A*(b/(b + 1)) + LogGamma[A] - LogGamma[a] + (a - A)*PolyGamma[a], {b, 6}]], {a, 1, 10}, {A, 1, 10}, PlotRange -> Full, ColorFunction -> "DarkRainbow", PlotStyle -> Directive[Opacity[0.5], Red], AxesLabel -> Automatic, PlotPoints -> 50, ImageSize -> {1024, 1024}, LabelStyle -> Directive[FontSize -> 24]] Figure B.2 Plot3D[Evaluate[ Table[-a + a*Log[(b + 1)/b] + A*(b/(b + 1)) + LogGamma[A] - LogGamma[a] + (a - A)*PolyGamma[a], {b, 6}]], {a, 1, 10}, {A, 1, 10}, PlotRange -> Full, ColorFunction -> "DarkRainbow", PlotStyle -> Directive[Opacity[0.5], Red], AxesLabel -> Automatic, PlotPoints -> 50, ImageSize -> {1024, 1024}, LabelStyle -> Directive[FontSize -> 24]] Figure B.3 Plot[Evaluate[ Table[-PDF[NormalDistribution[0, n + 1], x] + PDF[NormalDistribution[0, 1], x], {n, 4}]], {x, -5, 5}, Filling -> Axis, PlotRange -> Full, ImageSize -> {1024, 1024}, LabelStyle -> Directive[FontSize -> 24]] Plot3D[-(PDF[NormalDistribution[0, 2], x]* PDF[NormalDistribution[0, 2], y]) + (PDF[ NormalDistribution[0, 1], x]* PDF[NormalDistribution[0, 1], y]), {x, -5, 5}, {y, -5, 5}, PlotRange -> Full, ColorFunction -> Function[{x, y, z}, Hue[0.33*z + 0.09]], PerformanceGoal -> "Quality", PlotPoints -> 50, ImageSize -> {1024, 1024}, LabelStyle -> Directive[FontSize -> 24]] Figure C.1 Plot[{PDF[NormalDistribution[1, 1], x], PDF[NormalDistribution[0, 1], x]}, {x, -4, 4}, PlotRange -> Full] Plot[{0, Evaluate[ Table[-Log[PDF[NormalDistribution[n, 1], x]]* PDF[NormalDistribution[0, 1], x] + Log[PDF[NormalDistribution[0, 1], x]]* PDF[NormalDistribution[0, 1], x], {n, 3}]]}, {x, -4, 4}, PlotRange -> Full] Plot[{0, Evaluate[ Table[-Log[PDF[NormalDistribution[n, 1], x]]* PDF[NormalDistribution[0, 1], x] + Log[PDF[NormalDistribution[0, 1], x]]* 290 PDF[NormalDistribution[0, 1], x], {n, 3}]]}, {x, -4, 4}, PlotRange -> Full] Figure C.2 bb = 3. Table[Plot3D[ Evaluate[ Table[(-a + a*Log[(bb + 1)/bb] + A*(bb/(bb + 1)) + LogGamma[A] - LogGamma[a] + (a - A)*PolyGamma[a]) + (-aa + aa*Log[(bb + 1)/bb] + AA*(bb/(bb + 1)) + LogGamma[AA] - LogGamma[aa] + (aa - AA)*PolyGamma[aa]), {AA, 6} ]], {a, 1, 6}, {A, 1, 6}, PlotRange -> Full, ColorFunction -> "DarkRainbow", PlotStyle -> Directive[Opacity[0.5], Red], AxesLabel -> Automatic, ImageSize -> {1024, 1024}, LabelStyle -> Directive[FontSize -> 48]], {aa, 6}]
Abstract (if available)
Abstract
What draws in human attention and can we create computational models of it which work the same way? Here we explore this question with several attentional models and applications of them. They are each designed to address a missing fundamental function of attention from the original saliency model designed by Itti and Koch. These include temporal based attention and attention from non-classical feature interactions. Additionally, attention is utilized in an applied setting for the purposes of video tracking. Attention for non-classical feature interactions is handled by a model called CINNIC. It faithfully implements a model of contour integration in visual cortex. It is able to integrate illusory contours of unconnected elements such that the contours “pop-out†as they are supposed to and matches in behavior the performance of human observers. Temporal attention is discussed in the context of an implementation and extensions to a model of surprise. We show that surprise predicts well subject performance on natural image Rapid Serial Vision Presentation (RSVP) and gives us a good idea of how an attention gate works in the human visual cortex. The attention gate derived from surprise also gives us a good idea of how visual information is passed to further processing in later stages of the human brain. It is also discussed how to extend the model of surprise using a Metric of Attention Gating (MAG) as a baseline for model performance. This allows us to find different model components and parameters which better explain the attentional blink in RSVP.
Linked assets
Computational modeling and utilization of attention, surprise and attention gating
Conceptually similar
PDF
Computational modeling and utilization of attention, surprise and attention gating [slides]
PDF
Integrating top-down and bottom-up visual attention
PDF
Perceptual and computational mechanisms of feature-based attention
PDF
Temporal dynamics of attention: attention gating in rapid serial visual presentation
PDF
Schema architecture for language-vision interactions: a computational cognitive neuroscience model of language use
PDF
Computational model of stroke therapy and long term recovery
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Building and validating computational models of emotional expressivity in a natural social task
PDF
Mining and modeling temporal structures of human behavior in digital platforms
PDF
Robust and generalizable knowledge acquisition from text
PDF
Feature-preserving simplification and sketch-based creation of 3D models
PDF
Nonlinear dynamical modeling of single neurons and its application to analysis of long-term potentiation (LTP)
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Synaptic integration in dendrites: theories and applications
Asset Metadata
Creator
Mundhenk, Terrell Nathan
(author)
Core Title
Computational modeling and utilization of attention, surprise and attention gating
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2009-08
Publication Date
08/04/2009
Defense Date
04/21/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
attention,attentional blink,bayes,biologically inspired,CINNIC,computation,contour,detection,gating,H2SV,Human Performance,iLab,image processing,information,Integration,Itti,Koch,MAG,masking,Nerd-Cam,Neuromorphic Vision Toolkit,OAI-PMH Harvest,RSVP,saliency,spot light,statistics,Surprise,tracking,vision,visual cortex,visual saliency
Format
312 pages
(extent)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Itti, Laurent (
committee chair
), Arbib, Michael A. (
committee member
), Biederman, Irving (
committee member
), Schaal, Stefan (
committee member
)
Creator Email
nathan@mundhenk.com,tnmundhenk@hrl.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c127-15523
Unique identifier
UC182429
Identifier
usctheses-c127-15523 (legacy record id)
Legacy Identifier
etd-Mundhenk-2997
Dmrecord
15523
Document Type
Dissertation
Format
312 pages (extent)
Rights
Mundhenk, Terrell Nathan
Internet Media Type
application/pdf
Type
texts
Source
University of Southern California
(contributing entity)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
attention
attentional blink
bayes
biologically inspired
CINNIC
computation
contour
detection
gating
H2SV
iLab
image processing
Itti
masking
Nerd-Cam
Neuromorphic Vision Toolkit
RSVP
saliency
spot light
statistics
tracking
visual cortex
visual saliency
information