Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Gaze following in natural scenes: quantifying its role in eye movements, towards a more complete model of free-viewing behavior
(USC Thesis Other)
Gaze following in natural scenes: quantifying its role in eye movements, towards a more complete model of free-viewing behavior
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Gaze following in natural scenes: Quantifying its role in eye movements, towards a more complete model of free-viewing behavior Dissertation by Daniel Parks In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Friday 15 th May, 2015 Neuroscience Graduate Program University of Southern California Los Angeles, CA, USA Acknowledgments I would like to thank my mentor: Laurent Itti, for allowing me the freedom and financial support to complete this journey; my advisors: Bosco Tjan, Shrikanth Narayanan, Norberto Grzywacz, and Bartlett Mel, for their help and insight; and my collaborators: Ali Borji, Kirsten O’Hearn, and Bea Luna, who have inspired and guided me along the way. Contents Contents 1 Introduction 1 1.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background 4 2.1 Gaze following background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Psychophysical studies of head pose and eye gaze following . . . . . . . . 4 2.1.2 Developmental studies of gaze following . . . . . . . . . . . . . . . . . . 6 2.1.3 Neural basis of gaze following . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.4 Gaze estimation and theory of mind . . . . . . . . . . . . . . . . . . . . . 9 2.1.5 Gaze following and fixation prediction . . . . . . . . . . . . . . . . . . . . 10 2.1.6 Gaze following and joint attention . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Fixation prediction background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Saliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Bottom-up fixation prediction with fixed weights . . . . . . . . . . . . . . 14 2.2.3 Bottom-up fixation prediction with learned weights or maps . . . . . . . . 14 2.2.4 Task driven fixation prediction . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Quantifying effect of actor gaze on observer fixations using a controlled dataset 16 3.1 Experiment one: gaze following vs saliency with controlled pairs dataset . . . . . . 17 3.1.1 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.2 Observers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.3 Apparatus and procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.4 Analysis and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Experiment two: gaze following vs saliency with Flickr dataset . . . . . . . . . . . 23 3.2.1 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.2 Observers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.3 Apparatus and procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.4 Analysis and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.5 Analysis of gaze following . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4 Modeling gaze following, head pose, and saliency 39 4.1 Learning head pose and gaze following spatial probability maps . . . . . . . . . . 40 4.2 Conditioning cue combination on type of fixation location . . . . . . . . . . . . . 41 4.3 Learned transition probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Model prediction results using annotations . . . . . . . . . . . . . . . . . . . . . . 48 4.5 Head pose detection model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.6 Detection based fixation prediction results . . . . . . . . . . . . . . . . . . . . . . 55 4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5 Using DWOC to evaluate autism spectrum disorder (ASD) eye movement behavior 61 5.1 Gaze following and head region preference in ASD . . . . . . . . . . . . . . . . . 63 5.2 Cue preference experiment using Flickr dataset . . . . . . . . . . . . . . . . . . . 64 5.2.1 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2.3 Observers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2.4 Quantification of fixation heatmap differences . . . . . . . . . . . . . . . . 65 5.2.5 Gaze following and head fixation probability maps . . . . . . . . . . . . . 67 5.2.6 Differences in cue transitions between ASD & TD . . . . . . . . . . . . . 70 5.2.7 Differences in fixation performance between ASD & TD . . . . . . . . . . 70 5.3 Cue preference experiment using Social Video dataset . . . . . . . . . . . . . . . . 72 5.3.1 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.3 Observers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.4 Differences in transition probabilities of ASD vs TD . . . . . . . . . . . . 74 5.3.5 Differences in fixation prediction between ASD & TD . . . . . . . . . . . 75 5.3.6 Classification of ASD and TD participants . . . . . . . . . . . . . . . . . . 76 5.4 Evaluating model choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4.1 Removing gaze from the model . . . . . . . . . . . . . . . . . . . . . . . 78 5.4.2 Comparison to saliency alone . . . . . . . . . . . . . . . . . . . . . . . . 79 5.4.3 Removing saliency from the model . . . . . . . . . . . . . . . . . . . . . 79 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6 Dissertation Summary 84 6.1 Potential Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.1.1 Application to the bipolar affective disorder population . . . . . . . . . . . 85 6.1.2 Refinement as an ASD screening technique . . . . . . . . . . . . . . . . . 86 References 88 Bibliography 103 List of Figures 108 List of Tables 111 Chapter 1 Introduction A lot of research has gone into understanding the cues that are used by humans when planning eye movements. Saliency modeling has extensively utilized local, low-level statistical information such as: anomalies in color, intensity, orientation, spatial frequency, motion, and contrast as cues that guide eye movements (Treisman & Gelade, 1980; Koch & Ullman, 1985; Milanese, Wech- sler, Gill, Bost, & Pun, 1994; Mannan, Ruddock, & Wooding, 1996; Itti, Koch, & Niebur, 1998; Reinagel & Zador, 1999; Krieger, Rentschler, Hauske, Schill, & Zetzsche, 2000; Itti & Koch, 2001a; Parkhurst, Law, & Niebur, 2002; Peters, Iyer, Itti, & Koch, 2005; Borji & Itti, 2012; Borji, 2012). Other cues are more removed from local image features, such as: context and global scene information (Torralba, Oliva, Castelhano, & Henderson, 2006), task parameters (Yarbus, 1967; Land & D. N. Lee, 1994; Ballard, M. Hayhoe, & J. Pelz, 1995; Land & Mary Hayhoe, 2001; Tri- esch, Ballard, Hayhoe, & Sullivan, 2003; Einh¨ auser, Rutishauser, & Koch, 2008; Borji, D. Sihite, & Itti, 2014; Borji & Itti, 2014), domain knowledge (Underwood, Tom Foulsham, & Humphrey, 2009), bias towards the center of a screen (Tatler, 2007), memory (Droll, Hayhoe, Triesch, & Sul- livan, 2005; Carmi & Itti, 2006), emotion (Ramanathan, Divya, Nicu, & David, 2014), and culture (Chua, Boland, & Nisbett, 2005). Eye movement behavior involves a combination of these factors that is dependent of the task and the cognitive agenda of the participant. The role of faces in predicting eye movements has also been investigated as a cue (Hershler & Hochstein, 2005; VanRullen, 2006), and has been shown to improve predictions above and beyond local statistical cues (Cerf, Harel, Einh¨ auser, & Koch, 2008). However, faces themselves convey a 1 large number of cues, such as facial expression and gaze direction (Engell & Haxby, 2007), which are processed automatically in the superior temporal sulcus (STS) region of the brain. Gaze direction (Bock, Dicke, & Thier, 2008) is a well-known cue that modulates behavior in human and other primates. Attending to items in the direction of another person’s gaze, known as gaze following, is a behavior that arises as early as 3 months of age (Butterworth & Jarrett, 1991) and serves several functions for purposes such as social interactions, social learning, collaboration, coordination, threat assessment, understanding the intentions of others, signaling what is important in the shared visual field to direct the attention of others, and communicating thoughts, judgments, emotions, desires and needs (Emery, 2000). Gaze following is also an important aspect of joint attention (when two or more people are paying attention to the same entity) and allows coupling visual appearances to verbal descriptions (Bakeman & Adamson, 1984; Kobayashi & Kohshima, 1997). As an aside, let’s define “actors” to be humans depicted in images or movies, and “ob- servers” to be humans who are viewing those images or movies and whose eyes are being tracked and who can follow the gaze of the actors in the scenes. We will use these terms throughout this document. Although gaze direction is a well known cue, quantifying its effect on observers’ eye move- ments has been little studied. Castelhano, Wieth, & Henderson ( 2007) looked at this effect, al- though they did not control for other effects, such as saliency or spatial bias. This dearth of study is at least partially due to the difficulty of extracting gaze information from actors in natural scenes, currently limited to arduous manual annotation. Further, since gaze following only applies to sac- cades that leave the face by definition, the predictive power of the cue is limited when looking at all saccades. However, given the maturity of the field of modeling eye movements, these types of secondary cues must be addressed in order for models to further approach human performance. Further, we’ll review research that shows that autism spectrum disorder (ASD), bipolar, and potentially other populations differ from typically developing (TD) individuals in their processing of heads and gaze following information. This could allow an eye movement prediction model that incorporates gaze following to both discriminate these populations and allow better eye movement prediction 2 for these populations. Indeed, we show that the model developed here can discriminate participants with ASD from the TD population at both a population level and an individual level. 1.1 Aims The aims of this thesis are the following: 1. Quantify the effect of gaze direction, both in causality and strength, on eye movements 2. Create a dynamic fixation prediction model which integrates gaze direction, head pose, and bottom-up saliency where gaze direction and head pose cues are learned by parameterizing the head pose angle 3. Learn the relative transition probabilities of following each cue based on the current fixation location, and achieve state of the art in fixation prediction performance 4. Identify human populations that are known to vary in facial or gaze processing behavior, and learn category specific transition probabilities and cue maps in order to better under- stand variations in cue preferences and fixation behavior for the different populations (e.g. ASD/TD) 5. Evaluate how well a gaze following integrated model can distinguish ASD/TD participants at the level of an individual participant in order to better classify participants and potentially screen them 3 Chapter 2 Background In order to properly understand gaze following and situate it within a fixation prediction model it is important to review the relevant literature involving both gaze following and fixation prediction. There exists a large body of work in both fields, and the current understanding of both elements and how they impact other fields will be reviewed. For gaze following, which has not been extensively modeled, we will focus on the neural, psychophysical, and developmental literature, so that we can understand how to model the behavior. For fixation prediction, however, which has already been extensively analyzed and modeled, we will focus on the evolution of these models, especially bottom up saliency models. 2.1 Gaze following background 2.1.1 Psychophysical studies of head pose and eye gaze following Proper gaze estimation relies on the relative position of the iris within the eye socket as well as the head pose (Gibson & Pick, 1963). As can be seen in Figure 2.1, which is adapted from their paper, the final gaze is a result of both the eyes and the head. In the rest of this document, “gaze” is considered to be this final gaze, and “eye gaze” is considered to be just the eye component of gaze, while “head pose” is considered to be just the head component. In the top example, for instance, the eye gaze is looking right while the head pose is facing left which, when combined, results in a 4 Figure 2.1: Relative eye gaze + head pose combine to form effective eye gaze [Taken from (Gibson & Pick, 1963)]. gaze straight ahead. The bottom example shows eyes that are looking straight ahead along with a left facing head pose, which combine to produce a left facing gaze. The eye gaze and head pose estimation accuracy of a human observer have been measured un- der various conditions. Initial papers focused on eye gaze only when the subject’s head is directly facing the observer (Gibson & Pick, 1963; Cline, 1967). Head pose estimation accuracy was de- termined for a limited horizontal range (-30°:30°), where estimates of 0°:15° head poses had a 2° discrimination threshold, while the threshold for 30° head poses was 4.9° (Wilson, Wilkinson, Lin, & Castillo, 2000). Additional factors that affect these estimates have also been investigated, such as: the effect of head contour and nose angle on head pose estimation (Langton, Honeyman, & Tessler, 2004), the effect of head turn on gaze perception (Kluttz, Mayes, West, & Kerby, 2009), and the effect of face eccentricity on gaze perception (Todorovic, 2009). These experiments have been motivated by an attempt to understand dyadic interaction with humans, when a subject is actively engaged with another human. One of the few papers involving triadic gaze estimates with objects was done in a virtual en- vironment, where only the head pose of a virtual human was estimated based on a participant’s selection of which of a set of identical objects (arranged along an arc) a virtual actor’s head was pointing at (Poppe, Rienks, & Heylen, 2007). It has also been shown that an object near an ac- 5 Figure 2.2: The sclera, or white part of the human eye can be seen in the human face on the left. The chimpanzee on the right, however, has a dark sclera tor’s gaze biased an observer’s estimate of the actor’s gaze towards the object (Schwaninger, J. Lobmaier, & M. Fischer, 2005; J. S. Lobmaier, M. H. Fischer, & Schwaninger, 2006). 2.1.2 Developmental studies of gaze following Unlike most other mammals, and all non-human primates, the sclera in humans is white, and offers a high contrast background with which to view the iris, as can be seen in Figure 2.2. The cooperative eye hypothesis proposes that this evolved to allow quick non-verbal communication of gaze between humans to improve cooperativity in tasks (Kobayashi & Kohshima, 2001). It has also been shown that human infants are capable of crude gaze following at 3 months and complex gaze following across intervening objects at 12 months (Butterworth & Jarrett, 1991). The nature of how gaze cues develop ontogenically has been argued to be completely innate or nativistic (Trevarthen, 1979) and alternatively, largely culturally learned (Kaye, 1982). Nativist accounts focus on how infants are able to emote and fixate within a few weeks after birth, and how their expressions and behavior are similar across cultures. On the other hand, others attempt to show that infants must experience an activity before they can model it in others, which for them necessitates a learning stage (Tomasello, Carpenter, Call, Behne, Moll, et al., 2005). Regardless, it is clear that the functional capability arises early on in human development. In humans, gaze direction estimation has been shown to be at least crudely present (capable of discriminating left, right, or straight) as early as 3 months of age (D’Entremont, Hains, & 6 Muir, 1997). At 12 months, an infant can accurately follow a perceived gaze across an intervening stimulus and to the correct target. Prior to this age, objects that occur between the initial infant fixation and the true target seem to either override the search task, or are presumed to be the target (Butterworth & Jarrett, 1991). It is not until 14 months of age, however, that both eye position and head pose are reliably taken into account when estimating gaze (Caron, Butler, & Brooks, 2002). By 18 months, the infant can even estimate perceived gaze that is outside their field of view (Butterworth & Jarrett, 1991). The processing of gaze perception appears to be at least partly configural, because contrast inversion of the pupil and the sclera greatly reduces gaze estimation performance (Ricciardelli, Baylis, & Driver, 2000). Faces have also been shown to be configural, and identifying inverted faces has been shown to be difficult (Yin, 1969). However, inversion of the eyes did not produce a deficit when estimating gaze perception, which indicates the configural information is contained within the eye and is invariant to inversion (Schwaninger et al., 2005). 2.1.3 Neural basis of gaze following Monkeys as young as 3 weeks old can accurately distinguish between direct gaze and averted gaze (Mendelson, Haith, & Goldman-Rakic, 1982). This is an important social cue for monkeys, as direct gaze is a sign of dominance, and it is important to learn to avert gaze to show submission (Mendelson et al., 1982). It was determined fairly early that neurons in the superior bank of the Superior Temporal Sulcus (STS) in rhesus monkeys form a population code sensitive to gaze direction and head orientation (D. I. Perrett et al., 1985). The location of the STS in humans is shown in Figure 2.3. The code was not uniformly sampled, as few cells represented the backside of the head. Cells that preferred a forward facing head orientation also preferred a direct gaze, while oblique head orientations preferred averted gazes. However, there exists a subset of cells that are selective for gaze direction independent of facial position (D. Perrett, Hietanen, Oram, Benson, & Rolls, 1992). Human fMRI studies have also analyzed gaze perception and have found a homolog of monkey gaze selective activity in the posterior section of the STS in humans (Puce, Allison, Bentin, Gore, 7 SUP FRONTAL MIDDLE FRONTAL INF FRONTAL GYRUS GYRUS POST SUP TEMPORAL GYRUS MID. TEMPORAL GYRUS INF SUP PARIETAL INF PARIETAL LOBULE LOBULE PARIETO - OCC. sulcus Infr Postcentral sulcusl sulcusl Lat. occ. Trans occ. sulc. Parieto occ. fiss. Lat. Cereb. fiss. Supr Posterior ramus Mid. temp. sulcus SUPRAMARG. GYRUS ANG. GYRUS Pars triangularis Ant. asc. ramus Pars orbitalis Ant. horiz. Pars opercularis TEMPORAL GYRUS GYRUS Supr frontal ramus CENTORAL GYRUS sulcus frontal sulcus temp. sulcus Precentral sulcus Figure 2.3: Superior temporal sulcus in red [taken from (Gray, 1918)] & McCarthy, 1998; E. A. Hoffman & Haxby, 2000). Lesion studies of the STS region in monkeys have also produced deficiencies in gaze following (Heywood, Cowey, & Rolls, 1992). Further evidence for the involvement of human STS in gaze estimation studies comes from a single human patient MJ, who had a lesion almost completely contiguous with the right superior temporal gyrus. After a year of experiencing left field neglect, she recovered, but it was noticed that she had an inability to maintain eye contact, with her eyes seeming to drift. This deficit did not appear when she fixated on an object. MJ when tested biased her estimates of gaze consistently to the right (Akiyama et al., 2006). In an fMRI study, differential activity in the STS also occurred when a person was perceived to be looking at an object as opposed to empty space (Pelphrey, Singerman, Allison, & McCarthy, 2003). When looking at empty space, there was a significant extended hemodynamic response duration, which the authors argued shows the impact of context. The facial processing network includes the fusiform face area (FFA), which has been shown to be able to individuate faces (Sergent, Ohta, & Macdonald, 1992); the STS, which as discussed is critical for gaze estimation; and the amygdala, which has been implicated in both facial expression processing and gaze estimation (Young et al., 1995). It is important to note that the STS sends a large number of connections to the amygdala (Aggleton, Burton, & Passingham, 1980). Be- fore and after an amygdalotomy, a patient DR was tested on a battery of facial processing skills. Although she was still able to identify faces she had known prior, discrimination of new faces, recognition of facial expressions, and estimation of gaze were all impaired. In addition to amyg- dala connections, the STS area also has a large number of connections to the parietal cortex, which 8 is associated with attention and orienting (Harries & D. I. Perrett, 1991). Prosopagnosia, which is a deficit in the ability to identify faces (Hecaen & Angelergues, 1962), has also been investigated in relation to the ability to perceive gaze. Initial research with prosopag- nosics showed physiological abnormalities mainly in the FFA (A. R. Damasio, H. Damasio, & Hoesen, 1982; Puce, Allison, Gore, & McCarthy, 1995). Further research showed a decreased performance at a gaze estimation task for prosopagnosics compared to unaffected humans, but was unable to rule out that these participants didn’t also have damage in STS or other areas in addition to FFA damage. As a result, it was unclear from the study if gaze estimation and facial identifi- cation were dissociable (Campbell, Heywood, Cowey, Regard, & Landis, 1990). Later research, however, was able to clearly show that prosopagnosics were able to discriminate gaze as well as TD subjects (Duchaine, Jenkins, Germine, & Calder, 2009), which lends credence to the idea that facial individuation and gaze estimation are indeed separate. It also argues that gaze estimation is not heavily dependent on the fusiform face area for processing, which is hard to separate in imaging studies. 2.1.4 Gaze estimation and theory of mind Humans also have the ability to model other people’s current knowledge of the world as well as their goals, which is frequently referred to as Theory of Mind (Premack & Woodruff, 1978). A classic example of this ability is the false belief test, where an observer is shown a scene where an actor has incomplete knowledge of the world and the observer is asked what action that actor will take. This tests whether the observer can answer using a model of the knowledge the actor has, or will instead use the more complete knowledge that the observer has access to (Wimmer & Perner, 1983). A common form of this task is called the “Sally Anne Task”, in which Sally is shown to put a marble in her basket and then leaves the room. Anne then takes the marble from the basket and puts it in her box. Later Sally enters and the observer is asked where will Sally look for the marble (Leslie & Frith, 1988). Individuals younger than 4 do not seem to have developed a Theory of Mind yet, as they reliably guess the basket. Individuals with ASD are also likely to fail this task (Leslie & Frith, 1988). 9 Beyond just the ability to model the current memory of others, more complex knowledge of people’s beliefs also helps determine where their attention will likely be. For instance, knowing that Joe likes Bob, but dislikes Sam will bias the expectation of where Joe will head towards if given the option. Having an intact Theory of Mind allows a person to accurately infer the target of attention in the desired person before they are even visible. This ability can result in a much deeper understanding of what is going on in a scene and is not fully developed even in our closest relatives the chimpanzees (Call & Tomasello, 2008), who can model the goals of others, but fail in the false belief task. In order for Theory of Mind to be useful for predicting attention, it does not need to be com- plete, however. This is important, because it is unlikely that we will have a robust modeling of human behavior in the near future. A very simple use for Theory of Mind would be to simply keep track of objects attended to and actions taken by a particular agent over time and attribute interest in these objects and activities by the agent. This can then be used to bias judgment of that agent in the future. This type of inference does not require extensive modeling of real world knowledge that a fully functional Theory of Mind would necessitate. Gaze following by itself, can be viewed as a very simple form of Theory of Mind, in the sense that aping this behavior implicitly acknowledges that what other people are looking at tends to be important to them. 2.1.5 Gaze following and fixation prediction Castelhano et al. (2007) were the first to study the effect of gaze direction of actors in natural scenes in guiding the eye movements of observers of those scenes. They conducted a study in which participants viewed a sequence of scenes presented as a slide show that portrayed the story of a janitor (the actor) cleaning an office. They found that: a) the actor’s face was highly likely to be fixated (as also later suggested and modeled by (Cerf, Frady, & Koch, 2009)) and b) the observer’s next saccade was more likely to be toward the object that was the focus of the actor’s gaze than in any other direction. Castelhano et al.’s study is interesting as it provides the seed to look deeper in the role of gaze direction in free-viewing and fixation prediction. Gaze direction of actors in a scene could provide an additional source of information for visual attention models, 10 similar to the manner in which human faces and written text have recently been added to saliency models (Cerf et al., 2009; Judd, Ehinger, Durand, & Torralba, 2009; H.-C. Wang & Pomplun, 2012). 2.1.6 Gaze following and joint attention There are different levels of joint attention associated with different levels of sophistication in the knowledge of the outside world (Kaplan & Hafner, 2006). Shared gaze is simply the co- occurrence of attention on the same object by two or more individuals. Dyadic joint attention is when two persons are looking at each other while being aware that the other person is looking at them (Bakeman & Adamson, 1984). This dyadic stage is a key developmental stage in infants during which infants and caregivers exchange emotions (Tronick, 1989). Triadic joint attention is considered the most sophisticated, and is where two people are attending to the same object and both are aware that the other is attending to the object as well (Okamoto-Barth, Tomonaga, Tanaka, & Matsuzawa, 2008). Although here we are interested in the gaze estimation of a passive viewer of a scene, unable to interact with the agents present, and thus not technically joint attention, a lot of the research is still applicable. One of the distinguishing characteristics associated with autism is a malfunctioning joint atten- tion system. When compared with TD and intellectually disabled children of matched intelligence, individuals with ASD had profound deficits in joint attentional skills, namely gaze and gestural understanding (Mundy, Sigman, Ungerer, & Sherman, 1986). In trying to determine where the failure was occurring in individuals with ASD, Baron-Cohen and others have shown that partic- ipants with ASD could accurately assess whether a cartoon was looking at them and extend an arrow in space to determine what it was pointing to as well as TD children. This indicates they are able to see and process eye gaze to a certain extent. However, they failed to use that knowledge to infer internal states about the viewed cartoon. Specifically, the children with ASD were unable to infer that because the cartoon was gazing on a candy (Figure 2.4), it is likely that the cartoon prefers that candy (Baron-Cohen, Campbell, Karmiloff-Smith, Grant, & Walker, 1995). Instead, they consistently chose the candy that they 11 Figure 2.4: Participants with ASD could estimate the gaze of the cartoon, but not infer that the cartoon desired the gazed at candy [Taken from (Baron-Cohen, Campbell, Karmiloff-Smith, Grant, & Walker, 1995)] themselves wanted, which seems to imply that they ignore cues from the other person, and instead model that person with the same goal states as they themselves have. This is in agreement with the classic Sally-Anne task (Wimmer & Perner, 1983), or false belief task, that was discussed previously, where participants with ASD were unable to accurately model other people’s memory state. To understand how joint attention and gaze awareness interact, it is helpful to look at the or- dering of these attentional capabilities in development. For instance, although 6-month-old infants can jointly attend to an object with their caregiver, they fail to find the target object if they come upon another distractor object while they are shifting their gaze (Butterworth & Jarrett, 1991). The interpretation that has been given for this is that younger infants are simply using the gaze as a cue to find something interesting, while older infants are actually trying to find the object that the caregiver is looking at, and engaging in true joint attention. When looking at it from a bottom- up saliency perspective, however, this could be interpreted as being unable to suppress bottom-up salient objects while engaged in a top-down task until 12 months of age. This goes into the heart of how these cues should be integrated, a subject we will return to when building a gaze aware 12 fixation prediction model. 2.2 Fixation prediction background Ever since (Yarbus, 1967), eye movements have been recorded and studied. Eye movements can be broken down into saccades, which are rapid movements of the eye within the orbital socket, and fixations, which are pauses in between saccadic movement. The purpose of saccades is to direct the highly sensitive fovea of the retina to a particular location, at which point the visual information is processed during the fixation. The fovea, which only represents about 2°of the visual field, comprises about half of the information carried to the brain along the optic nerve. Selecting where to point this fovea is critical for any animal with a fovea (great apes, birds, diurnal lizards), as a misallocation of this visual resource can result in missing critical information about a potential predator, food source, etc. 2.2.1 Saliency It has been shown that low-level salient cues (Treisman & Gelade, 1980), such as motion, color (e.g. a red line among many green lines), luminance, etc., are good drivers of eye movements, causing observers to automatically shift their attention based on these cues in what is called a “pop out” effect (Figure 2.5). This bottom-up, automatic process is a highly parallel system, at least within a feature dimension (color, motion, etc). This means that subject reaction time in a visual search task or in free viewing of a pop-out array is largely invariant to the numerosity and density of distractors, when the target salience lies along a single feature dimension. A pop-out effect does not occur, however, when an item is unique only when looking at the conjunction of two or more features, as seen in the right most image of Figure 2.5, where the unique item is both red and mostly vertical. This discrepancy led to the proposal that individual features are processed in a parallel manner but are integrated into a combined whole (i.e. a conjunction of features) in a serial process (Treisman & Gelade, 1980). 13 Figure 2.5: Color or orientation by themselves can cause a “pop out” effect, but a conjunction of the two does not. [Taken from (Itti, 2007)] 2.2.2 Bottom-up fixation prediction with fixed weights The Feature Integration Theory (FIT) of (Treisman & Gelade, 1980), inspired many saliency mod- els. These models typically first extract a set of visual features such as contrast, edge content, intensity, and color for a given image (Koch & Ullman, 1985; Milanese et al., 1994; Itti et al., 1998). They then apply a spatial competition mechanism via a center-surround operation (e.g., us- ing Difference of Gaussian filters) to quantify conspicuity in a particular feature dimension. Third, they linearly (often, with equal weights) integrate conspicuity maps to generate a final saliency map (e.g., (Treisman & Gelade, 1980; Koch & Ullman, 1985; Itti et al., 1998; Ehinger, Hidalgo- Sotelo, Torralba, & Oliva, 2009; Cerf et al., 2009; Borji & Itti, 2012)), which is a map of the scalar quantity of saliency at every location in the visual field. Optionally, a Winner-Take-All (WTA) mechanism chooses the most salient region and then inhibits this area so that other regions become the most salient region at the next time step, which allows the model to shift attention. Variations of this model have used linear weights (Itti & Koch, 2001b), “max” (Z. Li, 2002), and maximum a posterior (Vincent, Baddeley, Troscianko, & Gilchrist, 2009) to combine the different cues in their respective models. 2.2.3 Bottom-up fixation prediction with learned weights or maps Some models learn weights for different channels from a set of training data. For example, (Itti & Koch, 2001b) weighted different feature maps according to their differential level of activation within compared to outside manually-outlined objects of interest in a training set (e.g., traffic 14 signs). (Navalpakkam & Itti, 2007) proposed an optimal gains theory that weights feature maps according to their target-to-distractor signal-to-noise ratio, and applied it to search for objects in real scenes. (Judd et al., 2009) used low-level image features, a mid-level horizon detector, and two high-level object detectors (faces using (Viola & M. Jones, 2001) and humans using (Felzenszwalb, McAllester, & Ramanan, 2008)) and learned a saliency model with liblinear SVM. Following Judd et al., (Zhao & Koch, 2011) learned feature weights using constrained linear regression and showed enhanced results on different datasets using different sets of weights. Later, (Borji, 2012) proposed an AdaBoost (Freund & Schapire, 1997) based model to approach feature selection, thresholding, weight assignment, and integration in a principled, nonlinear learning framework. The AdaBoost- based method combines a series of base classifiers to model the complex input data. Some models directly learn a mapping from image patches to fixated locations. Following (Reinagel & Zador, 1999) who proposed that fixated patches have different statistics than random patches, (Kienzle, Wichmann, Sch¨ olkopf, & Franz, 2007) learned a mapping from patch content to whether it should be fixated or not (i.e., +1 for fixated and -1 for not fixated). They learned a completely parameter-free model directly from raw data using a support vector machine with Gaussian radial basis functions (RBF). 2.2.4 Task driven fixation prediction Task-driven attention models have been reviewed recently, where they focus on modeling the cog- nitive agenda involved in task specific behavior (Mary Hayhoe & Ballard, 2014). Task-related modeling is highly effective in situations where there is strong top-down guided behavior (driving, making a sandwich, etc.), and useful anytime it is known or can be estimated. The work done here focuses on free-viewing behavior, where the cognitive agenda is not known, and in my view is not a strong driver of fixations under these conditions. Both aspects are highly relevant to human behavior, and in order to be fully understood, they both must be integrated. When a baseball flies at the head of a person making a sandwich, they will look, in spite of a top-down agenda. In the same way, the cognitive agenda is never fully extinguished even in free-viewing conditions. 15 Chapter 3 Quantifying effect of actor gaze on observer fixations using a controlled dataset It has been known that gaze following is an important social cue, but in a free-viewing context of natural scenes, it is unclear if gaze following causally effects eye movements 1 . For example, actors in scenes might strongly prefer viewing salient objects, and when observers view objects that actors gaze at, it could be because they are salient (or spatially close or semantically relevant), and not due to gaze following behavior. In order to assess causality and the strength of gaze following we created a controlled set of images pairs of people looking at objects, where only the direction of gaze changed. “From head” saccades are defined as those that start from actor head regions, and “leaving head” saccades are defined as a subset of the “from head” saccades that leave the head that contains the start point. Here we attempt to test the hypothesis that free-viewing observers follow the gaze direction of people in the scene above chance on a set of controlled stimuli. Since gaze following might be due to saliency of the object at the gaze endpoint, we also account for this confounding factor. This experiment focuses on saccades that leave the head. 1 This chapter has been published as (Borji, Parks, & Itti, 2014), with only my portion of the analysis reproduced here. 16 3.1 Experiment one: gaze following vs saliency with controlled pairs dataset 3.1.1 Stimuli The controlled images (1920 1080 pixels) were taken in pairs with a single actor in each. Actors were instructed to look at one of two main objects in a scene, one for each image pair. Fig. 3.1 shows example image pairs and saccade probability maps (made using “leaving head” saccades) and saliency maps. There were 30 pairs of images. Only the head pose and gaze direction of the actor in the scene changed within the image pair from the same scene. These test images were randomly interspersed with 60 images from the web, to mask the purpose of the experiment. From these images, two sets of stimuli were created by randomly assigning the controlled image pairs to one of the two sets. No observer viewed both of the images in a pair. 3.1.2 Observers Two groups of observers, comprising 15 subjects each, participated in this experiment. Observers in group 1 (6 male, 9 female; mean age = 19:73, SD = 1:03) viewed images in Set 1. Observers in group 2 (3 male, 12 female; avg age = 19.8, SD = 1.26) viewed images in Set 2. Observers were undergraduate students at the University of Southern California (USC). Observers had normal or corrected-to-normal vision and received course credit for participation. They were na¨ ıve to the purpose of the experiment and had not previously seen the stimuli. They were instructed to simply watch and enjoy the pictures (free viewing). 3.1.3 Apparatus and procedure Observers sat 106 cm away from a 42 inch LCD monitor screen and images subtended approxi- mately 45:5 31 visual angle. A chin rest was used to stabilize head movements. Stimuli were presented at 60Hz at a resolution of 19201080 pixels. Eye movements were recorded via a non- 17 Figure 3.1: Image pairs from the controlled set with their corresponding saccade probability map and saliency map from the AWS model (Garcia-Diaz, Leboran, Fdez-Vidal, & Pardo, 2012). Saccade maps are generated from fixations of all observers that start from the head region and leave that head region. For each image, the actor was explicitly instructed to look at a particular object. The head and the two object regions as well as gaze directions are marked. The looked-at object is shown with a red polygon, and the ignored object has a green polygon. Fixations are shown as blue dots on the images. 18 invasive infrared Eyelink (SR Research, Osgoode, ON, Canada) eye-tracking device at a sample rate of 1000 Hz (spatial resolution less than 0.5 ). Each image was shown for 30 seconds fol- lowed by a 5 second delay (gray screen). The eye tracker was calibrated using 5 point calibration at the beginning of each recording session. Observers viewed images in random order. Saccades were classified as events where eye velocity was greater than 35 /s and eye acceleration exceeded 9500 /s 2 as recommended by the manufacturer for the Eyelink-1000 device. Faces took up, on average 0.99% of the image, while they accounted for 9.22% of all fixations. The controlled pairs of objects took up 1.12% of the image each, but the gazed at object received 7.75% of all fixations, while the ignored object received only 5.20% of all fixations. The average head size was 143 180 pixels ( 3:6 5:2 visual angle). The average object size was 167 174 pixels ( 4:2 5:1 visual angle). The Adaptive Whitening Saliency model (AWS) by (Garcia-Diaz, Leboran, Fdez-Vidal, & Pardo, 2012) was used as a proxy for bottom-up saliency because it has been shown to outperform other saliency models in recent benchmarks (Borji, Tavakoli, Sihite, & Itti, 2013). The head regions and the two looked-at objects were annotated in all image pairs (object boundaries). Note that the ground truth gaze direction is known here since actors in images were explicitly instructed to look at one of the two objects. 3.1.4 Analysis and results To evaluate the effect of gaze direction on free viewing behavior, we can quantify the extent to which fixations are drawn to the gazed-at object vs. the ignored object. There are two target objects in each image: gazed-at and ignored. The fraction of all saccades that start from the head and end inside each of these two objects (overall 3,331 such saccades) can be computed. This results in two fractions, one for the gazed-at and one for the ignored object. For all 60 images, two arrays/vectors of size 60 of these fractions (one for the gazed-at and one for the ignored object) are generated. The median of these two vectors can then be compared. The median of the gazed-at vector is 0.069 which is significantly higher than the median 0.022 of the ignored vector (sign test; p<1e8). Medians of the normalized saliency map activation for both cases are not significantly 19 different from each other (0.019 vs. 0.020; p = 0:90). For 78.3% of cases, the gazed-at object attracted a higher fraction of fixations compared to the ignored object. These results clearly show that low-level saliency does not account for the influence of gaze direction on fixations. Actor gaze direction predicts observer saccade direction For each image, we generated prob- ability distributions for angular directions for all saccades starting on the head region and leaving that region. The angular saccade directions were discretized into a histogram with 20 bins, of 18 degrees each, and then converted to a probability density function (pdf) by dividing this histogram by its sum. A higher pdf value at the ground truth gaze direction means stronger gaze following irrespective of the exact endpoint. Fig. 3.2 shows 12 image pairs along with their distributions of saccade directions for data of all observers. As this figure shows, there is a peak in the direction of the looked-at object in the majority of images. Here we see that the value of the saccade distribution (pdf) in the direction of the ground truth gaze is significantly higher than in a random direction chosen uniformly (i.e., chance level). The 60 images produce a vector of size 60 of pdf values. Similarly, a vector of the same size is read out from random directions in the pdf. Shown in Fig 3.3.A, the median of the saccade direction vector is 0:220, which is significantly higher than the median of the saccade vector at uniform random directions 0:023 using the rank-sum test 2 (i.e., versus uniform chance level;p<1e 16). The median of the saccade direction vector is also significantly higher than a smart chance level in which directions are sampled randomly from the average ground truth gaze direction pdf shown in Fig 3.3.B (i.e., Naive Bayes chance level of 0:061; sign testp<1e 9). Hence, observers tended to look significantly more (overall) in the direction of actor gaze than in any other direction. To assess the relative strength of the saccade direction pdf values when the actor looked at an object and when he/she did not (i.e., looked at the other object), the pdf value of the object was compared when it was in the gaze direction, P + , and when it was not in the gaze direction, P . The relative gaze strength is denoted here byz to beP + =(P + +P ). Az value of 1 means that observers always looked at the object that the actor was looking at, i.e., observers always followed the actor’s gaze direction, and a z value of zero means they never followed the actor’s 2 Using the Matlab®ranksum function. 20 0.1 0.2 30 210 60 240 90 270 120 300 150 330 0 180 0.2 0.4 90 270 0 0 8 1 0.2 0.4 90 270 0 0 8 1 270 0.1 0.2 90 0 0 8 1 0.2 0.4 90 270 0 0 8 1 270 0.2 0.4 90 0 0 8 1 0.1 0.2 90 270 0 0 8 1 0.2 0.4 90 270 0 180 0.2 0.4 90 270 0 0 8 1 0.2 0.4 90 270 0 0 8 1 0.2 0.4 90 270 0 180 0.1 0.2 90 270 0 0 8 1 0.2 0.4 90 270 0 0 8 1 0.2 0.4 90 270 0 0 8 1 0.2 0.4 90 270 0 0 8 1 0.2 0.4 90 270 0 0 8 1 0.2 0.4 90 270 0 80 1 0.1 0.2 90 270 0 180 0.2 0.4 90 270 0 0 8 1 0.2 0.4 90 270 0 0 18 0.2 0.4 90 270 0 0 8 1 0.1 0.2 90 270 0 0 18 0.4 0.8 90 270 0 0 8 1 0.2 0.4 90 270 0 0 18 Figure 3.2: Sample images in the first experiment. In each panel, the two images of the image pair are shown in the left with their corresponding saccade probability distributions (for saccades starting somewhere in the annotated head; See Fig. 3.1) shown in the right (polar plot). Red lines in polar plots indicate the ground truth gaze direction. Gaze following is strong for some images (e.g., the person looking at the CRT monitor and the trash can) while it is weaker for some others (e.g., the person looking at the yellow food box and the tissue perhaps due to the complexity of the background and saliency of one of the objects). 21 gaze direction. Averaged over all image pairs, z was found to be z = 0:65 ( z = 0:14) for z which is higher than 50% chance level (t-test 3 ,p<1e10). This implicitly means that if one were to guess the ground truth gaze direction on each image based on the relative saccade direction pdf value (i.e., decision criterion being the direction of the gazed at object with higher pdf value) it would have an accuracy of 65%. This result shows that actor gaze direction has a significant causal effect on the directions of observer eye movements. Gaze direction vs. most salient location direction One might argue that it was saliency that attracted observers to look in a particular direction and not gaze following (i.e., the effect could be partly due to low-level saliency). To account for this confounding factor, we measured the saccade pdf value in the direction of maximum saliency (not necessarily inside annotated objects) and compared it with the saccade pdf value in the ground truth gaze direction. Median saccade pdf value at maximum saliency direction is0:10 which is significantly lower than the median pdf value 0:22 at the ground truth gaze direction (sign test, p<1e 5). Saccade pdf value at the direction of maximum salient location is significantly higher that the uniform chance level (p<1e 8) and Naive Bayes Chance (p<2e2). Thus, while saliency is an important factor in predicting observer gaze direction, it can not fully explain the data. This means that gaze direction and saliency are two complementary sources of information in guiding eye movements. 3 z was empirically verified via Kolmogorov–Smirnov test to be normally distributed. 22 0.05 0.1 0.15 0.2 0.25 30 210 60 240 90 270 120 300 150 330 180 0 Naive Bayes prior over ground truth gaze direction Gaze Following Strength * * * * * median gaze (0.220) median saliency (0.10) Raw Count Exp. I median - NB (0.061) median- uniform (0.023) 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 5 10 15 20 Saliency Gaze Average face annotation map Average fixation map Average object annotation map A C B N = 60 Figure 3.3: Results of the first experiment. A) the histogram of saccade direction pdf values (gaze following strength) in the ground truth gaze direction and the most salient location direction (saliency maps are nor- malized to be pdfs). B) the distribution of ground truth gaze directions over all 60 images in experiment one (i.e., prior gaze direction distribution). Prior distribution is used to compute the Naive Bayes chance level. C) average annotation map (average of all object polygons over all images) for faces and objects, as well the mean eye position map over all images for saccades that start from somewhere inside the head region. As can be seen, there is a high fixation density around faces even for saccades that leave the head region. This can be partly due to uncertainty of observers in landing saccades or eye tracker error. 3.2 Experiment two: gaze following vs saliency with Flickr dataset Our aim in this experiment was to explore the generality of gaze following behavior on a wider range of uncontrolled natural scenes including scenes with more cluttered backgrounds and with multiple people interacting with each other as well as multiple other objects. Having several per- sons in a scene raises an additional challenge that gaze following behavior might not be due to a person looking towards someone, but it might be because the looked-at person is salient and captures attention by itself. Indeed, recent evidence suggests that human faces strongly attract eye movements, often capturing the first fixation on a new scene (Cerf et al., 2009; Judd et al., 2009; Borji, 2012). Here we investigate this challenge by breaking down the analysis into cases where 23 the entity at the gaze endpoint is another person or an object. Figure 3.4: Sample images in the second experiment with heads, faces, eyes, and gaze directions annotated. Blue lines indicate ground truth gaze direction. We use the head region for calculating saccade pdfs. We annotated the gaze directions of the people that looked at something or a person in the scene (and not to the camera or out of the image plane). All faces were annotated. 24 1 original image (saccades) eye movements (all points) AWS saliency map from head 2 3 4 5 6 7 8 9 10 11 12 Figure 3.5: Twelve sample images in the second experiment along with their corresponding eye movement map (2nd rows), fixation map composed from all saccades (third rows), fixation map composed from all saccades that start from one face (fourth rows), and AWS saliency maps (fifth rows). Eye movement data is over all observers. Note that the AWS saliency model does not have an explicit face channel. Try to guess to which face the third row belongs to! 25 3.2.1 Stimuli Stimuli consisted of a set of 200 color photographs collected mostly from the ImageNet dataset (Rus- sakovsky et al., 2014) 4 . Photographs span a variety of topics and locations such as indoor, outdoor, social interactions, object manipulations, instrument playing, athletics, reading, and dancing. As in experiment one, images were resized to 1920 by 1080 pixels (See Figs. 3.4 and 3.5 for sample images) while keeping the aspect ratio. We chose images in which at least one of the faces were large enough, visible, and gazing at something visible in the scene (another person or an object). In several of the images, some people are looking into the camera or out of the scene, but these heads were discarded in the gaze following analysis. Images were shown to observers in two sessions with 100 images each. Observers had 5 minutes break in between two sessions. The eye tracker was re-calibrated before the second session. We annotated heads, faces, eyes, and gaze directions for all 200 images. We annotated the gaze direction from entire image content including the head area and candidate gazed-at objects. When multiple objects are spatially close to each other, the gaze can be ambiguous. In this case, even if we can infer general head direction, it is quite hard to know which object the person is gazing at. Thus, spatial accuracy of gaze following requires a very precise computation of gaze data with a high-resolution eye image, which are not available for these images. The average number of heads (with faces visible or not) in scenes was 2.65 (SD = 2.02, Median = 2). Fifty one images had only one head in them, 78 had 2 and 71 images had 3 or more. Overall there were 530 heads of which only 305 were looking at something in the scene (had their gaze annotated). From these 305 heads, 138 were looking at another head and 167 where looking elsewhere. In every image there was at least one person whose face was visible and who was looking at someone or something visible within the image, which could be used for analysis of gaze following. Although a face occupies 2.68% of the image on average, it contained 15.3% of all fixations. The average head size was 220 270 pixels ( 5:5 7:9 visual angle). The average gaze length was 459 pixels ( 11:5 ). 4 http://www.image-net.org/. Some images were also borrowed from the AFW dataset (Zhu & Ramanan, 2012) 26 3.2.2 Observers A total of 30 students (4 male, 26 female) from the University of Southern California (USC) took part in the study (Mean age = 19:46, SD = 0:97). Observers had normal or corrected-to-normal eyesight and were compensated by course credits. 3.2.3 Apparatus and procedure Procedure and apparatus were the same as in experiment one, except that here images were shown for 10 seconds with 5 seconds gray screen in between two consecutive images. 3.2.4 Analysis and results Fig. 3.4 shows sample images from the stimulus set in experiment 2 and their annotations. Fig. 3.5 shows sample images along with their corresponding fixation maps (from all observers), blurred fixation maps, blurred maps for fixations starting from head, and AWS saliency maps. Fig. 3.6 shows angular saccade probability distributions for sample images in the second experiment (over all observers). Note how there is a bias toward the upper regions of fixation maps where faces are more likely to occur in these scenes (Figs. 3.3.C & 3.7.B). This bias is away from the classical center bias (Tatler, 2007; Borji, Dicky Nauli Sihite, & Itti, 2011) but is close to the hotspot present in the head annotation map. 3.2.5 Analysis of gaze following We repeat the same analysis as in experiment one by reading out and comparing saccade pdf values in the ground truth gaze direction, in the direction of the maximum saliency location (anywhere in the scene), as well as in random directions. Results are shown in Fig. 3.7. We break-down the analysis (stimulus set) into three cases: 1) All data, 2) Single-Head where there is only one person in the scene looking at something, and 3) Face Saliency Control where a person is gazing at something other than a face, and there are multiple persons in the scene. Case 1 addresses if 27 1 0.2 0.4 90 270 0 0 8 180 0.2 0.4 90 270 0 180 0.2 0.4 90 270 0 0.4 0.8 90 270 0 0 18 0.2 0.4 90 270 0 0 8 1 30 210 60 240 120 300 150 330 8 1 0.4 0.8 90 270 0 0 0.2 0.4 90 270 0 0 8 1 0.2 0.4 90 270 0 0 8 1 0.2 0.4 90 270 0 0 8 1 180 0.2 0.4 90 270 0 0.1 0.2 90 270 0 180 0.2 0.4 90 270 0 0 8 1 0.4 0.8 90 270 0 0 8 1 0.4 0.8 90 270 0 0 8 1 0.2 0.4 90 270 0 0 8 1 1 0.2 0.4 90 270 0 180 1 0.2 0.4 90 270 0 180 1 0.2 0.4 90 270 0 0 8 A D B C Figure 3.6: Sample images and their corresponding saccade direction pdfs in the second experiment (case 1). In each panel, the face under investigation in the image (left) is marked with a polygon. The polar plot (right) shows saccade direction pdf for all saccades (case 1; see text). Data is for saccades that start from inside the polygon and land somewhere else in the image. 28 our results in experiment one generalize to uncontrolled complex natural scenes. Case 2 verifies generality of our results in natural scenes with only one person. Case 3 controls that gaze following is due to the gaze direction and not face saliency. In other words, this control checks whether the gaze following effect is still present when the direction of other faces is incongruent with the actor’s gaze direction. For example, consider the image in Fig. 3.6.A in which there are two men, one looking at a newspaper and the other looking at the first man. The question is whether an observer starting from the left man’s head will saccade to the newspaper or to the right man’s head. This partitioning of the dataset results in 51 samples for case 2 and 116 samples for case 3. Note that some images in case 3 may contain several faces looking at something resulting in several data points (e.g., Fig. 3.6.B). The overall number of gaze following data points (case 1) is 305. Total number of saccades over all observers and images (case 1) that start from annotated heads is 21,737. From this, 4,411 saccades belong to case 2 and 7,776 saccades belong to case 3. Median saccade pdf value at the ground truth gaze direction (case 1) is 0:326 and is significantly above uniform chance level of 0:029 (sign test,p = 4:051e62; over a vector of 305 probability values one for each head, across all observers). Median saccade pdf in the gaze direction in case 2 is0:378 which is again significantly above uniform chance level of 0:046 (sign test, p = 1:586e 12). This is in alignment with our results in experiment one. Median saccade direction pdf at the gaze direction for case 3 is 0:241 which is above uniform chance level of 0:035 (sign test, p = 1:167e 18). Median gaze following strengths in all three cases are significantly above Naive Bayes chance levels using sign test (p-values in order are3:975e39,1:767e09, and8:656e10). Comparing gaze following strengths for cases 2 and 3 shows a significant difference (sign test, p = 1:220e 05). This indicates that observers follow the gaze direction less on images with multiple faces suggesting that they were sometimes distracted by face saliency. Median saccade pdf values in the direction of maximum saliency over all data (case 1) is 0:132 which is significantly above uniform chance level of 0:037 (sign test,p = 1:102e 19). It means that observers gazed towards something salient in the scene from the actor’s face significantly more than expected by chance. These values for case 2 and case 3, in order, are: 0:149 (significantly different vs. uniform chance using sign test;p = 4:436e 05) and 0:138 (significant vs. uniform 29 chance;p = 1:428e07). Saccade pdf values in the direction of the maximum salient location are significantly above Naive Bayes chance levels in all three cases (p-values in order are 1:955e03, p = 2:324e02, andp = 3:438e02). Median saccade pdf values in ground truth gaze directions are significantly higher than median saccade pdf values in the maximum saliency direction in all three cases (p values in order are 9:696e 30, 1:220e 05, and 1:380e 04). This, in accordance with experiment one, confirms that gaze direction drives saccade direction more than the direction of the most salient location in free-viewing of natural scenes. 3.2.5.1 Addressing memory confound In our analysis so far, we considered all saccades that start from the head region. One confounding factor here is the memory of previously visited locations, which may attract or repel subsequent fixations. Observers might want to look back at the objects because they may find them somehow interesting or important, or they may want to preferentially discover new items in the scene. As a consequence, one might for example first explore the scene in many directions, then look at a face, and from there follow the actor’s gaze direction accidentally because it points towards a yet unexplored portion of the scene. Here, to make sure the gaze following effect is not because of memory, we only limited our analysis to the first saccades in the scene which also happened to be on the face (note: this analysis was done for the 2nd experiment since we did not have enough of such fixations in the 1st experiment). The difference between the median of gaze following strength and median saliency strength (in the direction of the maximum location) in experiment two for just the first fixations (n=151) is statistically significant (0:428 vs. 0:000; sign test,p = 1:142e 08). Median values for both uniform and Naive Bayes chance levels are zero. Hence, even discounting any possible effect of memory, we still see a significant gaze following effect. 3.2.5.2 Temporal analysis of gaze following strength To investigate gaze following over viewing time, we analyzed the effect of saccade order in gaze following strength in experiments 1 and 2. Figure 3.8 illustrates the results for the first 30 sac- 30 0.05 0.1 0.15 0.2 90 270 180 0 Exp. II (Case 1: All Data) Avergae face annotation map Gaze Following Strength * * * * * median gaze (0.326) median saliency (0.132) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 20 40 60 80 100 Raw Count median NB (0.081) median uniform (0.029) Average fixation map A C B N = 305 D Exp. II (Case 2: Single-Head) Gaze Following Strength * * * * * median gaze (0.241) median saliency (0.138) median gaze (0.378) median saliency (0.149) Saliency Gaze 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Raw Count median NB (0.080) median uniform (0.035) N = 116 5 0 10 15 20 25 30 35 Gaze Following Strength * * * * * Exp. II (Case 3: Face Saliency Control) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 5 10 15 median NB (0.098) median uniform (0.046) N = 51 Figure 3.7: Results of the second experiment 2. A) Distribution of gaze following and saliency strengths over all data (case 1). Inset shows the average distribution of saccades over all data (for saccades starting from a head region). The horizontal bias could be largely caused by location of people and their heads in scenes. This prior is used for calculating Naive Bayes chance level in all three cases. B) Average head annotation (case 3) and average fixation map over all data (for saccades staring from face). Top horizontal biases can be because faces happen around the top of scenes. C&D) Distribution of gaze following and saliency strengths for cases 2&3. Gaze following strength is significantly higher than both chance levels and maximum saliency direction in all three cases. 31 cades partitioned in bins of 10 saccades (culled from the first 30 saccades over all data, but only saccades that initiated somewhere in the head region were selected). We find that gaze following is a stronger cue during early saccades and drops over time in both experiments (it stays above max- imum saliency strength over all 30 saccades). For the first experiment, the first bin of ten saccades had a significantly higher median strength of 0.42 while the second and third bins had medians of 0.28 and 0.25, respectively (Bonferroni corrected (Bland & Altman, 1995) significance value of 0.5/3=0.017; p values for bin comparisons: 1 vs. 2: 2:20e 5, 1 vs. 3: 2:20e 5, and 2 vs. 3: 5:96e1). This was also true for the second experiment, where the medians were 0.33, 0.26, and 0.27, respectively (p values for bin comparisons: 1 vs. 2: 2:60e 4, 1 vs. 3: 1:52e 4, and 2 vs. 3: 9:70e1). Experiment one Experiment two Saccade number Saccade number Raw Count gaze following str. 5 10 15 20 25 30 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Gaze Strength Saliency Strength 5 10 15 20 25 30 0 50 100 150 5 10 15 20 25 30 5 10 15 20 25 30 0 500 1000 1500 B A Figure 3.8: Temporal analysis of gaze following strength for the first 30 saccades. Average is taken for all saccades that start from a face and land somewhere else in the image. For example, an order of second means that all saccades that started from a face and were the second overall saccade for a subject on an image. In both panels, the top part shows the gaze following strength and the bottom panel shows the number of samples in each order. The data was binned into three groups when comparing strength over time: Bin-[Saccade Numbers] 1-[1:10], 2-[11:20], 3-[21:30], which are color coded: yellow, blue, and pink, respectively. A) Experiment one, B) Experiment two. 3.2.5.3 Predicting fixation locations with a simple gaze map Having shown that gaze direction influences eye movements in free-viewing, here we explore how gaze direction can be utilized to explain fixation locations. We construct a simple map, referred to here as the gaze map, which has ones inside a cone (9 starting from a head and centered 32 along the gaze vector) and zeros anywhere else. This cone corresponds to a single bin in the polar histograms shown in Figs. 3.3 and 3.7. Gaze maps present the advantage of being directly comparable to saliency maps, so that we can here run a direct quantitative assessment of the relative strengths of gaze following versus saliency in driving observer eye movements. Gaze and saliency maps are converted to a pdf by dividing each of their values to the total sum of their values (See Fig. 3.10 for example cones). Fig. 3.9 shows the histogram of gaze and saliency map values at fixation locations over both experiments. As can be seen, the histogram for gaze map values are bimodal, a large leftmost peak near zero and another peak at higher values. The peak around zero is because many fixations do not fall inside the cone. Note that this a relative effect: although a large portion of fixations fall off the cone, the gaze cone still contains the largest fraction of fixations relative to any other cone of the same size in the image. Saliency map values, on the other hand, only show a left peak near zero. This observation hints toward efficient ways to integrate saliency and gaze maps (See the Discussion section). Although gaze map has higher frequencies at larger values, its median is dominated by zeros which makes its median lower than the saliency median (Dashed vertical lines in Fig. 3.9). As a complementary analysis we also calculated the Receiver Operation Characteristics (ROC) curves by thresholding maps and measuring true positive rate (fraction of fixations above threshold) and false positive rate (fraction of uniformly random chosen points above threshold). Results are shown in Fig. 3.9 insets. The Area Under the ROC Curve (AUC) values in both experiments are significantly above chance which is AUC = 0.5 (t-test over gaze direction cases; 60 in Exp 1, p = 3:855e18; 305 in Exp 2 case 1,p = 4:79e61) but significantly below the AUC values of the saliency map (t-test; Exp 1,p = 1:18e21; Exp 2, case 1,p = 2:27e56). Thus we conclude that both gaze direction and saliency strongly influence observer eye movements, with saliency here being a stronger predictor of saccade endpoint but gaze direction still providing significant prediction performance. Is there any benefit from gaze direction in prediction of fixation locations on top of early saliency? To answer this question, in Fig. 3.10 we illustrate scatter plots of gaze and saliency 33 N = 21737 Prediction Strength Exp II. (Case 1: All Data) B N = 3331 1e−06 2e−06 3e−06 4e−06 5e−06 6e−06 7e−06 8e−06 9e−06 1e−05 0 500 1000 1500 2000 Prediction Strength Raw Count Exp I. Saliency Gaze A D C 0.5 false alarm hit rate 0.5 1 1 0 0 GazeAUC 0.612 SalAUC 0.797 GazeAUC 0.568 SalAUC 0.794 GazeAUC 0.646 SalAUC 0.795 1e−06 2e−06 3e−06 4e−06 5e−06 6e−06 7e−06 8e−06 9e−06 1e−05 0 2000 4000 6000 8000 10000 12000 14000 0.5 false alarm hit rate 0.5 1 1 0 0 GazeAUC 0.625 SalAUC 0.789 N = 4411 Exp II. (Case 2: Single-Head) 0.5 false alarm hit rate 0.5 1 1 0 0 1e−06 2e−06 3e−06 4e−06 5e−06 6e−06 7e−06 8e−06 9e−06 1e−05 0 500 1000 1500 2000 2500 Prediction Strength N = 7776 Exp II. (Case 3: Face Saliency Control) 0.5 false alarm hit rate 0.5 1 1 0 0 1e−06 2e−06 3e−06 4e−06 5e−06 6e−06 7e−06 8e−06 9e−06 1e−05 0 1000 2000 3000 4000 5000 6000 Prediction Strength Raw Count Raw Count Raw Count Figure 3.9: Fixation prediction results for our simple gaze map and the AWS saliency model over both experiments. Both maps are normalized to be pdfs. Histogram of saliency and gaze map values at fixation locations (saccades starting from head regions) for: A) controlled pairs in the first experiment, B) all data in experiment two (case 1), C) when there is only one person in the scene looking at something (Exp2, case 2) , and D) where a person is gazing at something other than a face, and there are multiple persons in the scene (Exp2, case 3). Note that, as expected, in all cases there is a peak at the left around zero for the gaze map because many saccades fell off the gaze map (i.e., misses) mainly because observers did not follow the gaze (but overall followed gaze direction higher than any other direction). Dashed lines represent medians. Insets show the ROC curves. ROC is measured by thresholding all gaze maps and then calculating the ratio of ground truth and random fixations that fall above the threshold (corresponding to true positive rate/hit rate and false positive rate/false alarm). 34 map predictions on all images in both experiments. Gaze map results in above chance accuracy (AUC> 0.5) for 56/60 (93.33%) of images in experiment 1 and for 236/305 (77.38%) of images in experiment 2 (case 1). Corresponding numbers for saliency map are 60/60 (100%) and 299/305 (98.03%). Gaze map outperforms saliency map for 2/60 (3.33%) of images in experiment 1 and for 25/305 (8.20%) of images in experiment 2. Some success and failure cases for both maps are shown in Fig. 3.10 right side. Images that gaze map has high prediction power (3 in Fig. 3.10.A and 2 in Fig. 3.10.B) usually contain low background clutter and few salient objects at the gaze direction. Several reasons may lead to low performance of the gaze map such as high scene clutter, ambiguous gaze angle, and large gaze map area when the cone starts near one image corner and points to the opposite corner (See 1 in Fig. 3.10.A and 1&7 in Fig. 3.10.B). We learn that uniform distribution of activation in the cone is not efficient as this simple cone has no sense of features or objects. In some instances, however, gaze map was able to account for observer fixations that were almost completely missed by saliency, pointing towards the possibility of future synergies (e.g., the belly of the pregnant lady or the woman watching TV in Fig. 3.10.B). Perhaps the best to combine saliency and gaze maps is to multiply them first and add the result to the saliency map (See Discussion section). Table. 3.1 summarizes results of the first two experiments. 35 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 GazeAUC 0.484 SalAUC 0.820 1 GazeAUC 0.727 SalAUC 0.678 3 GazeAUC 0.548 SalAUC 0.925 2 GazeAUC 0.559 SalAUC 0.653 GazeAUC 0.534 SalAUC 0.967 GazeAUC 0.874 SalAUC 0.943 GazeAUC 0.844 SalAUC 0.920 4 1 2 3 6 5 7 Per Image Gaze AUC Per Image Saliency AUC Experiment I A B # of Images # of Images Per Image Gaze AUC # of Images 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 gaze >= saliency = 3.33 % FPR TPR 0 10 20 30 mean = 0.5933 273/305 ~ 89.5% above chance 302/305 ~ 99% above chance 302/305 ~ 99% above chance 56/60 ~ 93.33% above chance 100% above chance 0 10 20 30 mean = 0.7925 1 3 2 4 GazeAUC 0.681 SalAUC 0.442 GazeAUC 0.364 SalAUC 0.702 GazeAUC 0.380 SalAUC 0.556 0 0.5 1 0 0.5 1 8 4 GazeAUC 0.786 SalAUC 0.796 min max N = 60 6 5 7 gaze >= saliency = 9.84 % Per Image Saliency AUC Experiment II # of Images mean = 0.7815 mean = 0.6160 1 3 2 4 8 N = 305 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 50 100 150 0 50 100 150 0 0.5 1 GazeAUC 0.632 SalAUC 0.447 Figure 3.10: Gaze maps predict fixations in free viewing. A) Area under ROC curve for gaze and saliency map’s prediction of fixated locations pooled over all observers in each image. Each data point corresponds to one head gaze (thresholding each map separately). Histogram of AUCs are depicted as marginals (same axes as scatter plot). For points below the diagonal, gaze map’s accuracy is better than the saliency map (2 images) and vice-versa below the diagonal (58 images). Gaze map has above chance accuracy for 56 images (60 for saliency map). B) same as A but over experiment 2. Here, 25 images are below diagonal. Gaze map performs better than chance over 236 images (296 for saliency map). Example images where either one or both of maps perform well are shown in right along with their gaze maps, saliency maps, and ROC curves. Overlaid points represent eye movements. Interestingly, in some cases gaze map predicts fixations much better than saliency model (e.g., pregnant lady and the woman watching TV). 36 Exp I Ep II - Case 1 Exp II - Case 2 Exp II - Case 3 Predicting saccade direction (medians) Gaze direction 0.220 0.326 0.378 0.241 Most salient direction 0.101 0.132 0.149 0.138 Uniform chance 0.023 0.029 0.046 0.035 Naive Bayes chance 0.061 0.081 0.098 0.080 Predicting fixation locations (AUC values) Gaze map 0.612 0.625 0.646 0.568 Saliency map 0.797 0.789 0.795 0.794 Chance 0.5 0.5 0.5 0.5 Table 3.1: Summary results of experiments 1 & 2 for prediction of gaze direction and fixation locations (for fixations that start from the head region). For saccade direction prediction: Both gaze and most salient directions are significant predictors of observers’ saccade direction (p < 0:05 using sign test vs. both uniform and Naive chance levels) in two experiments and in all cases. In both experiments, gaze direction performs significantly higher than maximum saliency direction in predicting observers’ saccade directions. For saccade endpoint prediction: Both gaze and saliency maps outperform chance significantly above chance (AUC for chance is 0.5 corresponding to a white noise map). AUCs here are calculated by thresholding all maps at a certain threshold level and measuring true positive and false positive rates across all maps. Overall, our simple gaze map explains fixations significantly below the best existing purely bottom-up saliency model, but there are some cases where our gaze map wins over the saliency map. 3.3 Conclusions Our quantitative results in these two experiments indicate that free-viewing observers strongly follow the gaze direction of human actors in natural scenes. In experiment one, we found that the fraction of fixations that start from a head region and land on an object was significantly higher for the attended object compared to the ignored object. While actor gaze was a stronger predictor of saccade direction than saliency or chance in both experiments, it performed worse than saliency in predicting saccade endpoint, although above chance. We also noted that observers follow the gaze direction less on images with multiple faces, suggesting that they were sometimes distracted by face saliency (results of our 2nd experiment; higher gaze following strength and AUC values for case 2 vs. case 3 in Fig. 3.7 and Fig. 3.9, respectively). However, it remains to be investigated which cue (face, saliency, or gaze direction) observers prioritize in viewing natural scenes. In addition to our quantitative results, we also noted a few qualitative observations. Our observers 37 sometimes followed the gaze direction of inanimate objects (statues, robots, dolls, masks, etc). We saw a cyclic behavior in fixation patterns in alignment with (Yarbus, 1967), (DeAngelus & J. B. Pelz, 2009), and (Borji & Itti, 2014), such that observers look back and forth between the actor and the gazed-at object. Contrary to our expectation, we did not find a correlation between low-level saliency and gaze following strength. Note that we found that gaze direction while providing a strong directional cue, overall is a weaker predictor than saliency in predicting the saccade endpoint, and it is important to qualify this statement. First, here we employed the best existing bottom-up saliency model according to a recent comparative benchmark of 35 saliency models (Borji & Itti, 2013). This is particularly important since using weaker models can sometimes reverse the conclusions of a study (see for example (Borji, Dicky N Sihite, & Itti, 2013)). Yet, like all models, this saliency model is only a coarse approximation to human saliency. Second, most saliency models analyze the image along a number of feature dimensions, some of which can be fairly complex (e.g., face detection, text detection). Because saliency modeling is a mature field and these models are complex, it would have been quite surprising that gaze direction alone (in the form of our simple gaze map) might have surpassed complex and sophisticated saliency models, which integrate many known cues that attract attention. However, our results point towards a possible future synergy between gaze direction models and saliency models. Our analysis shows an important concept which is that predicting fixations is different from predicting directions. Gaze can be viewed as something like a face or text channel, which alone cannot explain a large number of fixations, but when combined with bottom-up saliency (assuming that they are reliable and make rare mistakes) can enhance the prediction power of a model (i.e., biasing saliency in one direction). 38 Chapter 4 Modeling gaze following, head pose, and saliency In studies, perceived gaze facilitates covert attention in the direction of gaze, even when the gaze cue is not informative of the task (Friesen & Kingstone, 1998; Driver et al., 1999) 1 . Perceived gaze has also been shown to facilitate overt attention if the gaze is in the same direction as the task direction, and hamper it if the gaze is not (Hietanen, 1999; Langton & Bruce, 1999; Ricciardelli, Bricolo, Aglioti, & Chelazzi, 2002; Kuhn & Kingstone, 2009). This work suggests that gaze fol- lowing cues are automatically processed, regardless of the current task, and not under the explicit control of the participant. As a result, a complete model of bottom up attention should include gaze cueing as well as the more standard simple features (orientations, color, intensity) and face detections used in current models (Cerf et al., 2008). The automatic nature of the gaze-following cue makes it amenable to modeling like other bottom-up, automatic cues that have been done in the past: color, motion, orientation, heads, etc., via saliency modeling. First, we will attempt to quantify both the gaze-following and head pose cues parameterized by the gaze angle. We will then investigate how to combine the different cues. 1 This chapter has been published as (Parks, Borji, & Itti, 2014). 39 4.1 Learning head pose and gaze following spatial probability maps In order to model head fixations, both those that land on the head, and those that leave the head, we parameterized these fixations based on the 2D head pose angle and head size. This resulted in two probability maps of those fixations over the training set. First the 2D rotation angle between the current head pose angle and a reference head pose angle on the image plane was determined, and all saccade vectors leaving the head were rotated by that angle. The saccade vectors were then normalized by the size of the head. Figure 4.1A shows the probability map for fixations that leave the head (i.e., average gaze following map) extracted from the training set. Note that since this is normalized by head pose and not final eye gaze, the uncertainty of final eye gaze given head pose is implicitly taken into account in this probability map. The second map that was created was for fixations that land on the head. Again, the saccade vectors were rotated based on the head pose angle, and then the length of the vectors was normal- ized by the head size. As shown in Figure 4.1B, the head fixations had a high density in the eye region, with low density in the hair and chin regions. For each head in the image, the gaze and head probability maps can generated using the original image, the learned head pose and gaze following maps and a set of head pose detections. For each head detection, each map is translated and scaled based on the position and size of the head pose detection. The maps are then rotated about the center of the head based on the head pose angle as shown in Figure 4.1C. The model uses three components: the gaze following probability map, the head probability map, and the saliency map. The first two were discussed above. The saliency map was generated using the adaptive whitening saliency (AWS) algorithm (Garcia-Diaz, Fdez-Vidal, Pardo, & Dosil, 2012). AWS was chosen as the saliency benchmark due to its performance in a recent review of saliency models (Borji et al., 2013), and because it doesn’t try to model higher level concepts, such as human faces or objects, and so is suitable for serving as a proxy of low-level attention. The gaze following probability and head probability maps both used annotated head regions and head pose angles to place, scale, and rotate the learned probability maps from Figure 4.1A and Figure 4.1B, 40 (A) Probability Leaving Head (B) Probability To Head (C) Example generation of gaze and head pose maps: Transformed Head Stimulus Transformed Gaze Head Pose Ground Truth/Detection Figure 4.1: (A) Average fixation probability after leaving a normalized head over the whole dataset, where all head poses were rotated to point to the right (white arrow), and sized to a nominal head size (black oval). (B) Average fixation probability for all head fixations over the whole dataset, where all head poses were rotated to point to the right and size to the nominal head size. (C) An example image which had head pose detections applied. The gaze following and head region maps are then translated, scaled, and rotated according to the head position, head size and head pose image angle, respectively. This is done for each head detection. respectively, which are then combined to form pose maps. Figure 4.4 shows what these combined pose maps look like for some sample images. Since both human faces and gaze direction have been shown to influence eye movements, a means to integrate these higher-level cues with saliency is needed. 4.2 Conditioning cue combination on type of fixation location The classic method of predicting eye fixations is to analyze the stimulus using a battery of bottom up cues and combine them into a fixed map which can then be compared to a human fixation map. When looking at fixed displays of low-level cues, this method has been shown to work very well. However, this ignores all temporal ordering in the process. However, certain cues, like gaze following, only make sense when an observer is fixated on a head. If the gaze cue were to be applied blindly to all saccades, even to those from other heads, or from far off regions in the scene, it would likely only degrade performance. This goes to an even deeper question, however, what does the current fixation location tell us about the behavior of the observer? Ever since (Yarbus, 1967), it has been argued that the scan path of saccades and not just the collection of fixations is important to understanding eye movements, however, most saliency mod- 41 els have focused on fixed maps or an inhibition scheme to predict a series of saccades. It was already known from eye tracking work that head fixations were frequently followed by other head fixations, and non-head fixations were frequently followed by non-head fixations, suggesting that this affinity might reflect a change in implicit task on the part of the observer. Here we propose that the choice of cue to follow when making saccades is a window into the implicit task of the observer. To take advantage of this implicit task behavior, a gaze contingent model was constructed, one where the last fixation location is provided. We argue that this is more suitable in the quest to achieve parity with the predictive power of other human observer’s fixations (i.e. an inter-observer model), which have “models” (i.e. humans) that have full knowledge of their saccade history. A portion of this inherent temporal structure can be captured by a discrete time Markov chain formulation (Norris, 1998) with two states,head (H) and nonhead (N). Unlike an implicit hidden Markov model (Rabiner & Juang, 1986), the states in this model are known and the meaning of the weights is easily understood. In addition to offering predictive power, this simple structure provides an understanding of how different cues are combined. From here on, this model will be referred to as the Dynamic Weighting of Cues model (DWOC). The following notation will be used for saccades:sac origin destination , with saccades originating from the head defined as: sac head , and from other regions: sac nonhead . From these two regions, there is a certain probability that participants will saccade to a headp(sac head ), to a point gazed at by an actor in the scene (i.e., following the actor’s gaze)p(sac gaze ), or to a salient pointp(sac salient ). These probabilities express how likely an observer is to follow each particular cue. The possible transitions are shown in Figure 4.2. 42 Head NonHead Markov Chain Formulation of Free Viewing sac head head sac head saliency sac head gaze sac nonhead head sac nonhead saliency Figure 4.2: Discrete-time Markov chain formulation of saccadic eye movement of an observer performing a free viewing task. 4.3 Learned transition probabilities To integrate head, gaze, and saliency information, the probability with which a participant transi- tions between regions of the image, namely head regions and other regions was used to define how likely an observer was to follow each cue. For tractability of calculation, the possible end point maps were treated as non-overlapping and exhaustive sets: p(sac all ) =p(sac head )+p(sac gaze )+ p(sac salient ) = 1. Of course, these are not completely disparate sets with no overlap, nor are they likely to be exhaustive, which is a source of error in the model. The conjunctions of each of these components could also be modeled, but it would be hard to ascribe the relative contribution of the components. These transition probabilities were learned in the training set separately for transi- tions from the headp(sac head all ) and not the headp(sac nonhead all ). This was to distinguish fixations that are presumably more gaze focused from the rest of the fixations. The transition probabilities were considered to be weights on the corresponding maps for each component. The transition probabilities are shown in Figure 4.3. Saccades going to any head was scored as a head transition. For saccades leaving a head, the value of the gaze following map at the saccade destination was compared to the value of the saliency map at that point. If the location in the gaze map was higher, then it was considered a gaze transition. In all other cases, it was scored as a saliency transition. 43 The transition probabilities try and capture the relative importance that a participant will con- sider a particular cue, given their current fixation type (head or non-head). Many cognitive pro- cesses can be involved in modulating these probabilities, which are not modeled here, and the probabilities are likely to be dependent on the subjects own biases and the dataset. However, they can be extracted without being privy to the task of the participant while still allowing one to utilize the current fixation behavior at a more abstract level then simple spatial position. This strategy could be expanded to other cue types, for instance, text, where the simple act of looking at the text could provide information as to what cues are currently important to a subject. Transition probabilities were learned for several systems, involving the following cues: head detections (H), head pose (P), gaze following (G), and saliency (S). The transition probabilities for heads and head pose were considered to be interchangeable, with only the map changing and not the weight. One 3-component model (PSG) was learned, along with several 2-component models (PG,HS,PS). Figure 4.3 shows the learned weights for the HSG (or PSG) model as well as the HS (or PS) model. It shows that gaze and head information are stronger during from head saccades, while saliency dominated in the non-head case. This is intuitive, as head and gaze seem more relevant when already looking at a head. It is also useful to note that gaze following is much less of a factor than head or saliency information, and is more of a second-order effect. Thus, only a small gain in fixation prediction accuracy is likely to result from taking gaze into consideration in the final model; however, these kinds of second-order effects need to be addressed to further improve upon the already very good fixation prediction abilities of state-of-the-art models The gaze following, head pose, and saliency maps were all made into probability maps (i.e., sum to 1). The transition probabilities for a particular model were used to weigh the respective maps, which were then summed to create the final maps. For instance, maps for theDWOC PSG model are shown in the second column from the right in Figure 4.4. This model was weighted using the following equation: DWOC PSG map =p(sac head )cue head +p(sac salient )cue salient +p(sac gaze )cue gaze (4.1) 44 For comparison purposes, several simpler 2-component systems were constructed, which used only two of the cues. TheDWOC HS baseline system takes head detections and bottom-up saliency into account without head pose or gaze information. TheDWOC PG system models how well head pose and gaze can perform alone. The DWOC PS model, is an intermediate model that shows what adding the head pose cue, but not gaze following cue can perform. After the transitions were learned, the maps were combined to form the combined probability maps using the following equations: DWOC PG map =p(sac head )cue head pose +p(sac gaze )cue gaze (4.2) DWOC HS map =p(sac head )cue head det +p(sac salient )cue salient (4.3) DWOC PS map =p(sac head )cue head pose +p(sac salient )cue salient (4.4) For all of these models, the weights change depending on the origin of the saccade (sac head or sac nonhead ). 45 From Head Transitions From NonHead Transitions HSG Component Transition Probabilities 0 0.1 0.2 0.3 0.4 0.5 0.6 Gaze Head Saliency 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Gaze Head Saliency 0 0.1 0.2 0.3 0.4 0.5 0.6 Head Saliency 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Head Saliency HS Component Transition Probabilities n=25,841 n=152,222 Figure 4.3: Transition probabilities when saccading to the head regions, gaze regions, or salient regions when starting from a head region or a non-head region. Note that saliency is much stronger when originating from a non-head region. Head detections (H) and head pose (P) cues are considered to have interchangeable transition probabilities. 5th and 95th percentile confidence intervals are shown across the cross validation folds. 46 Stimulus Head+Gaze Map* Saliency Map All Fixations Map DWOC PSG Map (H) Figure 4.4: Sample images with corresponding maps for Head+Gaze, Saliency, and the finalDWOC PSG model, along with maps of all fixations. The combinedDWOC PSG map is shown with the weights from the Heads (H) state. *Note that the maps in the Head+Gaze column are given equal weights for illustration purposes, but all DWOC maps, including the finalDWOC PSG column, use the learned weights. 47 4.4 Model prediction results using annotations Figure 4.5A shows the AUC performance for gaze, head detections, head pose, saliency,DWOC PG , DWOC HS , DWOC PS , andDWOC PSG as well as an inter-observer model for from head fixa- tions. 5th and 95th percentile confidence intervals over the cross folds are shown for all AUC data. The gaze cue is the learned map of gaze following, while the head is just the actual head detec- tions. TheDWOC HS model can be viewed as an updated version of Cerf’s saliency model with heads (Cerf et al., 2008), and provides our baseline of a state of the art system that includes head detection, but does not include head pose or gaze following information. The Wilcoxon signed rank test was used in all of the following comparisons, which have the same p-value because the signed rank significance value is determined by the number of com- parisons and the number of times that each comparison has a certain rank, and for all model comparisons here the winning model won on every fold. For from head fixations: Gaze<Head Detections (p<0.002), Head Detections<Head Pose (p<0.002), Saliency<DWOC PG (p<0.002), DWOC PG <DWOC HS (p<0.002),DWOC HS <DWOC PS (p<0.002),DWOC PS <DWOC PSG (p<0.002),DWOC PSG <Inter-observer 2 (p<0.002). Note that theDWOC PG head pose and gaze model outperforms bottom up saliency when looking at from head fixations. There were 184,061 total fixations in the data, with 53,928 or 29.3% of the fixations originating from a head. As a result, the effect of the improvement due to the from head fixations is muted in the overall data. When looking at all fixations, as in Figure 4.5B, the difference is much smaller, and the AWS saliency model does markedly better by itself. It is also interesting to note that the inter-observer model is more predictive for the saccades originating from the head, implying that they are more stereotyped than the remaining saccades. For all fixations, theDWOC PSG model outperforms theDWOC PS model (p<0.002) which outperforms theDWOC HS model (p<0.002). The different cues that are significantly different: Gaze<Head Pose (p<0.002), Head Detec- tions<Head Pose (p<0.002), Head Pose<Saliency (p<0.002),DWOC PG <Saliency (p<0.002), DWOC PG <DWOC HS (p<0.002),DWOC HS <DWOC PS (p<0.002),DWOC HS <DWOC PSG 2 A map built from fixations of other observers on the same image seen by an observer. 48 (p<0.002), andDWOC PSG <Inter-observer (p<0.002). As we can see from these results, head pose and gazeDWOC PG works better for from head fixations, while saliency works better for other fixations. This validates the use of different weights for these maps when starting from the head versus other regions. Head pose by itself also performs better when starting from the head, but does not account for all of the performance improvement. Figures 4.5C and 4.5D show the per image AUC performance of theDWOC HS andDWOC PSG models for from head fixations and all fixations, respectively. Looking at the data this way, the pose information improves fixation prediction when fixations are from the head, and this is enough to improve the performance of all fixations (DWOC HS <DWOC PSG (p<2e-31) for fixations from the head, and DWOC HS <DWOC PSG (p<4e-32) for all fixations). In 185 of the 200 images, AUC performance for both from head fixations and all fixations improved when using head pose and gaze following. For from head fixations, this improvement on a per image basis was, on average, 2.1%. To determine where the combined model and bottom-up saliency diverge, let’s also plot the performance of Saliency vsDWOC PSG in Figure 4.6 on a per image basis. Image 1 is one where the saliency map does not pick up saccades in the direction of gaze. Image 2 shows where the pose model is aided by the gaze following from the human to the dog, although one could argue that non-human faces should have been labeled to account for this. Image 3 shows where the head pose information complements low level saliency because the heads are not found to be salient. In image 4, both Saliency and DWOC PSG are not performing well, and it is interesting to note that both models miss another cue, text on the stomachs of the women. Image 5 is one where both models are performing well, as the heads are salient already given low level statistical information. Image 6 is one where the saliency model picks up the major attended areas, but also many more unattended regions. Overall, the combined model provides a small but consistent improvement over all fixations. 49 1-Cue 2-Cues Final (A) From Head Fixations (B) All Fixations AUC (C) From Head Fixations (D) All Fixations DWOC HS AUC DWOC PSG AUC 1 0.5 0.75 1 0.5 0.75 0.5 0.75 1 # of points above diagonal: 185/200=92.5% (p<1e -30 ) # of points above diagonal: 185/200=92.5% (p<1e -30 ) DWOC HS AUC 0.5 0.6 0.7 0.8 0.9 1 * * * * * * * * * * * * 1-Cue 2-Cues Final Head Pose (P) Saliency (S) InterObs Gaze (G) Head Dets (H) PG HS PSG PS DWOC Head Pose (P) Saliency (S) InterObs Gaze (G) Head Dets (H) PG HS PSG PS DWOC * * * Figure 4.5: Performance of each component using ground truth. AUC performance on saccades originating (A) from the head and (B) all saccades. An inter-observer model is shown to provide a ceiling for perfor- mance. 5th and 95th percentile confidence intervals are shown. There is also a per image AUC comparison between theDWOC HS model and theDWOC PSG model for (C) from the head and (D) all saccades. 50 Stimulus Saliency Map DWOC PSG Map (H) Fixation Map 4 5 2 3 1 0.5 0.75 1 Saliency AUC DWOC PSG AUC 4 5 2 3 0.5 0.75 (A) (B) 1 6 1 6 Figure 4.6: (A) Scatter plot showing per image AUC performance of saliency versus the DWOC PSG model. (B) Four sample points on the scatter plot with images and their corresponding maps are shown for illustration. The combinedDWOC PSG map is shown using the weights from the Heads (H) state. 51 4.5 Head pose detection model We established that given the ground truth head pose, head pose and gaze following information can improve state of the art saliency algorithms. We also found that saccades originating from the head were better predicted by this pose information than saccades originating elsewhere, which were better predicted by low-level saliency. However, in order to improve applicability for the model, it is useful to remove the need for manual annotations, which are expensive and time consuming. As before with the annotations, we need the detection model to generate head pose polygons and image plane head pose angles for each head to generate the maps for an image. When determining the 3D angle to which a head is oriented, the pose angles are defined ego- centrically, as shown in Figure 4.7A. The yaw and pitch of the head are important for determining the 2D image angle. The roll of the head simply rotates the head about the 2D image angle formed by the yaw and the pitch. Although large roll angles will drastically change the view of the head pose and can make it difficult to accurately extract the correct yaw and pitch angles, this effect is ignored. The Zhu & Ramanan model (2012) that we used to detect heads and extract head pose only provided a yaw estimate. Figure 4.7B shows the basic structure of their model. Local parts of the face (68 landmarks for45 to 45 frontal faces; 39 landmarks for side views) are detected using mixture of trees part detectors (similar to the detectors used in (Felzenszwalb et al., 2008), and the relative locations of the detected parts create deformations in the larger pose model at a certain cost. This model is trained for heads that are at least 80x80 pixels, and given the number of landmarks used, this is a limiting constraint. Further, the yaw angle is limited to 90 to90 , where 0 is facing the camera, since the landmarks are exclusive to the face and jawline. To add a pitch estimate, a set of random binary ferns (Ozuysal, Calonder, Lepetit, & Fua, 2010) were learned over the part detections. Each fern randomly samples a fixed set (n=3) of binary comparisons. Randomly selected head model parts were compared in either their X or Y relative positions as shown in Figure 4.7C with a binary 1 if (green)>(red) and 0 otherwise. This is done for a fixed set of comparisons,n, which is the size of the fern in bits. These fern bits form a valuef val , (e.g., 10 = 3 for a 2-bit fern). 52 (B) Zhu & Ramanan 2012 Model: (C) Random fern pitch model: (A) Head Pose: Yaw Pitch Roll Figure 4.7: (A) [Modified from (Murphy-Chutorian & Trivedi, 2009)] 3D head pose angle of a head can be defined egocentrically usingroll, pitch, andyaw. (B) The head pose model (Zhu & Ramanan, 2012) has 146 shared local parts across 13 learned poses, with learned deformation costs (in red) between neighboring parts. The model only defines head pose in the yaw angle. (C) Sample fern comparison for each bit. The relative X or Y position of the green part,g, is compared with the corresponding position of the red part,r, withg >r = 1 and 0 otherwise. Over the training images, a histogram is created for each possible value of the fern,f val (a 2 bit fern has 2 2 = 4 possible values). Each histogram has bins for each possible ground truth pitch angle categoryang cat . Each time a model evaluates to a particular fern valuef val , the histogram for the fern value,f val is selected, and the ground truth angle binang cat is incremented. During testing, the histogram for that fern valuef val is extracted, and these proportions are summed across all ferns (n=500). The angle bin with the maximum value is then used to predict the pitch angle. The random ferns were trained using unused images in Zhu & Ramanan’s AFW dataset ((Zhu & Ramanan, 2012) (n=185 images)) with 3 ang cat categories (22.5°, 0°, & -22.5°), and was only trained on heads that were 45°to -45°. The combined yaw and pitch angle estimates provided the 3D head pose angle. These 3D 53 Table 4.1: Head Pose Detection: Component Performance MeanF 1 StdF 1 Chance Face Detection 0.75 0.05 - Yaw Detection 0.82 0.07 0.33 Pitch Detection 0.56 0.07 0.33 head pose angles were first converted to 2D image head pose angles using a simple orthographic projection. Alternatively, the camera parameters can be learned, and a perspective projection can be used to gain a more accurate angle estimate. The X and Y extrema of the center points of all of the detected local parts provided the bounding polygon for the head as well. With the 2D head pose angle and head polygon, the system was then run in the same manner as using ground truth. Sample detections of the head pose estimation system are shown in Figure 4.8. The system generates a yaw and pitch angle in addition to a confidence score. The performance of the system over the 10 folds of the Flickr set is shown in Table 4.1. F 1 is simply the harmonic mean of precisionP and recallR:F 1 = 2PR=(P +R), and is bounded by 0 and 1, inclusively 3 . With this metric, precision and recall are equally weighted in importance, and a value of 1 indicates perfect recall and precision. Note that the head detectionF 1 performance has no real chance lower bound, but that the angle detections were only scored on correct head detections with coarse bins of left, straight, and right (60°, 0°,& -60°) for the yaw angle, and up, level, and down (22.5°, 0°, & -22.5 °) for the pitch angle. Therefore, chance for both angles is 33.3%. 3 Precision is defined as P = TP=(TP + FP), and recall is defined as R = TP=(TP + FN), where TP is True Positive, FP is False Positive, and FN is False Negative 54 Y 45.0; P −22.5; C −0.17 Y 45.0; P −22.5; C −0.28 Y −45.0; P −22.5; C −0.54 Y 0.0; P 0.0; C −0.20 Y 0.0; P 0.0; C −0.68 Y −15.0; P 0.0; C −0.09 Y −45.0; P −22.5; C −0.46 Figure 4.8: Sample head pose model detections. Y is yaw angle with 0°pointing out of the image and positive moving to the left from the head’s perspective. P is the pitch angle with 0°being level and positive looking up. Both are in degrees. The confidence of the detection is shown with C, and is an unbounded number, with higher values being more confident. The detector was thresholded at -0.75. 4.6 Detection based fixation prediction results Using the same cross fold sets as with the annotations, the head pose model was run over the images in the test fold, and this provides head polygons and 2D pose angles, which are used in the same manner as the ground truth head pose: head, head pose, and gaze following per head maps were rotated by the pose angle, scaled relative to the actual head detection (based on the mean scale over image dimensions), and applied to the appropriate map. Figure 4.9A shows the AUC performance when using head pose detections from the model instead of from the ground truth annotations when predicting only fixations from the head. Again, 5th and 95th percentile confidence intervals are shown for all AUC data. For from head fixa- 55 tions, significance was determined using the Wilcoxon signed rank test: Head Detections< Gaze (p<0.002), Head Detections < Head Pose (p<0.002), Gaze < Head Pose (p<0.002), Saliency <DWOC PG (p<0.002),DWOC PG <DWOC HS (p<0.002),DWOC HS <DWOC PS (p<0.002), DWOC PS <DWOC PSG (p<0.002), andDWOC PSG < Inter-observer (p<0.002). When looking at all fixations, as in Figure 4.9B, the difference is again smaller, and the saliency model alone accounts for most of the performance. Still, however, the differences were signifi- cant. For all fixations, theDWOC PSG model outperforms theDWOC PS model (p<0.002) which outperforms theDWOC HS model (p<0.002). The different cues that are significantly different: Head Pose< Gaze (p<0.002), Head Detections< Head Pose (p<0.002), Head Detections< Head Pose (p<0.002), Head Pose< Saliency (p<0.002),DWOC PG < Saliency (p<0.002),DWOC PG < DWOC HS (p<0.002),DWOC HS <DWOC PS (p<0.002),DWOC HS <DWOC PSG (p<0.002), andDWOC PSG < Inter-observer (p<0.002). The drop in the combinedDWOC PSG model AUC for from head fixations from a mean of 0.913 to 0.886 when changing from ground truth head pose angles in section 4.4 to using the automatic head pose estimates in section 4.6 shows the deficit due to detection performance. 56 1-Cue 2-Cues Final 1-Cue 2-Cues Final (A) From Head Fixations (B) All Fixations 0.5 0.6 0.7 0.8 0.9 1 AUC * * * * * * Head Pose (P) Saliency (S) InterObs Gaze (G) Head Dets (H) PG HS PSG PS DWOC Head Pose (P) Saliency (S) InterObs Gaze (G) Head Dets (H) PG HS PSG PS DWOC * * * * * * * * * * * Figure 4.9: Performance of each component using head pose detections. AUC performance on saccades originating (A) from the head and (B) all saccades. An inter-observer model is shown to provide a ceiling for performance. 5th and 95th percentile confidence intervals are shown. 4.7 Discussion We showed that head pose and gaze following information alone can contribute to fixation predic- tion, especially for saccades originating from head regions, where it outperforms purely bottom up saliency. Head pose detections were shown to be a fairly good proxy of the ground truth, as well. The learned gaze map performed better than the cone map used in previous work, but it is probably limited to images and video, since an environment where the observer and the actor are both present would have no image plane limitation (e.g. the observer could see an actor looking off in the distance, and could turn around to determine if the gazed at entity is behind them). While most saliency models use a static map, our results validate our approach where a different combination of cues is used with different weights depending on the origin of a saccade. We believe that this kind of gaze-contingent modeling is a promising direction to further bridge the now relatively small gap that remains between saliency models and inter-observer predictions. There are, however, opportunities for further improvement. Integrating final eye gaze direction should help, as the head pose does not always match the final eye gaze direction. This is actually 57 implicitly learned in the gaze following map, since we only aligned the data for head pose and not for final eye gaze. However, the averaging that takes place is likely to reduce performance. Also, note that automatic reliable eye direction detection models do not yet exist in computer vision especially over complex natural scenes, due to high variability of eyes in scenes. However, our model can be easily extended by adding the final gaze direction results of more accurate models in the future. In the prior chapter, it was shown that when the direction of other faces were incongruent with the gaze direction, gaze following was weaker. It was also weaker, but less so, when no other faces were present. This implies that gaze following is especially useful when predicting fixations in social scenes. There are several other factors that we did not model that could further contribute to fixation prediction. Text detection in the wild (e.g., (Meng & Song, 2012)) could extract a cue that has been shown to be useful in fixation prediction (Cerf et al., 2009). The presumed semantic relevance of an object to an actor could also potentially be used. For instance, if an actor was holding a knife, a jar of peanut butter would likely be more semantically relevant than a book. The facial expression of a face could also differentially drive fixations and could be evaluated. Alternatively, threatening objects (gun, knife, etc) and high-value objects (money, jewelry) could be evaluated. Quantifying these concepts would be difficult, but could represent some portion of the remaining performance. Improving the accuracy of gaze direction prediction and saliency models can be useful in sev- eral engineering applications, for example in computer vision (e.g., action recognition, scene un- derstanding in videos (Marin-Jimenez, Zisserman, Eichner, & Ferrari, 2014), reading intentions of people in scenes (Yun, Peng, Samaras, Zelinsky, & Berg, 2013), attentive user interfaces), human-computer and human-robot interaction (e.g., (M. W. Hoffman, Grimes, Shon, & Rao, 2006; Lungarella, Metta, Pfeifer, & Sandini, 2003; Nagai, Asada, & Hosoda, 2002; Breazeal & Scassellati, 2002; Bakeman & Adamson, 1984)), determining the attention levels of a driver (e.g., (Murphy-Chutorian, Doshi, & Trivedi, 2007)), and enriching e-learning systems (e.g., (As- teriadis, Karpouzis, & Kollias, 2013)). Also such models can be useful for scientific research to study psychological disorders and diagnose patients with mental illness (e.g., anxiety and depres- 58 sion (Compton, 2003; Horley, Williams, Gonsalvez, & Gordon, 2004; Kupfer & Foster, 1972), schizophrenia (Franck et al., 2002; Langton, 2000), and autism (Klin, Lin, Gorrindo, Ramsay, & Jones, 2009; Fletcher-Watson, Leekam, Benson, Frank, & Findlay, 2009)). Eye movements can also be used to help build assistive technology for the purpose of ASD diagnosis and monitor- ing of social development, as was done by (Ye et al., 2012) where they were used to detect eye contact preference. As another example, (Alghowinem, Goecke, Wagner, Parkerx, & Breakspear, 2013) demonstrated that eye movement can be used as a means of depression detection. Several works have built models to distinguish patient populations from controls such as visual agnosics (T. Foulsham, J.J., Kingstone, Dewhurst, & G., 2009) as well as ADHD, FASD, and Parkinson’s disease (Tseng et al., 2012). Our proposed model can be used for distinguishing patient populations by looking at how each group differs in the relative weighting of the component cues. For instance, one might speculate that individuals with ASD might exhibit significantly different transition probabilities than TD participants in the present study. Further, we can tailor our model to these specific populations to better predict each population’s eye movements. The model can also be used to optimize the stimuli presented to the different populations. We address these issues in the next chapter. 4.8 Conclusions We proposed a combined head pose estimation and low level saliency model that outperforms other models that do not take head pose and gaze following information into account. We earlier showed that this information causally predicts fixations, and is a well known element of human social understanding. Automatic head pose estimation from a single image was also incorporated in the model, and allows the system to be run directly. Our model formulates human saccade movement as a two state Markov chain that can be viewed as a dichotomy between two states (head and non-head) with different cue priorities. It extracts transition probabilities between these two states automatically, and the learned weights show a preference for heads and gaze related fixations when originating on a head, while being more saliency driven when originating elsewhere. This is intuitive, and we see this as a step towards 59 a more dynamic understanding of eye movement behavior of individuals when free-viewing natural scenes beyond fixed maps of fixation predictions. In many cases, the cognitive agenda and current drives of a individual are not known. However, the learned weights of our model can give us insight into the cognitive biases of subjects when analyzing the differences between groups, cultures, and genders. Other cues that were not addressed here, such as text and motion, can also be integrated into this model to further enhance fixation prediction performance. 60 Chapter 5 Using DWOC to evaluate autism spectrum disorder (ASD) eye movement behavior Because the DWOC model can capture some proportion of cognitive biases from the saccade pref- erences of observers viewing images or video with actors present (e.g. social scenes), it makes sense to investigate populations that are known to show differences in this domain. Here we inves- tigate individuals with ASD, but the model should apply to any other population in which facial or gaze following behaviors deviate from that of the average population. Autism spectrum disorders are neurodevelopmental disorders, involving deficits in social in- teractions, communication and behavior, that lie along a range of severity (Caronna, Milunsky, & Tager-Flusberg, 2008). Individuals with ASD can exhibit a dramatic range in severity in all three of these categories of impairment. Clinical assessments strive to identify ASD children by age 2, and use a set of behavioral and developmental milestones from 6 to 36 months of age. The preva- lence of ASD is fairly high, and has been estimated as anywhere from 0.4% to 1%. The prevalence has been growing over the past 2 decades as well, at least partially due to changes in diagnostic practices (King & Bearman, 2009). As was discussed before, Baron-Cohen and others have shown (Baron-Cohen et al., 1995) that children with ASD were able to correctly determine whether a cartoon face was looking at them and extend an arrow in space in the direction of the cartoon face’s gaze as well as TD children. This indicates they are able to see and process eye gaze to a certain extent, although they did not 61 use this knowledge to assign internal beliefs to other people (the candy study; see Figure 2.4). TD children and adults exhibit a bias in looking towards the eyes, which is corroborated in the head fixation heat maps shown in Chapter 3, which are dominated by the eye region. Participants with ASD, however, do not exhibit this preference to the same degree (Klin, Jones, Schultz, V olk- mar, & Cohen, 2002; Grice et al., 2005; W. Jones, Carr, & Klin, 2008; Riby, Doherty-Sneddon, & Bruce, 2009). It has also been found that participants with ASD fixate to other regions of the head, like the mouth, to a similar degree to TD participants (Dalton et al., 2005). Recently, (W. Jones & Klin, 2013) have shown that the deficit in eye contact that is known to be present in in- dividuals with ASD, does not develop until between 2-6 months of age. Jones and Klin used the eye fixations from a population of infants who were at risk for developing ASD, and were able to dissociate those infants who would go on to be diagnosed with ASD and those who would not, using just the percentage of fixation time devoted to actor’s eyes. Other deficits between ASD and TD individuals, such as differences in mouth fixations, have been inconsistent in the literature, both within and across age ranges, which is well reviewed in (Fedor et al., 2015). Several groups have studied the relative viewing behaviors of ASD and TD subjects with re- spect to top down cues that were compared to saliency models. Freeth and others (Freeth, Tom Foulsham, & Chapman, 2011) compared head fixations versus salient fixations. They called these “social” versus “salient” fixations, a distinction which we will adopt here. Quantification of the preference of subjects with ASD to fixate the mouth relative to saliency has also been studied (Neumann, Spezio, Piven, & Adolphs, 2006), where it was argued that saliency was consistent across ASD/TD groups, but that ASD participants engaged in a different top down strategy than TD participants, choosing to look at mouths. This result was reinforced by (Birmingham, Bischof, & Kingstone, 2009), who showed that saliency alone does not account for observer saccades to preferentially target actor’s eye positions. (Sasson, Turner-Brown, Holtzclaw, Lam, & Bodfish, 2008) analyzed differences in attention between ASD and TD participants and found that partici- pants with ASD inspected a smaller number of items for a longer period when using either object stimuli or social stimuli, suggesting that the attentional differences extend across cues. Given this data, it seems reasonable that gaze following could be a useful cue in dissociating 62 ASD and TD participants. Indeed, (Fletcher-Watson et al., 2009) showed that ASD adults did not view the face or background regions differently than TD adults, but that viewing times for the eye regions were significantly different. They also constructed a 30 cone partially using eye gaze and partially basing it on the possible range of eye positions. This showed a difference between the two groups, but had several issues. First, the cone that was used is similar to the one we used initially, that performed significantly worse than the learned gaze map, which had more of an arrowhead shape. Second, the size of the cone had a uniform probability, and is not able to account for the more normal like distribution about the gaze direction that we empirically found: (standard deviation 11:77 ) (Borji et al., 2014), likely leading to a weaker result. Finally, saliency was not used to distinguish saccades that could otherwise be explained by saliency, as opposed to more purely gaze driven saccades. In order to investigate the eye movement behavior of those with ASD, the DWOC model will be used to estimate the differences in cue preferences between ASD and TD participants, the differences in the learned maps of the models, as well as differences in the ability of the model to predict their fixations. 5.1 Gaze following and head region preference in ASD It is known that ASD children can correctly determine the gaze direction of an actor, so they are in principle capable of following gaze and head pose (see (Baron-Cohen et al., 1995)). Further, as was discussed, we found the standard deviation of gaze angle estimates for TD subjects to be 11:77 , and this increased to 18:73 when the actor’s eyes were masked (Borji et al., 2014), suggesting that head pose provides a fairly accurate assessment of gaze angle. Given this and the DWOC performance on TD participants, we believe that actor head pose by itself is a fairly good proxy for actor gaze direction in predicting observer eye movements, at least in the Flickr dataset. It stands to reason that individuals with ASD might not need to look at the eyes to assess head pose, and could be capable of gaze following even without attending to the eyes. Nonetheless, given that ASD is known for discounting the importance of gaze-following cues, it is expected that the transition probability to follow gaze and heads will be higher for TD than for subjects with 63 ASD. Further, it is anticipated that the to head fixation probability map for subjects with ASD should have less of a peak around the eye region. 5.2 Cue preference experiment using Flickr dataset As discussed, gaze following (Fletcher-Watson et al., 2009), head regions (Klin et al., 2002), and saliency (Sasson et al., 2008) are known to be differentially expressed in subjects with ASD. We would like to quantify the differential cue preferences between individuals with ASD and baseline individuals when attending to particular regions of the face as well as their probability of following actor gaze using our DWOC model. The model learns both the probability maps for each of these cues, as well as the relative preferences between the cues, while taking into account the role of saliency, which prior work has not done. Further, we would like to determine how the model’s fixation prediction performance changes when predicting subjects with ASD. Here we use the Flickr gaze-following dataset that we have used earlier to compare and contrast subjects with ASD with a baseline set. 5.2.1 Stimuli Stimuli consisted of the Flickr dataset that was discussed in Chapter 3. They involved at least one human head that was clearly visible and not staring directly into the camera. A diverse set of settings, social situations, and number of people were included. 5.2.2 Procedure Participants sat 49.5 cm away from a 19 inch LCD monitor screen so that scenes subtended ap- proximately 42 34 visual angle. A chin rest was used to stabilize head movements. Stimuli were presented at 60Hz at a resolution of 1280 1024 pixels. Eye movements were recorded via a non-invasive infrared Eyelink (SR Research, Osgoode, ON, Canada) eye-tracking device at a sample rate of 1000 Hz (spatial resolution less than 0.5 ). Each image was shown for 8 seconds 64 followed by 4 seconds of gray screen. The eye tracker was calibrated using 5 points calibration at the beginning of each recording session. Observers viewed images in random order. Saccades were classified as events where eye velocity was greater than 35 /s and eye acceleration exceeded 9500 /s 2 as recommended by the manufacturer for the Eyelink-1000 device. Each participant watched the images over two sessions of 100 images each, separated by a 5 minute break. 5.2.3 Observers Our baseline set of participants (n=30) were students (mean age 19.46) at the University of South- ern California, using data that we had previously collected (Borji et al., 2014). We collected novel eye movement data from 10 ASD participants on this same dataset. The participants with ASD were evaluated using the Autism Diagnostic Observation Schedule (ADOS) (Lord et al., 2000) broken down by: Communication, Social, Full Scale, and Calibrated Severity Score. These scales had means (standard deviations) of: 3.8 (1.4), 6.6 (1.5), 10.4 (2.3), and 5.9 (1.5), respectively. 5.2.4 Quantification of fixation heatmap differences Before using our model to estimate differences between the two populations, we would first like to quantify the differences, if any, in the fixation maps that were generated by the baseline and ASD populations when looking at the Flickr dataset. Here we compare maps for all fixations as well as maps of just fixations that originate on an actor’s head. Figure 5.1 shows sample stimuli along with saccade heatmaps for both the baseline and ASD populations. To quantify the differences between the heatmaps, we used the Kullback Leibler (KL) Diver- gence, which is used to estimate the similarity of two probability density functions. The saccade heat maps were turned into probability maps (i.e. made to sum to 1), and the KL divergence was measured between them. As can be seen in the figure, the KL divergence was usually higher when comparing participants with ASD to baseline, than when comparing a subset of baseline to the remaining baseline participants. This was true for all saccades, as well as saccades that originated from a head. 65 Stimuli ASD ASD All Saccades From Head Saccades Baseline Baseline 1.63 4.71 5.07 4.05 1.23 3.95 1.69 3.63 1.05 2.86 0.913 2.55 0.817 2.23 3.48 8.02 1.29 2.94 2.08 6.33 KL Div (this image) KL Div (dataset) 1.36 3.08 3.17 5.67 Figure 5.1: The first column shows sample stimuli from the Flickr dataset. The second and third columns show all saccade endpoints for each population as a heatmap. The fourth and fifth columns show the heatmaps for just saccades that start from a head in the image. Below each image is the KL Divergence estimate. The baseline population was randomly divided in half, and the distance between the heatmaps from the two halves of the population is shown under the baseline heatmaps. The KL divergence for the ASD map was measured against one of these halves. The KL Divergence mean for the entire Flickr dataset is shown at the bottom for each map type. 66 The DWOC model attempts to measure the relative saccade preferences of a population. Since there is a difference between the two groups based on a comparison of their fixation densities, we can use the DWOC model to try to understand these relative differences between the populations in terms of their relative cue preferences. 5.2.5 Gaze following and head fixation probability maps First we learned the relative heat maps for saccades that follow gaze and that land on actor’s head. These maps used all of the saccades from the training set (the remaining 9 folds in the 10-fold cross-validation scheme). Each saccade endpoint was rotated so that the head pose was pointing to the right, and scaled from the actual head to a standardized head size shown in blue in the maps. The heat maps for gaze following for both the original baseline subjects and for the subjects with ASD are shown in the first row of Fig 5.2A-B. The difference between the two maps is shown in 5.2C, where red values indicate relative strength in the ASD map, and green values indicate relative strength in the baseline map. In order to test the significance of different regions of the gaze following maps, the maps were broken down into 24 sectors as shown in Fig 5.2, with 12 inner 30 sectors, and 12 outer 30 sectors. The sectors were tested for significance at the P<0.05 level against the mean value of all regions as shown in the second row of Fig 5.2A-B, with Bonferroni correction (24 comparisons) using 10- fold cross validation and the Wilcoxon signed rank test. Significant sectors are shown with their brightness scaled according to the sector mean. To analyze the relative differences between the gaze following maps of the two populations, we subtracted the two maps (Baseline-ASD) and tested the significance of the differences. The sector means for the Baseline and ASD sectors are compared over the folds, and the sectors are nonblack if they are shown to be significant. If the baseline is stronger, the sector is shown in green scaled by the difference of the means, and if the ASD sector is stronger, the sector is shown in red scaled by the absolute difference of the means. We can see that the gaze following maps are similar for the baseline and participants with ASD, although the map is more diffuse for participants with ASD, especially in the direction of 67 Baseline Subjects ASD Subjects Baseline-ASD Heatmap Significant Sectors Leaving Head Saccades A) B) C) Gaze Regions D) Normalized head pose angle: Figure 5.2: The first row shows heatmaps generated from saccades leaving an actor’s head. The saccade endpoint was rotated about it’s origin so that the actual head pose angle of the head aligns with the direction of the black arrow (to the right), shown in D). The saccade vector is then scaled so that the size of the actor’s head matches the standard head, which is outlined in blue. The maps are for A) baseline individuals, B) individuals with ASD, and C) the subtraction of the two (Baseline-ASD) maps. Green indicates higher baseline saccade probability in a region, and red indicates higher ASD saccade probability. D) Shows the gaze map broken down into 24 sectors. The second row shows significance testing for each sector (P<0.05), where the mean value of each region was compared to the mean value for all regions for A) and B). If significant, the mean value of the region is shown for that sector, normalized by the maximum sector mean. For C), the value of the Baseline and ASD means for each region were subtracted. gaze (pointing to the right from the head, in blue) and in the opposite of the gaze direction. We saw this strength in the opposite direction earlier (Figure 4.1A), and find it to be an artifact of head height alignment in the presence of multiple heads, where saccades frequently go to both the gazed at head, and other ignored heads. The head region preferences of the baseline and ASD groups are shown in Fig 5.3A-B. As before, the heatmaps are shown in the first row, where saccades to the head are rotated by the actor’s head pose and scaled to the standardized head size. The difference between the head heat maps is shown in Fig 5.3C, where red indicates a stronger ASD head heatmap, and green indicates a stronger baseline head heatmap. The head heatmaps were broken down into 6 sectors, as shown in Fig 5.3D, where the top and bottom sector are larger, and represent the forehead and chin regions, respectively. The 4 center regions are equal sized, and contain some proportion of the eyes, nose, and mouth regions, although these features were not actually explicitly labeled in the data. These head sectors were also tested for significance at the P<0.05 level, again comparing the 68 Baseline Subjects ASD Subjects Baseline-ASD Heatmap Significant Sectors Head Fixation A) B) C) Head Regions D) Normalized head pose angle: Figure 5.3: The first row shows heatmaps generated from saccades landing on an actor’s head, where the endpoint was again rotated by the actor’s head pose and scaled by the size of the actor’s head. The normalize head is outlined in blue, and the normalized head pose is pointing to the right. The maps are for A) baseline subjects, B) subjects with ASD, and C) the subtraction of the two (Baseline-ASD) maps. Green indicates higher baseline saccade probability in a region, and red indicates higher ASD saccade probability. D) Shows the head map broken down into 6 sectors. The second row shows significance testing for each sector (P<0.05). The mean value of each region was compared to the mean value for all regions for A) and B). Significant sectors are displayed with their mean values. For C), the value of the Baseline and ASD means for each region were compared over each fold. sector mean values against the map mean, using Bonferroni correction (6 comparisons) and the Wilcoxon signed rank test. The results are shown in the second row of Fig 5.3A-B. Significant sectors are displayed with their brightness set based on the mean value over the sector. The signifi- cant sectors in the difference between the two maps is shown in the second row of Fig 5.3C, where red sectors indicate the ASD heatmap is consistently stronger, and green sectors indicate that the baseline heatmap is consistently stronger. We can see that the the heat maps for head fixations for the baseline and individuals with ASD are similar, with the ASD map being more diffuse. The differences are accentuated in the difference heat map, where we can see that baseline individuals had more fixations on the center regions of the head than the individuals with ASD. 69 5.2.6 Differences in cue transitions between ASD & TD With the probability maps for each cue determined, the cue following probabilities of participants with ASD can be evaluated relative to the baseline group. The maps that were learned previously were used to score whether a saccade was considered a transition to a particular cue. Baseline participants used the learned maps from all of the training folds and were scored on the training folds (the testing fold was saved for the fixation prediction performance). For the ASD participants, it was unclear if we should use the ASD subject maps, or the baseline subject maps, since the baseline subject map seems to more closely adhere to the cue, but the ASD map comes from the same population. To evaluate this, both cases were run. The transition probabilities are shown in Fig 5.4 for both saccades that originate on an actor’s head, and for saccades that don’t originate on the head. These two origins reflect the two states in the Markov chain. The transitions to the head are less likely for participants with ASD regardless of saccade origin. Note that the saliency transition probability goes up and the gaze following probability goes down slightly for participants with ASD for the Flickr dataset. The Wilcoxon signed rank test was used to compare the baseline transition probabilities to their respective probabilities for ASD participant, both when using participants with ASD to generate the learned maps and using baseline participants. This was done across the folds with Bonferroni correction (n=10 comparisons), and all comparisons were found to be significant with (P<0.02). 5.2.7 Differences in fixation performance between ASD & TD Using the learned heatmaps and transition probabilities, the model can combine the cues to form fixation prediction maps. These prediction maps can then be compared to the actual fixation maps for each population. We calculated the standard Receiver Operator Characteristic (ROC) curves, which illustrate the performance of a binary classifier as the threshold between the two classes is changed. To assess the performance independently of this threshold, the area under the ROC curve (AUC) is taken. An area of 0.5 indicates purely chance performance, while an area of 0 or 1 indicates perfectly negative or positive agreement with the ground truth, respectively. 70 From Head Transition Probabilities From NonHead Transition Probabilities To Head To Gaze To Saliency Baseline ASD To Head To Background A) B) 0 0.2 0.4 0.6 ASDwBaseMaps 0 0.2 0.4 0.6 0.8 Baseline ASD ASDwBaseMaps Baseline ASD ASDwBaseMaps Baseline ASD ASDwBaseMaps Baseline ASD ASDwBaseMaps * * * * * * * * * * Figure 5.4: Transition probabilities of DWOC model on Flickr dataset for A) saccades that originate on a head, and B) saccades that don’t originate on a head. Error bars represent 95% confidence intervals over cross validation. The fixation performance for both the DWOC model and inter-observer models for each popu- lation are shown in Fig 5.5. The inter-observer model for each ASD participant is generated from other ASD participant saccades on a per image basis (ASD:InterObs). Likewise, an inter-observer model for baseline participants is generated by other baseline participants (Base:InterObs). The performance is shown for all fixations as well as just those originating from an actor’s head. The results are shown over 10-fold cross validation with error bars signifying95% confidence intervals. Again, the Wilcoxon signed rank test was used to compare the performance of the DWOC model for the baseline and ASD participants as well as the performance of the DWOC model to the inter-observer model for each subject set. Bonferroni correction (n=6 comparisons) was applied, and significance (P <0.012) was found for all comparisons, except for the DWOC vs inter-observer performance for participants with ASD for From Head saccades, as shown in the figure. As you can see, both the DWOC model and the Inter-observer models have better fixation performance for baseline participants than for participants with ASD. For the DWOC model, we interpret this to mean that the cues that we are using to predict fixations are less important for participants with ASD. For the inter-observer model, we believe that this shows ASD fixations are less stereotyped, and thus harder to predict. 71 All Fixations FromHead Fixations Fixation Prediction AUC 0.5 0.6 0.7 0.8 0.9 ASD:Model Base:Model ASD:InterObs Base:InterObs ASD:Model Base:Model ASD:InterObs Base:InterObs * * * * * Figure 5.5: Fixation prediction performance of the DWOC model on Flickr dataset trained for partici- pants with ASD (ASD:Model) and for baseline participants (Base:Model) for all fixations and for just sac- cades that originate from the head. Inter-observer models were also made for each population separately (Base:InterObs and ASD:InterObs). Error bars represent 95% confidence intervals over cross-validation. 5.3 Cue preference experiment using Social Video dataset In the first ASD experiment, we established that there were differences between participants with ASD and a baseline population that could be extracted with the model. In order for this method to be used as a screening test, it needs to be easy to administer, short, and hold the participant’s attention. Further, the screening must be done on an individual basis, not at the population level, increasing the sensitivity to variance within the population. To compensate for this increased difficulty, we used age, gender, and IQ matched TD participants when conducting this next eye tracking experiment. 5.3.1 Stimuli This dataset, which we will refer to as the Social Video set, contained 20 minutes of 3 to 4 second video clips @ 30 Hz. Using the same inclusion criteria that we applied to the Flickr set, we chose clips where there was at least one clearly visible head that was not looking directly at the camera. Sample frames can be seen in Figure 5.6, which involve at least one human but typically more than one (mean 2.8 visible heads) visibly looking in a particular direction. This provided just over 2:52 of video clips. All visible heads were labeled along with a human estimate of the 3D head pose direction. 72 Figure 5.6: Sample frames from the Social Video dataset. 5.3.2 Procedure Participants sat 49.5 cm away from a 19 inch LCD monitor screen so that scenes subtended ap- proximately 42 34 visual angle. A chin rest was used to stabilize head movements. Stimuli were presented at 60Hz at a resolution of 1280 1024 pixels. Eye movements were recorded via a non-invasive infrared Eyelink (SR Research, Osgoode, ON, Canada) eye-tracking device at a sample rate of 1000 Hz (spatial resolution less than 0.5 ). The eye tracker was calibrated using 5 points calibration at the beginning of each recording session. Observers viewed video clips in random order. Again, saccades were classified as events where eye velocity was greater than35 /s and eye acceleration exceeded9500 /s 2 as recommended by the manufacturer for the Eyelink-1000 device. 73 5.3.3 Observers 155 TD participants and 19 participants with ASD participated in this video study at the Laboratory of Neurocognitive Development at the University of Pittsburgh. From this set, 15 participants with ASD along with 15 age, gender, and IQ matched TD participants were preselected for the purpose of testing the discrimination ability of the transition probabilities. There were 3 female pairs, with the 12 remaining pairs being male. On average, the TD participant was 0.05 years older than the ASD participant for each matched pair, with a standard deviation of 0.67 (overall mean age: 14.27). For IQ, Wechsler’s WAIS-IV test (Wechsler, 2008) was used, with the TD individual scoring on average 3.3 points higher, with a standard deviation of 11.1 (overall mean IQ: 108.8). Individuals with ASD were again evaluated for severity using ADOS, with scores for Communication, Social, Full Scale, and Correlated Severity Scale. The means (standard deviations) for these components were 3.6 (1.5), 7.5 (1.6), 11.1 (2.6), and 6.5 (1.5), respectively. 5.3.4 Differences in transition probabilities of ASD vs TD Each participant made, on average, 238.6 saccades over 5177 frames of video, with a standard deviation of 50.3 saccades. TD participants made on average 6.7 less saccades than the participants with ASD, with the a standard deviation of 67.5 saccades of the difference. There are far fewer saccades per image, since each frame is only displayed for a portion of a second (33.3 milliseconds), unlike the still image set, which were shown for 8 seconds each. This drastically limits the ability to generate saccade density maps. Also, since the image frames are constantly changing, learning the heatmaps would require compensating for motion in the scene. As a result, we use the heatmaps that were created in the first experiment for the rest of the paper. The transition probabilities that were learned for the social video dataset are shown in Fig 5.7 for both saccades that originate on an actor’s head, and for saccades that don’t originate on the head. The Wilcoxon signed rank was used across the training folds as was done in the prior experiment, and Bonferroni was again applied (n=10). For the social videos, there were no significant (P> 0.45 even without Bonferroni) transition probability differences in the from head case. For the from 74 From Head Transition Probabilities From NonHead Transition Probabilities To Head To Gaze To Saliency TD ASD To Head To Background A) B) ASDwTDMaps 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0.8 * * * * ASD ASDwTDMaps ASD ASDwTDMaps ASD ASDwTDMaps ASD ASDwTDMaps TD TD TD TD Figure 5.7: Transition probabilities of DWOC model on social videos for A) saccades that originate on a head, and B) saccades that don’t originate on a head. Error bars represent 95% confidence intervals over cross validation. nonhead case, TD participants had higher to head transition probabilities and lower to background probabilities (P< 0.02 with Bonferroni). Not surprisingly, there are also differences in the relative weighting of the cues between the Flickr dataset (Fig 5.4) and these videos (Fig 5.7). We attribute these differences principally to the addition of motion as a cue, as well as the large difference in the average size of the head regions between the two sets: The Social Video dataset is composed of heads that are less than one third the size of the heads in the Flickr set (Flickr: 5:5 7:9 circ visual angle vs Social Video: 3:44:2 visual angle). 5.3.5 Differences in fixation prediction between ASD & TD Figure 5.8 shows the fixation prediction performance of the DWOC model and an inter-observer model for both TD and participants with ASD. The performance of all fixations is shown on the left of the figure, and just the fixations originating from the head are shown on the right. Because the dataset is much smaller, there is much more uncertainty in these results than there were for the Flickr dataset, with no significance for the same comparisons made for the Flickr set (TD:Model to TD:InterObs, ASD:Model to ASD:InterObs, and TD:Model to ASD:Model for both all fixations and from head fixations). 75 All Fixations FromHead Fixations ASD:Model TD:Model ASD:InterObs TD:InterObs Fixation Prediction AUC ASD:Model ASD:InterObs 0.5 0.6 0.7 0.8 0.9 TD:Model TD:InterObs Figure 5.8: Fixation prediction performance of the DWOC model using learned maps from the Flickr dataset, but learned transition probabilities from the social videos. TD:Model represents the fixation prediction per- formance of the model using the transition probabilities learned from the TD cue maps and tested on a test set of TD participants. ASD:Model is done in the same way using transition probabilities from partici- pants with ASD and tested on a test set of ASD participants. The inter-observer models (TD:InterObs and ASD:InterObs) use the training set fixations directly to predict the test set fixations on a per image basis for each population (TD and ASD), respectively. The results are shown for all fixations, as well as just the subset of fixations that are from the head. Error bars represent 95% confidence intervals over cross-validation. 5.3.6 Classification of ASD and TD participants The video dataset did not show clear differences between the two groups at a population level, but we would also like to determine how well we can dissociate individual participants with ASD using the DWOC model, with the hope that this could eventually be incorporated into a short screening test that could be applied to infants and young children. Earlier we found that the transition prob- abilities of the model were significantly different in the two populations for still images. Here we will compare the transition probabilities of participants with ASD with age, gender, and IQ matched TD participants to determine how well these cue preferences bifurcate the two groups on an individual participant basis. Figure 5.9 compares the transition probabilities of the full DWOC model for an ASD partici- pant on the y-axis with an age, gender, and IQ matched TD participant on the x-axis. When the point sits above and to the left of the diagonal, this indicates that the ASD subject has a higher transition probability for that cue than their respective TD subject, and vice-versa. The from head to saliency transition probability stands out, as the two groups are well separated, but why is this the case? We can modify the from head transitions that were just shown, by combining the to head and 76 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 ASD Transition Probability To Head Transitions 0 0.04 0.08 0.12 0 0.04 0.08 0.12 To Gaze Transitions 0.55 0.65 0.75 0.55 0.65 0.75 To Saliency Transitions 0.04 0.08 0.12 0.16 0.04 0.08 0.12 0.16 0.82 0.86 0.9 0.94 0.82 0.86 0.9 0.94 TD Transition Probability ASD Transition Probability Aged Matched Transition Probabilities n=15 pairs From NonHead Transitions From Head Transitions (A) (B) (C) (D) (E) To Background Transitions ASD Goes To Heads More TD Goes To Heads More TD Transition Probability TD Transition Probability Figure 5.9: The learned transition probabilities for each ASD participant is plotted with an age matched TD participant for the full DWOC model. If the point is located up and to the right of the diagonal, then the ASD participant has a lower transition probability as compared to the TD participant, and vice-versa. Each ASD/TD pair are plotted individually. Transitions are shown for: (A) From Head to Head; (B) From Head to Gaze; (C) From Head to Background; (D) From NonHead to Head; and (E) From NonHead to Background to gaze transitions together, and leaving to saliency alone, and we can see that the transitions can be interpreted to mean following social cues (head+gaze) or following saliency cues as shown in Figure 5.10. Here we can see that the separation is clear, and the logic behind the difference in populations is intuitive. Using social cues versus saliency, the separation is 93.3%, using this single factor alone. 77 TD To Social Transition Probability ASD To Social Transition Probability Aged Matched Transition Probabilities n=15 pairs 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 From Head Transitions Social vs Salient Transitions ASD Goes To Social More TD Goes To Social More Figure 5.10: Here the head and gaze transitions are combined and what was called background is now labeled saliency. If the point is located up and to the right of the diagonal, then the ASD participant has a lower transition probability as compared to the TD participant, and vice-versa. Each ASD/TD pair are plotted individually. Transitions are shown for: (A) From Head to a Social Cue; (B) From Head to a Salient Cue; 5.4 Evaluating model choices Since the DWOC model was designed to predict eye fixations of TD subjects, we can also deter- mine if the design choices that were made for the purpose of TD fixation prediction also holds for ASD/TD dissociation. For instance, both saliency and gaze are included in the model, and were shown to be both important in predicting fixation behavior of the general population, but it is unclear if they are both necessary, or even desirable, for ASD classification. 5.4.1 Removing gaze from the model Given that the transition probability of gaze is very low ( 3-5%), gaze following itself might not be important in the ASD/TD discrimination. To test this, we removed the gaze following cue from the model. This model retains the head transition probability, which we saw was significantly different for ASD and TD participants. 78 0.04 0.08 0.12 0.16 0.04 0.08 0.12 0.16 ASD Goes To Heads More TD To Head Transition Probability Head vs Background Aged Matched Transition Probabilities n=15 pairs 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 ASD To Head Transition Probability From Head Transitions (A) (B) TD Goes To Heads More From Head Transitions From NonHead Transitions TD To Head Transition Probability Figure 5.11: Learned transition probabilities for each ASD participant is plotted with an age matched TD participant for transitions going to a head or the background. If the point is located up and to the right of the diagonal, then the ASD participant has a lower transition probability as compared to the TD participant, and vice-versa. Each ASD/TD pair are plotted individually. Transitions are shown for: (A) From Head to Head; (B) From Head to Background; (C) From NonHead to Head; and (D) From NonHead to Background Fig 5.11 shows that looking at just to head and to background fixations only provided 80% separation, as compared to the original DWOC model with gaze following, which had a separation to 93.3%. This indicates that gaze following is useful when assessing ASD. We could see this gaze following difference in the heat maps on the Flickr set as well, but comparing two maps requires a much larger number of samples than comparing transition probabilities, which are very amenable to being used in a short screening test. 5.4.2 Comparison to saliency alone We also wanted to determine if agreement with saliency alone was a good discriminator of ASD and TD participants. Here the area under the curve (AUC) of the ROC for saliency as a predictor of eye movements is used to separate out the ASD/TD pairs. Fig 5.12 illustrates the separation created by how well the ASD participant’s fixations matched the saliency model, which only results in a 60% classification performance. 5.4.3 Removing saliency from the model Given the performance of saliency alone, is having saliency in the model useful? The DWOC model determines the transition probabilities by comparing the relative strength of the learned cue maps. The map with the highest value at the saccade endpoint is credited with that transition. 79 TD Saliency AUC ASD Saliency AUC Aged Matched Saliency AUC n=15 pairs 0.58 0.62 0.66 0.58 0.62 0.66 From Head Saccades Saliency as a discriminator for TD/ASD populations TD Saliency AUC 0.45 0.55 0.65 0.75 0.85 0.45 0.55 0.65 0.75 0.85 (A) 0.7 0.7 (B) All Saccades ASD Better Predicted By Saliency TD Better Predicted By Saliency Figure 5.12: ASD/TD matched pairs for saliency AUC score for A) from head saccades and B) all saccades. Head Pose Cue Stimulus Gaze Cue Saliency Cue Gaze Cue Head Pose Cue Background Cue Highest cue at saccade endpoint is counted With Saliency W/O Saliency A) B) C) Figure 5.13: A) Example stimulus with a single head B) Cue maps used in the DWOC model, which includes saliency. C) Cue maps if saliency were removed, and the gaze cue was thresholded. Since the head map covers a relatively small area, it is almost always higher than saliency inside the head, and the gaze map is by design outside of the head, so transitions that land on the head are scored as a head transition. For nonhead transitions, the competition between saliency and gaze determines the scored transition. While we feel this is a useful element of the model, we wanted to investigate whether this competition is useful in discriminating subjects with ASD. A simpler model would only have a thresholded gaze map (top 5% of the values) and a uniform background map, which removes the need for saliency. This is illustrated in Fig 5.13, where B) is the original DWOC transition map scoring, and C) is this proposed saliency free transition scoring. This saliency-free approach is similar to that used in prior work on fixation differences in ASD (Fletcher-Watson et al., 2009), and provides an opportunity for us to compare our method. 80 TD To Social Transition Probability ASD To Social Transition Probability Aged Matched Transition Probabilities n=15 pairs 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 From Head Transitions Social vs Background Transitions: (No gaze vs saliency comparison) ASD Goes To Social More TD Goes To Social More Figure 5.14: The head and gaze transitions are combined as before along with background. If the point is located up and to the right of the diagonal, then the ASD participant has a lower transition probability as compared to the TD participant, and vice-versa. Each ASD/TD pair are plotted individually. Transitions are shown for: (A) From Head to a Social Cue; (B) From Head to a Background Cue; Transitions without the saliency comparison are shown in Figure 5.14. Here we see that only 80% separation is achieved if we don’t use saliency to “explain away” some of the gaze follow- ing saccades. This justifies the incorporation of this gaze/saliency competition into the transition probability calculation. 5.5 Discussion In general, these findings corroborate prior work, which has found that participants with ASD pay less attention to facial features and gaze following than TD participants. In general, we found that participants with ASD were less stereotyped in their fixation behavior. The learned maps for the head fixation and gaze following cues were more diffuse, with subjects with ASD focusing less energy on the central facial features and the head pose direction than TD subjects. The combined model was a weaker predictor of ASD fixation behavior as well relative to baseline. We have shown that it is useful to include saliency when evaluating gaze following as a feature, 81 which has not been done when looking at participants with ASD before. Given the extensive analysis that has done using saliency in the field of eye movement prediction, it seems remiss to disregard these aspects of bottom up attention when evaluating other cues like gaze following. We also found that estimating the relative importance of several known cues for a subject can provide a more accurate assessment of their behavior than looking at a single cue in isolation. We also showed that saliency, gaze following, and head preferences by themselves are not as good at discriminating between ASD and TD participants as when they are combined. The combination is done in an intuitive way, by simply adding the “social” cues (heads and gaze) and comparing them to the “saliency” cues, which can be thought of as a split between more top down and bottom up behavior, respectively. It is very useful to diagnose ASD, especially at a young age. Early interventions in autism have been shown to be helpful in reducing ASD symptoms (and possibly “curing” a subset of the group), albeit in a small study (Rogers et al., 2014). These require quick, inexpensive methods with which to assess ASD risk, and monitor treatment progress. Here we show that with a short set of video clips (just under 3 minutes), we were able to separate participants with ASD from their matched TD participants using just the transition probabilities of the existing DWOC model. Investigating several alternative design choices in the model, which were optimized for fixa- tion performance, we found that these choices also improved ASD/TD dissociation on a different dataset. We also validated that the model itself could be extended to fixation prediction of video as well. 5.6 Conclusions The DWOC model views human saccade behavior as a relative weighting of competing cues based on the type of location that is currently fixated (head or non-head). Although we didn’t model other factors such as the individual cognitive agenda (Mary Hayhoe & Ballard, 2014) or innate preferences of the individual participants, we expected these differences to be present to some extent. Here we show that we can use the model to estimate population level differences in these cue preferences, and that the estimate is sharp enough that we can discriminate two populations 82 (ASD and TD), on an individual participant basis, using less than 3 minutes of data. We feel that this approach can be refined to aid in the screening of ASD, and can extended to other populations, where gaze following and head region fixation behavior is known to be atypical. 83 Chapter 6 Dissertation Summary This dissertation situates gaze following and head pose cues with respect to attentional models. The first study established that gaze following was a causal cue that outperformed saliency when predicting the direction of a saccade leaving a head, although not in the precise position. This effect was shown in both a controlled fashion, and with a newly described Flickr gaze dataset. The second study described a method to weight different kinds of cues based on the probability that they were looked at conditioned on the type of current fixation location (head or nonhead). It was shown that these probabilities varied greatly depending on this condition, implying that a real behavioral difference was being captured. It also provided a straightforward means of inserting the gaze-following cue into the model. This model was shown to outperform a state of the art fixation prediction model for all fixations, and especially for from head fixations. The third study involves using the model to better quantify the ASD population with respect to the TD population. Head pose and gaze following maps were learned for subjects with ASD and compared to the general population, which illuminated spatial differences within the cues them- selves. Transition probabilities were learned as well, and it was shown that, at a population level, subjects with ASD were less likely to saccade to a head or a gaze following cue and more likely to follow a salient cue than TD subjects. An additional video dataset was compiled, which validated the performance of the model in an additional modality. Most significantly, discrimination of sub- jects with ASD versus age, gender, and IQ matched TD subjects was very promising, with 93.3% separation using just the model’s transition probabilities on less than 3 minutes of eye tracking 84 data. 6.1 Potential Applications Because gaze following is involved in many aspects of social communication in humans as well as apes, the potential applications of this model are quite broad. Specifically, bipolar affective disorder, has also shown differential effects with regard to facial processing (Loughland, L. M. Williams, & Gordon, 2002a), and would a good candidate for future work. In addition, although the dichotomy between social (head + gaze) vs salient transitions was shown to be very good at discriminating participants with ASD, integrating this feature with all existing known markers into a “kitchen sink” classification system would be necessary before this work could be most effectively used in a medical setting. 6.1.1 Application to the bipolar affective disorder population Bipolar affective disorder is a psychiatric disorder involving repeated episodes of depression and elevated mood (Anderson, Haddad, & Scott, 2012). It is divided into two categories: bipolar I and bipolar II, where bipolar I requires a clear manic episode. Bipolar II requires only a minor manic incident (called hypomanic) in addition to a depressive episode. It is explicitly indicated that it is hard to differentially diagnose bipolar depression and unipolar depression from symptoms alone, and doctors rely on a person’s individual and family history. Bipolar disorder has a median onset age of 25 years (Anderson et al., 2012). Prevalence of bipolar I is 0.6% and bipolar II is 0.4%, making them also fairly common disorders. Several lines of research have investigated how bipolar individuals differ from TD individuals in attending to and processing visual information. (Rubinow & Post, 1992) showed that bipolar affective disorder participants who were experiencing depression were less able to discriminate facial expressions than that of TD participants. (George et al., 1998) performed a longitudinal study of a single bipolar participant across depressive, manic, and euthymic moods, where he had to assess the emotional state of a people in pictures. The participant made more significantly more 85 errors in this task when the participant was depressed, and that the results were biased towards negative emotional states. This was not true during manic or euthymic states. In contrast, euthymic bipolar participants have been shown to have an enhanced ability to detect fear (Malhi et al., 2007). Across paradigms, depressed bipolar individuals have also been shown to bias their attention to more negative emotions in the emotional Stroop task (J. M. G. Williams, Mathews, & MacLeod, 1996) and in an affect-based no/no-go task (Murphy et al., 1999). (Loughland et al., 2002a) showed that bipolar subjects avoided facial features (eyes, nose, mouth) during the scanpath of neutral, happy, and sad faces relative to TD subjects. This bias- ing in feature locations is directly relevant to the DWOC model, and indicates there should be at least some differential effect on the model between bipolar individuals and TD subjects. In another work, they show that schizophrenic individuals also avoid facial features and are not as able to distinguish emotion in happy or neutral faces (Loughland, L. M. Williams, & Gordon, 2002b). Quantifying the fixation behavior of each of these populations can allow a finer grained understanding of their differences and further our understanding of these diseases. This disorder provides another avenue upon which the DWOC model can be applied. Although it has not been studied, the avoidance behavior of facial features shown in bipolar participants (Loughland et al., 2002a) is likely to reduce the gaze following behavior in these individuals, in addition to potential differences in their transition probabilities to heads and saliency in general. Further, the learned heat map parameterized by head pose angle that is learned by the model is likely to be different in this population. This represents another avenue through which the DWOC model can elucidate cue preferences. 6.1.2 Refinement as an ASD screening technique The work here has established that incorporating gaze following, saliency, and head cues in a competitive cue model provides good dissociation of ASD/TD participants who were mostly in their teens. The utility of this feature for screening purposes would be for children as young as possible, since early intervention has been shown to be highly effective (Rogers et al., 2014), and fixation differences appear as young as 2 to 6 months (W. Jones & Klin, 2013). Further, all possible 86 features should ideally be included in this classification, as it would be imperative for a screening technique to have a very high hit rate along with a very low false positive rate. 87 References Aggleton, J. P., Burton, M. J., & Passingham, R. E. (1980, May). Cortical and subcortical afferents to the amygdala of the rhesus monkey (macaca mulatta). Brain research, 190(2), 347–368. Akiyama, T., Kato, M., Muramatsu, T., Saito, F., Nakachi, R., & Kashima, H. (2006). A deficit in discriminating gaze direction in a case with right superior temporal gyrus lesion. Neuropsy- chologia, 44(2), 161–170. Alghowinem, S., Goecke, R., Wagner, M., Parkerx, G., & Breakspear, M. (2013). Head pose and movement analysis as an indicator of depression. In Affective computing and intelligent in- teraction (acii), 2013 humaine association conference on (pp. 283–288). IEEE. Anderson, I. M., Haddad, P. M., & Scott, J. (2012). Bipolar disorder. BMJ: British Medical Journal, 345. Asteriadis, S., Karpouzis, K., & Kollias, S. (2013). Visual focus of attention in non-calibrated environments using gaze estimation. International Journal of Computer Vision, 1–24. Bakeman, R. & Adamson, L. B. (1984). Coordinating attention to people and objects in mother– infant and peer–infant interaction. Child development. Ballard, D. H., Hayhoe, M. [M.], & Pelz, J. (1995). Memory representations in natural tasks. Journal of Cognitive Neuroscience. 7(1), 66–80. Baron-Cohen, S., Campbell, R., Karmiloff-Smith, A., Grant, J., & Walker, J. (1995). Are children with autism blind to the mentalistic significance of the eyes? British Journal of Developmen- tal Psychology, 13(4), 379–398. Birmingham, E., Bischof, W. F., & Kingstone, A. (2009). Saliency does not account for fixations to eyes within social scenes. Vision research, 49(24), 2992–3000. 88 Bland, J. M. & Altman, D. G. (1995). Multiple significance tests: the bonferroni method. Bmj, 310(6973), 170. Bock, S., Dicke, P., & Thier, P. (2008). How precise is gaze following in humans? Vision Research. 48, 946–957. Borji, A. (2012). Boosting bottom-up and top-down visual features for saliency estimation. In Computer vision and pattern recognition (cvpr), 2012 ieee conference on (pp. 438–445). IEEE. Borji, A. & Itti, L. (2012). Exploiting local and global patch rarities for saliency detection. In Computer vision and pattern recognition (cvpr), 2012 ieee conference on (pp. 478–485). Borji, A. & Itti, L. (2013). State-of-the-art in visual attention modeling. IEEE Trans. Pattern Anal- ysis and Machine Intelligence (PAMI), 35(1), 185–207. Borji, A. & Itti, L. (2014). Defending yarbus: eye movements predict observers’ task. Journal of vision. Borji, A., Parks, D., & Itti, L. (2014). Complementary effects of gaze direction and early saliency in guiding fixations during free viewing. Journal of vision, 14(13), 3. Borji, A., Sihite, D. N. [Dicky N], & Itti, L. (2013). Objects do not predict fixations better than early saliency: a re-analysis of einh¨ auser et al.’s data. Journal of vision, 13(10), 18. Borji, A., Sihite, D. N. [Dicky Nauli], & Itti, L. (2011). Quantifying the relative influence of pho- tographer bias and viewing strategy on scene viewing. Journal of Vision, 11(11), 166–166. Borji, A., Sihite, D., & Itti, L. (2014). What/where to look next? modeling top-down visual atten- tion in complex interactive environments. IEEE Transactions on Systems, Man, and Cyber- netics, PART A-SYSTEMS AND HUMANS. Borji, A., Tavakoli, H. R., Sihite, D. N., & Itti, L. (2013). Analysis of scores, datasets, and models in visual saliency prediction. In Computer vision (iccv), 2013 ieee international conference on (pp. 921–928). IEEE. Breazeal, C. & Scassellati, B. (2002, May). Challenges in building robots that imitate people (C. L. Nehaniv & K. Dautenhahn, Eds.). MIT Press. 89 Butterworth, G. & Jarrett, N. (1991). What minds have in common is space: spatial mechanisms serving joint visual attention in infancy. British Journal of Developmental Psychology, 9(1), 55–72. Call, J. & Tomasello, M. (2008, May). Does the chimpanzee have a theory of mind? 30 years later. Trends in Cognitive Sciences, 12(5), 187–192. Campbell, R., Heywood, C., Cowey, A., Regard, M., & Landis, T. (1990). Sensitivity to eye gaze in prosopagnosic patients and monkeys with superior temporal sulcus ablation. Neuropsy- chologia, 28(11), 1123–1142. Carmi, R. & Itti, L. (2006). The role of memory in guiding attention during natural vision. Journal of Vision, 6(9), 4. Caron, A. J., Butler, S., & Brooks, R. (2002). Gaze following at 12 and 14 months: do the eyes matter? British Journal of Developmental Psychology, 20(2), 225–240. Caronna, E. B., Milunsky, J. M., & Tager-Flusberg, H. (2008). Autism spectrum disorders: clinical and research frontiers. Archives of Disease in Childhood, 93(6), 518–523. Castelhano, M. S., Wieth, M., & Henderson, J. M. (2007). I see what you see: eye movements in real-world scenes are affected by perceived direction of gaze. In Attention in cognitive systems. theories and systems from an interdisciplinary viewpoint (pp. 251–262). Springer. Cerf, M., Frady, E. P., & Koch, C. (2009, November). Faces and text attract gaze independent of the task: experimental data and computer model. Journal of Vision, 9(12). Cerf, M., Harel, J., Einh¨ auser, W., & Koch, C. (2008). Predicting human gaze using low-level saliency combined with face detection. In Advances in neural information processing sys- tems (pp. 241–248). Chua, H. F., Boland, J. E., & Nisbett, R. E. (2005). Cultural variation in eye movements during scene perception. Proceedings of the National Academy of Sciences of the United States of America, 102(35), 12629–12633. Cline, M. G. (1967, March). The perception of where a person is looking. The American journal of psychology, 80(1), 41–50. 90 Compton, R. J. (2003). The interface between emotion and attention: a review of evidence from psychology and neuroscience. Behavioral and cognitive neuroscience reviews, 2(2), 115– 129. Dalton, K. M., Nacewicz, B. M., Johnstone, T., Schaefer, H. S., Gernsbacher, M. A., Goldsmith, H., . . . Davidson, R. J. (2005). Gaze fixation and the neural circuitry of face processing in autism. Nature neuroscience, 8(4), 519–526. Damasio, A. R., Damasio, H., & Hoesen, G. W. V . (1982, April). Prosopagnosia anatomic basis and behavioral mechanisms. Neurology, 32(4), 331–331. DeAngelus, M. & Pelz, J. B. (2009). Top-down control of eye movements: yarbus revisited. Visual Cognition, 17(6), 790–811. D’Entremont, B., Hains, S., & Muir, D. (1997, October). A demonstration of gaze following in 3- to 6-month-olds. Infant Behavior and Development, 20(4), 569–572. Driver, J., Davis, G., Ricciardelli, P., Kidd, P., Maxwell, E., & Baron-Cohen, S. (1999). Gaze perception triggers reflexive visuospatial orienting. Visual Cognition, 6(5), 509–540. Droll, J. A., Hayhoe, M. M., Triesch, J., & Sullivan, B. T. (2005). Task Demands Control Ac- quisition and Storage of Visual Information. Journal of Experimental Psychology Human Perception and Performance, 31(6), 1416–1438. Duchaine, B., Jenkins, R., Germine, L., & Calder, A. J. (2009, August). Normal gaze discrimina- tion and adaptation in seven prosopagnosics. Neuropsychologia, 47(10), 2029–2036. Ehinger, K. A., Hidalgo-Sotelo, B., Torralba, A., & Oliva, A. (2009). Modelling search for people in 900 scenes: a combined source model of eye guidance. Visual cognition, 17(6-7), 945– 978. Einh¨ auser, W., Rutishauser, U., & Koch, C. (2008). Task-demands can immediately reverse the effects of sensory-driven saliency in complex visual stimuli. Journal of Vision, 8(2), 2. Emery, N. (2000). The eyes have it: the neuroethology, function and evolution of social gaze. Neuroscience & Biobehavioral Reviews, 24(6), 581–604. Engell, A. D. & Haxby, J. V . (2007). Facial expression and gaze-direction in human superior tem- poral sulcus. Neuropsychologia, 45(14), 3234–3241. 91 Fedor, J., Lynn, A., Foran, W., DiCicco-Bloom, J., Luna, B., & O’Hearn, K. (2015). Patterns of fixation during face recognition: differences in autism across age. Journal of Autism and Developmental Disorders, 20(10), XX–XX. Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008 (pp. 1–8). Fletcher-Watson, S., Leekam, S. R., Benson, V ., Frank, M., & Findlay, J. (2009). Eye-movements reveal attention to social information in autism spectrum disorder. Neuropsychologia, 47(1), 248–257. Foulsham, T. [T.], J.J., B., Kingstone, A., Dewhurst, R., & G., U. (2009). Fixation and saliency during search of natural scenes: the case of visual agnosia. Neuropsychologia, 47, 1994– 2003. Franck, N., Montoute, T., Labruy` ere, N., Tiberghien, G., Marie-Cardine, M., Dal´ ery, J., . . . Georgi- eff, N. (2002). Gaze direction determination in schizophrenia. Schizophrenia research, 56(3), 225–234. Freeth, M., Foulsham, T. [Tom], & Chapman, P. (2011). The influence of visual saliency on fixation patterns in individuals with autism spectrum disorders. Neuropsychologia, 49(1), 156–160. Freund, Y . & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1), 119–139. Friesen, C. & Kingstone, A. (1998). The eyes have it! reflexive orienting is triggered by nonpre- dictive gaze. Psychonomic Bulletin & Review, 5(3), 490–495. Garcia-Diaz, A., Fdez-Vidal, X. R., Pardo, X. M., & Dosil, R. (2012). Saliency from hierarchical adaptation through decorrelation and variance normalization. Image and Vision Computing, 30(1), 51–64. Garcia-Diaz, A., Leboran, V ., Fdez-Vidal, X. R., & Pardo, X. M. (2012). On the relationship be- tween optical variability, visual saliency, and eye fixations: a computational approach. Jour- nal of Vision. 12(6). 92 George, M. S., Huggins, T., Mcdermut, W., Parekh, P. I., Rubinow, D., & Post, R. M. (1998). Abnormal facial emotion recognition in depression: serial testing in an ultra-rapid-cycling patient. Behavior Modification, 22(2), 192–204. Gibson, J. & Pick, A. (1963). Perception of another person’s looking behavior. American Journal of Psychology, 76(3), 386–394. Gray, H. (1918). Gray’s anatomy of the human body. 1918. Grice, S. J., Halit, H., Farroni, T., Baron-Cohen, S., Bolton, P., & Johnson, M. H. (2005). Neural correlates of eye-gaze detection in young children with autism. Cortex, 41(3), 342–353. Harries, M. H. & Perrett, D. I. (1991). Visual processing of faces in temporal cortex: physiological evidence for a modular organization and possible anatomical correlates. Journal of Cognitive Neuroscience, 3(1), 9–24. Hayhoe, M. [Mary] & Ballard, D. H. (2014). Modeling task control of eye movements. Current Biology, 24(13), R622–R628. Hecaen, H. & Angelergues, R. (1962). Agnosia for faces (prosopagnosia). Archives of neurology, 7(2), 92–00. Hershler, O. & Hochstein, S. (2005). At first sight: a high-level pop out effect for faces. Vision research, 45(13), 1707–1724. Heywood, C. A., Cowey, A., & Rolls, E. T. (1992, January). The role of the ‘face-cell’ area in the discrimination and recognition of faces by monkeys [and discussion]. Philosophical Trans- actions of the Royal Society of London. Series B: Biological Sciences, 335(1273), 31–38. Hietanen, J. K. (1999). Does your gaze direction and head orientation shift my visual attention? Neuroreport, 10(16), 3443–3447. Hoffman, E. A. & Haxby, J. V . (2000). Distinct representations of eye gaze and identity in the distributed human neural system for face perception. Nature Neuroscience, 3(1), 80–84. Hoffman, M. W., Grimes, D. B., Shon, A. P., & Rao, R. P. (2006, April). A probabilistic model of gaze imitation and shared attention. Neural Networks, 19(3), 299–310. 93 Horley, K., Williams, L. M., Gonsalvez, C., & Gordon, E. (2004). Face to face: visual scanpath evidence for abnormal processing of facial expressions in social phobia. Psychiatry research, 127(1), 43–53. Itti, L. (2007). Visual salience. Scholarpedia, 2(9), 3327. Itti, L. & Koch, C. (2001a). Computational modelling of visual attention. Nature reviews. Neuro- science, 2(3), 194–203. Retrieved from http://dx.doi.org/10.1038/35058500 Itti, L. & Koch, C. (2001b). Feature combination strategies for saliency-based visual attention systems. Journal of Electronic Imaging, 10(1), 161–169. Itti, L., Koch, C., & Niebur, E. (1998, November). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1254–1259. Jones, W., Carr, K., & Klin, A. (2008). Absence of preferential looking to the eyes of approaching adults predicts level of social disability in 2-year-old toddlers with autism spectrum disorder. Archives of General Psychiatry, 65(8), 946–954. Jones, W. & Klin, A. (2013). Attention to eyes is present but in decline in 2-6-month-old infants later diagnosed with autism. Nature. Judd, T., Ehinger, K., Durand, F., & Torralba, A. (2009). Learning to predict where humans look. In Computer vision, 2009 ieee 12th international conference on (pp. 2106–2113). IEEE. Kaplan, F. & Hafner, V . V . (2006). The challenges of joint attention. Interaction Studies, 7(2), 135– 169. Kaye, K. (1982). The mental and social life of babies: how parents create persons. Harvester Press. Kienzle, W., Wichmann, F. A., Sch¨ olkopf, B., & Franz, M. O. (2007). A nonparametric approach to bottom-up visual saliency. Advances in neural information processing systems, 19, 689. King, M. & Bearman, P. (2009). Diagnostic change and the increased prevalence of autism. Inter- national journal of epidemiology, 38(5), 1224–1234. Klin, A., Jones, W., Schultz, R., V olkmar, F., & Cohen, D. (2002). Visual fixation patterns during viewing of naturalistic social situations as predictors of social competence in individuals with autism. Archives of general psychiatry, 59(9), 809–816. 94 Klin, A., Lin, D., Gorrindo, P., Ramsay, G., & Jones, W. (2009). Two-year-olds with autism orient to non-social contingencies rather than biological motion. Nature, 459. Kluttz, N. L., Mayes, B. R., West, R. W., & Kerby, D. S. (2009, July). The effect of head turn on the perception of gaze. Vision Research, 49(15), 1979–1993. Kobayashi, H. & Kohshima, S. (1997). Unique morphology of the human eye. Nature, 387, 767– 768. Kobayashi, H. & Kohshima, S. (2001, May). Unique morphology of the human eye and its adaptive meaning: comparative studies on external morphology of the primate eye. Journal of human evolution, 40(5), 419–435. Koch, C. & Ullman, S. (1985). Shifts in selective visual attention: towards the underlying neural circuitry. Human Neurobiology, 4(4), 219–227. Krieger, G., Rentschler, I., Hauske, G., Schill, K., & Zetzsche, C. (2000). Object and scene analy- sis by saccadic eye-movements: an investigation with higher-order statistics. Spatial vision, 13(2-3), 201–214. Kuhn, G. & Kingstone, A. (2009). Look away! eyes and arrows engage oculomotor responses automatically. Attention, Perception, & Psychophysics, 71(2), 314–327. Kupfer, D. & Foster, F. G. (1972). Interval between onset of sleep and rapid-eye-movement sleep as an indicator of depression. The Lancet, 300(7779), 684–686. Land, M. F. & Hayhoe, M. [Mary]. (2001, December). In what ways do eye movements contribute to everyday activities? Vision research, 41(25-26), 3559–3565. Land, M. F. & Lee, D. N. (1994). Where we look when we steer. Nature, 369, 742–744. Langton, S. R. (2000). The mutual influence of gaze and head orientation in the analysis of social attention direction. The Quarterly Journal of Experimental Psychology: Section A, 53(3), 825–845. Langton, S. R. & Bruce, V . (1999). Reflexive visual orienting in response to the social attention of others. Visual Cognition, 6(5), 541–567. Langton, S. R., Honeyman, H., & Tessler, E. (2004, July). The influence of head contour and nose angle on the perception of eye-gaze direction. Perception & Psychophysics, 66(5), 752–771. 95 Leslie, A. M. & Frith, U. (1988). Autistic children’s understanding of seeing, knowing and believ- ing. British Journal of Developmental Psychology, 6(4), 315–324. Li, Z. (2002). A saliency map in primary visual cortex. Trends in cognitive sciences, 6(1), 9–16. Lobmaier, J. S., Fischer, M. H., & Schwaninger, A. (2006). Objects capture perceived gaze direc- tion. Experimental Psychology, 53(2), 117–122. Lord, C., Risi, S., Lambrecht, L., Cook Jr, E. H., Leventhal, B. L., DiLavore, P. C., . . . Rutter, M. (2000). The autism diagnostic observation schedule—generic: a standard measure of social and communication deficits associated with the spectrum of autism. Journal of autism and developmental disorders, 30(3), 205–223. Loughland, C. M., Williams, L. M., & Gordon, E. (2002a). Schizophrenia and affective disor- der show different visual scanning behavior for faces: a trait versus state-based distinction? Biological psychiatry, 52(4), 338–348. Loughland, C. M., Williams, L. M., & Gordon, E. (2002b). Visual scanpaths to positive and neg- ative facial emotions in an outpatient schizophrenia sample. Schizophrenia research, 55(1), 159–170. Lungarella, M., Metta, G., Pfeifer, R., & Sandini, G. (2003). Developmental robotics: a survey. Connection Science, 15(4), 151–190. Malhi, G. S., Lagopoulos, J., Sachdev, P. S., Ivanovski, B., Shnier, R., & Ketter, T. (2007). Is a lack of disgust something to fear? a functional magnetic resonance imaging facial emotion recognition study in euthymic bipolar disorder patients. Bipolar disorders, 9(4), 345–357. Mannan, S. K., Ruddock, K. H., & Wooding, D. S. (1996). The relationship between the locations of spatial features and those of fixations made during visual examination of briefly presented images. Spatial vision, 10(3), 165–188. Marin-Jimenez, M., Zisserman, A., Eichner, M., & Ferrari, V . (2014). Detecting people looking at each other in videos. International Journal of Computer Vision, 1–15. Mendelson, M. J., Haith, M. M., & Goldman-Rakic, P. S. (1982, March). Face scanning and re- sponsiveness to social cues in infant rhesus monkeys. Developmental Psychology, 18(2), 222–228. 96 Meng, Q. & Song, Y . (2012). Text detection in natural scenes with salient region. In Document analysis systems (das), 2012 10th iapr international workshop on (pp. 384–388). IEEE. Milanese, R., Wechsler, H., Gill, S., Bost, J.-M., & Pun, T. (1994). Integration of bottom-up and top-down cues for visual attention using non-linear relaxation. In Computer vision and pat- tern recognition, 1994. proceedings cvpr’94., 1994 ieee computer society conference on (pp. 781–785). IEEE. Mundy, P., Sigman, M., Ungerer, J., & Sherman, T. (1986). Defining the social deficits of autism: the contribution of non-verbal communication measures. Journal of Child Psychology and Psychiatry, 27(5), 657–669. Murphy, F., Sahakian, B., Rubinsztein, J., Michael, A., Rogers, R., Robbins, T., & Paykel, E. (1999). Emotional bias and inhibitory control processes in mania and depression. Psycho- logical medicine, 29(06), 1307–1321. Murphy-Chutorian, E., Doshi, A., & Trivedi, M. M. (2007). Head pose estimation for driver assis- tance systems: a robust algorithm and experimental evaluation. In Intelligent transportation systems conference, 2007. itsc 2007. ieee (pp. 709–714). IEEE. Murphy-Chutorian, E. & Trivedi, M. M. (2009). Head pose estimation in computer vision: a survey. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(4), 607–626. Nagai, Y ., Asada, M., & Hosoda, K. (2002). A developmental approach accelerates learning of joint attention. In Development and learning, 2002. proceedings. the 2nd international conference on (pp. 277–282). IEEE. Navalpakkam, V . & Itti, L. (2007). Search goal tunes visual features optimally. Neuron, 53(4), 605–617. Neumann, D., Spezio, M. L., Piven, J., & Adolphs, R. (2006). Looking you in the mouth: abnor- mal gaze in autism resulting from impaired top-down modulation of visual attention. Social cognitive and affective neuroscience, 1(3), 194–202. Norris, J. R. (1998). Markov chains. Cambridge university press. 97 Okamoto-Barth, S., Tomonaga, M., Tanaka, M., & Matsuzawa, T. (2008). Development of using experimenter-given cues in infant chimpanzees: longitudinal changes in behavior and cogni- tive development. Developmental Science, 11(1), 98–108. Ozuysal, M., Calonder, M., Lepetit, V ., & Fua, P. (2010). Fast keypoint recognition using random ferns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3), 448–461. Parkhurst, D., Law, K., & Niebur, E. (2002). Modeling the role of salience in the allocation of overt visual attention. Vision Research. 42(1), 107–123. Parks, D., Borji, A., & Itti, L. (2014). Augmented saliency model using automatic 3d head pose detection and learned gaze following in natural scenes. Vision research. Pelphrey, K. A., Singerman, J. D., Allison, T., & McCarthy, G. (2003). Brain activation evoked by perception of gaze shifts: the influence of context. Neuropsychologia, 41(2), 156–170. Perrett, D. I., Smith, P. a. J., Potter, D. D., Mistlin, A. J., Head, A. S., Milner, A. D., & Jeeves, M. A. (1985, January). Visual cells in the temporal cortex sensitive to face view and gaze direction. Proceedings of the Royal Society of London. Series B. Biological Sciences, 223(1232), 293– 317. Perrett, D., Hietanen, J., Oram, M., Benson, P., & Rolls, E. (1992). Organization and functions of cells responsive to faces in the temporal cortex [and discussion]. Philosophical transactions of the royal society of London. Series B: Biological sciences, 335(1273), 23–30. Peters, R. J., Iyer, A., Itti, L., & Koch, C. (2005). Components of bottom-up gaze allocation in natural images. Vision research, 45(18), 2397–2416. Poppe, R., Rienks, R., & Heylen, D. (2007). Accuracy of head orientation perception in triadic situations: experiment in a virtual environment. Perception, 36(7), 971–979. Premack, D. & Woodruff, G. (1978). Does the chimpanzee have a theory of mind? Behavioral and Brain Sciences, 1(4), 515–526. Puce, A., Allison, T., Bentin, S., Gore, J. C., & McCarthy, G. (1998). Temporal cortex activation in humans viewing eye and mouth movements. The Journal of Neuroscience, 18(6), 2188– 2199. 98 Puce, A., Allison, T., Gore, J. C., & McCarthy, G. (1995). Face-sensitive regions in human extras- triate cortex studied by functional mri. Journal of neurophysiology, 74, 1192–1192. Rabiner, L. & Juang, B.-H. (1986). An introduction to hidden markov models. ASSP Magazine, IEEE, 3(1), 4–16. Ramanathan, S., Divya, S., Nicu, S., & David, M. (2014). Emotion modulates eye movement pat- terns and subsequent memory for the gist and details of movie scenes. Journal of Vision, 14(3)(31). Reinagel, P. & Zador, A. M. (1999). Natural scene statistics at the centre of gaze. Network: Com- putation in Neural Systems, 10(4), 341–350. Riby, D. M., Doherty-Sneddon, G., & Bruce, V . (2009). The eyes or the mouth? feature salience and unfamiliar face processing in williams syndrome and autism. The Quarterly Journal of Experimental Psychology, 62(1), 189–203. Ricciardelli, P., Baylis, G., & Driver, J. (2000, October). The positive and negative of human ex- pertise in gaze perception. Cognition, 77(1), B1–B14. Ricciardelli, P., Bricolo, E., Aglioti, S. M., & Chelazzi, L. (2002). My eyes want to look where your eyes are looking: exploring the tendency to imitate another individual’s gaze. NeuroReport: For Rapid Communication of Neuroscience Research, 13(17), 2259–2264. Rogers, S., Vismara, L., Wagner, A., McCormick, C., Young, G., & Ozonoff, S. (2014). Autism treatment in the first year of life: a pilot study of infant start, a parent-implemented interven- tion for symptomatic infants. Journal of autism and developmental disorders, 44(12), 2981– 2995. Rubinow, D. R. & Post, R. M. (1992). Impaired recognition of affect in facial expression in de- pressed patients. Biological psychiatry, 31(9), 947–953. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., . . . Bernstein, M., et al. (2014). Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575. Sasson, N. J., Turner-Brown, L. M., Holtzclaw, T. N., Lam, K. S., & Bodfish, J. W. (2008). Children with autism demonstrate circumscribed attention during passive viewing of complex social and nonsocial picture arrays. Autism Research, 1(1), 31–42. 99 Schwaninger, A., Lobmaier, J., & Fischer, M. (2005). The inversion effect on gaze perception reflects processing of component information. Experimental Brain Research, 167(1), 49–55. Sergent, J., Ohta, S., & Macdonald, B. (1992, February). Functional neuroanatomy of face and object processing a positron emission tomography study. Brain, 115(1), 15–36. Tatler, B. W. (2007). The central fixation bias in scene viewing: selecting an optimal viewing position independently of motor biases and image feature distributions. Journal of Vision, 7(14), 4. Todorovic, D. (2009). The effect of face eccentricity on the perception of gaze direction. Percep- tion, 38(1), 109–132. Tomasello, M., Carpenter, M., Call, J., Behne, T., Moll, H., et al. (2005). Understanding and sharing intentions: the origins of cultural cognition. Behavioral and brain sciences, 28(5), 675–690. Torralba, A., Oliva, A., Castelhano, M. S., & Henderson, J. M. (2006, October). Contextual guid- ance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological review, 113(4), 766–786. Treisman, A. M. & Gelade, G. (1980). A feature integration theory of attention. Cognitive Psy- chology. 12, 97–136. Trevarthen, C. (1979, September). Before speech: the beginning of interpersonal communication (M. Bullowa, Ed.). CUP Archive. Triesch, J., Ballard, D. H., Hayhoe, M., & Sullivan, B. (2003). What you see is what you need. Journal of Vision. 3, 86–94. Tronick, E. Z. (1989). Emotions and emotional communication in infants. American Psychologist, 44(2), 112–119. Tseng, P.-H., Cameron, I. G. M., Pari, G., Reynolds, J. N., Munoz, D. P., & Itti, L. (2012). High- throughput classification of clinical populations from natural viewing eye movements. Jour- nal of Neurology. Underwood, G., Foulsham, T. [Tom], & Humphrey, K. (2009). Saliency and scan patterns in the inspection of real-world scenes: eye movements during encoding and recognition. Visual Cognition, 17(6-7), 812–834. 100 VanRullen, R. (2006). On second glance: still no high-level pop-out effect for faces. Vision re- search, 46(18), 3017–3027. Vincent, B. T., Baddeley, R. J., Troscianko, T., & Gilchrist, I. D. (2009). Optimal feature integration in visual search. Journal of Vision, 9(5), 15. Viola, P. & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Computer vision and pattern recognition, 2001. cvpr 2001. proceedings of the 2001 ieee computer society conference on (V ol. 1, pp. I–511). IEEE. Wang, H.-C. & Pomplun, M. (2012). The attraction of visual attention to texts in real-world scenes. Journal of vision, 12(6), 26. Wechsler, D. (2008). Wechsler adult intelligence scale–fourth edition (wais–iv). San Antonio, TX: NCS Pearson. Williams, J. M. G., Mathews, A., & MacLeod, C. (1996). The emotional stroop task and psy- chopathology. Psychological bulletin, 120(1), 3. Wilson, H. R., Wilkinson, F., Lin, L.-M., & Castillo, M. (2000, March). Perception of head orien- tation. Vision Research, 40(5), 459–472. Wimmer, H. & Perner, J. (1983, January). Beliefs about beliefs: representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 13(1), 103–128. Yarbus, A. (1967). Eye movements and vision. New York: Plenum. Ye, Z., Li, Y ., Fathi, A., Han, Y ., Rozga, A., Abowd, G. D., & Rehg, J. M. (2012). Detecting eye contact using wearable eye-tracking glasses. In Proceedings of the 2012 acm conference on ubiquitous computing (pp. 699–704). ACM. Yin, R. K. (1969). Looking at upside-down faces. Journal of Experimental Psychology, 81(1), 141–145. Young, A. W., Aggleton, J. P., Hellawell, D. J., Johnson, M., Broks, P., & Hanley, J. R. (1995, February). Face processing impairments after amygdalotomy. Brain: a journal of neurology, 118 ( Pt 1), 15–24. 101 Yun, K., Peng, Y ., Samaras, D., Zelinsky, G. J., & Berg, T. L. (2013). Exploring the role of gaze behavior and object detection in scene understanding. Frontiers in Psychology. 4. Zhao, Q. & Koch, C. (2011). Learning a saliency map using fixated locations in natural scenes. Journal of vision, 11(3), 9. Zhu, X. & Ramanan, D. (2012). Face detection, pose estimation, and landmark localization in the wild. In 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2879–2886). 102 Bibliography Adolphs, R., Sears, L., & Piven, J. (2001). Abnormal processing of social information from faces in autism. Journal of cognitive neuroscience, 13(2), 232–240. Allison, T., Puce, A., & McCarthy, G. (2000, July 1). Social perception from visual cues: role of the STS region. Trends in Cognitive Sciences, 4(7), 267–278. Anstis, S. M., Mayhew, J. W., & Morley, T. (1969, December). The perception of where a face or television ’portrait’ is looking. The American Journal of Psychology, 82(4), 474–489. Arti- cleType: research-article / Full publication date: Dec., 1969 / Copyright © 1969 University of Illinois Press. Arbib, M. A. & Lee, J. (2007, January 1). Vision and action in the language-ready brain: from mir- ror neurons to SemRep. In F. Mele, G. Ramella, S. Santillo, & F. Ventriglia (Eds.), Advances in brain, vision, and artificial intelligence (4729, pp. 104–123). Lecture Notes in Computer Science. Springer Berlin Heidelberg. Retrieved from http://link.springer.com/chapter/10. 1007/978-3-540-75555-5 11 Arbib, M. A. & Lee, J. (2008, August 15). Describing visual scenes: towards a neurolinguistics based on construction grammar. Brain Research, 1225, 146–162. Budanitsky, A. & Hirst, G. (2001). Semantic distance in WordNet: an experimental, application- oriented evaluation of five measures. In Workshop on WordNet and other lexical resources (V ol. 2). Butterworth, G. & Grover, L. (1988). The origins of referential communication in human infancy. In Thought without language (pp. 5–24). A Fyssen Foundation symposium. New York, NY, US: Clarendon Press/Oxford University Press. 103 Caruana, R. & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algo- rithms. In Proceedings of the 23rd international conference on machine learning (pp. 161– 168). ICML ’06. New York, NY, USA: ACM. Cilibrasi, R. & Vitanyi, P. (2007). The google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3), 370–383. Dalal, N. & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE computer society conference on computer vision and pattern recognition, 2005. CVPR 2005 (V ol. 1, 886–893 vol. 1). Desimone, R. & Duncan, J. (1995). Neural mechanisms of selective visual attention. Annual review of neuroscience, 18(1), 193–222. Einh¨ auser, W., Spain, M., & Perona, P. (2008). Objects predict fixations better than early saliency. Journal of Vision, 8(14), 18. Fathi, A., Hodgins, J., & Rehg, J. M. (2012). Social interactions: a first-person perspective. In 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1226–1233). Fathi, A., Li, Y ., & Rehg, J. M. (2012, January). Learning to recognize daily actions using gaze. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y . Sato, & C. Schmid (Eds.), Computer vision – ECCV 2012 (7572, pp. 314–327). Lecture Notes in Computer Science. Springer Berlin Heidelberg. Fathi, A., Ren, X., & Rehg, J. M. (2011). Learning to recognize objects in egocentric activities. In 2011 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3281–3288). Gale, C. & Monk, A. (2000). Where am i looking? the accuracy of video-mediated gaze awareness. Attention, Perception, & Psychophysics, 62(3), 586–595. doi:10.3758/BF03212110 Itti, L. & Arbib, M. A. (2006). Attention and the minimal subscene. Action to language via the mirror neuron system, 289–346. Kalal, Z., Matas, J., & Mikolajczyk, K. (2010). P-n learning: bootstrapping binary classifiers by structural constraints. In 2010 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 49–56). 104 Karnath, H.-O. (2001, August). New insights into the functions of the superior temporal cortex. Nature Reviews Neuroscience, 2(8), 568–576. Retrieved from http://www.nature.com/nrn/ journal/v2/n8/full/nrn0801 568a.html Kita, S. (2003). Pointing: where language, culture, and cognition meet. Retrieved September 14, 2012, from http://edoc.mpg.de/127495 Kschischang, F., Frey, B., & Loeliger, H.-A. (2001). Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2), 498–519. Lafferty, J., McCallum, A., & Pereira, F. (2001, June). Conditional random fields: probabilistic models for segmenting and labeling sequence data. Departmental Papers (CIS). Langton, S. R., Watt, R. J., & Bruce, V . (2000, February). Do the eyes have it? cues to the direction of social attention. Trends in Cognitive Sciences, 4(2), 50–59. Lawrence, N. S., Williams, A. M., Surguladze, S., Giampietro, V ., Brammer, M. J., Andrew, C., . . . Phillips, M. L. (2004). Subcortical and ventral prefrontal cortical neural responses to facial expressions distinguish patients with bipolar disorder and major depression. Biological psychiatry, 55(6), 578–587. Lempers, J. D. (1979). Young children’s production and comprehension of nonverbal deictic be- haviors. The Journal of Genetic Psychology: Research and Theory on Human Development, 135(1), 93–102. Leung, E. H. & Rheingold, H. L. (1981). Development of pointing as a social gesture. Develop- mental Psychology, 17(2), 215–220. Li, L.-J., Socher, R., & Fei-Fei, L. (2009). Towards total scene understanding: classification, anno- tation and segmentation in an automatic framework. In IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009 (pp. 2036–2043). Moore, C., Dunham, P. J., & Dunham, P. (1995, March 1). Joint attention: its origins and role in development. Taylor & Francis. Morissette, P., Ricard, M., & D´ ecarie, T. G. (1995). Joint visual attention and pointing in infancy: a longitudinal study of comprehension. British Journal of Developmental Psychology, 13(2), 163–175. 105 Nagai, Y ., Hosoda, K., Morita, A., & Asada, M. (2003). A constructive model for the development of joint attention. Connection Science, 15(4), 211–229. O’Hearn, K., Lakusta, L., Schroer, E., Minshew, N., & Luna, B. (2011). Deficits in adults with autism spectrum disorders when processing multiple objects in dynamic scenes. Autism Re- search, 4(2), 132–142. Oshin, O., Gilbert, A., Illingworth, J., & Bowden, R. (2009). Action recognition using randomised ferns. In 2009 IEEE 12th international conference on computer vision workshops (ICCV workshops) (pp. 530–537). Pashler, H. E. (1998). Attention. Psychology Press. Pelphrey, K. A., Morris, J. P., & McCarthy, G. (n.d.). Grasping the intentions of others: the per- ceived intentionality of an action influences activity in the superior temporal sulcus dur- ing social perception. Journal of Cognitive Neuroscience, 16(10), 1706–1716. doi:10.1162/ 0898929042947900 Pelphrey, K. A., Morris, J. P., & McCarthy, G. (2005, May 1). Neural basis of eye gaze processing deficits in autism. Brain, 128(5), 1038–1048. Retrieved from http://brain.oxfordjournals. org/content/128/5/1038 Rosenthal, R. (1979). Sensitivity to nonverbal communication: the PONS test. Johns Hopkins Univ Pr. Sato, W., Kochiyama, T., Uono, S., & Yoshikawa, S. (2008, September 1). Time course of superior temporal sulcus activity in response to eye gaze: a combined fMRI and MEG study. Social Cognitive and Affective Neuroscience, 3(3), 224–232. Sumioka, H., Hosoda, K., Yoshikawa, Y ., & Asada, M. (2007). Acquisition of joint attention through natural interaction utilizing motion cues. Advanced Robotics, 21(9), 983–999. Sutton, C. & McCallum, A. (2010, November). An introduction to conditional random fields (arXiv e-print No. 1011.4088). Tomasello, M., Hare, B., Lehmann, H., & Call, J. (2007, March). Reliance on head versus eyes in the gaze following of great apes and human infants: the cooperative eye hypothesis. Journal of Human Evolution, 52(3), 314–320. 106 V ogel, J. & Schiele, B. (2007, April). Semantic modeling of natural scenes for content-based image retrieval. International Journal of Computer Vision, 72(2), 133–157. Wang, X. & Jin, J. (2001). A quantitative analysis for decomposing visual signal of the gaze dis- placement. In Proceedings of the pan-sydney area workshop on visual information process- ing - volume 11 (pp. 153–159). VIP ’01. Darlinghurst, Australia, Australia: Australian Com- puter Society, Inc. West, R., Pineau, J., & Precup, D. (2009). Wikispeedia: an online game for inferring semantic distances between concepts. In IJCAI (pp. 1598–1603). Wu, B. & Nevatia, R. (2005). Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In Tenth IEEE international conference on computer vision, 2005. ICCV 2005 (V ol. 1, 90–97 V ol. 1). Wu, Y . & Toyama, K. (2000). Wide-range, person-and illumination-insensitive head orientation estimation. In Automatic face and gesture recognition, 2000. proceedings. fourth ieee inter- national conference on (pp. 183–188). IEEE. Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). SUN database: large-scale scene recognition from abbey to zoo. In 2010 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3485–3492). Yu, C. & Ballard, D. H. (2002). Learning to recognize human action sequences. In The 2nd inter- national conference on development and learning, 2002. proceedings (pp. 28–33). 107 List of Figures 2.1 Eye gaze and head pose vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Sclera of a human vs a chimpanzee . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Superior temporal sulcus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Cartoon stimulus used to distinguish subjects with ASD . . . . . . . . . . . . . . . 12 2.5 “Pop out” effect in single feature dimension . . . . . . . . . . . . . . . . . . . . . 14 3.1 Experiment 1: Pairs of stimulus images with saccade probability map and saliency map from AWS model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Polar plots of saccade vector distribution leaving a head. . . . . . . . . . . . . . . 21 3.3 Results of experiment 1: Histogram of saccade directionvalues for gaze following, saliency, and naive Bayes models. . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Experiment 2: sample images with annotated faces and gaze ground truth . . . . . 24 3.5 Sample images and fixation maps . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.6 Distribution of saccade vectors leaving a particular head . . . . . . . . . . . . . . 28 3.7 Results of experiment 2: Histogram of saccade directionvalues for gaze following, saliency, and naive Bayes models. . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.8 Temporal analysis of gaze following and saliency strength . . . . . . . . . . . . . 32 3.9 Fixation prediction results for gaze map and AWS saliency model over both exper- iments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.10 Gaze and saliency model fixation prediction results. . . . . . . . . . . . . . . . . . 36 108 4.1 Learned fixation probability of leaving a head and landing on a head, normalized by head size and head pose angle. . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Markov chain formulation of saccade fixations in DWOC model . . . . . . . . . . 43 4.3 Learned transition probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Saliency, gaze, and head pose maps for sample images . . . . . . . . . . . . . . . 47 4.5 Performance of each DWOC component and their combinations . . . . . . . . . . 50 4.6 Scatter plot of per image AUC for from head saccades . . . . . . . . . . . . . . . . 51 4.7 Proposed head pose estimation model . . . . . . . . . . . . . . . . . . . . . . . . 53 4.8 Sample head pose detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.9 Performance of each DWOC component and their combinations using head pose detections instead of ground truth . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.1 ASD/TD saccade map comparison using KL divergence . . . . . . . . . . . . . . . 66 5.2 Learned saccade probabilities for leaving a head for ASD/TD subjects . . . . . . . 68 5.3 Learned saccade probabilities for landing on a head for ASD/TD subjects . . . . . 69 5.4 Learned transition probabilites for DWOC model for ASD/TD subjects . . . . . . 71 5.5 Fixation prediction performance of DWOC model for ASD/TD subjects . . . . . . 72 5.6 Social video dataset sample frames . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.7 Learned transition probabilites of ASD/TD subjects on social video dataset . . . . 75 5.8 Fixation prediction performance of ASD/TD subjects on social video dataset . . . . 76 5.9 Comparison of learned transition probabilities of age/IQ/gender matched ASD/TD pairs for DWOC model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.10 Combining head and gaze transitions into social transition probabilities . . . . . . 78 5.11 Comparison of learned transition probabilities of age/IQ/gender matched ASD/TD pairs for DWOC model without gaze . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.12 Comparison of fixation prediction of age/IQ/gender matched ASD/TD pairs of AWS saliency model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.13 Illustration of gaze,head, and saliency cue competition, and the removal of saliency as a cue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 109 5.14 Comparison of learned transition probabilities of age/IQ/gender matched ASD/TD pairs for DWOC model with no saliency . . . . . . . . . . . . . . . . . . . . . . . 81 110 List of Tables 3.1 Summary of results for experiments 1 & 2 . . . . . . . . . . . . . . . . . . . . . . 37 4.1 Head pose detection performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 111
Abstract (if available)
Abstract
The direction of gaze is an important means of communication among humans and apes. Although several studies have looked into gaze direction in the context of artificial scenes, few have looked at natural scenes, and none have addressed saliency as a confound. Gaze direction is shown to be more important than saliency in modulating saccade direction, but not saccade endpoint, and establishes that the gaze of actor(s) biases the viewing behavior of observers. With causality established, gaze direction is then integrated with other cues (head pose and saliency) to form a Markov chain model based on the current fixation location state (head or nonhead) and using learned cue maps and transition probabilities to predict the next fixation. This model, named the Dynamic Weighting of Cues (DWOC) model, achieves state of the art fixation prediction performance in a dataset of natural images, and finds that there is a high affinity for remaining in the same state. The DWOC model is then used to quantify differences between autism spectrum disorder (ASD) participants and a baseline population. Finally, the model attempts to classify individual ASD participants apart from typically developing (TD) participants using a small amount of eye tracking data (under 3 minutes) as might be done in a screening procedure. A differential preference between ""social"" (head+gaze) and salient cues between the two populations is established as a result.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Spatiotemporal processing of saliency signals in the primate: a behavioral and neurophysiological investigation
PDF
Eye-trace signatures of clinical populations under natural viewing
PDF
Learning contour statistics from natural images
PDF
Functional models of fMRI BOLD signal in the visual cortex
PDF
Individual differences in heart rate response and expressive behavior during social emotions: effects of resting cardiac vagal tone and culture, and relation to the default mode network
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
Building and validating computational models of emotional expressivity in a natural social task
PDF
The role of individual variability in tests of functional hearing
Asset Metadata
Creator
Parks, Daniel
(author)
Core Title
Gaze following in natural scenes: quantifying its role in eye movements, towards a more complete model of free-viewing behavior
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Neuroscience
Publication Date
04/09/2016
Defense Date
03/11/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
autism spectrum,eye-tracking,fixation prediction,OAI-PMH Harvest,saliency modeling
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Tjan, Bosco S. (
committee chair
), Itti, Laurent (
committee member
), Narayanan, Shrikanth S. (
committee member
)
Creator Email
dan.parks@gmail.com,danielfp@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-544982
Unique identifier
UC11297847
Identifier
etd-ParksDanie-3267.pdf (filename),usctheses-c3-544982 (legacy record id)
Legacy Identifier
etd-ParksDanie-3267.pdf
Dmrecord
544982
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Parks, Daniel
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
autism spectrum
eye-tracking
fixation prediction
saliency modeling