Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Facial gesture analysis in an interactive environment
(USC Thesis Other)
Facial gesture analysis in an interactive environment
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Facial Gesture Analysis In An Interactive Environment by Wei-Kai Liao A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Ful¯llment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2008 Copyright 2008 Wei-Kai Liao Dedication To my family and in memory of my loving mother, Su-Chen Wang, who passed away during my years in USC. ii Acknowledgements \The acknowledgement is the most important part in a thesis." Many years ago, some- one told me this and I thought he was joking. After years of doing research, su®ering frustrationsthatmostPhDstudentswillexperience,andwritingthisthesis,Irealizethat without help of many people, completing this thesis will be impossible, or at least will take me much more e®orts and time. First of all, I would like to thank my advisor, professor G¶ erard Medioni. He taught me what a substantial engineering research should be, and I will keep this in mind for my entire career. His insight of computer vision has inspired me in numerous ways and transformed my view of this ¯eld. I also thank my previous advisor, Dr. Isaac Cohen. Before he left USC to Honeywell, I had great days for working with him. It was him that leads me into the topic of this thesis and teaches me the foundation of doing research. I deeply appreciate professor Ram Nevatia. His perspectives of computer vision extending my view of this ¯elds and his advices and suggestions has shaped my thinking and work. I am very grateful to my committee member, professor C.-C. Jay Kuo. I bene¯ted a lot from his expertise. I thank professor Karen Liu and professor Alexander Tartakovsky for their time and e®orts serving as my qualifying exam committee. iii This thesis contains several materials from cooperation with other people. Among them, I give a special thank to Dr. Douglas Fidaleo. I really enjoyed working with him and learned a lot from his expertise in vision and graphics. I have worked with Chi-Wei ChuinICTprojectforseveralyearsandappreciatehiscompany. IhavetomentionImane IdbihiandAnustupKumarChoudhury. Theymadeindeliblecontributionstotheproject with California Science Center. I thank Dr. Kwangsu Kim and Dr. Alexandre Francois for pleasant cooperation in ETRI project. I also thank Dr. Philippos Mordohai and Adit Sahasrabudhe for valuable conversations about Tensor Voting and manifold learning. During my PhD study, I was fortunate to be surrounded by extraordinary people in USC vision lab. Matheen Siddiqui is my friend since my ¯rst day in PhD program. He is a such nice person and knowledge source for vision and other questions. I appreciate YuPing Lin's friendship, as well as our numerous conversations about research. I am grateful for working with and companion by current and past members of our lab: Dr. Jinman Kang, Dr. Mun Wai Lee, Dr. Fengjun Lv, Dr. Chang Yuan, Dr. Changki Min, Dr. TaeEunChoe, ElenaDotsenko, Dr. Sung ChunLee, Dr. JongmooChoi, Cheng-Hao Kuo, Bo Wu, Qing Yu, Pradeep Natarajan, Paul Hsiung, Vivek Kumar Singh, Li Zhang, Yuan Li, Eunyoung Kim, Thang Ba Dinh, Derya Ozkan, and Nancy Levien. Last but not the least, I would like to thanks my family and friends, especially to my parents. Without their endless support, pursuing my dream will not be possible. It is a pitythatmymotherpassedawayduringmyyearsinUSCandcannotsharethismoment with me. Her love and legacy will be always remembered by me and my family. Finally, my deepest gratitude goes to my girl friend, Hui-Ching Chuang. Her encouragement is my warmest support in these years. iv Table of Contents Dedication ii Acknowledgements iii List Of Tables viii List Of Figures ix Abstract xi Chapter 1: Introduction 1 1.1 Issues and Di±culties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Organization of the document . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 Early Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 Recent Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.3 Psychological Findings . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.4 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Overview of the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: 3D Rigid Face Tracking 13 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Intensity- V.S. Feature-Based Tracker . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Intensity-Based Trackers . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 Feature-Based Trackers . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Our Hybrid Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.1 Integrating Multiple Visual Cues . . . . . . . . . . . . . . . . . . . 22 2.4.2 E±cient Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.3 Improving the Robustness . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.4 Local Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.5 Automatic Initialization and Reacquisition . . . . . . . . . . . . . 29 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5.1 Evaluation with Synthetic Sequences . . . . . . . . . . . . . . . . . 31 v 2.5.2 Evaluation with Real Sequences. . . . . . . . . . . . . . . . . . . . 35 2.5.3 Infrared Sequences and Application . . . . . . . . . . . . . . . . . 38 2.5.4 Automatic Initialization and Reacquisition . . . . . . . . . . . . . 40 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter 3: Automatic Classi¯cation of Facial Gestures 43 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 The Region-based Face Model . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4 Modeling Facial Deformations with a Graphical Model . . . . . . . . . . . 47 3.4.1 Modeling Intra-region Dynamics with the A±ne Motion Model . . 49 3.4.2 Modeling Inter-region Dependency with the Gaussian Mixture . . 50 3.4.3 Learning the Graphical Model . . . . . . . . . . . . . . . . . . . . 51 3.5 The Classi¯cation Framework . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5.1 The Complete Observation Case . . . . . . . . . . . . . . . . . . . 54 3.5.2 The Belief Propagation Based Inference Mechanism . . . . . . . . 56 3.5.3 The Incomplete Observation Case . . . . . . . . . . . . . . . . . . 59 3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.6.1 The Inference Mechanism . . . . . . . . . . . . . . . . . . . . . . . 63 3.6.2 Classi¯cation Results. . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.6.3 More Challenging Cases . . . . . . . . . . . . . . . . . . . . . . . . 67 3.7 Application to Human-Robot Interaction . . . . . . . . . . . . . . . . . . 69 3.8 National Traveling Exhibition . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.9.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Chapter 4: 3D Face Tracking and Expression Inference Using Manifold Learning 75 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.1 Deformable Face Model . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.2 Nonlinear Manifold Learning . . . . . . . . . . . . . . . . . . . . . 78 4.3 Manifolds of 3D Facial Deformations . . . . . . . . . . . . . . . . . . . . . 79 4.4 Tensor Voting and Nonlinear Manifold Learning. . . . . . . . . . . . . . . 83 4.4.1 Review of N-D Tensor Voting . . . . . . . . . . . . . . . . . . . . . 83 4.4.2 Traversing the Manifold with Tangent Vectors . . . . . . . . . . . 86 4.5 Modeling the Nonrigid Facial Deformation with a Set of 1D Manifolds . . 90 4.6 Tracking the Nonrigid Facial Deformation with Head Pose . . . . . . . . . 93 4.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.7.1 Evaluation of the Proposed Method . . . . . . . . . . . . . . . . . 96 4.7.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.7.3 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 vi Chapter 5: Conclusions and Future Directions 105 5.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 References 109 vii List Of Tables 3.1 Number of Gaussians for GMM . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2 The confusion matrix of the proposed classi¯cation framework. . . . . . . 64 4.1 Estimated dimensionality of 3D facial deformations using Tensor Voting . 82 [List of Tables] viii List Of Figures 1.1 Example of motion on the face . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Overview of processing facial motions . . . . . . . . . . . . . . . . . . . . 8 2.1 Comparison between the feature- and intensity-based trackers . . . . . . . 21 2.2 Example of synthetic sequences used for experiments . . . . . . . . . . . . 32 2.3 The estimated pose for synthetic sequences . . . . . . . . . . . . . . . . . 33 2.4 Average error for synthetic sequences . . . . . . . . . . . . . . . . . . . . . 34 2.5 Evaluation on the BU database . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6 Evaluation on collected real sequences . . . . . . . . . . . . . . . . . . . . 37 2.7 Comparison of intensity-based tracker and hybrid tracker . . . . . . . . . 37 2.8 Setting of IR camera environment . . . . . . . . . . . . . . . . . . . . . . 39 2.9 Tracking results of IR sequences . . . . . . . . . . . . . . . . . . . . . . . 40 2.10 Automatic initialization of 3D head tracker . . . . . . . . . . . . . . . . . 41 2.11 Automatic reacquisition of 3D head tracker . . . . . . . . . . . . . . . . . 42 3.1 System °ow chart of the ¯rst generation system . . . . . . . . . . . . . . . 44 3.2 The region-based face model. . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 Graphical model for the region-based face model . . . . . . . . . . . . . . 48 3.4 Empirical Joint Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5 The information criteria for selecting number of Gaussians . . . . . . . . . 53 3.6 KL distance for message m ij . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.7 KL distance for marginal distributions . . . . . . . . . . . . . . . . . . . . 62 3.8 KL distance for messages m 21 , m 41 , and m 45 . . . . . . . . . . . . . . . . . 64 3.9 KL distance for marginal distributions P(x i jZ obs ) of region 1, 2, and 4.. . 65 ix 3.10 Classi¯cation result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.11 Classi¯cation result for a partial occlusion case . . . . . . . . . . . . . . . 67 3.12 Classi¯cation result for a spontaneous case. . . . . . . . . . . . . . . . . . 68 3.13 Personal service robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.14 Setup of the expression recognition system . . . . . . . . . . . . . . . . . . 70 3.15 Examples of stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.16 Examples of recorded spontaneous expressions. . . . . . . . . . . . . . . . 72 4.1 System °ow chart of the second generation system . . . . . . . . . . . . . 76 4.2 Shape model. The red points indicate the landmarks . . . . . . . . . . . . 81 4.3 Learned manifolds of 3D facial deformations . . . . . . . . . . . . . . . . . 82 4.4 Vote generation of stick and ball tensor . . . . . . . . . . . . . . . . . . . 83 4.5 Generic Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.6 Example of Tensor Voting result in 2D . . . . . . . . . . . . . . . . . . . . 85 4.7 Traversing the manifold with tangent vectors . . . . . . . . . . . . . . . . 86 4.8 The graphical model of facial deformations . . . . . . . . . . . . . . . . . 90 4.9 The graphical model of facial motions . . . . . . . . . . . . . . . . . . . . 93 4.10 Quantitative evaluation of our proposed method . . . . . . . . . . . . . . 98 4.11 Tracking results of deformation manifold based approach. . . . . . . . . . 101 4.12 Probability and manifold coordinate for \surprise" . . . . . . . . . . . . . 102 4.13 Probability and manifold coordinate for \smile" . . . . . . . . . . . . . . . 103 4.14 Synthesized shape for left-eye blinking . . . . . . . . . . . . . . . . . . . . 104 4.15 Synthesized shape for surprise . . . . . . . . . . . . . . . . . . . . . . . . . 104 [List of Figures] x Abstract This research focuses on tracking, modeling, quantifying and analyzing facial motions for gesture understanding. Facial gesture analysis is an important problem in computer vision since facial gestures carry signals besides words and are critical for nonverbal com- munication. The di±culty of automatic facial gesture recognition lies in the complexity of face motions. These motions can be categorized into two classes: global, rigid head motion, and local, nonrigid facial deformations. In reality, observed facial motions are a mixture of these two components. In this work, we propose a framework to take both of these two motions into account. The whole framework consists of three components: 3D head pose estimation, modeling local deformations, and expression classi¯cation. We propose a novel hybrid 3D head tracking algorithm to di®erentiate these two motions. The hybrid tracker integrates both intensity and feature correspondence in- formation for robust real-time head pose estimation. Based on this tracker, we classify video segments into expressions by learning a graphical representation for nonrigid fa- cial motions. The graphical model characterizes each face region by dense motion ¯elds, and encodes inter-region dependencies in joint density functions. This graphical model is xi learned empirically using the EM algorithm and expression recognition is performed by Bayesian MAP estimation. In addition, rigid and nonrigid motions are analyzed simultaneously in 3D by man- ifold learning techniques. We decompose nonrigid facial deformations on a basis of 1D manifolds. Each 1D manifold is learned o²ine from sequences of labeled basic expres- sions, such as smile, surprise, etc. Any expression is then a linear combination of values along these axes, with coe±cient representing the level of activation. We experimentally verify that expressions can indeed be represented this way, and that individual manifolds are indeed 1D. The manifold learning and dimensionality estimation are all implemented in the N-D Tensor Voting framework. The output of our system is a rich representation of the face, including the 3D pose, 3D shape, expression label with probability, and the activation level. xii Chapter 1 Introduction Faces convey rich information of humans' mental status. People reveal their intentions, concerns, reactions and feelings through facial gestures. Facial gesture is a fundamental elementofourdailysocialinteraction;itcarriesimportantsignalsbehindthewords. This is critical for nonverbal communication. Thus, it elicits widespread interests in diversi- ¯ed ¯elds, ranging from emotion analysis and social behavior, to humanoid robot, facial animation, perceptual interface and multi-modal human-computer interaction (HCI). The goal of this research is to track, model, quantify, and analyze facial motions for automatic gesture understanding from images and video sequences. The facial gestures are de¯ned as visual appearance changes on the face, and such appearance changes cause observable motion on the face. Besides, facial motion characterizes the spatio-temporal behavior on the face, and thus is a descriptive measure for analyzing facial gestures. By tracking and modeling facial motions, we extract meaningful parameters, and interpret them as facial gestures or expressions. 1 (a) Neutral face in frontal pose (b) Facial expression with head motion Figure 1.1: An example of the motion on the face. 1.1 Issues and Di±culties The di±culty here is the complexity of the motion presented on the face. Face motion can be characterized in two classes: 1. The global, rigid head motion 2. The local, nonrigid facial deformations Figure 1.1 illustrates a sample motion of a face. Looking at these two images, we observe two di®erent motions, the out-of-plane rotation of the head, and the nonrigid soft tissue deformation of the face. The global motion comes from the change of 3D head pose. People adjust their heads to face the interesting object, move and / or rotate the head because of the shock and reaction, or even unconsciously. On the other hand, the local motion comes from the change of the face appearance. For each expression, facial muscles are activated to pull or stretch the soft tissues and the skin, and move the eyes and mouth. This changes the appearance of skin patches and results in detectable nonrigid motion on the face. 2 The challenging issues are: ² These two motions are usually coupled together. A facial expression usually comes with a natural head motion. Most existing liter- ature only focuses on one side, either estimate the head pose, or track and classify the local deformation. However, in the context of the real-world applications, such as HCI, observed motion is a mixture of these two, and solely considering one com- ponent results in inaccurate understanding of facial gestures. ² Global head motion may cause occlusion. Since the head can move in the environment freely, there may be some occlusions, such as self-occlusion in the pro¯le or side-view pose, or the occlusion by the hand. In this case, we have only partial observation of the face. Previous work in the literature assumes either that the whole face is observable, or that all considered measurements of facial features are reliable. Such assumption restricts the usage of existing approaches on the incomplete measurements case, and hence prevents applying them in real-world applications such as computer aided tutoring. ² Local facial deformations has many degrees of freedom. The human face is a deformable object, and thus the nonrigid facial deformation hasmanydegreesoffreedom. Dependingontheparameterization,thenonrigidface shape and deformation is represented by a complex model or a huge dimensional vector. Finding an appropriate model of deformable objects is not trivial, since it 3 is a tradeo® among accuracy, robustness, and computational e±ciency. Besides, in- ference and classi¯cation with such parameterization is problematic, such as \curse of dimensionality" and computational complexity issues. In this work, the proposed approaches take both of these two motions into account. We di®erentiate these two motions, and model and track them from monocular image sequences. The facial gesture recognition is performed by combining the understanding of these two components. The partial observation issue is also addressed. The resulting system thus is scalable for real-world unconstrained facial gestures classi¯cation scenario. 1.2 Organization of the document This thesis is organized as follows: We start with a review of related work in section 1.3. Section 1.4 introduces the overviewof our approach. Wepresent a brief overview of three important components of our framework, 3D head pose estimation, modeling nonrigid facial deformation, and expression classi¯cation. The following chapters are devoted to the detail of each component and related experiments: Chapter 2 presents the 3D face tracker,chapter3presentsaregion-basedapproachtomodelnonrigidfacialmotionanda Bayesianclassi¯erforfacialexpressions. Inaddition,chapter4presentsanovelframework to model the nonrigid facial deformations by manifold learning techniques. Based on this framework, the head pose, nonrigid face shape, expression label with probability, and activation level are tracked simultaneously in 3D. Finally, conclusions and future directions are discussed in chapter 5. 4 1.3 Literature Review Inthecomputervisionliterature,thereexistsmanyresearche®ortsfocusingonautomatic facial gestures analysis (see [29, 62, 75] for detailed reviews). In this section, we present a brief overview of related works on facial gesture recognition. For 3D face tracking, modeling nonrigid facial deformations, and manifold learning based approaches, reviews of previous work are presented in 2.2, 3.2 and 4.2. 1.3.1 Early Work BlackandYacoob[7]usedlocalparameterizedmodelstorecovernonrigidmotionoffacial featuresandderivedmid-levelpredicatesfromlocalparameters. Thesepredicatesarethe inputs of their rule-based classi¯cation system. Essa et al [27] used optical °ow based spatial-temporal motion energy template for expression recognition. In [20], Donato et al presented the comparison between di®erent approaches to represent the facial gestures, including optical °ow analysis, holistic spatial analysis, and local representation. 1.3.2 Recent Work More recently, Cohen et al. presented a system of facial expressions recognition from videos in a series of papers [14, 13]. In [14], they used Naive Bayes and Tree-Augmented Naive Bayes classi¯ers for expression recognition. They also proposed a new multi-level HMM architecture to capture the temporal pattern of expressions and segment the video automatically. In [13], they presented a classi¯cation driven stochastic structure search algorithm to learn the dependence structure of Bayesian network and hence applied gen- erative a Bayesian network classi¯er for classi¯cation. In [12], Chang et al. proposed a 5 probabilistic model for expression manifold. The idea of expression manifold comes from facial expressions form a smooth manifold in a very high dimensional image space, and similar expressions are points in the local neighborhood on the manifold. The expression sequencesbecomeapatchonthemanifoldandtheybuildaprobabilistictransitionmodel todeterminethelikelihood. In[84],ZalewskiandGongproposedahierarchicaldecompo- sitionofthehumanfaceintothreecomponents: mouth,lefteye,andrighteye. Toclassify the expression, they ¯rst inferred the status for each component, and then combine the estimatedstatus ofeachcomponentfor recognizing thefacialexpressions. In[4], Bartlett et al. presented their evaluation of di®erent machine learning techniques for expression recognition. They compared AdaBoost, SVM, and Linear Discriminant Analysis. The best results were obtained using AdaBoost to select a feature subset from Gabor ¯lter responses and then classify the facial expressions by SVM. In [22], Dornaika and Davoine proposed a particle ¯lter based approach for tracking the facial features on a moving head. The proposed method ¯rst estimates the 3D head pose using Online Appearance Models, and then tracks the facial actions and estimates the expression simultaneously. 1.3.3 Psychological Findings To interpret the expressions and facial gestures, the most famous approach in psychology and behavior science is the Facial Actions Coding System (FACS) [24]. This system de¯nes44actionunits(AUs)onthehumanfacesandinterpretsthehumanexpressionsas di®erentcombinationofAUs. Howeverthetrainingandcodingthehumanexpressionsare performed manually and very time-consuming. Another way for interpretation is based 6 on the primitive expressions. Ekman and Friesen claimed there are 6 basic \universal emotions": happiness, sadness, fear, anger, disgust, and surprise [23]. 1.3.4 Limitation As Pantic and Rothkrantz pointed out, the most signi¯cant limitation of these systems is they usually rely on a frontal view of face images [62]. Varying head poses and non- frontal faces decrease system performance. Moreover, if the user is in an interactive environment, head motion is a natural component of the interaction that cannot be ignored, and solutions proposed in the literature do not apply. Recent work tries to overcome this limitation. In [76], Tong et al. presented a dynamic Bayesian network (DBN) appraoch to model the head pose, 3D shape, 2D shape and face components, muscle movements, AU relations and interactions. Multiple image measurements are used,includingdetectedfaceandeyes,responseofGarbos¯lters,andactiveshapemodel, etc. Usingprobabilisticinference,itshowsimprovementforinferringfacialactivities,such as head pose estimation and expression recognition. 1.4 Overview of the Framework The goal of this work is to analyze the facial gesture by studying facial motion. We proposeaframeworktounderstandtherigidglobalheadmotion, thenonrigidlocalfacial deformation,andcombinetheunderstandingofthesetwocomponentsforinterpretingthe expressionsandfacialgestures. Thewholeframeworkconsistsofthreemodules: 3Dhead pose estimation, modeling and tracking local deformation, and classi¯cation. Figure 1.2 shows the overview of our approach: 7 Figure 1.2: Overview of processing facial motions ² 3D head pose estimation (see chapter 2, published in [44]): The ¯rst part is a 3D face tracker. The tracker estimates the 3D head pose from a molecular video. In order to satisfy the real-time and robustness constraints, we propose a new tracking algorithm, based on the integration of multiple visual cues. Insteadofrelyingonanysingleinformation, wereformulatethe3Dposeestimation problem as a hybrid of intensity- and feature-based constraints. An e±cient algo- rithm is proposed, and the implementation has been proven fast enough for many interactiveapplications. Thehybridtrackerisalsodemonstratedtobemorerobust than existing tracking algorithms. By a detailed evaluation with synthetic and real video sequences, the hybrid tracker is superior in both varying illumination and ex- pression change cases. Thus it is suitable to track the global head motion for facial gestures analysis. Besides, we also test it on an infrared (IR) camera setting. This 8 is a real-world HCI application in a theater environment for training and mission rehearsal. All existing face trackers perform poorly in this setting, due to the high noise level, low contrast, low image quality conditions, whereas our face tracker can track the head pose to identify the subject's focus successfully, and in real-time. ² Modeling nonrigid facial deformations and classifying expressions (see chapter 3, published in [42, 43]): Once,thefaceistrackedin3D,itcanberegisteredandwarpedbacktothereference frame, based on the estimated 3D pose. This cancels the coupled global head motion, and the residual motion on the face is the local nonrigid component. We proposed the ¯rst generation expression recognition system based on this idea: { Modelingnonrigiddeformationusingregion-basedrepresentationandagraph- ical model To model local deformations, we divide the face into 9 regions, each charac- terizing a part of the face. The observed motion inside each region is thus more homogeneous, and we model them by the a±ne motion model. This parametric representation of the motion provides a robust description of the intra-region facial deformations. However, an expression is a holistic behav- ior of the face, and only considering local description is not su±cient, since there exist interdependencies between face regions. Hence, we construct a la- tent variable graphical model to formulate such inter-region dynamics. The region-based face model characterizes the human face and it has a natural connection to the graphical model. The graphical model is a generative model 9 for structured data as well as a powerful tool for facial gesture classi¯cation. The joint density functions are modeled as Gaussian mixture, and they are learnedempiricallyfromcollecteddatausingEM(ExpectationMaximization) algorithm. { Classi¯cation Afterthe¯rsttwostages,asetoffeaturevectorsareextractedtorepresentthe spatio-temporal deformation of the face. A facial gesture is an interpretation of this phenomenon. We build a Bayesian classi¯er to recognize the gesture based on our training database. In this work, we consider the six \universal" expressions, anger, disgust, fear, happiness, sadness, and surprise [23]. The Gaussian mixture modeling (GMM) approach is applied to model the densi- ties of the graphical model, and an modi¯ed Expectation-Maximization (EM) algorithmisusedtolearnthesedensitiesempirically. Forclassi¯cation, weuse the maximum a posterior (MAP) estimation based on the trained graphical model and the extracted feature vectors. Handling occlusion is another issue. Classifying the expressions of a partially occluded face is challenging as it requires inference of the emotional state from incomplete observations. We augment the laten variable graphical model formalism with a belief propagation algorithm to infer missing data, as well as correcting for erroneous estimations of the local deformations. Under this situation, the MAP estimation is still applicable for gesture recognition. 10 ² Modeling nonrigid deformations using manifold learning (see chapter 4, published in [45]): The ¯rst generation facial expression recognition system has some limitations. Its performancecanbefurtherimproved,ifwecanincorporatesemanticfacialfeatures, such as eyes and mouth, and better characterize the nonrigid facial deformations. Besides,theinteractionbetweentherigidandnonrigidmotionsshouldbetakeninto account. Conditional on one motion component, the inference of the other one can be further re¯ned by Bayesian estimation. For example, if the estimated 3D head posehasanerrorduetoafacialexpressionchange,thiserrormaynotbecorrectedin the ¯rst generation system, and thus introduces a bias for classi¯cation. Modeling the relation between these two motions can overcome this limitation. Thus, the second generation system is developed to better characterize the nonrigid facial deformations and the relation between these two motions. We propose a person-dependent, manifold-based approach for modeling and track- ing rigid and nonrigid 3D facial deformations from a monocular video sequence. The rigid and nonrigid motions are analyzed simultaneously in 3D, by automati- cally¯ttingandtrackingasetoflandmarks. Wedonotrepresentallnonrigidfacial deformationsasasimplecomplexmanifold, butinsteaddecomposethemonabasis ofeight1Dmanifolds. Each1Dmanifoldislearnedo²inefromsequencesoflabeled expressions,suchassmile,surprise,etc. Anyexpressionisthenalinearcombination of values along these 8 axes, with a coe±cient representing the level of activation. We experimentally verify that expressions can indeed be represented this way, and 11 that individual manifolds are indeed 1D. The manifold dimensionality estimation, manifold learning, andmanifold traversaloperationareall implementedin the N-D Tensor Voting framework. Using simple local operations, this framework gives an estimate of the tangent and normal spaces at every sample, and provides excellent robustness to noise and outliers. The output of our system, besides the tracked landmarks in 3D, is a labeled annotation of the expression. We demonstrate results on a number of challenging sequences. The proposed approach is a uni¯ed formalism that handles complete observation and partial occlusion cases, and allows for the recognition of facial gestures in presence of head motion and partial occlusions of the face. Based on this system, each component of the motion can be tracked and modeled. The global head motion can be interpreted as theuser'sattentionandreaction,andthelocalfacialdeformationsde¯netheexpression. 12 Chapter 2 3D Rigid Face Tracking 2.1 Introduction 3Dfacetrackingisafundamentalcomponentforsolvingmanycomputervisionproblems. The estimated 3D pose is useful for various face-related applications. For example, in human-computer interaction, the 3D pose can be used to determine the user's attention and the mental status. For expression analysis and face recognition, the 3D head pose can be used to stabilize the face as a preprocess. The pose estimate can also assist in 3D face reconstruction from a monocular video. For real-world applications, there are several constraints besides tracking accuracy, including the computational e±ciency and the robustness of the tracker. For real-time or interactive applications, the tracker must be computationally e±cient. Robustness can be de¯ned in several ways including robustness to noise, stability on textureless video, insensitivity to illumination changes, and resistance to the expression changes or other local nonrigid deformation. The tracker should be able to run continuously for long sequences, requiring a mechanism to prevent drift and error accumulation. 13 Inthischapter,weproposeahybridtrackerfor3Dfacetracking. Insteadofrelyingon any single channel of information, the hybrid tracker integrates di®erent visual cues for facetracking. Thisideaisinspiredbydetailedcomparisonsbetweenexistingstate-of-the- artheadtrackers[77,83]. Feature-basedmethodssuchas[70,77]dependontheabilityto detect and match the same features in subsequent frames and keyframes. The quantity, accuracy,andfacecoverageofthematchesfullydeterminestherecoveredposequality. In contrast, intensity-based methods such as [83] do not explicitly require feature matching, but expect brightness consistency between the same image patches in di®erent images to compute the implicit °ow of pixels. These two methods are extensively examined in our experiments. Empirical observation suggests thats there is no dominant one among the existing face tracking algorithms; each tracker has its own strength but also comes with its weakness. Thus, by design, the hybrid tracker is expected to overcome the °aws of the single channel trackers while retaining their strength. This is clearly demonstrated in our experiments. The rest of this chapter is organized as follows: We start with the literature review of related work in section 2.2. Next, section 2.3 discusses each of the intensity- and feature- based 3D head tracking approaches, and compares their di®erence. Based on empirical observation, a hybrid tracking algorithm is proposed. The details of this algorithm are illustrated in section 2.4. The proposed hybrid tracker, along with the intensity- and feature-based trackers, have been examined thoroughly in various experiments. These results are presented in section 2.5. Finally, the summary and conclusion are given in section 2.6. 14 2.2 Previous Work Theperformanceofthefacetrackerisa®ectedbymanyfactors. Whilehigherlevelchoices such as whether or not to use keyframes, how many to use, and whether to update them online can alter the accuracy (and speed) of the tracker, a more fundamental issue is the optimization algorithm and the related objective functional. Most state-of-the-art face tracking algorithms are a®ected by these three factors: ² Prior knowledge of the approximate 3D structure of the subject's face. In [30], Fidaleo et al. have shown that the accuracy of the underlying 3D model can dra- matically a®ect the tracking accuracy of a feature driven tracker. Much of the performance di®erence between tracking methods can be attributed to the choice of model: planar [7], ellipse [5, 60, 61], cylinder [11, 83], and generic face or precise geometry [77]. The 2D plana is very simple, but its lack of 3D structure introduces theerrorintheout-of-planerotationcase. The3Dellipseandcylinderisconsidered as an good 3D approximation to the humans' head. One advantage of using ellipse and cylinder head model is the alignment, due the simplicity of the model. The precise geometry with good initial alignment attain the best performance, but the precise geometry and high qualify initial alignment may not be available in every case. When the alignment becomes bad, the tracking accuracy drops. The same thing is observed with generic head model, and it becomes worst since there is also the inconsistency between the model and real head existing. 15 ² Observed data in the 2D image. The tracker relies on this information to estimate the head pose. This includes feature locations [70, 77], intensity values in a region [11, 60, 61, 69, 83], or estimated motion °ow ¯elds [5, 17]. ² The computational framework, which can be roughly divided into deterministic methodandstochasticmethod[47]. Fordeterministicmethods, anerrorfunctionis de¯nedusingtheobserved2Ddataandthecorrespondingestimated2Ddata. Pose parameters are adjusted to minimize this error function. Most of the determinis- tic methods use the non-linear optimization approach, which relies on the gradient based method such as Gaussian-Newton or Levenberger-Marquart. The scheme it adopted (line search or trust region), the way to compute ¯rst- and second-order derivatives highly e®ects the convergency, e±ciency, and accuracy of the method. On the other hand, stochastic estimation methods such as particle ¯ltering (se- quential Monte Carlo) and Markov Chain Monte Carlo de¯ne the observation and transition models for tracking. Model ¯tness and the quality of estimated model parameters determine the tracking accuracy, and the e±ciency depends on model complexity and ¯ltering algorithm. In short, deterministic methods are typically more computational e±cient, while stochastic methods are more resistent to the local minima. 2.3 Intensity- V.S. Feature-Based Tracker This section compares the intensity- and feature-based trackers. To prepare the readers we ¯rst review the individual algorithms. The selected representative algorithm for each 16 class is [83] and [77] for the intensity- and feature-based methods, respectively. The fundamental concepts of these trackers are summarized, and the reader is referred to the original papers for the speci¯c details. 2.3.1 Intensity-Based Trackers Theintensity-basedtrackerperformsoptimizationbasedonthebrightnessconstraint. To be more speci¯c, let ¹ = ft x ;t y ;t z ;! x ;! y ;! z g T be the motion vector specifying the 3D head pose. Given the pose in frame t¡1, ¹ t¡1 , we de¯ne an error function E t (4¹) for 4¹, the incremental pose change between frame t¡1 and t, as E t (4¹;¹ t¡1 )= X p2 kI t¡1 (F(p;0;¹ t¡1 ))¡I t (F(p;4¹;¹ t¡1 ))k 2 2 (2.1) here is the face region and p is the 3D position of a point on the face. F = P ±M, where M(p;4¹) will transform the 3D position of p as 4¹ speci¯ed and P is a weak perspective projection. I t (:) and I t¡1 (:) are the frame t and t¡1 respectively. This error function measures the intensity di®erence between the previous frame and thetransformedcurrentframe. Iftheintensityconsistencyismaintainedandthenoiseof intensityisGaussiandistributed,theminimumofthis2-normerrorfunctionisguaranteed to be the optimal solution. Thus, by minimizing this error function with respect to the 3D pose, we can estimate the change of 3D pose and recover the current pose. 17 O®-line information can also be integrated into the optimization similar to Vacchetti et al. [77]. The error function E k (4¹): E k (4¹;¹ t¡1 )= N k X i=1 ® i 2 4 X p2 kI i (F(p;0;¹ i ))¡I t (F(p;4¹;¹ t¡1 ))k 2 2 3 5 (2.2) is de¯ned between the current frame and the keyframes. N k is the number of keyframes. I i (:) and ¹ i are the frame and pose of the i th keyframe. This error function can use both o®-line or on-line generated keyframes for estimating the head pose. A regularization term E r (4¹;¹ t¡1 )= X p2 kF(p;0;¹ t¡1 )¡F(p;4¹;¹ t¡1 )k 2 2 (2.3) canalsobeincludedtoimposeasmoothnessconstraintovertheestimatedmotionvector. The ¯nal error function for optimization is combination of (2.1), (2.2), and (2.3): E int =E t +¸ k E k +¸ r E r (2.4) where ¸ k and ¸ r are the weighting constants. This is a nonlinear optimization problem and the iteratively reweighted least square is applied. 2.3.2 Feature-Based Trackers The feature-based tracker minimizes the reprojection error of a set of 2D and 3D points matched between frames. A keyframe in [77] consists of a set of 2D feature locations detected on the face with a Harris corner detector and their 3D positions estimated by 18 back-projectingontoaregistered3Dtrackingmodel. Thekeyframeaccuracyisdependent on both the model alignment in the keyframe image, as well as the geometric structure of the tracking mesh. These points are matched to patches in the previous frame and combined with keyframe points for pose estimation. The reprojection error for the keyframe feature points is de¯ned as: E k;t = X p2· km p t ¡F(p;¹ t )k 2 2 (2.5) where · is the set of keyframe feature points, m p t is the measured 2D feature point corresponding to the keyframe feature point p at frame t, and F(p;¹ t ) is the projection of p's 3D position using pose parameters ¹ t . To reduce jitter associated with single keyframe optimization, additional correspon- dences between the current and previous frame are added to the error term: E t = X p2· ¡ kn p t ¡F(p;¹ t )k 2 2 +kn p t¡1 ¡F(p;¹ t¡1 )k 2 2 ¢ (2.6) where the 3D for the new points is estimated by back projection to the 3D model at the current pose estimate. The two terms are combined into the ¯nal error functional: E fpt =E k;t +E k;t¡1 +E t (2.7) which is minimized using nonlinear optimization. 19 2.3.3 Comparison Both tracking methods are model based, using an estimate of the 3D shape of the face anditsprojectionontothe2Dimageplanetode¯neareprojectionerrorfunctionalthatis minimized using a nonlinear optimization scheme. The forms of the error functionals are nearly identical, di®ering only in the input feature space on which the distance function operates. Figure 2.1 illustrates the di®erence between these 2 trackers. Forthefeature-basedtracker,thereprojectionerrorismeasuredasthefeaturedistance between a set of key 2D features and their matched points in the new image. The tracker relies on robust correspondence between 2D features in successive frames and keyframes, and thus the e®ectiveness of the feature detector and the matching algorithm is critical for the success of the tracker. In [77], Vacchetti et al. used the standard eigenvalue-based Harris corner detector. Using a more e±cient and robust detector should improve the feature-based tracker. Incontrast,theintensity-basedtrackerutilizesthebrightnessconstraintbetweensim- ilar patches in successive images and de¯nes the error functional in terms of intensity di®erences at sample points. To determine the role of this input space on tracking accuracy we perform a set of controlled experiments on synthesized motion sequences (see section 2.5 for details). Feature-basedmethodsaregenerallychosenfortheirstabilityunderchanginglightingand otherconditions,withtheassumptionthatfeaturelocationsremainconstantdespitethese changes. For cases where there is insu±cient texture on the face (low resolution, poor focus, etc) the accuracy of feature methods quickly degrades. Intensity-based methods 20 Intensity 3D Face Model Frame t-1 Frame t Known Pose and 2D-3D Correspondence Adjust Pose and Project 3D Feature Points Feature Matching Feature Distance 3D Face Model Frame t-1 Frame t Intensity Difference Adjust Pose and Project Face Known Pose Estimated Face Position Scanline L Scanline L’ Feature Based Tracker Intensity Based Tracker Figure 2.1: Di®erence in optimization source data for the feature-based tracker, T F , and the intensity-based tracker, T I . Given a set of key feature points de¯ned on a 3D model, and their projection, T F minimizes the total distance to matched feature points in pixel space. T I computes the pose that minimizes the total intensity di®erence of pixels under the feature points. 21 are more widely applicable and can perform well in low or high-texture cases, however they are clearly sensitive to environmental changes. This is demonstrated empirically by testing on the near-infrared sequence. 2.4 Our Hybrid Tracker The empirical and theoretical comparison of intensity- and feature-based tracker inspires the design of our hybrid tracking algorithm. In this section, we reformulate the 3D face tracking problem as a multi-objective optimization problem, and present an e±cient method to solve it. The robustness of the tracker is also discussed. 2.4.1 Integrating Multiple Visual Cues Integrating multiple visual cues for face tracking can be interpreted as adjusting the 3D pose to ¯t multiple constraints. The hybrid tracker has two objective functions with di®erent constraints to satisfy simultaneously: equations 2.4 and 2.7. This becomes a multi-objective optimization problem. Scalarization is a common technique for solving multi-objectiveoptimizationproblems. The¯nalerrorfunctionisaweightedcombination of the individual error functions 2.4 and 2.7: E =a i E int +a f E fpt (2.8) where a i and a f are the weighting constants. Thehybridtrackersearchesforthesolutiontominimizeequation2.8. Theprocesscan beinterpretedasanonlinearoptimizationbasedonbrightnessconstraints,butregularized 22 with feature correspondence constraints. Ideally, these two constraints compensate for each other's de¯ciencies. The feature point correspondences restrict the space of feasible solutionsfortheintensity-basedoptimizationandhelpstheoptimizertoescapefromlocal minima. The brightness constraint re¯nes and stabilizes the feature-based optimization. When there are not su±cient high quality feature matches, the intensity constraint still provides adequate reliable measurement for optimization. The convergence of feature-based optimization is much faster than intensity-based methods due to the high dimensionality of the image data and the nature of the asso- ciated imaging function. However, when E fpt is close to its optimum, E int still provides information to re¯ne the registration. Therefore, an adaptive scheme is applied to choose the weights a i and a f . At the beginning of the optimization, E fpt has higher weight and decreaseswhenitapproachesitsoptimum. Atthesametime, theweightof E int increases when the optimization proceed. The overall distribution of the weights is also a®ected by the number of matched features. In the case of few feature correspondences, the tracker reduces the weight of E fpt . 2.4.2 E±cient Solution The computational cost of the feature-based tracker is low due to the relatively small number of matched features and the fast convergence of the optimization. On the other hand, the intensity-based tracker is notorious for its high computational cost. The stan- dardalgorithmforsolvingthisiterativeleast-squareproblemisslow,duetotheevaluation of a large Jacobian matrix F ¹ = @F=@¹ and Hessian matrix (I u F ¹ ) T (I u F ¹ ), where I u is 23 the gradient of the frame I. This can be accelerated using the (forward) compositional algorithm, but the evaluation of Hessian is still required at each iteration. The speed of the algorithm can be further improved using the inverse compositional algorithm. In [2], Baker and Matthews proposed the inverse compositional algorithm to solve the image alignment problem e±ciently. The same modi¯cations to the solver can be made for this problem. In the inverse compositional algorithm, the Jacobian and Hessian matrices are evaluated in a preprocessing step; only the error term is computed during the optimization. To do this, the image is warped at each iteration, and the computed transform is inverted to compose with previous transform. Here, warping the image is equivalent to model projection. Since we know the 2D-3D correspondence in I t¡1 , warpingI t forintensitydi®erenceevaluationisachievedbyprojectingthe3Dmodel and sampling to get the intensity in I t . The inverse compositional version of the algorithm is: ² Preprocess { For E int : Compute the gradient image, the Jacobian, and the Hessian matrix. { For E fpt : Perform feature detection on I t , and feature matching between I t , I t¡1 , and keyframes. ² Optimization At each iteration: 1. For E int : 24 1.1. Warp the face region of I t to get the intensity. 1.2. Compute the intensity di®erence and the weight. 2. For E fpt : 2.1. Project the feature points to get the 2D position. 2.2. Compute the reprojection error and weights. 3. Solve the linear system. 4. Update the pose. ² Postprocess Back-project the face region and feature points of I t into the 3D face model. In our experiments, for fast convergence and small face region cases, the speed of forward and the inverse compositional algorithm is similar. This is true because the preprocess of the inverse compositional algorithm takes more time. However, as the face region or the iteration number increases, the bene¯t of the inverse compositional algorithmbecomesclear,sinceeachiterationtakeslesstime. Besides,thisdirectextension of inverse compositional algorithm to 3D-2D alignment is not mathematically equivalent totheforwardcompositionalalgorithm, as discussedin [3]. However, inourexperiments, it still shows good performance for estimating 3D head pose. 2.4.3 Improving the Robustness Robustness is an important issue for 3D face tracking. We employ the M-estimator [38] technique for optimization, which improve the robustness against outliers and noise. 25 Combining the feature correspondence constraint with the brightness constraint for face tracking intrinsically improves the robustness. With the proper weighting, we over- come the instability of the feature-based optimization due to insu±cient or poor feature matching. The sensitivity of the intensity-based optimization is also reduced, as many plausible solutions are ruled out by the feature correspondence constraints. This is espe- cially useful for lighting variation. Lighting changes a®ect the intensity on the face, and violate the underlying brightness consistency assumption of the intensity-based tracker. Moreover, we use the SIFT (scale invariant feature transform) [46] as local features (see section 2.4.4 for the details of local features in hybrid tracker). In the literature and our experiments, SIFT is demonstrated to be superior than most local features. Hence, the extracted feature correspondence and the resulting hybrid trackers is more resistent to the illumination change. Robustnesstonon-rigiddeformationisanotherissue. Sinceweonlyfocusontherigid motion of the head, the local non-rigid motion should be regarded as the noise for this framework. It has been shown that better results are achieved by utilizing feature-based methods. However, it turns out that this performance gain is not strictly due to the use of features over intensity. A fundamental part of the feature-based tracker is the feature matching stage. Dur- ing feature matching, candidates with low region-correlation are rejected as outliers and therefore not included in the optimization stage. The e®ect of this is that the majority of feature points used in the optimization belong to rigid areas of the face. On the other hand, the weighting scheme of the intensity-based method only considers the pixel-wise intensity di®erence. This di®erence will be near zero under deformation, as deformation 26 doesnotaltertheintensityofthesinglepixel. Instead, thedeformationaltersthecompo- sition of local patch. Thus, it suggests the use of region-wise intensity di®erences instead of pixel-wise intensity di®erences. The intensity of each pixel is modi¯ed as the weighted average of the intensity of its neighbors. The idea is that if this point is located in a highly deformable area, the composition of the region changes signi¯cantly, thus the weighted average is changed. Combining with the m-estimator technique, the proposed region-based intensity di®er- ence improves the robustness by implicit decreasing the weight of pixels in the highly deformable area. 2.4.4 Local Features We adapt the SIFT [46] detector to extract 2D local features for hybrid tracker. ² Scale-space extrema detection. SIFT detector uses DoG (di®erence of Gaussian) function to approximate LoG (Laplacian of Gaussian) for extracting candidate po- sitions of keypoints in the scale-space. In the original paper[46], the scale-space space contains several octaves, and each octave is an image pyramid represents a certainspatialscaleofinputimage. Usingseveraloctaves, SIFTdetectore±ciently searches over many scales, and extracts scale invariant feature points. However, in 3D head tracking scenario, the size variation between two face images is not signi¯cant. Detecting scale invariant keypoint is not critical for head tracking and using only a few octaves is su±cient to detect reliable features. Therefore, for computational e±ciency, we just use one octave. 27 ² Keypointlocalization. TheSIFTdetectoradaptsacoarse-to-¯nesearchingstrategy to locate keypoints. After previous, there is a set of sparse keypoint candidates, and SIFT ¯t a quadratic function to re¯ne the location and scale of each keypoint. Besides, the unstable keypoints, low contrast or edge response, are eliminated. ² Orientation assignment. After identifying a keypoint, its orientation is computed fromimage. Thisorientationwillbeusefulforcommutating2Dtransforminvariant feature descriptors. ² Compute keypoint descriptor. The feature descriptor contains the statistics of 16£ 16 local patch. The statistic is the orientation histogram, and the orientation is quantized into 8 bins. The extracted feature descriptor is an 128-D vector of the underlying keypoint. The feature matching is performed by comparing the closest candidate to second- closest candidate with the distance metric is the 2-norm of feature descriptor [46]. It implicitly assumes that there is at most one correct match, and false matches are all ambiguous to each other. The second-closest point is considered a false match, and we requirethedistanceclosestcandidatemustbesigni¯cantsmallerthanthedistancetothe second-closest candidate: kx¡x 1 k<®£kx¡x 2 k (2.9) where x is the input feature point, x 1 and x 2 are the closest and second-closest keypoint candidate, respectively. ® is the parameter de¯nes the distance ratio for rejection. 28 Thecorrespondencepairisdeterminedusingthepresentedfeaturematchingapproach. Given two set of SIFT features, fx m g and fy n g, and two keypoints x i 2 fx m g and y j 2fy n g, (x i ;y j ) is a correspondence pair if: y j =FeatureMatch(x i ;fy m g) x i =FeatureMatch(y j ;fx n g) kP x i ¡P y j k 2 <d (2.10) where FeatureMatch is the matching approach presented before, P x i and P y j are the 2D coordinate of x i and y j in the image plane. d is a real value threshold to reject two far away points being a correspondence pair. 2.4.5 Automatic Initialization and Reacquisition Automatic initialization and reacquisition is critical for using 3D head tracker in real- worldapplications. Theinitializationa®ectstheperformanceoftrackerandtheautomatic requisition module enables the tracker recovering from failure situation. In this work, we make an assumption that the tracker only initializes and reacquires in the frontal pose. We use the face detector from [78] to locate the frontal face. The detector is trained using adaboost methodology [32] and its structure is cascade. When the tracker starts, we ask users to look at the camera, and the detector will scan entire image to search for a frontal face. If the detector consistently detects a face near some position, a 3D head model is ¯tted into the detected face area by changing its 3D position. To improve the accuracy of initialization, we also use a 2D Active Shape Model [85] to validate the existence of a frontal face and locate the semantic facial features, such as eyes and mouth 29 corners, and align the 3D model to ¯t these features. Once we initialize the tracker, the proposed hybrid tracking algorithm is used to estimate 3D head pose afterward. The system assessestracker'sreliabilitybyexamining residual errorof optimizationor¯nding impossible head pose. If the estimation is not reliable, the system turns on face detector for reacquisition. Once a frontal face is located and current estimation is far away from frontal, the tracker is re-initialized by using only the keyframe of frontal pose, which is the ¯rst keyframe in our implementation. Di®erent 3D head models have been evaluated for the automatic ¯tting and tracking. The detailed face geometry is found to be the best for head tracking, but it requires a very accurate initialization. This is also true for generic face model. Its 3D geometry is close to the true face, but an inaccurate initialization results in signi¯cant tracking bias. On the other hand, the approximate shape, such as 3D cylinder and ellipsoid, is inconsistent with true 3D shape of a face, it is more robust to initialization error. 2.5 Experiments A series of face tracking evaluations are performed. The ¯rst set of experiments uses syn- theticsequences. Usingsyntheticsequenceguaranteestheexactgroundtruthisavailable. We have full control over sequence generation, and thus can isolate each factor and test the tracker's response. The next experiment tests the performance of the tracker in real video sequences. The collected video sequences and one public benchmark database are used for evaluation. In a third experiment we test the performance on textureless videos. We have a real-world application that demands the use of a near-infrared camera. The 30 face tracker is used to extract head pose for human-computer interaction. We present tracking results of the proposed hybrid tracker in this challenging setting. In these ex- periments, the proposed tracker, and the existing state-of-the-art tracking algorithms are evaluated and compared. The feature-based tracker is an implementation of [77]. The intensity-based and hybrid tracker are C++ implementations of the methods presented in section 2.3 and 2.4. 2.5.1 Evaluation with Synthetic Sequences The evaluation sequences are generated using textured 3D face models of four subjects. ThesemodelsareacquiredbytheFaceVisionmodelingsystem[28]. Foreachmodel, three independent sequences of images are rendered. The ¯rst consists of pure rotation about the X- (horizontal) axis, the second is rotation about the Y- (vertical) axis, and the third is rotation about the Z-axis. In each case, the sequences begin with the subject facing the camera and proceed to -15 degrees, then to 15 degrees, and return to neutral in increments of 1 degree. A total of 60 frames are acquired for each sequence. Image size is 640£480. Synthetic perturbations are applied to the sequences to mimic variations occurring due to lighting and facial deformation changes. The following test con¯gurations will be used to evaluate the tracking performance: Static In this case the sequences are rendered with constant ambient lighting. This removes all factors in°uencing the tracking accuracy. 31 Static Lighting Deformation Figure 2.2: Example synthetic sequences used for experiments. Top: Static sequence. Middle: Singledirectionallightsource. Bottom: Deformationwithfacemusclesystem. Lighting We explore the robustness of the trackers in the presence of subtle lighting changes. The models are rendered with a single directional light source. Deformation Weexploretherobustnessofthetrackerinresponsetofacialdeformation. A synthetic muscle system is used to deform the face mesh over the course of the sequence. The muscles are contracted at a constant rate over the duration of the sequence, inducing deformation in the mouth and eyebrow region (two high texture areas on the face). Figure 2.2 shows some examples from the synthetic sequences. The faces in the rendered sequences have a large amount of surface texture and are therefore amenable to feature based tracking. 32 0 20 40 60 −20 −10 0 10 20 Rotation: x, Estimation: x Static 0 20 40 60 −20 −10 0 10 20 Rotation: y, Estimation: y 0 20 40 60 −20 −10 0 10 20 Rotation: z, Estimation: z Intensity Tracker Feature Tracker Hybrid Tracker Ground Truth 0 20 40 60 −20 −10 0 10 20 Lighting 0 20 40 60 −20 −10 0 10 20 0 20 40 60 −20 −10 0 10 20 0 20 40 60 −20 −10 0 10 20 Deformation 0 20 40 60 −20 −10 0 10 20 0 20 40 60 −20 −10 0 10 20 Figure 2.3: The estimated pose for synthetic sequences. For the rows, Top: the static sequences. Middle: the lighting sequences. Bottom: the deformation sequences. For the columns,Left: the estimated rotation along x-axis for pure x-axis rotation. Center: the estimated rotation along y-axis for pure y-axis rotation. Right: the estimated rota- tion along z-axis for pure z-axis rotation. The angle is averaged over all subjects, and the unit is degree. The proposed hybrid tracker, the intensity- and feature-based tracker are evaluated. All trackers use the precise 3D face model to rule out the e®ect of model misalignment. Figure2.3showstheaveragedestimatedposecomparedtothegroundtruth,and¯gure2.4 shows the averaged error per frame. This error measures the absolute di®erence between the estimated angle and the true angle. In this evaluation, the averaged speed of the proposedtrackeris30frame-per-second(FPS)onanormaldesktopwithoneIntelXEON 2.4GHz processor. 33 x y z 0 0.05 0.1 0.15 0.2 Rotation: x−axis Static x y z 0 0.05 0.1 0.15 0.2 Rotation: y−axis x y z 0 0.05 0.1 0.15 0.2 Rotation: z−axis Intensity Tracker Feature Tracker Hybrid Tracker x y z 0 0.05 0.1 0.15 0.2 Lighting x y z 0 0.05 0.1 0.15 0.2 x y z 0 0.05 0.1 0.15 0.2 x y z 0 0.05 0.1 0.15 0.2 Deformation x y z 0 0.05 0.1 0.15 0.2 x y z 0 0.05 0.1 0.15 0.2 Figure 2.4: Average error for synthetic sequences. For the rows, Top: the static se- quences. Middle: the lighting sequences. Bottom: the deformation sequences. For the columns, Left: rotation around x-axis sequences. Center: rotation around y-axis sequences. Right: rotationaroundz-axissequences. Each¯gureplotstheaveragederror per frame for x-, y-, and z-axis angle. From this evaluation, we can see that all these three trackers are all comparable. In most cases, the hybrid tracker is consistently better than the other two, especially on the rotation axis. In some cases, the hybrid tracker is worse than the other two, but the di®erence is marginal and has not statistically signi¯cant. Static All trackers perform very well, despite the di®erent optimization functionals. Lighting The result is somewhat unintuitive, as we would expect the intensity-based tracker's performance to degrade. However, the performance di®erence is very marginal, since the points are weighted high in high gradient regions. 34 Deformation All trackers perform worse than the optimal cases, but the accuracy is still acceptable. From ¯gure 2.3, as deformation increases with time the accuracy of all methods declines. The intensity-based method is only slightly worse than the feature-based method, since the usage of the region-based di®erence compensates for the outliers and improves the robustness. 2.5.2 Evaluation with Real Sequences The proposed tracker is also evaluated with many real sequences. One problem of evalu- ating with real sequences is the lack of ground truth. Only \estimated ground truth" is available. In the literature, several methods are used to estimate the ground truth, such as with a magnetic tracker or o®-line bundle adjustment. We perform the evaluation with two di®erent sets of sequences. One is collected in our lab, and the other is from the Boston University (BU) database [11]. The BU database contains 2 sets of sequences: uniform lighting and varying lighting. The uniform lighting class includes 5 subjects, totalling 45 sequences. Figure 2.5 shows the tracking result of the \jam5.avi" sequence in the uniform lighting class. Overall, the estimated pose is close to ground truth, despite the fact that there is some jitter from the magnetic tracker. Our sequence is captured in an indoor environment. The ground truth is estimated by commercial bundle adjustment software [64]. These sequences contain large rotations with a maximum angle near 40 degree. The hybrid tracker tracks the 3D pose reliably. Figure 2.6 shows the tracking result of one sequence. 35 (a) Frame 000 (b) Frame 040 (c) Frame 080 (d) Frame 120 (e) Frame 160 (f) Frame 198 0 20 40 60 80 100 120 140 160 180 −20 −15 −10 −5 0 5 10 15 20 Roll Frame Index Degree 0 20 40 60 80 100 120 140 160 180 −40 −30 −20 −10 0 10 20 30 40 Yaw Frame Index Degree Ground Truth Estimated Pose 0 20 40 60 80 100 120 140 160 180 −20 −15 −10 −5 0 5 10 15 20 Pitch Frame Index Degree Figure 2.5: Evaluation on the BU database. The top rows show some examples from the tracker and the last row show the estimated roll, yaw, and pitch compared with the ground truth from magnetic tracker. The result is for the \jam5.avi" sequence in the uniform lighting class of the BU database. 36 (a) Frame 00 (b) Frame 10 (c) Frame 20 (d) Frame 30 (e) Frame 40 (f) Frame 50 5 10 15 20 25 30 35 40 45 50 55 −20 −15 −10 −5 0 5 10 15 20 Rotation around X−axis Frame Index Degree Ground Truth Estimated Pose 5 10 15 20 25 30 35 40 45 50 55 −30 −20 −10 0 10 20 30 40 Rotation around Y−axis Frame Index Degree 5 10 15 20 25 30 35 40 45 50 55 −20 −15 −10 −5 0 5 10 15 20 Rotation around Z−axis Frame Index Degree Figure 2.6: The estimated rotation around x-, y-, and z-axis of our sequences. The top rows show some result of tracked sequences. The bottom row is the estimated rotation. Figure 2.7: Comparison of intensity-based tracker and hybrid tracker. The top row is the intensity tracker and the bottom is the hybrid tracker for the same sequence. The intensity-based tracker is more sensitive to the strong re°ection. 37 Figure 2.7 shows the comparisonof the hybrid tracker and the intensity-basedtracker in a strong re°ection case. The intensity-based tracker is sensitive to lighting change, since it violates the brightness consistency assumption. In ¯gure 2.7, there is strong re°ectiononthesubject'sforehead, anditmovesasthesubjectturnshishead. Asshown in the ¯gure, the drift of intensity-based tracker is much larger than the hybrid tracker, especially for the pose is far away from the frontal view (see the third and forth column of ¯gure 2.7). This example clearly demonstrates the robustness of the hybrid tracker. 2.5.3 Infrared Sequences and Application Infrared(IR)imagesarecommonlyusedinvisionapplicationsinenvironmentswherevis- ible light is either non-existent, highly variable, or di±cult to control. Our test sequences are recorded in a dark, theater-like interactive virtual simulation training environment. In this environment the only visible light comes from the re°ection of a projector im- age o® a cylindrical screen. This illumination is generally insu±cient for a visible light camera and/or is highly variable. The tracker estimates the head pose, indicating user's attention and is used in a multi-modal HCI application. The theater environment and sample IR video frames are shown in ¯gure 2.8. Ground truth is not available for this data, therefore only qualitative evaluation is made. IR light is scattered more readily under the surface of the skin than visible light. Micro-texture on the face is therefore lost (especially at lower resolution), making iden- ti¯cation of stable features more di±cult and error prone. Due to varying absorption properties in di®erent locations of the face, however, low frequency color variations will persist which satisfy the brightness constraint. 38 Figure 2.8: left Theater environment for head tracking application. Subject is in nearly complete darkness except for the illumination from the screen. Image courtesy of USC's Institute for Creative Technologies. right Images from high resolution IR camera placed below the screen. Figure 2.9 shows the tracking results in this environment. It shows multiple frames across a several minute sequence. The video is recorded at 15 FPS and its frame size is 1024£768. In most cases, the face size is around 110£110. The subject's head moves in both translation and rotation. There are also some mild expression changes (mouth open and close), and strong re°ection in some frames. In this experiment, the user is assumed to begin in a frontal view. The tracker uses only one keyframe, the ¯rst frame. No o®-line training is involved. The proposed hybrid tracker reliably tracks the pose in real-time with large head motion, while the feature-based tracker loses track completely after only 3 frames. Probing deeper we see that when feature-based tracker is lost, only a few features (1-4) are reliably matched on each frame. This exempli¯es the problem with feature-based methods on low texture images. Another interesting observation is related to error accumulation. In ¯gure 2.9, the center column shows a frame with strong re°ection coming from the subject's glasses. 39 Figure 2.9: The top row shows some example frames and the bottom row shows the estimation of the proposed tracker. The arrow indicates the direction that the user is facing. The feature-based tracker lose track completely in only 3 frames. At that frame, the tracking accuracy degrades, due to the insu±cient number of the fea- tures matched in this environment. However, after the re°ection disappears, the tracker recovers. This demonstrates how the use of keyframes prevents error accumulation. 2.5.4 Automatic Initialization and Reacquisition Theautomaticinitializationandreacquisitionmoduleisalsoevaluated. Figure2.10shows the process of the automatic initialization module. As we presented in section 2.4.5, a face detector is used to detect and locate the frontal face, and the detected face region is annotated by a red rectangle in ¯gure 2.10. The 3D head tracker is initialized only after the detector can consistently locate a face in some position for several consecutive frames. Figure 2.11 shows the reacquisition of 3D head tracker. As we showed in previous sections, current 3D head tracker is reliable near the frontal pose, but the accuracy decreaseswhentheheadapproachestheextremeposeandisfarawaytothefrontalview. In¯gure2.11,theaccuracyofheadtrackerdecreasesfromframe710,asthesubjectturns 40 Figure 2.10: Automatic initialization of 3D head tracker. It shows 4 consecutive frames for automatic 3D head tracker initialization. The red rectangle indicates the detected face region. her head to pro¯le view. The tracker is considered as \lost track" in frame 765 since the head pose is very di®erent from the actual pose and the residual error becomes high. In frame 766, the reacquisition module detects a frontal face and uses it to re-initialize the 3D head tracker. 2.6 Summary We have proposed a hybrid tracking algorithm for robust real-time 3D face tracking. Built on a nonlinear optimization framework, the tracker seamlessly integrates intensity information and feature correspondence for 3D tracking. To improve the robustness, we have adopted an m-estimator type scheme for optimization. Patch-based di®erencing has been used to de¯ne the objective function. The inverse compositional algorithm is presented to solve this problem e±ciently. We also addressed the issues of automatic initialization and requisition. The proposed tracker tracks the 3D head pose reliably in various environments. An extensively empirical validation and comparison with state-of- the-art trackers conclusively demonstrates this. 41 (a) 710 (b) 755 (c) 765 (d) 766 (e) 767 (f) 768 (g) 770 (h) 774 Figure2.11: Automaticreacquisitionof3Dheadtracker. Thenumberindicatestheframe index. 42 Chapter 3 Automatic Classi¯cation of Facial Gestures 3.1 Introduction Nonrigid local deformations can be detected and interpreted by eliminating the global head motion. The hybrid tracker presented in chapter 2 can track the 3D head pose in the presence of expression changes. The estimated rotation and translation compen- sates the global head motion. For this purpose, we track the 3D head pose using a 3D cylinder approximation of the head. After this step, the remaining motion accounts for the local deformations and corresponds to changes in facial expressions, mouth, and eye movements. In this chapter, we propose a new approach using a region-based description of the face depicted in ¯gure 3.2. The expression can be compactly represented by this region- based face model. These regions locate the key features of human faces and the motions inside each region are smoother. For di®erent regions, we can also observe there are some correlations between their motion. Thus, the proposed approach consists of a mo- tion model representing local face deformations corresponding to the relative motion in 43 Facial Deformations Head Pose Recognition and Interpretation Expressions, Facial Gestures Training Database Face Sequences Figure 3.1: System °ow chart of the ¯rst generation system each region as well as the interrelations between these regions features. The observation considered by the proposed framework is a dense optical °ow in the area of the face. Di®erentexpressions, generate distinct motion¯eldpatterns andwepropose torecognize the facial expressions based on a local a±ne approximation of the observed optical °ow. Based on this idea, the facial gestures can be represented as the combination of relative motion in each region with the interaction between di®erent regions. The °ow chart of the proposed system is outlined in ¯gure 3.1. The rest of this chapter is organized as follows: In section 3.2, we review the related work in literatures. Section 3.3 illustrates the region-based face model for extracting features from facial expressions. A graphical model is built to characterize the interde- pendency between face regions. The formulation, construction, and learning is presented in section 3.4. In section 3.5, we introduce a BP (belief propagation) method to infer the 44 hidden variables as well as correct the erroneous observation, and classify expression by Bayesian MAP (Maximum A Posterior) estimation. The experimental results are pre- sentedanddiscussedinsection3.6. Thissystemhasbeenusedforsomeapplications,and section3.7andsection3.8showtheseapplications. Finally, theconclusionandlimitation is addressed in section 3.9. 3.2 Previous Work A large number of approaches focused on extracting facial deformation information [20, 29, 62]. These proposed approaches can be roughly categorized as model-based and descriptor-based. ² Model-based approach. The model-based method builds a model, template or holistic representation for the face, and facial deformations are described by model parameters or the model itself. For example, 2D active shape model (ASM) [16] builds a deformable model for 2D shape. The face shape is encoded by a set of sparse landmark points. The deformation is modeled by linear subspace analysis, and each shape instance is a weighted linear combination of basis shapes. The 2D active appearance model (AAM) [15, 50], considering as a descendant of ASM, uses similar idea to model both deformable shape and facial texture. 3D deformable face models have also been explored, such as 3D morphable model (3DMM) [8, 9], ratio images [80], 3D extension of AAM [81], 3D shape and appearance model [21], and 3D deformable model [17]. 45 (a) The nine face regions (b) Fitting into a human face Figure 3.2: The region-based face model ² Descriptor-based approach. The descriptor-based approach ¯nds a set of descriptors for the facial deformation, such as optical °ow [7, 27], appearance [12], ¯lter responses [4], or prede¯ned fea- tures[14,13,84]. Facialexpressionsareencodedandidenti¯edbythesedescriptors. Usually,basedonthetargetapplication,peoplesearchforthe\bestdiscriminative" descriptors, and thus this can be considered as a discriminative approach. 3.3 The Region-based Face Model The human face is divided into nine regions [31]. These regions are related to the char- acteristics of the human face. Roughly speaking, these regions are foreheads, eyes, the nose, left and right checks, and the chin. Figure 3.2(a) shows the prede¯ned face regions and ¯gure 3.2(b) shows the ¯tting into a human face, using the estimated head pose with a 3D cylindrical head model [42, 83]. 46 The idea behind this face model is, when people perform expressions, the local de- formations of the whole face are not homogeneous. However, within each of the regions de¯ned by the model, observed deformations are more homogeneous. Therefore, we pro- pose to divide the face into several regions and focus on characterizing the intra-region dynamic patterns and the inter-region dependency. Inside each region, we use an a±ne motionmodeltocapturetheunderlyingdynamicpattern. Theseregionscanprovidemul- tiple cues to determine the expression type. Also, some facial gestures are characterized by symmetry constraints, while other corresponds to the combination of local deforma- tions. Modelingthesejointdependencieswillbuildamathematicalmodelforrepresenting theassociatedrelationsamongfaceregions. Here, weproposetouseagraphicalmodelto model the interrelation between di®erent face regions. Along with the local a±ne motion model, this approach will capture the characteristics of the facial gestures in the human face. 3.4 Modeling Facial Deformations with a Graphical Model For expression analysis, we use a latent variable model. A graphical model is de¯ned by G = (X;Z;E), where X and Z are the sets of vertices and E is the set of edges. Each vertex of Z represents the observation, and each X is the state vector. X is called the hidden nodes or latent state vectors, which is not observable to the users, only Z is observable. However, Z is governed by X and by using the probabilistic inference procedure, people can infer the state vectors X. 47 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 z 1 z 4 z 7 z 9 z 8 z 5 z 6 z 2 z 3 Figure 3.3: The topology of the graphical model associated to 9 face regions for facial gesture analysis. Each z i and x i represent the i-th face region Figure 3.3 shows the topology of the graph. Such topology preserves the spatial structure and the symmetry properties of human faces. In this graph, each z i is the observation in the i-th face region, and x i is its underlying motion model parameters. The edge between two x i s represents the interdependency between two regions, and the edge connecting x i and z i denotes the observation model. Underthisgraphicalmodel,thejointprobabilityofallobservationsZ =(z 1 ;z 2 ;:::;z 9 ) T and full state vector X =(x 1 ;x 2 ;:::;x 9 ) T is: P(Z;X)= 1 K Y (i;j)2E à i;j (x i ;x j ) Y i2V Á i (z i ;x i ) (3.1) 48 where K is the normalization constant, à and Á are nonnegative potential functions. The function Á describes the observation model and measures the compatibility between the state vector and the observation, while the function à models the interdependency between neighboring nodes. 3.4.1 Modeling Intra-region Dynamics with the A±ne Motion Model The observation z i is the set of optical °ow and the state vector x i is the a±ne motion parameters in region i. Let P t , P t+1 the 2D position of a point at time t and t + 1 respectively. After compensating for variations in the head pose, we represent the local deformations using an a±ne motion model within each region: P t+1 =AP t +B; A= 2 6 6 4 a 1 a 2 a 3 a 4 3 7 7 5 ;B = 2 6 6 4 b 1 b 2 3 7 7 5 (3.2) and P t+1 = P t +V t where V t = (u t ;v t ) T is the optical °ow at this point. The matrix A could be further decomposed into rotation and scaling using a singular value decomposi- tion (SVD): A=USV T =(UV T )VSV T =R(µ)R(¡') 2 6 6 4 s 1 0 0 s 2 3 7 7 5 R(') The resulting state vector of vertex i is x i =(µ;';s 1 ;s 2 ;b 1 ;b 2 ) T . Each component of this state vector has a geometric meaning for the motion in this region and thus encodes the underlying dynamic pattern. 49 In the context of the proposed latent variable model, Á measures the error between the observed optical °ow and the underlying motion parameters. Therefore, we design Á as a Gaussian for the averaged residual between the optical °ow and the estimated motion from the motion parameters: Á i (x i ;z i )=P N (rj0;¾); (3.3) where r = 1 N P N j=1 k ^ V i j ¡V i j k 2 . Here V i j is the optical °ow of point j and ^ V i j is the estimated motion based on the motion parameter x i , all for vertex i. Thek:k 2 stands for the Euclidean norm and P N (:j0;¾) is the Gaussian density function with zero mean and standard deviation ¾. 3.4.2 Modeling Inter-region Dependency with the Gaussian Mixture The function à models the joint density between two neighboring nodes. Figure 3.4 plots the empirical distribution of one of these densities. Clearly, these densities are non-Gaussian and multi-modal, and therefore di±cult to be captured by any parametric distribution. Consequently, instead of using parametric density estimation, we propose to approximate the empirical joint distribution by a Gaussian mixture modeling (GMM): à i;j (x i ;x j )= X m ® m P N (x i ;x j j¹ m ;§ m ) (3.4) where the subscript m is the index of each Gaussian in this mixture, (® m ;¹ m ;§ m ) is the weight, mean vector, and covariance matrix for the m-th Gaussian, and P m ® m =1. 50 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 −1 −0.5 0 0.5 1 1.5 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Empirical Distribution Figure 3.4: This ¯gure plots the empirical joint pdf of the b 2 of V 1 and b 2 of V 2 in a surprise expression. GMM has become a very common method in computer vision. We use it here since it balancesestimationaccuracyandcomputationale±ciency[53]. Usingasu±cientnumber of Gaussians, it can approximate the true density very well. Unlike the nonparametric kernel density estimation, Gaussian mixture reduces the complexity of the model and makeslearningmoree±cient. Moreover,wecanrelyonistheverywellstudiedstatistical tool, the EM algorithm, for estimating the parameters of the Gaussians [52, 66]. 3.4.3 Learning the Graphical Model We adapt the supervised learning to train the proposed graphical model. Learning the observationmodelisstraightforwardsinceitconsistsofonlyoneGaussian. ForGMM,the well-known EM algorithm can be used to estimate the model parameters [52, 53, 66, 73]. However, even in the training stage, the collected data may not be perfect. Part of the human face may be occluded when the person is performing the expression. This results 51 in an incomplete observation data set. We used a modi¯ed EM algorithm to learn the parameters of the GMM [33]. It measures the probability on observable dimension in expectation stage, and then performs the maximization to update the estimation. Estimating the number of Gaussians needed is an important issue of GMM. The number of Gaussians is a tradeo® between ¯tting accuracy and model complexity. There are many approaches for model selection in the literature, such as statistical hypothesis test and information criteria. Here we adapt an information-theoretic point of view and use the AIC (Alike's Information Criterion) [1, 40]. AIC comes from optimizing the Kullback-Leibler information measure of the true density with respect to the ¯tted density by maximum likelihood estimation: AIC =¡2L+2b (3.5) where L is the log-likelihood and b is the number of parameters in the model, which will be: b=n£(1+d+d£d) where d is the dimension of the joint pdf. AIC could be regarded as an asymptotically bias-correctedlog-likelihoodand2bisthebiascorrectionterm. AIChasseveralattractive properties in practice. Since its bias correction term is very simple and does not require further derivation, it is suitable for the automatic selection of the number of Gaussians. The optimal number of Gaussians is automatically selected based on AIC: n optimal =argmin n AIC(n) (3.6) 52 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 4870 4880 4890 4900 4910 4920 AIC 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 −2440 −2420 −2400 −2380 −2360 −2340 −2320 number of Gaussian Log Likelihood Figure3.5: TheinformationcriteriaforselectingnumberofGaussians. This¯gureshows the value of AIC and log-likelihood versus the number of Gaussians. As the number of Gaussians increases, the log-likelihood increases while AIC varies, since the formula of AIC takes the model complexity into account. In ¯gure 3.5, we plot the value of AIC and log-likelihood versus number of Gaussians. It shows the AIC will compensate the model complexity and prevent the over-¯tting. Thus, we learn the graphical model empirically from the training data set [42]. Table 3.1 shows the optimal number selected by AIC for a smiling expression. à 1;2 à 2;3 à 1;4 à 2;5 à 3;6 à 4;5 à 5;6 à 4;7 à 5;7 à 5;8 à 6;8 à 7;8 à 7;9 à 8;9 N 6 7 6 5 4 5 5 6 4 5 7 6 4 5 Table 3.1: Number of Gaussians for Gaussian Mixture Modeling. 53 3.5 The Classi¯cation Framework In this chapter, we introduce our classi¯cation framework. We begin with the complete observation case and present the whole framework in section 3.5.1 and 3.5.2. Extension to the case of missing observations is addressed in section 3.5.3. 3.5.1 The Complete Observation Case Classifying the expression based on the observation is formulated by the Maximum A Posteriori (MAP) estimation. Let c denotes the expression indicator variable and ^ c is the estimated expression. The MAP estimation is: ^ c=argmax c P(cjZ) (3.7) Based on the latent variable model, the probability can be written as: P(cjZ)= Z P(c;XjZ)dX = Z P(cjX;Z)P(XjZ)dX (3.8) and similarly: P(XjZ)= X c P(X;cjZ)= X c P(XjZ;c)P(cjZ) (3.9) Equations 3.8 and 3.9 are interdependent, and to compute one of these two functions, we need the other. Thus, we propose to use the following two-stage Gibbs sampling approach [73]: 1. Initialize a set of expression indicator variables fc n;0 g N n=1 , set the iteration index t=1. 54 2. Apply the iterative updating approach: 2a. GeneratefX n;t g N n=1 from P(XjZ;c n;t¡1 ) 2b. Generatefc n;t g N n=1 from P(cjX n;t ;Z) 3. tÃt+1. Repeat steps 2a. and 2b. until convergence. In step 2a, sampling X n from P(XjZ;c n ) is feasible, since P(XjZ;c) / P(X;Zjc) and by equation 3.1, P(Z;Xjc)= 1 K Y (i;j)2E à i;j (x i ;x j jc) Y i2V Á i (z i ;x i jc) (3.10) where K is the normalization constant. Hence, Sampling from P(XjZ;c n ) could be interpreted as, given the observation Z and current estimation of the expression c n , inferring the most likely latent state vector X based on the proposed graphical model. Thereareseveralapproachesforprobabilisticinferenceongraphicalmodel,andweadopt thebeliefpropagationalgorithm. Section3.5.2addressesthisinferenceproblemindetail. In step2b, the key component is computing the complete-data posterior P(cjX n ;Z): P(cjX n ;Z)= P(X n ;Zjc)P(c) P c P(X n ;Zjc)P(c) (3.11) Thejointlikelihood P(X n ;Zjc)could becomputed from thegraphical model using equa- tion 3.10. The prior P(c) re°ects our initial belief of the expression. Under the equal prior as- sumption, the complete data posterior (3.11), P(cjX n ;Z), is proportional to P(X n ;Zjc), the joint likelihood. On the contrary, we can impose some stochastic structure on the 55 prior. For example, wecan build a Markovmodel for theexpressions in the time domain. At time t, the prior of c t is conditional on time t¡1 and becomes P(c t jc t¡1 ). When we classify the expressions in a video sequence, this setting would incorporate the temporal dependency into our classi¯cation. 3.5.2 The Belief Propagation Based Inference Mechanism Belief Propagation (BP) is an algorithm for inference in a graphical model. It is best suitable when the exact inference is infeasible. In the BP framework, a message m ij (x j ) is a function of x j representing the node i's belief for x j . It has a local message passing processtointegratemessagesbetweenneighboringnodes. Themessageupdatingequation is: m t ij (x j )/ Z à ij (x i ;x j )£Á i (z i ;x i )£ Y k2N(i)nj m t¡1 ki (x i )dx i (3.12) where the superscript t denotes the iteration and N(i) denotes the neighbors of node i. After T iterations, the marginal distribution P(x i jZ) could be computed from the updated messages: P(x i jZ)/Á i (z i ;x i )£ Y k2N(i) m T ki (x i ) (3.13) The message updating process can be divided into 2 steps: evaluating incoming mes- sages with the local observation z i and then integrating the potential function à ij . With a slight abuse of the term, the resulting function of the ¯rst step can be considered as the 56 conditional density of x i and it is denoted as f(:). From this point of view, the Monte Carlo approximation of equation 3.12 can be derived: m t ij (x j )= Z à ij (x i ;x j )f(x i )dx i =E f(x) [à ij (x i ;x j )] ¼ 1 N N X n=1 à ij (x n i ;x j ); x n i »f(x) (3.14) where f(x) = Á i (z i ;x i ) Q k2N(i)nj m t¡1 ki (x i ). One problem of such formulation is, for a continuous valued non-Gaussian graphical model, it is usually di±cult to sample from an arbitrary f(:). Recently, several variations of the belief propagation algorithm were proposedtoaddressthislimitation,suchasNonparametricBeliefPropagation(NBP)[72], ParticleMessagePassing(PAMPAS)[39,71],andBeliefPropagationMonteCarlo(BPMC) [35,36]. In[35],theproposedBPMCusestheimportantsamplingtechniqueforthisprob- lem; instead of sampling from f(:), a set of weighted samples are drawn from a proposal functiong(:). Althoughthisapproachshouldworktheoretically, inpractice, thechoiceof an appropriate proposal function is an issue. In [36], the authors proposed a data-driven BPMC (DDBPMC) algorithm which construct data-driven proposal functions from the bottom-up image cues. On the other hand, if we make an assumption on the functional form of the poten- tial functions, we can approximate each potential as a Gaussian mixture. The NBP and PAMPAS approaches approximate the potential functions as a Gaussian mixture. PAMPAS focuses on the case of a small number of Gaussians while NBP looks at more general settings and consider a more complicated Gaussian mixture. In this work, we 57 will consider à is a mixture of a small number of Gaussians, we basically follow the same approach as the one proposed in PAMPAS. Toupdatethemessagem ij , aweightedparticlesetfx n i ;w n i g N n=1 issampledasfollows: x n i » Y k2N(i)nj m t¡1 ki (x i ) w n =Á(x n i ;z i ) (3.15) and since à ij (x i ;x j ) is a Gaussian mixture, the à ij (x j ;x i ) is a mixture of conditional Gaussian. Thus, each message is represented as a Gaussian mixture. The sampling strategy in (3.15) can be interpreted as, ¯rst integrating the belief from neighbors, and then weighting the integrated belief by the local observation. Another strategy is using the opposite order: drawing samples from the observation model and then weighting these samples by the incoming messages: x n i »Á(x i ;z i ) w n = Y k2N(i)nj m t¡1 ki (x i ) (3.16) Such sampling will integrate the bottom-up cues from the local observations. Based on the observation model (3.3), given a set of optical °ow z i , the best estimation of x i is provided by the least-square estimation for the a±ne parameters. Thus, if we add a small noise to the real optical °ow measurement, another least-square estimation can be computed. If the noise term is generated by the equation 3.3, this new state vector can be regarded as a sample drawn from the observation model. The resulting samples are a mixture of equations 3.15 and 3.16. 58 3.5.3 The Incomplete Observation Case In the previous section, we have discussed how to perform the belief propagation for inference and then classi¯cation when the observation is available. For the incomplete observation case, the proposed framework is still applicable with a slight modi¯cation. Suppose some regions of the face are occluded and let ~ V µ V is the set of all non- occluded vertices. The observation Z =(Z obs ;Z mis ) T . Here Z obs is the set of all measur- ableobservationandZ mis denotesthemissingobservation. Theclassi¯cationrulede¯ned by equation 3.7 now becomes: ^ c=argmax c P(cjZ obs ) (3.17) and the equations 3.8 and 3.9 become: P(cjZ obs )= Z P(c;XjZ obs )dX = Z P(cjX;Z obs )P(XjZ obs )dX (3.18) and P(XjZ obs )= X c P(X;cjZ obs )= X c P(XjZ obs ;c)P(cjZ obs ) (3.19) Thus,thetwo-stepGibbssamplingapproachisstillapplicablebasedonequations3.18 and 3.19. The computation of the joint likelihood de¯ned by equation 3.10 becomes then computing P(X;Z obs jc) for partial observation cases: P(X;Z obs jc)= 1 K Y (i;j)2E à i;j (x i ;x j jc) Y i2V ~ Á i (z i ;x i jc) 59 where ~ Á i (z i ;x i )= 8 > > < > > : Á i (z i ;x i ); i2 ~ V k; others and k is some constant. To infer the latent state vector X, message updating equation 3.12 and the marginal probability (3.13) are then reformulated as: m t ij (x j )à Z à ij (x i ;x j )£ ~ Á i (z i ;x i )£ Y k2N(i)nj m t¡1 ki (x i )dx i and, P(x i jZ obs )/ ~ Á i (z i ;x i )£ Y k2N(i) m T ki (x i ) It means that the message updating process is performed as the complete observation case,exceptmissingobservationnodes. Forthosemissingobservationnodes,themessage updating process only integrates the belief from its neighboring nodes and thus only equation 3.15 is used to generate samples. The resulting estimation only depends on its neighbors. Such strategy enables our framework to infer the state vectors of occluded regions,andthenperformstheclassi¯cation. Hence,itbreaksthelimitationofincomplete observation, and works for varying degree of occlusion. 3.6 Experiments Wehaveexaminedtheproposedframeworkinvariousaspectsandtheresultsarereported here. Inthisexperiment, wetestproposedframeworkforrecognizingfacialgestureunder 60 10 20 30 40 50 iteration m12 0 10 20 30 40 50 −20 0 20 40 60 80 100 iteration K−L distance m41 0 10 20 30 40 50 −20 0 20 40 60 80 100 iteration K−L distance m45 10 20 30 40 50 iteration m63 0 10 20 30 40 50 0 10 20 30 40 50 iteration K−L distance m65 0 10 20 30 40 50 −20 0 20 40 60 iteration K−L distance m68 10 20 30 40 50 iteration m85 0 10 20 30 40 50 −20 0 20 40 60 80 iteration K−L distance m74 0 10 20 30 40 50 −20 0 20 40 60 80 iteration K−L distance m89 Figure 3.6: KL distance for messages m ij . Due to the space limitation, we only shows the result of 9 messages instead of all 28 messages. natural head motion. We follow the 6 universal expression categorization and classify ex- amined expressions into one of the 6 classes. In section 3.6.1, we show the experimental results for the inference mechanism. We have collected a data set to evaluate the classi- ¯cation performance. The results are reported in section 3.6.2. Finally, we have tested the proposed framework in several challenging sequences including the partially occluded faces and some videos from internet in section 3.6.3. The results show the proposed framework has the potential to solve these di±cult problems. 61 10 20 30 40 50 iteration Region1 0 10 20 30 40 50 −20 0 20 40 60 80 iteration K−L distance Region2 0 10 20 30 40 50 −10 0 10 20 30 40 iteration K−L distance Region3 10 20 30 40 50 iteration Region4 0 10 20 30 40 50 −10 0 10 20 30 40 50 60 iteration K−L distance Region5 0 10 20 30 40 50 −10 0 10 20 30 40 iteration K−L distance Region6 10 20 30 40 50 iteration Region7 0 10 20 30 40 50 −10 0 10 20 30 40 50 iteration K−L distance Region8 0 10 20 30 40 50 −10 0 10 20 30 40 50 60 iteration K−L distance Region9 Figure 3.7: KL distance for marginal distributions P(x i jZ obs ). 62 3.6.1 The Inference Mechanism To validate our approach, ¯rst we examine the empirical properties of the proposed framework. We have tested the convergence behavior of the inference mechanism in various situations and show some of these results here. The most important components of the inference mechanism are the messages m ij (x j ) and the marginal distributions P(x i jZ). Sincethesetwoaredistributions, weusedtheKullback-Leiber(KL)distanceas themetrictomeasurethedistanceanddiagnosetheconvergence. Also,sincetheresulting distribution has no closed form, we compute the KL distance empirically. Figure 3.6 and 3.7 show the results for an anger expression with occlusion of region 3 and 6 (the region number corresponds to the node label of the graphical model in ¯gure 3.2). We run the BP algorithm from 1 to 50 iterations and computed the KL distance between the messages and marginal distributions of 2 consecutive iterations. To update the message, 100 samples are drawn. Clearly, as expected, the KL distance drops as the number of iterations increases. A second experiment consisted of evaluating the number of samples we draw when updating the messages. Figure 3.8 and 3.9 show the result for a smile expression with occlusion of region 1. We run the BP algorithm from 1 to 30 iterations where a various numberofsamplesaredrawnwhenupdatingthemessages. Hereweonlyreporttheresult on 60, 120, and 180 samples. From the ¯gures, we can clearly see that as the samples increase, the faster is the convergence. 63 5 10 15 20 25 30 iteration Message2 60 particles 120 particles 180 particles 5 10 15 20 25 30 iteration Message6 60 particles 120 particles 180 particles 5 10 15 20 25 30 iteration Message7 60 particles 120 particles 180 particles Figure 3.8: KL distance for messages m 21 , m 41 , and m 45 . Anger Disgust Fear Sadness Happiness Surprise Anger 72.67 0.85 1.70 1.53 16.98 6.28 Disgust 1.52 67.85 4.67 3.69 14.44 7.82 Fear 0.42 0.71 80.45 0.71 9.63 8.07 Sadness 2.03 1.35 2.57 66.22 19.46 8.38 Happiness 0.75 1.17 1.49 1.92 88.26 6.40 Surprise 0.37 0.25 0.87 0.75 9.99 87.77 Table 3.2: The Confusion Matrix of the proposed classi¯cation framework. The rows indicate the true class and the columns indicate the classi¯ed results. 64 5 10 15 20 25 30 iteration Region1 60 particles 120 particles 180 particles 5 10 15 20 25 30 iteration Region2 60 particles 120 particles 180 particles 5 10 15 20 25 30 iteration Region4 60 particles 120 particles 180 particles Figure 3.9: KL distance for marginal distributions P(x i jZ obs ) of region 1, 2, and 4. 65 (a) Frame 01 (b) Frame 03 (c) Frame 06 (d) Frame 09 (e) Frame 12 (f) Frame 14 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 3.10: Classi¯cation result. These images come from a testing sequence with a surpriseexpression. Thetoprowshowsthetrackingresult. Thearrowpointstothefacing direction and the green points denote the points in the cylinder surface of our cylindrical head tracker. The bottom row shows the classi¯cation results and the histogram reports the recognition rate for each of the 6 gestures the system was trained on. 3.6.2 Classi¯cation Results To evaluate the classi¯cation performance, we collected a data set in an indoor o±ce environment. The data set contains 10 subjects. Each subject is instructed to perform the expression with head motion. For each subject, we recorded 5 sequences for each expression, and total 300 sequences in our data set. The length of each sequence varies from 10 to 40 frames depending on the subject. In this evaluation, for each expression, we randomly selected 4 sequences from each subject, total 200 sequences as the training set, and then used the remaining sequences for testing. Table 3.2 shows the confusion matrix based on the classi¯cation of each frames. To demonstrate the performance of the proposed framework in detail, we report the classi¯cation result on a set of sequence containing various head motions and facial ex- pressions. Figure 3.10 shows the classi¯cation result of a \Surprise" expression sequence. The top row shows the head pose tracking result (through the use of the blue arrow corresponding to the estimated head pose) and the bottom row shows the classi¯cation 66 (a) Frame 05 (b) Frame 11 (c) Frame 12 (d) Frame 13 (e) Frame 18 (f) Frame 23 1 2 3 4 5 6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1 2 3 4 5 6 0 0.05 0.1 0.15 0.2 0.25 1 2 3 4 5 6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1 2 3 4 5 6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Figure 3.11: Classi¯cation result. These images come from a testing sequence and depict a surprise expression in presence of partial occlusion of the face. result. The histogram illustrates the estimated probability of each expressions: 1-Anger, 2-Disgust, 3-Fear, 4-Sadness, 5-Happiness, 6-Surprise. As expected, the classi¯cation has higher error rate during transition phases, while it is more accurate at the apex of the facial gesture. The reason is, in the beginning, the expression is closer to neutral and hence more likely to be confused with the other expressions. 3.6.3 More Challenging Cases We have tested the classi¯cation performance of this framework for partially occluded faces. Figure 3.11 shows the classi¯cation results on a \Surprise" expression with occlu- sion of the mouth by hand. The inference mechanism is ran with 100 samples and 30 iterations. 6 frames from a total 23 frames sequence are presented here. The subject starts with a neutral expression and shows a \Surprise" expression with hand raising. In frame 12, the hand starts to occlude some parts of the face and then completely occlude the mouth. The classi¯cation in frame 11 is accurate, while in frame 12, it is confused with other expressions. Since the proposed belief propagation driven framework could 67 (a) Frame 13 (b) Frame 17 (c) Frame 21 (d) Frame 24 (e) Frame 29 (f) Frame 33 1 2 3 4 5 6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1 2 3 4 5 6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1 2 3 4 5 6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1 2 3 4 5 6 1 2 3 4 5 6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 1 2 3 4 5 6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Figure 3.12: Classi¯cation result obtained on a video sequence not belonging to the training data. The top row correspond to the estimated 3D head pose, the bottom row depicttheobtainedrecognitionrates. Hereagain, facialgesturesduringtransitionphases are not recognized robustly, while at gesture apex, good recognition is achieved. infer the missing data, the classi¯cation in frame 13 and 18 is correct though the con¯- dence is not high. This sequence is very challenging since in the last half of the sequence, the mouth is occluded by the hand and the mouth has been considered as an important feature for expression recognition in literature. Another challenging scenario tested here is the recognition of facial gestures of peo- ple not belonging to the training data. We tested our framework on some video clips downloaded from the internet. Figure 3.12 shows the classi¯cation result on an interview video. We manually segmented the video into subsequences containing the expression change and run our classi¯cation method. The 3D head pose is automatically tracked and the inference mechanism run with 100 samples and 30 iterations. 6 frames from a total 33 frames subsequences are presented here. Such sequence is challenging since the background and the environment are totally di®erent from the training database. Be- sides, it is a spontaneous expression instead of the fake expression recorded in the studio. Comparingtheclassi¯cationresulttothoseofthecollecteddatainthestudio, evenwhen the classi¯cation is correct, the con¯dence of classi¯cation is lower. 68 Figure 3.13: Personal service robot 3.7 Application to Human-Robot Interaction This automatic expression recognition system has been applied for human-robot inter- action. The goal is to build a personal service robot to assist human, possibly for a member of an aging population, in the intelligent home scenario. For this purpose, both theenvironmentandtherobotarerequiredtobeawareofthehumanstatus, andnatural human-robot interaction is desired. We apply the above expression recognition system to improve the human-robot interaction and gesture understanding. Figure 3.13 shows the images of the robot platform. The left image shows the prototype of personal service robot in our lab, while the right one is the commercial robot. We have transferred our software to ETRI (Electronics and Telecommunications Research Institute), a Korean government-funded research organization, and it is ported onto the actual platform. 69 Figure 3.14: Setup of automatic facial expression recognition system 3.8 National Traveling Exhibition This research is also demonstrated in a public exhibition. We have implemented a real- time interactive system for automatic expression recognition and delivered this system to California Science Center for a national traveling exhibition [10]. It is an exhibition exploringthescienceandresearchforemotionsandfear. Thefacialexpressionrecognition system is part of this exhibition, and open to the public at large. It is a showcase for our research,andallowsustocollectadatabaseofspontaneousfacialexpressions. Figure3.14 shows the setup of the interactive expression recognition system. A screen shows some information,operationinstructionsofthesystem,imagesandvideostimuli,andfeedback. The camera is mounted on top of the screen. At the beginning of each trial, the system asks the participant to adjust the pan/tilt of the camera to place his/her face in the center. The system displays image and video stimuli and classi¯es the participant's sponta- neous expression. By displaying these image or video stimuli with music, participants show some reaction and emotions on the face, such as smile, surprise, or disgust. Each 70 Figure 3.15: Examples of image stimuli stimulus is associated with an expected expression, which is assigned by human experts. Figure 3.15 shows some examples of image stimuli. After displaying the stimulus and recording images, our system classi¯es the user's facial expression into one of six univer- sal expressions. 1. Validate a face is present. We use a face detector [78] to check the existence of a human face. It searches for a human faces through all the facial expression image sequences. If there is no face for most frames, it is assumed that the participant left, and the expression recognition module is not performed. 2. Classify facial expressions. After the system validates the presence of a face, our expression recognition approach is used to compute the posterior probability for each expression. The classi¯cation result is the expression with highest posterior probability. 3. Display feedback. Our system compares the recognition results with this prede- ¯ned expected label, and give a di®erent message and feedback to the participant 71 Figure 3.16: Examples of recorded spontaneous expressions depending whether it matches the expectation or not. The recognition result may be di®erent from the expected one, due to classi¯cation errors, or the participant's reactionbeingachesllydi®erentfromwhatweexpect. Thus,thesystemwilldisplay a di®erent stimulus with the same expression again. Figure 3.16 shows some examples of recorded facial expressions. 72 3.9 Summary In this work, we have proposed an automatic system to classify facial expression in a highly interactive environment. The 3D head tracker is used to estimate the 3D pose and compensates for this global, rigid head motion. To model the complex, nonrigid facial deformations,wedevisedaregion-basedfacerepresentationwithaprobabilisticgraphical modeltocharacterizethehumanface. Thedensityfunctionsofthegrapharemodeledby Gaussian mixture and learned empirically using EM (Expectation-Maximization) algo- rithm. Abeliefpropagationdrivenframeworkisproposedforclassifyingtheexpressionin presence of occlusions. The graphical model and the belief propagation based approach enables us to infer the missing data and correct the unreliable measurements. The ex- perimental result shows the quantitative evaluation of the proposed approach in various setting, and we also used it in two real-world applications. 3.9.1 Limitations The main limitation of the current system is the ambiguity in the onset and o®set stage, and the presence of subtle expressions. So far, we have presented a complete framework to recognize facial gestures. However, as we have shown in section 3.6, the ambiguity of onset and o®set stages is an issue for improving the classi¯cation accuracy. At the peak of each expression, our system estimates the posterior reliably, but the ambiguity in the onset and o®set decreases the classi¯cation accuracy. The key di±culty is the complexity of nonrigid facial deformations. The high degree of freedom nature of nonrigid facial deformations makes it di±cult to model, and it is sensitive to the noise and outliers. 73 Thus, incorporating semantic facial features, such as eyes and mouth, and modeling the nonrigid facial deformations in a ¯ner resolution are desired. Another limitation is the interaction between rigid and nonrigid facial deformations. The current system separates these two motions using a 3D head tracker. It cancels the rigid head motion by warping the face into the reference frame, and assumes the residual motions are nonrigid facial deformations. However, if there is an error in the estimation of 3D head pose, it will introduce a bias into the modeling nonrigid motions and the expression interpretation. The main issue is how to model the relation between these twomotions. Abetterapproachshouldanalyzethesetwomotionssimultaneously. These limitations and observations inspire an idea for developing the second generation system, using manifold learning for tracking and inference facial expressions, which is presented in next chapter. 74 Chapter 4 3D Face Tracking and Expression Inference Using Manifold Learning 4.1 Introduction Nonrigid deformation is an important property of human faces, as it conveys information about a human's mental state. However, modeling such nonrigid deformation is a quite challengingproblemincomputervision,sinceithasmanydegreesoffreedom(DOF),and requiresahighdimensionalspaceforrepresentation. Workingonahighdimensionalspace presents several issues, such as modeling and computational e±ciency. To resolve these issues in a high dimensional space, some researchers try to exploit the intrinsic structure of nonrigid face, which is achieved by applying linear subspace analysis. Another issue is the nonrigid facial deformation is mixed with rigid head motion. Solely using a 3D head tracker is not su±cient to fully decouple these two motions, and a model for their relations will be bene¯cial for the expression inference. In this chapter, we propose a new framework to model the deformable shape using nonlinear manifolds. The main contribution is two-fold. First, instead of using a linear 75 Detect and Track 2D positions of landmark points 3D embedding Nonlinear manifold learning Infer 3D shapes, manifold coordinates, and expression Detect and Track 2D positions of landmark points Estimate head pose Offline Online Manifold-based facial deformation model Head pose, 3D shape, and expression labels 3D face model Training sequences Test video Figure 4.1: System °ow chart of the second generation system subspace analysis, we argue the 3D facial deformations are better modeled as a combi- nation of several 1D manifolds. Each 1D manifold represents a mode of deformation or expression, suchassmile, surprise, blinking, etc. Bylearningthesemanifolds, a3Dshape instance, usually represented by a very high dimensional vector, can be mapped into a low-dimensionalmanifold. Thecoordinateonthemanifoldcorrespondstothemagnitude of facial deformation along that mode. We thus call it the \level of activation". Second, weproposeanovelframeworkofnonlinearmanifoldlearningbasedonN-DTensorVoting [57, 59]. Tensor Voting estimates the local normal and tangent spaces of the manifold at eachpoint. Theestimatedtangentvectorsenableustodirectlynavigateonthemanifold. The proposed 3D deformable shape model is applied to nonrigid face tracking. We develop an algorithm to infer the nonrigid 3D facial deformations with the head pose and expression iteratively, based on the proposed model. Without learning complex facial gestures dynamics, the proposed algorithm can track a rich representation of the face, 76 including the 3D pose, 3D shape, expression label with probability, and the activation level. The °ow chart of our proposed system is outlined in ¯gure 4.1. Therestofthischapterisorganizedasfollows: Section4.2givesanoverviewofrelated work. We start to present our framework from the o²ine construction of the manifold- based facial deformation model. The formulation and learned manifolds are presented in section 4.3. The manifold learning and inference is implemented in the N-D Tensor Votingframework,shownonsection4.4. Basedontheproposedmodelandinferencetool, we develop an iterative algorithm to track the nonrigid facial deformation with 3D head pose, as detailed in section 4.6. In section 4.7, we conduct several experiments to analyze and evaluate the proposed model and algorithm. Finally, conclusions and discussions are given in section 4.8. 4.2 Previous Work 4.2.1 Deformable Face Model A signi¯cant amount of research has been devoted to investigate deformable face mod- els based on linear subspace analysis. The 2D Active Shape Model (ASM) and Active Appearance Model (AAM) [15, 16, 50] approximate the shape deformation as a linear combination with some 2D basis shapes. The model is learned using Principal Compo- nent Analysis (PCA). The AAM inherits the idea of deformable shape, but also learns an appearance model for texture variation. A 3D deformable model has also been proposed. In [51, 81], the authors extended the 2D AAM to a combined 2D+3D AAM. Besides, Ramnath et al. investigated the multiview AAM ¯tting algorithms [65]. In [8, 9], Blanz 77 andVetterbuilt a3Dmorphablemodel forfacialanimationand facerecognition. In[49], the Gaussian ¯lter approach is proposed to track the pose, expression, and texture. More recently, in [82], Xiao and Kanade derive a close-form solution to infer the deformable shape, which is a weighted combination of certain linear shape bases. They proposed shape constraints to improve the decomposition of nonrigid shape into linear subspace. In [34], Gu and Kanade proposed a 3D deformable model consisting of a set of sparse 3D points and patches associated with each point. Based on this model, an EM style algorithm is proposed to infer head pose and face shapes. In [86], Zhu and Ji proposed a normalized SVD to estimate the pose and expression. Based on this, a non- linear optimization method is also proposed to improve the tracking result. Vogler et al [79]proposed an integrationsystem tocombine 3Ddeformable model with 2D ASM. The proposed system uses ASM to track reliable features and 3D deformable model to infer the face shape and pose from tracked features. In [22], the deformable face is modeled by OnlineAppearanceModel,whichisalsoalinearsubspaceapproach. Atrackingalgorithm based on particle ¯ltering is also proposed to estimate the facial action and expression simultaneously. 4.2.2 Nonlinear Manifold Learning In the above papers, the construction of a deformable model is built on top of the linear subspace approach. However, linear subspace methods are inadequate to represent the underlying structure of real data, and nonlinear manifold learning approaches are pro- posed [67, 74]. Nonlinear dimensionality reduction techniques provide a good alternative to model the high dimensional visual data. In [12], Chang et al proposed a probabilistic 78 approachbasedontheappearancemanifoldforexpressionanalysis. In[25,26],theauthor proposed a manifold based approach for 3D body pose tracking and general tracking. Most nonlinear manifold learning techniques characterize the intrinsic structure by recovering the low-dimensional embedding. For example, ISOMAP [74] ¯nds a low- dimensional embedding that preserves geodesic distances in the input space. Locally- linearembedding(LLE)[67]searchesforamanifoldbasedonthelocallinearityprinciple. Recentworksarguethatthismaynotbethebestwaytoparameterizethemanifold,espe- cially for the purpose of handling noisy data and out-of-sample generalization [6, 18, 19]. They propose di®erent algorithms to estimate the local tangent hyperplane on the man- ifold, and use the estimated tangent to manipulate novel points. Besides, [63] proposed computational tools for statistical inference on a Riemannian manifold. 4.3 Manifolds of 3D Facial Deformations Using a deformable model allows to ¯nd a compact parametrization to represent the nonrigid facial shape and motion. Linear subspace techniques approach this problem by searching for a set of optimal basis, whose span covers varying shapes. For a given shape instance, S2R N , the projection into its subspace is an approximation: S =S 0 +©b+² (4.1) where S 0 is the mean shape and ² is the noise term. © is an N£n matrix of basis, which de¯nes a linear subspace of nonrigid facial shapes. The dimensionality of this linear subspace is much lower than the original high dimensional space, n << N. © can be 79 learnede±cientlybyapplyingPrincipalComponentAnalysis(PCA)onthetrainingdata. b is a low dimensional vector lives in this subspace, b2R n . Each element of b represents the weight of each principal component and controls the variation of the shape along this component. Such ideas can be found in many previous works, such as [8, 15, 16, 50, 51]. On the other hand, the 3D shape itself is conceptually aligned on the manifolds. The 3D facial shape is governed by two factors, the type of expression and the magnitude of the deformation. The type of expression indicates the style of deformation, such as smile, surprise, or blinking, and warps the neutral face based on this deformation. The magnitude of the deformation controls the \level of activation", such as onset, apex, or o®set, for this shape instance. The mathematical formulation of the above ideas is: S =S 0 +f(L;c)+² (4.2) where S 0 now is the neutral face. c is the variable indicates the type of deformation, and L is the variable controlling the magnitude of this deformation. f is a nonlinear function that maps L and c into the 3D shape space. X =S¡S 0 is the 3D facial deformation. If S 0 is ¯xed and known, modeling the facial deformation, X, and modeling the 3D shape, S, are equivalent. Under this setting, for a speci¯c expression c, the deformation X should lie on a 1D manifold, since L is conceptually a real-valued scalar that controls the magnitude of this expression. Because the face can mix several di®erent expressions, we have a set of 1D manifolds and any facial deformation is a combination of this set of manifolds. 80 (a) 2D (b) 3D Frontal (c) Non- frontal Figure 4.2: Shape model. The red points indicate the landmarks To validate this deformation manifold concept, we conduct an o²ine learning process as follows: A face shape is represented by 42 landmark points. For a collected expression video, we use a 2D active shape model to detect landmark points. These points are tracked across frames using the Lucas-Kanade feature tracker [48]. After tracking, we have the 2D image coordinates for each landmark point and they represent the 2D shape variation. We project these landmark points into 3D by using a previously generated 3D face model. Currently, we use a person speci¯c face model, but it might be su±cient to use either a generic one, or even a cylinder. Figure 4.2 illustrates the shape model. The deformation vector X is the stack of 3D point displacement and it is the input to the manifold learning algorithm. The dimensionality of facial deformation manifolds are examined. Figure 4.3 plots the learned manifolds of eight common facial deformations, and table 4.1 reports the estimated dimensionality using Tensor Voting. The eight modes of deformations are surprise, anger, joy, sadness, disgust, close and open eyes, left eye blinking, and right eye blinking, and each learned manifold is projected into a 2D space. It is veri¯ed experimentally that the dimensionality of the manifold is very close to 1, which con¯rms our assumption. 81 −2 0 2 −1 −1 −1 Surprise −5 0 5 −1 −1 −1 −1 −1 Anger −5 0 5 0.9998 0.9999 1 1.0001 1.0002 1.0003 1.0004 Joy −2 0 2 −1 −1 −1 −1 −1 −1 Disgust −5 0 5 −1 −1 −0.9999 Sadness −2 0 2 −1 −1 −1 −1 −1 −0.9999 Close Eye −2 0 2 −1.0015 −1.001 −1.0005 −1 −0.9995 −0.999 Blinking − Left −2 0 2 1 1 1 1 1 1.0001 Blinking − Right Figure 4.3: Learned manifolds of 3D facial deformations % classi¯ed as 1D Surprise 90.1% Anger 85.3% Joy 93.4% Disgust 95.3% Sadness 90.1% Close and open eyes 88.7% Left-eye blinking 92.4% Right-eye blinking 93.9% Table 4.1: Estimated dimensionality of 3D facial deformations using Tensor Voting 82 (a) Stick voting (b) Ball voting Figure 4.4: Vote generation of stick and ball tensor (excerpt from [59]) TovisualizethemeaningoftheparameterL,¯gure4.3plotseachsampleinadi®erent color based on its order in the original sequences. We record videos with the subject going from neutral to the apex of the expression and back to neutral. The samples in the beginning and end are plotted in blue and the samples in the middle of the video are plotted in red, which is assumed to be the peak of this expression. 4.4 Tensor Voting and Nonlinear Manifold Learning Tensor Voting is a computational framework to estimate geometric information. It was originallydevelopedin2Dforperceptualgroupingand¯gurecompletion[54],andlaterto 3D and N-D for other problems, such as stereo matching and motion processing [55, 58]. Since the focus of this paper is not the Tensor Voting framework, we only introduce the fundamental concepts, particularly in the context of learning conceptual manifold, and refer readers to [57, 59] for the complete presentation and implementation details. 4.4.1 Review of N-D Tensor Voting Supposewehaveasetofsamples,fX i g, inahighdimensionalspaceV andthesesamples lie on a manifold M of much lower dimension. Our objective is to infer the geometric 83 Figure 4.5: Vote generation of a generic tensor (excerpt from [59]) structure of this manifold. Instead of directly ¯nding the low dimensional embedding of fX i g on M, we try to estimate the vectors that span the normal and tangent space at each point, and use them to characterize this manifold. Tensor voting is an unsupervised approach to estimate a structure tensor T at each point. Here, T is a rank-2, symmetric tensor, whose quadratic form is a symmetric, nonnegative de¯nite matrix, representing underlying geometry. Given training samples, fX i g, Tensor Voting encodes each sample as a ball tensor, which indicates an unoriented token. Each X i receives a vote from X j , T j!i , and X i sums up all incoming tensors. T j!i is generated by taking X j 's tensor and relative orientation between X i and X j into account, and weight the distance between them. Figure 4.4 and 4.5 illustrate the idea of generating the tensor vote for stick, ball, and generic tensor. The result of this process canbeinterpretedasalocal,nonparametricestimationofthegeometricstructureateach sample position. After accumulating all cast tensors, the local geometry can be derived by examining its eigensystem. Recall that a tensor can be decomposed as T = N X i=1 ¸ i e i e T i = N¡1 X i=1 [(¸ i ¡¸ i+1 ) i X k=1 e k e T k ]+¸ N N X i=1 e i e T i (4.3) 84 (a) Tangent Directions: e 2 (b) Saliency Values: ¸ 1 ¡¸ 2 Figure 4.6: Example of Tensor Voting result in 2D wheref¸ i g are the eigenvalues arranged in descending order,fe i g are the corresponding eigenvectors, and N is the dimensionality of the input space. Equation 4.3 provides a way to interpret the local geometry from T. The di®erence between two consecutive eigenvalues, ¸ i ¡¸ i+1 , encodes the salience of certain structure: ¸ i ¡¸ i+1 À¸ j ¡¸ j+1 ; 8j2f1;¢¢¢ ;N¡1g;j6=i meansthegeometricstructure,whosenormalspaceisi-Dandthetangentspaceis(N¡i)- D, is most salient. The eigenvectorsfe 1 ;:::;e i g span the normal space of this point, while fe i+1 ;:::;e N g represent the tangent space. For example, if ¸ N¡1 ¡¸ N is the most salient structure, this point is considered in an 1D manifold and e N is its tangent vector. Figure 4.6 shows an example of applying Tensor Voting in 2D. The red dots represent the training samples, which come from a 1D curve. Each point is also perturbed with noise,thustheyarenotperfectlyalignedintheconceptualmanifold. AfterTensorVoting, wehavethetangentdirectionandsaliencyvalueofeachposition. In¯gure4.6(a), arrows indicate direction of tangent vector and it is clear that for position near the conceptual manifold, the estimated tangent direction is accurate. On the other hand, ¯gure 4.6(b) 85 x 0 x 1 x 2 ^ x ||X - X || ^ X d Geo L 0 ^ L Figure 4.7: Traversing the manifold with tangent vectors is the saliency map of ¸ 1 ¡¸ 2 : If it is more salient, its color is close to blue, otherwise it is close to white. We can observe that positions close to the curve have higher saliency. Combining these two ¯gures, we can easily identify those points that are more likely to be on the manifold, and also estimate their tangent vectors. The estimated tangent and saliency value are used to traverse the manifold, as shown in the next section. 4.4.2 Traversing the Manifold with Tangent Vectors The estimated tangent plane enables us to directly navigate on the manifold. The key idea is to \walk down the manifold" along the estimated tangent hyperplane. Figure 4.7 illustrates the process of traversing a manifold with tangent hyperplane. To be more speci¯c, we present 4 tasks related to the 3D facial deformation modeling. The ¯rst two tasks are about learning the manifold from training data, and the last two are inference tasks after we have learned the manifold. Training Task 1. Estimate d Geo (X i ;X j ), the geodesic distance between X i and X j . 86 d Geo (X i ;X j ) can be estimated as the minimum traveling distance between X i and X j along the manifold. Let X i be starting point and destination is X j , and set X 0 = X i . In each X k , we move toward the destination under the guidance of moving direction D k : D k =H k H kT (X j ¡X k ) (4.4) where H k is the matrix of tangent vectors estimated by Tensor Voting. We then move iteratively to estimate the geodesic distance: X k+1 =X k +® k D k d Geo =d Geo +k® k D k k (4.5) until X k reaches an ²-neighborhood of X j . The step length ® is chosen to ensure we are still within the manifold. In other works [6, 18, 19], the authors have to use a very small step to avoid leaving the support of manifold. In contrast, Tensor Voting provides a mechanism to prevent thisundesirablesituation,assaliencyindicatestheunderlyingstructure. Ifwereach a point outside the manifold, the saliency value is low and we change to a smaller step. This is clearly an advantage over other approaches. Training Task 2. Recover the manifold coordinate L of X. Given2samplesX i andX j ,thegeodesicdistanced Geo (X i ;X j )isassumedtobethe same askL i ¡L j k manifold , the distance in the embedded manifold. Therefore, if L i is known, we can recover L j as the byproduct of estimating the geodesic distance. 87 To recover the manifold coordinate of a given training set fX i g, we ¯rst identify a sample as the starting point. In practice, we include a vector 0 in the training data. Its L is de¯ned as 0, since it means zero deformation and represents the neutral shape. Starting from this point, we recursively move outward to estimate the geodesic distance and manifold coordinate. Inference Task 1. Given L, ¯nd its mapping ^ X in input space. This is a nonlinear function approximation problem, as the objective is to learn the function f TV : ^ X =f TV (L) (4.6) However, we can solve it as a search problem, using the estimated tangent hyper- plane: First, we ¯nd point L 0 , whose coordinate is X 0 in input space. In practice, L 0 is the nearest point of L in training set. We iteratively move X k and L k follow- ing the estimated tangent vectors, until we reach L. The procedure is outlined in algorithm 1. input : L output: ^ X Initialize L 0 and X 0 ; 1 whilekL¡L k k¸² do 2 H k à Tensor Voting at X k ; 3 D k ÃH k (L¡L k ); 4 X k+1 ÃX k +® k D k ; 5 L k+1 ÃL k +® k (L¡L k ); 6 end 7 ^ X ÃX k ; 8 Algorithm 1: Map L to ^ X 88 Inference Task 2. Given an arbitrary X in input space, ¯nd its optimal projection ^ X in the manifold. In 3D facial deformation modeling, this is the most important interpretation task, since we want to recover the \true" position from noisy observation using deforma- tion model. Given an observation X, it is represented as X = f(L)+², where ² is the noise. Let ^ X = f(L) be the optimal projection of X on the manifold. This \projection"operationcanbecalled¯ltering,sinceitremovesthenoiseandrestores the position on the manifold. If we further assume the equal prior of L, ^ X will be: ^ L=argmin L kX¡f TV (L)k input ^ X =f TV ( ^ L) (4.7) Therefore, ^ X is the point nearest to X on the manifold. In practice, searching ^ X is formulated as an optimization problem, and solved by algorithm 2. input : X output: ^ X Initialize X 0 ; 1 repeat 2 H k à Tensor Voting at X k ; 3 D k ÃH k H kT (X¡X k ); 4 X k+1 ÃX k +® k D k ; 5 until convergence ; 6 ^ X ÃX k ; 7 Algorithm 2: Search for optimal projection of X 89 X c L Figure 4.8: The graphical model of the 3D facial deformation, facial expression, and activation level In line 1, we choose to initialize X 0 as the nearest point to X in the training samples. The convergence criteria is either kX¡X k k < ² or kX¡X k k reaches a local minimum. 4.5 Modeling the Nonrigid Facial Deformation with a Set of 1D Manifolds We build a statistical model for 3D nonrigid facial deformations, expression, and activa- tion level based on the inference methods developed in the previous section. Recall the equation 4.2, the 3D nonrigid facial deformation, X, is a mapping from embedded low dimensional manifolds: X =f TV (L;c)+² Here, c 2 f1;2;:::;Kg denotes an facial deformation manifold. There are several manifolds in the input space, and these manifolds may have the same dimensionality, as we have shown that there are multiple 1D deformation manifolds of a human face. An instance of facial deformation is jointly governed by c and L. This relation can be illustrated by a graphical model (see ¯gure 4.8). Under this setting, a natural expression is represented by a set of 1D manifolds. As demonstrated in section 4.3, there exists a set of 1D manifolds in the high dimensional 90 input space, and each manifold corresponds to a basic expression. These manifolds are learned using the procedures presented in section 4.3 with the training approaches in section4.4.2. Ontheotherhand,anaturalexpressionisusuallycomplicatedandmodeled as a mixture of these basic expressions. Given an input sample X, its \projection" on these manifolds is: ^ X = X c ^ X c P(cjX; ^ L c ) (4.8) where ^ L c and ^ X c are computed using equation 4.7. Equation 4.8 indicates that, for a given set of facial deformation manifolds, the esti- mation from observation X is an expectation over each manifold. For each manifold c, we ¯nd the \optimal projection" ^ X c of X on this manifold, since ^ X c is the most likely true position if X is generated from manifold c. Thus, we apply algorithm 2 to ¯nd the mode of each manifold. However, the optimality property of ^ X c is only hold for a speci¯c manifold c, while there are several manifolds for di®erent expressions in the input space. We incorporate the posterior of c to address this issue. Intuitively, the posterior of c will give the low probability to the unlikely expression and associates higher weight for correct expressions. To compute the posterior probability of c, the Bayesian rule is applied: P(cjX;L)/P(X;Ljc)P(c) (4.9) In equation 4.9, P(c) de¯nes the prior and P((X;Ljc) is the likelihood of expression c. For a given sample X and recovered manifold coordinate L, P(X;Ljc) measures the likelihood of X belonging to each manifold c. Given a sample X and recovered low 91 dimensional embedding L for manifold c, the joint probability of X and L is de¯ned using Gaussian mixture modeling (GMM): P(X;Ljc)= X m w m P m (XjL;c)P m (Ljc) = X m w c m P G (Xjf TV (L;c);§ c m )P G (Lj¹ c m ;¾ c m ) (4.10) wheremistheindicatorvariableofGaussiancomponent. f TV (:;c)isthemappingforthe manifold c. P G (Xjf TV (L;c);§ c m ) is the Gaussian density function with mean f TV (L;c) and covariance matrix § c m . For simplicity, we assume § c m is diagonal. P G (Lj¹ c m ;¾ c m ) is the Gaussian pdf with mean ¹ c m and standard deviation ¾ c m . w m is the weight of the m-th Gaussian and X m w m =1 These parameters are all learned empirically from training data using the EM (Expecta- tion Maximization) algorithm [52, 53, 66, 73]. Equation 4.10 relates the relative position between X and its optimal projection ^ X with its manifold coordinate L. Using this formulation, one can discriminate between ^ X c s for di®erent manifold c by measuring its likelihood, and thus compute the posterior of each expression. Note that, under this setting and the equal prior assumption of L, for a given sample X, the optimal projection of X onto manifold c, ^ X c , is the one we have presented in equation 4.7, even though equation 4.10 is a complex Gaussian mixture. 92 u X θ c L Figure 4.9: The graphical model of the 2D shape, 3D head pose, 3D deformation, facial expression, and activation level 4.6 Tracking the Nonrigid Facial Deformation with Head Pose The proposed deformation manifold method can be used to track the 3D head pose with nonrigid facial deformation. The 3D shape, 2D observation, and the head pose can be represented by a graphical model in ¯gure 4.9. Here, µ =f! x ;! y ;! z ;t x ;t y ;t z g is the 3D head pose, and u is 2D observed positions of landmark points in the image. To extract u from images, we train a feature detector for each landmark point. The detector is learned using real-Adaboost [68] framework and its structure is nested cascade [37]. For each incoming frame, we ¯rst use a Lucas-Kanade feature tracker [48] to initialize the 2D landmarks, and apply the feature detectors to update their positions. Based on this graphical model, the head pose µ and expression label c and L are conditional independent on X: µ?(c;L)jX;u (4.11) which means, if X is given, the inference of µ and (c;L) can be performed separately: P(µ;c;LjX;u)=P(µjX;u)P(c;LjX) (4.12) 93 Ontheotherhand, estimationofX hastotaketheµ, c, L, anduintoaccount. Given uandµ,theinitialguessof3Dfacialdeformations, ~ X,canbecomputedbybackprojecting uinto3Dusinga3Dfacemodel. ~ X isanoisyobservationoftruedeformation,andwerely on the o²ine learned manifold-based facial deformation model to re¯ne the estimation. Thus, equation 4.8 isapplied to inferthe ^ X, whichis the estimation of facial deformation from u and µ, and the expectation over posterior probability. The inference of head pose µ and nonrigid deformation X is divided into two steps: ² Estimate µ ByintroducingX,estimatingµisperformedsolelybasedonP(µjX;u). Ifwefurther assume a Gaussian model for P(µjX;u), the estimation is computed by minimizing the reprojection error between the 2D-3D correspondence u and X with respect to 3D head pose µ: ^ µ =argmin µ X i kF(P i ;µ)¡p i k 2 2 (4.13) where P i and p i are the 3D and 2D position of i-th landmark point, respectively. The 3D position of landmark point, P i , is the composition of neutral shape and 3D facial deformation X. Inequation4.13,the ^ µ istheoptimumforthereprojectionerroroflandmarkpoints. It can be solved by using iterative optimization algorithms, such as Gauss-Newton orLevenberg-Marquardt. Reweightingscheme,suchasm-estimatortypeweighting, can also be used to improve the robustness. ² Estimate c and L 94 Conditional on X, estimating c and L is performed using P(c;LjX). Since c is an integer and there is only a ¯nite set of facial deformation manifolds, we compute the optimal L for each possible c: Given X, we apply the algorithm 2 to estimate ^ L c and ^ X c , and compute the posterior probability using equation 4.9 for each c, as we have discussed in section 4.5. Algorithm3outlinesthetrackingprocedure. Currently,theimplementedsystemruns near1framepersecond,andthebottleneckisthetraversalofthemanifold(algorithm2), as it requires to do N-D Tensor Voting in the online tracking stage. However, this oper- ation can be further speeded up by using the power of the GPU [56], and we believe real time performance can be achieved. Initialize tracker; 1 foreach frame do 2 ^ µÃ 3D rigid head tracker; 3 ~ uà Lucas-Kanade feature tracker; 4 repeat 5 ^ uà use feature detector at ~ u; 6 ~ X à backproject ^ u using ^ µ and 3D face model; 7 foreach c do 8 ^ L c ; ^ X c à use algorithm 2 with ~ X; 9 P(cj ~ X; ^ L)ÃP( ~ X; ^ L c jc)P(c)= P c P( ~ X; ^ L c jc)P(c); 10 end 11 ^ X à P c ^ X c P(cj ~ X; ^ L c ); 12 ^ µÃargmin µ P i kF(P i ;µ)¡p i k 2 2 ; 13 ~ uà project ^ X using ^ µ; 14 until convergence ; 15 end 16 Algorithm 3: Tracking algorithm 95 4.7 Experiments 4.7.1 Evaluation of the Proposed Method The objective of this evaluation is to compare the performance of our proposed nonlinear manifold approach with other approaches for recovering correct face shape under noise. We manually label the ground truth position of all landmark points and their 3D coor- dinates are reconstructed using 3D face model, as described in section 4.3. For each true 3DshapeX,weperturbthe3DpositionoflandmarkpointsindependentlywithGaussian noise. This step is repeated 10 times to produce 10 test samples. We have 20 ground truth for each expression, total 20£8 = 160 true shapes and 1600 test samples. The noisy shape, X 0 , is considered as the initial observation of 3D shape. We then use dif- ferent methods to estimate ^ X from X 0 . This evaluation can be interpreted as examining the \denoising" ability, as we try to ¯lter out the noise and estimate the true shape. For our proposed method, we estimation ^ X TV using algorithm 2. For the linear subspace approach, we use PCA to learn the person-speci¯c model. We compare with two types of PCA estimation, naive PCA estimation and shrinkage PCA estimation. Recall that PCA is to ¯nd the optimal projection of X in linear subspace: X =X mean +©b where © is the matrix of eigenvectors of principal components. The naive PCA estimation ^ X NPCA of observation X is: ^ X NPCA =X mean +©© T (X¡X mean ) (4.14) 96 Although naive PCA enjoys the advantage of easy implementation and inexpensive computation,ithasseveraldrawbacks. Oneisitmaygenerate\unallowable"shapes. This has been investigated in the literature. For example, in [41], Li and Ito have proposed a shape parameter space optimization technique for 2D active shape model. A set of complicated rules is devised to eliminate the unallowable con¯gurations. On the other hand,for3Dshape,[34]proposeashrinkageprocesstorestrictthevariationinparameter space. The shrinkage PCA estimation ^ X SPCA : ^ b i =¯ i b i ¯ i =¸ i =(¸ i +¾) ^ X SPCA =X mean +© ^ b (4.15) where b i is the i-th element of b, ¸ i is the eigenvalue of i-th principal component, and ¾ is the sum of residual energy. ¯ i is the shrinkage coe±cient that controls the feasible range of the i-th component; a signi¯cant component has larger variation range while a less signi¯cant one has smaller range. To measure the error, X, X 0 , and ^ Xs are all projected to a 640£480 image. Three performance measures are computed: ² Mean Error It measures the average pixel displacement per estimated landmark point to the ground truth. 1 N N X i=1 42 X k=1 k^ p i k ¡p i k k 2 42 (4.16) 97 (a) Average Error (b) Max Error (c) Percentage of Points within 2 Pix- els Figure 4.10: Quantitative evaluation of our proposed method where N is the number of total test samples, ^ p i k and p i k is the 2D position of k- th landmark point of i-th test sample for the estimated shape and true shape, respectively. ² Max Error It measures the largest pixel distance among all landmark points with respect to the projected position of ground truth. 1 N N X i=1 max k k^ p i k ¡p i k k 2 (4.17) 98 ² Percentage Itcomputesthepercentageofestimatedlandmarkpointswithinrpixelsofprojected position of ground truth. 1 N N X i=1 ( P 42 k=1 I(k^ p i k ¡p i k k 2 <r)£100 42 (4.18) whereI(:)istheindicatorfunction, andwesetr =2. Itmeasuresthepercentageof \accurate" landmark points, where the accuracy is de¯ned within r-neighborhood of the ground truth. The quantitative evaluation is reported in ¯gure 4.10. The reported statistics are: ² Initial. It indicates the error of initial observation X 0 and is included here as the evaluation baseline. ² Tensor Voting. It is the proposed nonlinear manifold approach based on Tensor Voting. ² NPCA. It is the naive PCA estimation from equation 4.14. ² SPCA.ItrepresentstheshrinkagePCAestimationasequation4.15. Theattached number, 90%, 85%, and 75%, is the energy of selected principal components. From ¯gure 4.10, the low-energy PCA has higher error than high-energy PCA in low noise level, while has lower error in the high noise level. This con¯rms our understanding of PCA, since decreasing the energy means sacri¯cing the details, but it will be less sensitive to the noise and outliers. Note that, the SPCA is slightly better than NPCA 99 in this evaluation, despite of varying energy. Among all cases, the proposed approach is consistently better than other approaches, especially in the high noise level. This clearly demonstrates the advantage of using nonlinear manifold learning for facial deformations. Since it better characterizes the intrinsic structure inside a high dimensional space, it resolves the issues of unallowable shapes and outliers. 4.7.2 Tracking Figure 4.11 shows tracking results from our proposed method. In test sequences, the subjectstartsfromaneutralexpressionatfrontalpose,andperformsmultipleexpressions. The tracker is initialized by an automatic 2D ASM ¯tting. In ¯gure 4.11, the ¯rst row showsthetrackingresults. Greenarrowindicatestheestimatedheadposeandredpoints represent the tracked landmark points. We can see that even with out-of-plane rotation, theproposedmethodstilltrackswell. Somesubtledeformations, suchasblinking, canbe identi¯edbytheproposedalgorithm. Thesecondrowshowsthereprojectionofestimated 3D shapes in frontal pose. Since the tracker infer rigid and nonrigid motions in 3D, those inferred 3D landmark points can be projected into an arbitrary pose by changing its 3D head pose. The last row shows the estimated probability of each expression. There are 8 modes of expression: from left to right is surprise, anger, joy, disgust, sadness, close and open eyes, left-eye blinking, and right eye-blinking. Probing deeper, ¯gure 4.12 shows the interplay between the probability and manifold coordinate. \Prob." and \L" are the estimated probability and manifold coordinate of \Surprise", respectively. The bottom row shows the tracked landmarks and example frames, whose index is 0;5;10;:::;35;40, of this expression. All of them are the output 100 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 4.11: Tracking and interpretation results 101 Figure 4.12: Probability and manifold coordinate for \surprise" of the proposed tracker. \L" is also properly scaled into the [0;1] interval by the maxi- mum value in the training set. It is clear that L can represent the activation level of this expression, which agrees with our interpretation. Besides, the tracker assigns low prob- ability to the correct expression in the onset and o®set, since the expression is close to neutral. As the expression starts to activate, the ambiguity decreases and the probability of \Surprise" increases. Figure 4.13 shows another results from a segment of smile. The last row shows every 5 frame of this sequence. \Prob." and \L" are the estimated probability and manifold coordinateof\Smile",respectively. Wecanobserveasimilarphenomenaastheactivation and probability increases together. 102 Figure 4.13: Probability and manifold coordinate for \smile" 4.7.3 Synthesis The proposed method can also be used to generate expression sequences of each basic expression. For each known basic expression, we synthesize the expression sequences by changing the value of activation level L. Figure 4.14 shows a synthesized sequence of a left-eye blinking expression. The sequence is generated from neutral to the apex of this mode, as we control the manifold coordinate L. We use algorithm 1 to map L to ^ X and project them into the 2D image plane. Figure 4.15 shows another synthetic sequence of a surprise expression. 103 Figure 4.14: Synthesized shapes for left-eye blinking Figure 4.15: Synthesized shapes for surprise expression 4.8 Summary Wehaveproposedanewdeformablefacemodelbasedonnonlinearmanifoldsfor3Dfacial expressions. The 3D facial deformation model is a combination of several 1D manifolds and each manifold represents a mode of expression. We apply Tensor Voting to learn these nonlinear deformation manifolds. Tensor voting estimates local tangent hyperplane and it provides us a robust estimation tool for unseen, noisy input data. An iterative algorithm is proposed to infer the 3D head pose, 3D deformations, expression class, and manifold coordinate, which indicates the activation level of an expression. Thus, the output of our tracker is a rich representation of the current status of the face. 104 Chapter 5 Conclusions and Future Directions In this work, we have addressed the problem of automatic understanding of facial ges- tures. Given images or video sequences, we quantify and analyze the facial motions. The facial motions are decomposed into two components, the global, rigid head motion, and local, nonrigid facial deformations. After modeling and tracking these two components, we combine the understanding of these two and interpret them as facial expressions or gestures. Thisthesisfocusesonexploringanddevelopingnovelapproachestoachievethe this objective, extending the capability, or improving the performance of current state of the art methods. These approaches can be used in a wide variety of applications, such as face reconstruction, animation, behavior understanding, human-computer interaction, and perceptual user interface. 5.1 Summary of Contributions We summarize the contributions of this work: ² 3D head tracker (chapter 2): 105 We developed a novel hybrid tracker to estimate 3D head pose, and used it to di®erentiate the rigid and nonrigid facial motions. The hybrid tracker integrates both feature- and intensity- information to achieve real-time robust 3D head pose estimation. An extensive experiment with synthetic, real, and near IR sequences has been conducted to demonstrate the superior performance of our hybrid tracker. ² Facial expression recognition system - the ¯rst generation (chapter 3): Based on 3D head tracker, we proposed and implemented the ¯rst generation of expression recognition system. { Use 3D head tracker to estimate and cancel global 3D head motion { Devise a region-based representation for human face { Constructa graphical model to characterize the interdependency between face regions and learn this graphical model empirically via EM algorithm { Use belief propagation algorithm to infer the hidden variables of the graphical model, and Bayesian MAP estimation to classify the expressions Theimplementedsystemhasbeenusedforsomeapplications,suchashuman-robot interaction. Wealsobuiltaninteractiveexpressionrecognitionsystemanddelivered it to the California Science Center for a national traveling exhibition. ² Facial expression recognition system - the second generation (chapter 4): To improve the performance of the ¯rst generation system, we proposed and imple- mented the second generation system by learning submanifolds. 106 { Propose a novel formulation of 3D nonrigid facial deformations based on non- linear manifolds. Instead of using conventional linear subspace analysis, we argue that nonrigid facial deformations is better modeled by a set of 1D manifolds. Each 1D manifold represents a basic/primitive expression, and a natural expression is a composition of these manifolds. { Develop novel manifold learning algorithms based on N-D Tensor Voting. This manifold learning approach uses estimated tangent hyperplane to char- acterize and traverse the manifold, and is shown to be robust to noise and outliers. { Estimate rigid head motion and nonrigid facial deformations simultaneously. Both rigid and nonrigid facial motions are analyzed simultaneously. We de- couple the facial motions by building a graphical model for the interaction between these two components. Based on this graphical model, an iterative algorithm is developed to infer the 3D head pose, 3D nonrigid facial deforma- tion, expression label with probability, and activation level. The implemented system can track a rich representation of human face. 5.2 Future Directions There are some interesting future directions can be explored based on this work: ² Person-independent manifold 107 Currently, we are learning person-dependent manifolds. However, the facial expres- sions exhibit strong similarities across di®erent subjects, and we believe this prop- erty can be extracted from empirical data by some nonlinear manifolds. Investigat- ing this direction will lead current formulation to a person-independent setting. A generic face model can be used to extract the 3D facial deformation across di®erent subjects for deformation manifold learning. Moreover, an ideal person-independent manifold captures the \type" of facial expressions. Thus, it di®erentiates the \per- sonal signature" and universal components of facial expressions. ² How many manifolds are required to model all possible facial deforma- tions? The follow-up research will address this question, and searching for a compact set of manifolds to cover all possible facial deformations. It is interesting to see if these manifolds correspond to prototype expressions. In the psychology literature, there are 6 prototype expressions identi¯ed [23]. Examining the relationship between thevision-baseddeformationmanifoldsandpsychology-basedprototypeexpressions should be an interesting future research direction. ² Orthogonality of deformation manifolds The orthogonality of these deformation manifolds is worth investigating further. A conjugate question of the above one is: Are these manifolds independent of each other? 108 Asshowninourexperimentsandintheliterature, someexpressionsaremorelikely to be confused with each other, and it may be due to the orthogonality of di®er- ent deformation manifolds. It would be interesting to quantify the orthogonality between di®erent deformation manifolds, and compare it with the ambiguity and confusion matrix from recognition evaluation. It would also be interesting to com- paretheclassi¯cationresultofcomputerversushumanperception. Doingsuchuser study,wecanexplorethedi®erencebetweenhumanvisualperceptionandcomputer modeling. ² Synthesis and blending the expressions Inchapter4.7.3,wecanusetheproposedfacialdeformationmanifoldstosynthesize theexpression. However, itisthesynthesisforasingle, knownexpression. Itwould be an interesting question to explore ways to manufacture an arti¯cial or blended expression, such as half smile and half surprise. ² Application to human - computer/robot/machine interaction InmanyHCIapplications,aperceptualuserinterfaceisdesiredtoachievethenatu- ral,convenient,andintuitivewayofcommunication. Intheseapplications,knowing the user's mental status is critical. The face plays an important role for deriving thisinformation. Thus, applyingourdevelopedsystemsforHCIapplicationswould be an interesting direction. 109 References [1] Hirotugu Akaike. A new look at the statistical model identi¯cation. IEEE Trans. on Automatic Control, 19(6):716{723, December 1974. [2] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unifying framework. IJCV, 56(3):221{255, March 2004. [3] Simon Baker, Raju Patil, Kong Man Cheung, and Iain Matthews. Lucas-kanade 20 years on: Part 5. TechnicalReport CMU-RI-TR-04-64, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, November 2004. [4] Marian Stewart Bartlett, Gwen Littlewort, Mark Frank, Claudia Lainscsek, Ian Fasel, and Javier Movellan. Recognizing facial expression: Machine learning and application to spontaneous behavior. In CVPR 2005, volume 2, pages 568{573. [5] Sumit Basu, Irfan Essa, and Alex Pentland. Motion regularization for model-based head tracking. In ICPR 1996, volume 3, pages 611{616. [6] Yoshua Bengio, Martin Monperrus, and Hugo Larochelle. Nonlocal estimation of manifold structure. Neural Computation, 18(10):2509{2528, October 2006. [7] Michael J. Black and Yaser Yacoob. Recognizing facial expressions in image se- quences using local parameterized models of image motion. IJCV, 25(1):23{48, Oc- tober 1997. [8] Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In SIGGRAPH 1999, pages 187{194. [9] Volker Blanz and Thomas Vetter. Face recognition based on ¯tting a 3d morphable model. PAMI, 25(9):1063{1074, September 2003. [10] Goose Bumps! http://www.fearexhibit.org/. [11] Marco La Cascia, Stan Sclaro®, and Vassilis Athitsos. Fast, reliable head tracking under varying illumination: An approach based on registration of texture-mapped 3d models. PAMI, 22(4):322{336, April 2000. [12] Ya Chang, Changbo Hu, and Matthew Turk. Probabilistic expression analysis on manifolds. In CVPR 2004, volume 2, pages 520{527. 110 [13] Ira Cohen, Nicu Sebe, Fabio G. Cozman, Marcelo C. Cirelo, and Thomas S. Huang. Learning bayesian network classi¯ers for facial expression recognition using both labeled and unlabeled data. In CVPR 2003, volume 1, pages 595{601. [14] Ira Cohen, Nicu Sebe, Ashutosh Garg, Lawrence S. Chen, and Thomas S. Huang. Facial expression recognition from video sequences: Temporal and static modeling. CVIU, 91(1-2):160{187, July 2003. [15] Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. Active appear- ance models. PAMI, 23(6):681{685, June 2001. [16] Timothy F. Cootes, Christopher J. Taylor, David H. Cooper, and Jim Graham. Active shape models: Their training and application. CVIU, 61(1):38{59, January 1995. [17] Douglas DeCarlo and Dimitris Metaxas. The integration of optical °ow and de- formable models with applications to human face shape and motion estimation. In CVPR 1996, pages 231{238. [18] Piotr Dollar, Vincent Rabaud, and Serge Belongie. Learning to traverse image man- ifolds. In NIPS 2006, pages 361{368. [19] PiotrDollar, VincentRabaud, andSergeBelongie. Non-isometricmanifoldlearning: Analysis and an alogrithm. In ICML 2007, pages 241 { 248. [20] Gianluca Donato, Marian Stewart Bartlett, Joseph C. Hager, Paul Ekman, and Terrence J. Sejnowski. Classifying facial actions. PAMI, 21(10):974{989, October 1999. [21] Fadi Dornaika and Franck Davoine. Simultaneous facial action tracking and expres- sion recognition using a particle ¯lter. In ICCV 2005, volume 2, pages 1733{1738. [22] Fadi Dornaika and Franck Davoine. Simultaneous facial action tracking and ex- pression recognition in the presence of head motion. IJCV, 76(3):257{281, March 2008. [23] Paul Ekman and Wallace V. Friesen. Unmasking the Face. Prentice Hall, New Jersey, 1975. [24] Paul Ekman and Wallace V. Friesen. Facial Action Coding System: Investigator's Guide. Consulting Psychologists Press, Palo Alto, CA, 1978. [25] Ahmed Elgammal. Learning to track: Conceptual manifold map for closed-form tracking. In CVPR 2005, volume 1, pages 724{730. [26] Ahmed Elgammal and Chan-Su Lee. Inferring 3d body pose from silhouettes using activity manifold learning. In CVPR 2004, volume 2, pages 681{688. [27] Irfan Essa and Alex Pentland. Coding, analysis, interpretation, and recognition of facial expressions. PAMI, 19(7):757{763, July 1997. 111 [28] FaceVision200. Geometrix http://www.geometrix.com. [29] Beat Fasel and Juergen Luettin. Automatic facial expression analysis: A survey. Pattern Recognition, 36(1):259{275, January 2003. [30] DouglasFidaleo,G¶ erardMedioni,PascalFua,andVincentLepetit. Aninvestigation of model bias in 3d face tracking. In IEEE International Workshop on Analysis and Modeling of Faces and Gestures, pages 125{139, 2005. [31] Douglas Fidaleo and Ulrich Neumann. Coart: Co-articulation region analysis for control of 2d characters. In Computer Animation 2002, pages 17{22. [32] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119{139, August 1997. [33] Zoubin Ghahramani and Michael I. Jordan. Supervised learning from incomplete data via an EM approach. In NIPS 1993, volume 6, pages 120{127. [34] Lie Gu and Takeo Kanade. 3d alignment of face in a single image. In CVPR 2006, volume 1, pages 1305{1312. [35] GangHuaandYingWu. Multi-scalevisualtrackingbysequentialbeliefpropagation. In CVPR 2004, volume 1, pages 826{833. [36] GangHua, Ming-HsuanYang, andYingWu. Learningtoestimatehumanposewith data driven belief propagation. In CVPR 2005, volume 2, pages 747{754. [37] C. Huang, H. Ai, B. Wu, and S. Lao. Boosting nested cascade detectors for multi- view face detection. In ICPR 2004, pages 415{418. [38] Peter J. Huber. Robust Statistics. Wiley, New York, 1981. [39] Michael Isard. Pampas: Real-valued graphical models for computer vision. In CVPR 2003, volume 1, pages 613{620. [40] SadanoriKonishiandGenshiroKitagawa. Generalizedinformationcriteriainmodel selection. Biometrika, 83(4):875{890, December 1996. [41] Yuanzhong Li and Wataru Ito. Shape parameter optimization for adaboosted active shape model. In ICCV 2005, volume 1, pages 251{258. [42] Wei-KaiLiaoandIsaacCohen. Classifyingfacialgesturesinpresenceofheadmotion. In Vision for Human Computer Interaction, June 2005. [43] Wei-Kai Liao and Isaac Cohen. Belief-propagation driven method for classifying facialgesturesinpresenceofocclusions. InVision for Human Computer Interaction, June 2006. [44] Wei-KaiLiao,DouglasFidaleo,andG¶ erardMedioni. Integratingmultiplevisualcues for robust real-time 3d face tracking. In IEEE International Workshop on Analysis and Modeling of Faces and Gestures, pages 109{123, 2007. 112 [45] Wei-Kai Liao and G¶ erard Medioni. 3d face tracking and expression inference from a 2d sequence using manifold learning. In CVPR 2008. [46] David G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91{110, November 2004. [47] Le Lu, Xiangtian Dai, and Gregory Hager. E±cient particle ¯ltering using ransac with application to 3d face tracking. IVC, 24(6):581{592, June 2006. [48] Bruce D. Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo vision. In International Joint Conference on Arti¯cial In- telligence, pages 674{679, 1981. [49] Tim K. Marks, John Hershey, J. Cooper Roddey, and Javier R. Movellan. Joint tracking of pose, expression, and texture using conditionally gaussian ¯lters. In NIPS 2004, pages 889{896. [50] Iain Matthews and Simon Baker. Active appearance models revisited. IJCV, 60(2):135{164, November 2004. [51] Iain Matthews, Jing Xiao, and Simon Baker. 2d vs. 3d deformable face models: Representational power, construction, and real-time ¯tting. IJCV, 75(1):93{113, October 2007. [52] Geo®rey J. McLachlan and Thriyambakam Krishnan. The EM Algorithm and Ex- tensions. Wiley, New York, 1996. [53] Geo®rey J. McLachlan and David Peel. Finite Mixture Models. Wiley, New York, 2001. [54] G¶ erard Medioni, Mi-Suen Lee, and Chi-Keung Tang. A Computational Framework for Segmentation and Grouping. Elsevier Science, New York, 2000. [55] ChangkiMin. Spatiotemporal Motion Analysis Using 5D Tensor Voting. PhDthesis, University of Southern California, August 2006. [56] Changki Min and G¶ erard Medioni. Tensor voting accelerated by graphics processing units (gpu). In ICPR 2006, volume 3, pages 1103 { 1106. [57] Philippos Mordohai and G¶ erard Medioni. Unsupervised dimensionality estimation and manifold learning in high-dimensional spaces by tensor voting. In IJCAI 2005, pages 798{803. [58] Philippos Mordohai and G¶ erard Medioni. Stereo using monocular cues within the tensor voting framework. PAMI, 28(6):968{982, June 2006. [59] Philippos Mordohai and G¶ erard Medioni. Tensor Voting: A Perceptual Organiza- tion Approach to Computer Vision and Machine Learning. Morgan and Claypool Publishers, 2007. 113 [60] Louis-Philippe Morency, Ali Rahimi, Neal Checka, and Trevor Darrell. Fast stereo- based head tracking for interactive environment. In FGR 2002, pages 375{380. [61] Louis-Philippe Morency, Ali Rahimi, and Trevor Darrell. Adaptive view-based ap- pearance model. In CVPR 2003, volume 1, pages 803{810. [62] Maja Pantic and Leon J.M. Rothkrantz. Automatic analysis of facial expressions: The state of the art. PAMI, 22(12):1424{1445, December 2000. [63] XavierPennec. Intrinsicstatisticsonriemannianmanifolds: Basictoolsforgeometric measurements. Journal of Mathematical Imaging and Vision, 25(1):127{154, July 2006. [64] Photomodeler. http://www.photomodeler.com. [65] Krishnan Ramnath, Seth Koterba, Jing Xiao, Changbo Hu, Iain Matthews, Simon Baker, Je®rey Cohn, and Takeo Kanade. Multi-view aam ¯tting and construction. IJCV, 76(2):183{204, February 2008. [66] Richard A. Redner and Homer F. Walker. Mixture densities, maximum likelihood and the em algorithm. SIAM Review, 26(2):195{239, April 1984. [67] SamT.RoweisandLawrenceK.Saul. Nonlineardimensionalityreductionbylocally linear embedding. Science, 290(5500):2323{2326, December 2000. [68] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using con¯dence-rated predictions. Machine Learning, 37(3):297{336, December 1999. [69] Arno Schodl, Antonio Haro, and Irfan Essa. Head tracking using a textured polyg- onal model. In Proceedings of Perceptual User Interfaces Workshop (held in Con- junction with ACM UIST 1998), 1998. [70] YingShan,ZichengLiu,andZhengyouZhang. Model-basedbundleadjustmentwith application to face modeling. In ICCV 2001, volume 2, pages 644{651. [71] Leonid Sigal, Sidharth Bhatia, Stefan Roth, Michael J. Black, and Michael Isard. Tracking loose-limbed people. In CVPR 2004, volume 1, pages 421{428. [72] Erik B. Sudderth, Alexander T. Ihler, William T. Freeman, and Alan S. Willsky. Nonparametric belief propagation. In CVPR 2003, volume 1, pages 605{612. [73] Martine A. Tanner. Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions. Springer, 1996. [74] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319{2323, December 2000. [75] Ying-Li Tian, Takeo Kanade, and Je®rey Cohn. Facial expression analysis. In Stan Z. Li and Anil K. Jain, editors, Handbook of Face Recognition. Springer, March 2005. 114 [76] Yang Tong, Wenhui Liao, Zheng Xue, and Qiang Ji. A uni¯ed probabilistic frame- work for facial activity modeling and understanding. In CVPR 2007. [77] Luca Vacchetti, Vincent Lepetit, and Pascal Fua. Stable real-time 3d tracking using online and o²ine information. PAMI, 26(10):1385{1391, October 2004. [78] Paul Viola and Michael J. Jones. Robust real-time face detection. IJCV, 57(2):137{ 154, May 2004. [79] Christian Vogler, Zhiguo Li, Atul Kanaujia, Siome Goldenstein, and Dimitris Metaxas. The best of both worlds: Combining 3d deformable models with active shape models. In ICCV 2007. [80] ZhenWenandThomasS.Huang. Capturingsubtlefacialmotionsin3dfacetracking. In ICCV 2003, volume 2, pages 1343{1350. [81] Jing Xiao, Simon Baker, Iain Matthews, and Takeo Kanade. Real-time combined 2d+3d active appearance models. In CVPR 2004, volume 2, pages 535{542. [82] Jing Xiao, Jinxiang Chai, and Takeo Kanade. A closed-form solution to non-rigid shape and motion recovery. IJCV, 67(2):233{246, April 2006. [83] JingXiao,TsuyoshiMoriyama,TakeoKanade,andJe®reyCohn. Robustfull-motion recoveryofheadbydynamictemplatesandre-registrationtechniques. InternalJour- nal of Imaging Systems and Technology, 13:85{94, September 2003. [84] Lukasz Zalewski and Shaogang Gong. 2d statistical models of facial expressions for realistic 3d avatar animation. In CVPR 2005, volume 2, pages 217{222. [85] LiZhang,HaizhouAi,andShihongLao. Robustfacealignmentbasedonhierarchical classi¯er network. In ECCV Workshop on HCI, 2006. [86] Zhiwei Zhu and Qiang Ji. Robust real-time face pose and facial expression recovery. In CVPR 2006, volume 1, pages 681{688. 115
Abstract (if available)
Abstract
This research focuses on tracking, modeling, quantifying and analyzing facial motions for gesture understanding. Facial gesture analysis is an important problem in computer vision since facial gestures carry signals besides words and are critical for nonverbal communication. The difficulty of automatic facial gesture recognition lies in the complexity of face motions. These motions can be categorized into two classes: global, rigid head motion, and local, nonrigid facial deformations. In reality, observed facial motions are a mixture of these two components.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Multiple vehicle segmentation and tracking in complex environments
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Model based view-invariant human action recognition and segmentation
PDF
Motion segmentation and dense reconstruction of scenes containing moving objects observed by a moving camera
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Landmark detection for faces in the wild
PDF
Decoding situational perspective: incorporating contextual influences into facial expression perception modeling
PDF
Exploitation of wide area motion imagery
Asset Metadata
Creator
Liao, Wei-Kai
(author)
Core Title
Facial gesture analysis in an interactive environment
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/06/2008
Defense Date
05/19/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D face tracking,expression recognition,Face,face alignment,facial gestures,graphical model,head pose estimation,manifold learning,OAI-PMH Harvest,tensor voting
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Medioni, Gerard (
committee chair
), Kuo, C.-C. Jay (
committee member
), Nevatia, Ramakant (
committee member
)
Creator Email
kai.liao@gmail.com,wliao@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1551
Unique identifier
UC1142198
Identifier
etd-Liao-2237 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-106684 (legacy record id),usctheses-m1551 (legacy record id)
Legacy Identifier
etd-Liao-2237.pdf
Dmrecord
106684
Document Type
Dissertation
Rights
Liao, Wei-Kai
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
3D face tracking
expression recognition
face alignment
facial gestures
graphical model
head pose estimation
manifold learning
tensor voting