Close
The page header's logo
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
G -folds: An appearance-based model of facial gestures for performance driven facial animation
(USC Thesis Other) 

G -folds: An appearance-based model of facial gestures for performance driven facial animation

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content G-FOLDS: AN A PPE A R A N C E-B A SE D M ODEL OF FACIAL G ESTU RES FO R PE R FO R M A N C E D RIVEN FACIAL ANIM ATION by Douglas A lexander Fidaleo A D issertation Presented to the FACULTY O F T H E G RADU ATE SCHOOL U N IV ERSITY O F SO U TH ERN CALIFORNIA In P artial Fulfillment of the R equirem ents for th e Degree D O C T O R O F PH ILO SO PH Y (C O M PU T E R SCIENCE) A ugust 2003 Copyright 2003 Douglas A lexander Fidaleo Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3116697 Copyright 2003 by Fidaleo, Douglas Alexander All rights reserved. INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 3116697 Copyright 2004 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90089-1695 This dissertation, written by under the direction of h 1 - dissertation committee, and approved by all its members, has been presented to and accepted by the Director of Graduate and Professional Programs, in partial fulfillment of the requirements for the degree of D O C T O R O F P H IL O S O P H Y Director Date August 12. 2003 Dissertation Committee Chair Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Dedication This thesis is dedicated to my family without whose endless supply of unconditional love, support, and encouragement I would still be scooping ice cream at Baskin Robbins. And to my fiance, Karen, for selflessly enduring the long years (and late hours). I promise you won’t regret it. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgments I thank my advisor Ulrich Neumann for his financial, moral, and intellectual support during the past several of years during my PhD. It was through Ulrich that I learned independent exploration and thought in a research environment and developed intellectual confidence. I thank Ulrich for his confidence in my ability and willingness to let me explore seemingly off-topic genres for my own personal growth. J.P. Lewis was always eager to share his time, advice, knowledge and ideas with incredible humility. I am grateful for the sacrifices he has made for me and for the development of the CGIT lab. I thank Kazunori Okada for his friendship and for turning me on to appearance manifolds. Kaz was instrumental in helping me formulate the ideas for the G-Folds section of this thesis. I thank my thesis and quals committee members for their time and effort. Thanks to Vibeke Sorensen for expanding my mind to the broader social and cultural ramifications of my work, and to Shrikanth Narayanan, Isaac Cohen, and Cyrus Shahabi for their numerous comments and suggestions. One of the most rewarding experiences during my PhD was collaboration with Ann Page, Brian Cooper, and Tomoyuki Isoyama, in producing the Comfort Control art in­ stallation. I am grateful to each of them for helping me flesh out my ideas and providing a iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. stimulating collaborative environment, and to Jerry Mendel for giving me the opportunity to develop the piece. I thank Lisette Garcia for helping me through many stressful times with her open­ ness and honesty. And my other lab-mates over the years: Clint, Reyes, Jun, Taeyong, Junyong, Mo, Deng, Ilmi, Bolan, Jong Weon and Suya. I wouldn’t have wanted to share the PhD experience with anyone else. Thanks to the IMSC and USC Computer Science department staff for their effort on my behalf. Specifically Amy Yung, Cheryl Weinberger, Linda Wright, Ann Spurgeon, Nicole Phillips, and Isaac Maya. Thanks to Dave Schrader for the yearly graduation reminder, and to Tom Malzbender for their comments and encouragement. Prom my pre-doctoral work I thank Dr. Ran Libeskind-Hadas and Dr. Shariari, two of the most inspirational professors I have ever had, my undergraduate advisor Dr. Elderkin who encouraged me to pursue my PhD, and the late Dr. Wong who, through horse racing and craps, inspired me to pursue computer science. I am grateful to my friends who stuck with me despite my lack of contribution to our relationships over the past few years: Esther, Brian, Bill, Ryan, Dan, Eileen, Madeleine, Reyes, Didi, and Clint. Once again, I am eternally indebted to my family and fiance. I thank my dad for teaching me how to think, my mom for teaching me how to feel, and my family for giving me the opportunity to merge the two. I cannot thank them enough for the love and support they have given me from day one. None of this would have been possible without them. iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Contents D edication • • 11 A cknow ledgm ents iii List O f Tables ix List O f Figures X A bstract xiv Preface xv i 1 Introduction 1 1.1 Problem S tatem en t................................................................................................. 1 1.2 On Appearance Based Representation: Image Intensity as Basic Features 4 1.3 Co-articulation Regions, G-Folds and the G P R ............................................. 9 1.4 Thesis Overview ..................................................................................................... 10 2 Background and R elated W ork 12 2.1 Facial A n a ly s is ........................................................................................................ 12 2.1.1 Appearance Based Facial A n a ly s is ....................................................... 13 2.1.2 Face R ecognition........................................................................................ 15 2.1.3 Gesture A n a ly sis........................................................................................ 16 2.1.4 Universal Expression A nalysis................................................................. 17 2.2 Performance Driven Facial A n im a tio n ............................................................. 18 2.3 Facial Analysis for A n im atio n .............................................................................. 20 2.3.1 2D Motion Capture and Optical F l o w ................................................. 20 2.3.2 Motion Processing for Animator C ontrol............................................. 22 2.3.3 Feature T e m p la te s..................................................................................... 22 2.3.4 D ata Driven Approaches for P D F A ....................................................... 23 2.4 Summary of Benefits and Novelty of the A p p ro a c h ....................................... 24 3 C o-A rticulation R egions 27 3.1 Facial Gesture Set and Partitioning M o tiv a tio n ............................................. 27 3.2 Co-articulation R e g io n s ........................................................................................ 28 3.2.1 The Canonical CR T e m p la te ................................................................. 30 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.3 Gesture Sample A cquisition................................................................................... 32 4 N orm alization 35 4.1 O v e rv ie w ................................................................................................................... 35 4.2 Reducing the DOF’s ............................................................................................... 36 4.3 Head P o s e ................................................................................................................... 37 4.3.1 LED track er................................................................................................... 37 4.3.2 Pose normalization with the LED tr a c k e r ............................................ 39 4.3.3 Camera H a t................................................................................................... 41 4.4 Geometric Normalization ...................................................................................... 41 4.4.1 Feature Point Selection ............................................................................ 42 4.4.2 RBF Image w a rp in g ................................................................................... 44 5 C oA rt 47 5.1 Unsupervised Statistical Methods for CR A n a ly s is ........................................ 47 5.1.1 Principle Component Analysis (PCA) and Eigenfaces ..................... 48 5.1.1.1 Eigenanalysis of Co-articulation Region D a t a ..................... 49 5.1.1.2 Issues with PCA for Co-articulation Region Analysis . . 50 5.1.2 Independent Component Analysis ......................................................... 52 5.1.2.1 Co-articulation Region Analysis using I C A ........................ 53 5.1.3 Discussion of Methodology ...................................................................... 55 5.1.3.1 Nonlinearity of Gesture P r o f ile s ............................................ 56 5.2 PCA and ICA Quantitative Results .................................................................. 57 5.2.1 On Gesture Classifier A ssessm ent............................................................ 57 5.2.1.1 Quantitative M easurem ents...................................................... 57 5.2.1.2 Perturbed Sam ples...................................................................... 57 5.2.1.3 Indirect/Qualitative A ssessm en t............................................ 58 5.2.2 Binary Gesture C lassification.................................................................. 58 5.2.2.1 Classification R a t e ...................................................................... 59 5.2.2.2 False Alarm Rate ...................................................................... 59 5.2.3 Results ......................................................................................................... 60 5.3 Region Based Flipbook A n im a tio n ..................................................................... 62 5.3.1 M e th o d ......................................................................................................... 62 5.3.2 Results ......................................................................................................... 63 5.4 Discussion of L im ita tio n s ...................... •............................................................ 64 6 G-Folds and G esture Polynom ial R eduction 66 6.1 Appearance M an ifo ld s............................................................................................ 67 6.1.1 W hat causes appearance m a n ifo ld s? ..................................................... 67 6.1.2 Dimensionality Reduction and Appearance Manifolds ..................... 68 6.1.3 Appearance manifolds and gesture polynom ials.................................. 70 6.2 G-Folds and GPR M odeling.................................................................................. 71 6.2.1 Testing for Gesture M anifolds................................................................... 71 6.2.2 Subspace Parameterization By Least Squares Polynomial Regression 75 6.2.3 R econstruction............................................................................................. 76 6.2.4 GPR R e s u lts ................................................................................................ 77 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.2.5 Gesture sign resolution............................................................................... 78 6.3 Applying the G P R ................................................................................................. 78 6.3.1 Intensity L a b e lin g ...................................................................................... 81 6.3.2 Quantized gesture intensity classification............................................... 83 6.4 D iscussion................................................................................................................. 83 7 G P R E xploration and Intuition 85 7.1 Intuitive interpretation of points projected into gesture s p a c e ................... 85 7.1.1 Gesture s a m p le s .......................................................................................... 86 7.1.2 Unseen gesture s a m p le s ............................................................................ 86 7.1.3 Arbitrary expressions ................................................................................ 90 7.1.4 S peech............................................................................................................. 94 7.1.5 Combined G estures...................................................................................... 98 7.2 Generalization ............................................................................................................100 7.2.1 Cross Subject Classification . . ..................................................................104 7.3 D iscussion..................................................................................................................... 105 8 M uscle M orphing 108 8.1 Overview ..................................................................................................................... 109 8.2 Character C reatio n ..................................................................................................... 110 8.3 Animation P ro c e s s ..................................................................................................... I l l 8.4 Mass Spring M u sc u la tu re ........................................................................................112 8.4.1 Muscle A ssem bly..............................................................................................113 8.5 R esults............................................................................................................................ 114 9 Sum m ary 116 10 Future W ork 118 R eference List 120 A ppendix A G PR R e s u lts ..........................................................................................................................133 A ppendix B Expression Analysis .......................................................................................................... 142 B .l Introduction.................................................................................................................. 142 B.2 Holistic Expression Classifier ..................................................................................143 B.2.1 Expression Training D ata Acquisition and Preprocessing...................... 144 B.2.2 Results .............................................................................................................. 146 B.3 Gesture Space Representation for Expressions.....................................................146 A ppendix C Comfort Control A rt In sta lla tio n ....................................................................................... 148 C .l Introduction.................................................................................................................. 148 C.2 Artistic S ta te m e n t......................................................................................................148 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C.3 Game O verview .......................................................................................................... 150 C.3.1 H a rd w a re .........................................................................................................152 C.3.2 Softw are...........................................................................................................153 C.4 Expression Classification................................................................................. 153 C.4.1 Training and Expression C lassification......................................................... 155 C.5 D iscussion...................................................................................................................... 156 viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List Of Tables 1.1 Comparison of properties of raw pixel, dense optical flow,and sparse feature point data as basic features for facial analysis.................................................... 5 5.1 Classification results for each gesture. Classification rate is averaged over 3 subjects. Average number of test frames per gesture is 180........................ 60 5.2 False alarm rate for each co-articulation region. The quantity shown indi­ cates the percent correct for each region. Decrease in performance in C R 5 , CRe, and CR-t can be attributed to cross-talk between these regions. Av­ erage number of test frames per gesture is 1200................................................ 60 5.3 Overall classification accuracy for each CR .......................................................... 60 7.1 Phonemes and example words used in speech tests and figures 7.7 and 7.8 97 C .l Example image/expression label pairs for each game level.................................. 151 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List Of Figures 1.1 Minute feature motion can change the emotional content of the face. Though subtle, the above images each communicate different emotional states to the observer................................................................................................................. 6 3.1 (left) List of analyzed muscles groups and their locations and directions of contraction, (right) Labeled co-articulation regions and list of contributing muscles......................................................................................................................... 28 3.2 (top) Expressions designed to actuate the set of defined muscles, (bottom) Difference images between gesture and neutral face image that assist in identifying co-articulation regions.......................................................................... 29 3.3 State space defined for three co-articulation regions in the forehead area. The two inward pointing muscles are the corrugator and are always con­ tracted in unison, hence the 2D parameterization of C R \............................... 30 3.4 Identification numbers and images of subjects analyzed................................... 31 3.5 A gesture image sequence is broken into three segments for training. Neu­ tral images are independently labeled.................................................................. 32 3.6 Spontaneous gestures and the muscles for which they provide data.............. 33 3.7 Gesture samples assembled into the data matrix used for training C R q. Actual samples are vectorized and comprise the columns of the matrix. . . 34 4.1 LED eyeglasses used for head tracking.................................................................. 38 4.2 Rigid camera hat used to fix subjects head pose relative to the camera. . 40 4.3 Labeled feature points we identify on the subjects face. Circled points are those that can be identified with high reliability across different subjects. Circled points are used to define the boundary points..................................... 43 x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.4 RBF image warping from source image to match the geometry of the fea­ ture point targets defined on the generic face..................................................... 45 5.1 Example of projection axes computed using PCA on a hypothetical data set.................................................................................................................................. 49 5.2 Problems with PCA for classification purposes. Projection on the first principal component completely obscures the class structure......................... 51 5.3 M utual contraction of the frontalis and corrugator muscles occurs fre­ quently in expressions of sadness and fear........................................................... 55 5.4 Reconstruction samples for CR§ of Maggie (left) and those extracted from the training video (right). Locations of samples in CR param eter space (center)......................................................................................................................... 62 5.5 (top) Reconstructed video frames, (bottom) Original normalized video frames........................................................................................................................... 63 5.6 Animation frames using gesture analysis to control a hand drawn character. 65 6.1 Conceptual diagram of projection from image space to and from original and truncated PCA spaces...................................................................................... 69 6.2 Projection of gesture data for C R q onto the first 3 principal components. 72 6.3 Examples of G-Folds for different subjects and all regions............................... 74 6.4 Misbehaved G-Fold structure in a subject with poor mouth muscle control. 75 6.5 The projected data from each gesture is reprojected into a 2D PCA space and modeled with quadratic polynomials............................................................ 77 6.6 G PR transformation applied to X = C R 5 ........................................................... 79 6.7 G PR transformation applied to X = C R 3 ........................................................... 80 6.8 A new sam p le is p ro je c te d o n to th e p o ly n o m ia l b y selectin g th e b in w ith the closest centroid.................................................................................................... 82 6.9 Gesture intensity classification using the GPR representation........................ 83 7.1 Frown-Smile m a n ifo ld ............................................................................................. 87 xi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.2 High similarity between low intensity state in gestures for region C R q. (top) Smile, (middle) grimace, (bottom) frown gesture. In all cases, there is significant correlation along the left edge of sample even at different intensities..................................................................................................................... 88 7.3 Appearance similarities between grimace, smile, and frown, gestures. . . . 89 7.4 Grimace gesture in smile-frown space. Variations not correlated with the existing training data are lost. Similarity to frown and smile gestures is indicated by deviation from neutral along the gesture axes. The remaining correlation is with the neutral state. Most samples project close to neutral. 91 7.5 Trajectory of mouth opening gesture in C R q space........................................... 92 7.6 Trajectory of open mouth smile shown as a deviation from the closed mouth smile.............................................................................................................................. 93 7.7 Trajectory of dee (/D / /IY /) in C R q space. Note the correlation to the smile gesture in the /IY / viseme........................................................................... 95 7.8 Trajectories of various phonemes in C R q space.................................................. 96 7.9 Trajectory of combined frontalis and corrugator contraction in C R \ space 99 7.10 Merged data generalization test on C R q for subjects 1, 2, 3 and 4. Individ­ ual structures are shown above merged case and are oriented to emphasize the 3D structure of the data. In the merged case, the data flattens as some of the structure is lost...................................................................................................101 7.11 Merged data for C R \ and subjects 1, 2, and 3. Low variance structure collapses in subjects 2 and 3, while higher variance structure retained. . . 103 7.12 Classification of SID-0007 using SID-0009 for training on C R q...............106 7.13 Classification of SID-0007 using SID-0009 for training on C R \...............107 8.1 Skin, muscle insertion, and CR boundary points defined on Maggie character. 109 8.2 Example muscle deformations on Maggie character...............................................I l l 8.3 Complete mass-spring muscle system defined on a human character. . . . 112 8.4 Muscle Morphing results...............................................................................................115 A .l G PR transformation applied to X = C R q............................................................... 134 xii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A.2 G PR transformation applied to X — C R \.............................................................. 135 A.3 G PR transformation applied to X — C R 2 ..............................................................136 A.4 G PR transformation applied to X = C R 3 ..............................................................137 A.5 G PR transformation applied to X — C R 4 ..............................................................138 A.6 G PR transformation applied to X = C R 5 ..............................................................139 A.7 G PR transformation applied to X = C R q..............................................................140 A.8 G PR transformation applied to X = CR-j..............................................................141 B .l Holistic facial analysis using five prototypical facial expressions. Intensity frames are acquired from a neutral state. Face images are masked prior to analysis............................................................................................................................. 144 B.2 Holistic expression analysis results...........................................................................145 B.3 Expressions represented as clusters in CR space.................................................. 147 C .l The Comfort Control art installation. Exterior and interior of the cube. J.P. conjures his best fear expression in response to an image of a pair of capybaras......................................................................................................................... 149 C.2 Comfort Control system diagram............................................................................. 153 xiii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract In performance-driven facial animation (PDFA), an animated character is driven by the contraction of facial muscles of a performer. Most existing work maps literal skin motion from the performer to the character, resulting in “human-like” deformation of the face. This is desirable for realistic human animation, but is unintuitive, difficult to edit, and allows little or no flexibility in re-mapping the resulting animation to a non-human character. Facial gestures are an abstraction of facial motion, defined as the visual manifestation of the contraction of one or more facial muscles. By mapping face state through this set of abstract parameters, the animator is free to choose the resulting output and is not constrained by the literal interpretation of facial motion. Extracting these parameters, however, has proved to be much more difficult than traditional motion capture and, as such, the majority of facial parameter extraction work has been performed in the area of facial analysis and not applied to PDFA. In PDFA, the intensities of gestures are essential to produce continuous and subtly varying expressions, but existing facial analysis work focuses only on the presence or absence of a gesture (binary analysis). Though practical to obtain, images of face state are an unnecessarily high dimensional representation for a fairly low degree of freedom phenomenon. To address this problem in the context of facial gesture analysis in this work, the face is partitioned into local regions xiv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of change called co-articulation regions (CR) to constrain the number of muscle degrees of freedom. By reducing the dimensionality of gesture data in each CR with principal component analysis, a coherent low dimensional appearance structure to gesture intensi­ ties is uncovered. The structures induced by independent gesture actuations are termed gesture manifolds, or G-Folds. This structure is modeled with quadratic polynomials in Gesture Polynomial Reduction. The continuous G-Fold representation is a fundamental improvement over discrete template based approaches and heuristic models of facial action. The utility of the model is demonstrated by classifying fine levels of gesture intensity and applying the results to a novel 2D animation technique called Muscle Morphing, combining mass-spring muscle control with image warping. xv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Preface Computer science is on the forefront of natural human-computer interaction. The days of the blinking cursor beckoning a response to be barbarically pounded into an archaic array of inefficiently lain, lettered keys is reaching an end. People who have not the patience, nor desire, and in some cases, not the ability to learn the unnatural modes of communication necessary to interact with their mechanized assistants are now bound to the comfort afforded to them by computers. Computer scientists working in the field of human-computer interfaces are, in this light, professional mediators, and HCI the study of communication. The scientists role is to teach, not the user, but the computer to communicate better with the human as its humble servant; That it may understand better the desire and will of its master, and do so without inspiring fear, resentment, or frustration. This holy grail of HCI is prismatic. To be reached requires advances in language processing, visual reconstruction, object recognition and interpretation, speech analysis and synthesis, to name a few. It will remain to be seen whether we really want computers to have human qualities, but without a doubt we want to give human (and super-human) abilities to our machine servants, as predicated by the oft exclaimed: “Why wont it do what I tell it to do!” (Of course, this statement reflects a problem of language and semantics, rather than one of obedience.) xvi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The recent trend towards natural Human-Computer interfaces directly stems from our own tendency to be lazy. We are trained from birth how to interact and communicate with other humans in order to manipulate our surroundings in such a way that we ourselves benefit and simplify our lives. Most objects we encounter are simple tools th at we can use with minimal effort. Humans, on the other hand are much more complex objects, and as this complexity increases, so does our need to have a more complex relationship. Our ability to communicate with other humans greatly affects our ability to reap the benefits that the human-tool has to offer. Arguably, the computer is approaching the complexity of human objects, thus requiring its users to engage it with a more sophisticated relationship. At least two distinct possibilities exist as models for this human-computer relation­ ship. The first is to communicate on the level of the computer, requiring an intricate understanding of the underlying system. In effect, this is the relationship maintained by the computer programmer who must understand the architecture of the system in order to manipulate it. A second model, gaining popularity, is the “natural” computer rela­ tionship. This model stresses the same kinds of relationship paradigms th at one has with humans and other living entities. These relationship paradigms include natural modes of communication including speech, facial expression, body gestures, etc. Hence, the body of literature dealing with face and body gesture analysis and animation, speech recognition and synthesis. This model has obvious merits of drawing from human experience and thus not re­ quiring learning of additional relationship paradigms. However it also brings along all of the baggage of expectation. We expect humans to act a certain way, we expect a certain level of understanding, and we expect certain actions and reactions to be immediate. xvii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Therefore, to communicate effectively, the computer must not only present to the human a meaningful visage, but it must also interpret, understand, and react to the actions and intents of the human counterpart in real time. While body posture, motion, and hand gestures all play a role in human-to-human interaction, communication is primarily carried out face to face. Hence, the human visage has evolved to enhance this mode of communication, both on the perceptual and production end. My work therefore dwells in the spectrum of analysis and recognition of facial gestures. The Problem of Expectation The human face is capable of producing an infinite number of facial gestures. Each different gesture or combination of gestures manifests itself as a different expression. Some expressions appear to be identical to the untrained eye, but the emotional intent transferred to the observer can be dramatically different, especially when presented as a continuous stream of facial states (rather than a static snapshot). Humans have the ability to detect minute changes in their surroundings. Often these changes do not register cognitively, but contribute to a general feeling or sense of the environment. The same holds true for human perception of facial gestures. One may not be able to formulate a coherent description of why they distrust someone with whom they previously had a conversation, but the feeling has registered internally. Similarly, people often describe facial animation as “fake” or “disturbing” , with little understanding of why they come to th at conclusion. xviii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In the case of computer generated facial animation, the synthetic representations are often close enough to realism th at we expect them to behave in a manner consistent with real humans. When the facial motion, co-articulation of muscles, and appearance of wrinkles, is even slightly incorrect, we recognize this immediately and the inability to resolve our expectation makes us judge the representation as wrong. For cartooned/caricatured animation, we don’ t have the same expectations as we do for human faces. We maintain a different base expectation class, one of a “cartoon face” , with different, looser properties than a real human face. The character is the creation of an animator and hence the artist can build up the expectation in the viewer in any way she pleases. Performance Driven Facial Anim ation In 3D facial animation an animator is given control over a set of parameters they can manipulate to produce the final animation. The control interface typically requires a conceptual leap between the actions he/she is performing and the resulting actions of the character being animated. Manual mouse-click screen-based interfaces attem pt to make the control mechanism intuitive, correlating a screen-based action to underlying parameter changes, and in return reflecting those changes appropriately to the user. However, even if the interface allows accurate reconstruction of the face, the animator must also set the parameters correctly over time or perceptual problems can arise. Because of the separation between the control methodology and synthesis parameters, and the sensitivity of the results to human perception, achieving realistic animation is xix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. very difficult. For example, in the film Final Fantasy we see highly realistic human face modeling. In static images one may imagine that the face is real. However, this realism is lost in several key points in the film when character’s express emotions incongruent with with our expectations. By fusing the parameter set of the control and synthesis modules we eliminate the intermediate interface. This simplifies the animation creation process dramatically, and with accurate param eter estimation, enforces a plausibility constraint not generally present in facial animation. This is the goal of performance driven facial animation. Motion cap­ ture has been used for this purpose, and results in mapping skin motion onto a 3D face. However, this mapping is very literal and difficult to edit without revisiting the individual vertices. An alternative is to define a more concise and manageable physically inspired param ­ eterization. However, this presents a new challenge, that of defining the parameters and extracting their values from the source face image. Performance driven facial animation is thus a compelling and challenging end goal of facial gesture analysis Applications There are several application areas to which facial gesture analysis is relevant. In cel-based animation of cartoon faces the animator frequently has a set of face images in different configurations (different expressions and/or mouth shapes) and manually switches be­ tween them to create the desired sequence. The frames must be synchronized with a xx Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. speech audio track, a process th at can be tedious and error prone. Attaching the car­ tooned face configurations to individual face states in the face parameterization results in the ability to control the cartooned face with an actors facial gestures. Similarly, if the image samples are taken from a set of real images of a persons face instead of cartoons, and the reconstruction maintains the fidelity of the original images, we can use the face param eterization/reconstruction model in model-based coding (MBC). An appropriately parameterized MBC face with its reconstruction database can be used in 2D or 3D Teleconferencing, saving huge amounts of bandwidth for other data such as high fidelity audio and scene animation data. By using an intuitive parameterization, we allow user intervention into the final representation. For instance, for cultural neutralization, we may want to re-map gestures, or lessen their intensity. Similarly, we may want to intensify gestures to emphasize a particular internal sense. Teleconferencing is part of a larger class of applications we call avatar mediated communication. This includes potential applications such as distance learning, remote military command, and internet chat. Facial gesture analysis also relevant for psychological research, as indicated by the abundance of literature utilizing FACS to analyze correspondence between certain psy­ chological disorders and facial motion. Finally, with the ability to assist in emotional understanding, it can be a key component in affective computing [Pic97]. xxi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 Introduction 1.1 Problem Statem ent The term performance driven facial animation (PDFA) was coined by Williams [Wil90] to describe the act of controlling facial animation with the actions (performance) of a live subject. A control signal is derived from one or more input video streams of the performer’s face and connected to a 2D or 3D face model with an embedded animation mechanism. Existing PDFA control work is largely based on mapping motion literally, from the performer’s face to a 3D representation. This is a desirable approach if the end goal is realistic reproduction of the actor, however this is often not the case. For example, if a human face is used to control a dog, the human skin motion can appear incorrect when transfered to the dog’s face. W ith direct application of motion capture data, there is a limited ability to control the resulting animation. The resulting animation can be more easily generated and edited if a flexible and intuitive abstraction of skin motion is introduced. The animator needs a set of abstract parameters that she can modify or map through that are complete and intuitive. Complete 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in that they represent a large portion of face state space and intuitive in th at one can easily interpret the semantics of the parameters. Facial gestures are one such set of parameters. Facial gestures are the visual manifestation of the contraction of one or more facial muscles. These gestures can be created intentionally in an attem pt at active communica­ tion with another individual, but more often, they are passive modes of communication reflecting some internal state (sentiment, emotion, desire, perception). Each gesture has associated with it a characteristic change of visual facial attributes. These changes come in the form of deformed or shifted facial features, skin motion, bulges and wrinkles. Each change is thus correlated to the underlying muscle contractions and by observing the appearance changes of an individual’s face one can also get an idea of the amount of contraction of the underlying muscles. Determining muscle contractions is an appetizing goal of image based facial gesture analysis, as muscles are a complete basis for facial state. However, images only provide access to the skin surface affected by their contractions and therefore facial analysis must draw inference between the appearance of muscle contractions (the affect they have on the skin surface geometry, color, or motion) and the contractions themselves. Facial gesture analysis is a subfield of facial analysis with much active research. Un­ fortunately, most of this work focuses on binary gesture analysis: determining whether the gesture is active or not, without regard to the level of activity. To be used in PDFA, gestures must be determined with a very fine level of intensity and ideally in real-time. There are two fundamental problems that must be addressed to achieve these goals: 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Problem 1: Data dimensionality Comparing the number of muscle degrees of freedom (approximately 50) to the number of image DO F’s (10000 for a 100 pixel square image) it appears th at intensity images of face state are an unnecessarily high dimensional representation for a fairly low degree of freedom phenomenon. However, even though the true facial state muscle basis is known, the phenomenon to be measured is a highly nonlinear function of the state coefficients. Furthermore, without attaching obtrusive devices to our subjects, the high dimensionality arising from the image basis must be dealt with. P ro b lem 2: L a b eled D a ta While unsupervised learning techniques such as principal component analysis (PCA) and independent component analysis (ICA) have been used extensively to extract features from images, training a gesture intensity classifier is highly dependent on the existence of a labeled data corpus. For certain facial states such as the resting and maximum gesture actuation states, it is feasible for a human labeler to manually identify such states with good accuracy. However, gesture intensity is represented by several fine levels of change that are difficult (if not impossible) to identify manually. This severely limits the employable computational learning schemes. The first gesture analysis methods presented in this thesis are confined within these limitations and use correlation based template matching of gesture features extracted using PCA and ICA. 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Solutions To address these two problems in the context of facial gesture analysis, Gesture Polyno­ mial Reduction (GPR) is presented as a method for decomposing the high dimensional face image space into a concise low dimensional gesture model. The foundation for this method is the observation that by constraining the number of muscle degrees of freedom (DOF’s) in a given sample region, a coherent low dimensional structure is discovered that may be used to concisely represent each gesture. The structural models are called Gesture Manifolds, or G-Folds and provide a natural ordering of gesture samples relative to gesture intensity, thereby solving the labeling problem. The remainder of this chapter will motivate the choice of intensity images and as basic features for analysis over optical flow and sparse feature motion, and introduce co­ articulation regions, G-Folds and the GPR as a techniques for solving the dimensionality and labeling problems discussed above. 1.2 On Appearance Based Representation: Image Intensity as Basic Features Intensity images of the face are easy to obtain and contain all of the information needed to judge facial state, however, it is common practice to extract motion information (sparse feature point motion or dense optical flow) from time ordered sequences of images and then perform further analysis on this motion data. While there are several reasons one 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Raw Pixels Dense Optical Flow Sparse Feature Points Sensitivity to Static session variations HIGH MEDIUM LOW Sensitivity to Dynamic ses­ sion variations HIGH HIGH HIGH Preprocessing Time NONE > 1 min real time Special Requirements NONE High Resolution Low Noise Images Presence of cor­ ners Dimensionality of Data HIGH HIGH LOW Redundancy HIGH MEDIUM LOW Table 1.1: Comparison of properties of raw pixel, dense optical flow,and sparse feature point data as basic features for facial analysis. may choose motion data, there are significant benefits to using raw-image muscle appear­ ance information over dense optical flow or sparse feature motion, especially for PDFA applications. Table 1.1 summarizes the properties of the features under consideration. Dense optical flow algorithms generally require high resolution, low noise images to produce acceptable results. Under these conditions, computation time on the order of minutes per frame is required. If dense motion is not essential, the attention may be focused on regions with significant intensity variations to improve tracking accuracy. However, there are limited numbers of such regions on the face (eyes, eyebrows, mouth, ears) that can be detected with repeatable high accuracy in all individuals. Sparse feature points have similar problems. For example, contraction of muscles such as the orbicularis occuli (the “squint” muscle) affects primarily the upper cheek area which contains nearly no high contrast features, and hence, falls between the cracks of sparse feature motion sampling. These areas between the feature points contain rich information related to the underlying muscle contractions in the form of skin wrinkles and 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Negative Neutral Positive Figure 1.1: Minute feature motion can change the emotional content of the face. Though subtle, the above images each communicate different emotional states to the observer. bulges, often resulting in very subtle variations in image intensity. Im portant information is communicated to an observer through these minute variations. Both optical flow and feature points tracking systems require a time sequence of images. This makes judgment of facial expression from a static image impossible. As a concrete example, consider Figure 1.1. The image in the center is a neutral expression, the right is discernibly positive/happy, and the left, negative/sad. The only feature point motion that has occurred are the mouth corners that have moved 1 pixel in the image plane. The remaining information exists in the textural variations and is suffi­ cient to communicate critical information pertaining to the subject’s internal emotional state. 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Drawbacks of raw pixel data. Raw image data is not without it’s faults. Images suffer from static and dynamic ses­ sional variations, geometric normalization problems, high data dimensionality, and high information redundancy. Static variations are visual characteristics of the face th at re­ main constant from the beginning to the end of a performance session, but can change across different sessions. The primary static variations are due to global ambient light­ ing conditions, changes in facial hair, makeup, and hair style. Feature and optical flow tracking establishes correspondences between pairs or sets of images and therefore are only affected insomuch as the geometry of the face is changed or the ability to establish correspondences is compromised, which is likely to occur only under the most extreme cir­ cumstances (for example, applying opaque makeup over the entire face, thereby obscuring facial features). Dynamic variations, in contrast may or may not occur between sessions, but do so within a single session. Dynamic variations include occlusions, shadowing (due to self shadows by face features or other environmentally induced shadows), and skin coloration changes. The potentially detrimental effects of dynamic variations are also present im­ plicitly in motion data. Clearly an occlusion due a hand passing in front of the face, or a shadow cast from another person walking through the room will adversely affect the motion estimation. Geometric normalization is a problem specific to intensity and dense optical flow representations. W ith sparse motion features, the location of feature points defines a geometric structure and provides a one to one mapping of points between frames and 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. potentially different faces. However, for intensity images and dense feature motion, a mapping can be reliably defined for a only small subset of feature points, thus the re­ maining locations must be mapped by interpolation which potentially introduces error. Despite the trade-off between inaccuracies introduced by interpolation and the additional information afforded by intermediate image data, it is often better to exploit the richness of information as there are several applications (including our focus, performance driven facial animation) where one can assume a database exists for the performer. In these cases, a mapping does not need to be defined between different face shapes, and any geometric variations will occur as a result of facial gestures. The final problem, (one also plaguing dense optical flow) is the high dimensionality of the data, where th at the subspace of all possible face images occupies only a tiny fraction of the global image space, implying that a standard linear algebraic representation of the data is highly redundant. This can make analysis of the data error prone and unnecessarily computationally expensive. Several linear and nonlinear dimensionality reduction techniques exist, of them linear principal component analysis is widely used for facial analysis. One can blindly apply such techniques to model face data, but significant results and insight can be gained if the structure of the underlying data is considered. Often, this structure is of lower dimensionality than the original data and can, in turn, be used to parameterize a lower dimensional subspace spanning the same set of properties as the original data. 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.3 Co-articulation Regions, G-Folds and the G PR Because muscle contractions impart nonlinear appearance changes on the surface of the skin, it is unlikely that a meaningful low dimensional projection of face data can be found that contains the full set of muscle D OF’s. However, it is observed that muscle actuations do not create global deformations on the face and thus, pixels outside of a muscles region of influence are irrelevant to the appearance of that muscle. By constraining the number of muscle DOF’s in a given data set, a coherent low dimensional structure to the data is uncovered. This thesis introduces three concepts: Co-articulation Regions, Gesture Manifolds, and Gesture Polynomial Reduction. C o-A rticulation R egions The face is partitioned into a small set of local co-articulation regions configured to impose the degree of freedom constraint. The region of influ­ ence of each muscle is analyzed and a set of nine regions is assembled. G esture M anifolds Principal component analysis is applied at the co-articulation re­ gion level on data acquired for a set of uncorrelated facial gestures (the appearance of approximately independent muscle contractions.) A trend is observed of uncorre­ lated gestures tracing coherent paths in this space. These curves are termed gesture manifolds or G-Folds. G-Folds are a simple and visually intuitive model for facial gestures. G esture Polynom ial R eduction Following the manifold discovery phase, the discrete manifold samples are modeled with low dimensional continuous curves using Gesture Polynomial Reduction (GPR). 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.4 Thesis Overview The remainder of this thesis describes a method to systematically decompose the face and facial gestures into low dimensional parametric G-Fold representations. These methods enable real-time, robust, person-specific gesture intensity classification for performance driven facial animation. In Chapter 3 co-articulation regions are introduced, a face-image partitioning scheme designed to constrain the number of muscle degrees of freedom in each sampling location. This is followed in Chapter 4 by a description of the potential sources of error faced during data acquisition and modeling. The normalization techniques used to mitigate these errors are presented with a focus on radial basis function image warping for CR template alignment. In Chapter 5 an unsupervised learning approach to CR analysis is presented. Assessing the quality of gesture classification is problematic due to the absence of labeled gesture intensity data. This issue is discussed along with alternatives to standard quality evaluation methods in Section 5.2. The chapter concludes in section 5.3 with a flip-book style discrete output space animation method. Chapter 6 motivates and introduces gesture polynomial reduction, a novel CR analysis technique that solves the gesture labeling problem by modeling gesture samples with low- order polynomials. Results of applying GPR to CR data are also presented. Chapter 7 explores the G-Fold space in more detail, building intuition for the meaning behind G- Folds and addressing issues of generalization. 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In chapter 8 a novel 2D performance driven facial animation method driven by output from the continuous G-Fold facial gesture classifier is described. Mass-spring muscles are used to deform neutral expression 2D face images. Chapter 9 concludes with a discussion of the limitations of work presented in this thesis and future work stemming from the G-Fold model of facial gestures. Appendix B describes an extensions of my current work to facial expression analysis with application to gesture driven 3D facial animation. Appendix C presents an interactive media-art installation called Comfort Control developed in collaboration with students from USC’s School of Fine Arts utilizing facial expression analysis work. 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2 Background and Related Work This thesis presents a facial analysis approach to performance driven facial animation con­ trol. In this section, related work will be addressed as well as drawbacks and limitations of current approaches in both of these fields. 2.1 Facial Analysis The first facial analysis work was conducted in 1862 by Duchenne de Boulogne [dB90]. Duchenne attached electrodes to his subjects face and applied small amounts of electrical stimulation to the underlying muscles, thereby forcing their contraction. He demonstrated how expressions of pain, fear, and happiness could be induced from such stimulation. Times changed, as did laws governing human experimentation, and alternative meth­ ods for facial analysis were born. In 1974 Paul Ekman developed the Facial Action Coding System (FACS) [EF78] as a tool to enable trained psychologists to perform quantitative experiments relating facial motion to underlying psychological activity. FACS identifies 46 independent areas of facial motion called Action Units (AU) and establishes a method­ ology for the manual estimation of AU activity from static images of facial expression. 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Though still in wide use today (particularly by the psychology community), this cumber­ some system requires extensive training and time-intensive data collection. Automatic methods for performing such analysis have risen to meet the limitations and subjectivity of manual FACS data collection. Those methods specific to FACS as well as other strategies for decomposing facial motion have wide applicability, beyond the psychophysical. From these methods we obtain a snapshot of a param eter set encoding the configuration of the face at a given point in time. W ith a suitable parameterization we can drive 2D or 3D animation sequences reflecting the progression of facial state over time. It is also possible to infer facial expressions from combinations of the decomposed units. 2.1.1 A p p ea r a n c e B a se d F acial A n a ly sis Facial images can be modeled as points in a high dimensional space whose dimensionality is defined by the number of pixels in the image. Appearance-based subspace methods of facial analysis typically define the characteristics (degrees of freedom) of a set of faces of interest, and then model the image subspace that spans this set. Identification of the properties of a new image can be posed as a classification problem, where the closest training image in the learned subspace dictates the properties of the test image. The general framework for these approaches consists of 4 steps: D O F Selection Define the specific properties we are interested in analyzing (pose, light­ ing variations, facial motion, facial structure, etc). Sam ple A cquisition Acquire sample images representative of these properties. 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Subspace M odeling Perform subspace analysis on the above image set to model the subspace spanned by the images C lassification Analyze new images relative to the modeled subspace. Subspace methods have a large benefit over more classical statistical formulations where probability densities must be estimated as we can often get away with a sparse sampling of the face configuration manifold. This is important, as data collection can be a laborious process and for some applications such as Video Rewrite [BCS97] the size of the training data set may be fixed and small. A major difficulty in applying pixel-based subspace methods to facial gesture intensity analysis is the lack of motion state labels that are required to correlate face appearance to motion parameters. Motion templates are derived from finite element simulation of actual skin tissue on a fitted 3D model of a subject in [Ess95]. In general, manual labeling is very difficult for more than a few levels of gesture intensity. Labels are unnecessary for applications where we are only interested in correlating an observed state directly to an appearance. This approach has been taken in our early work [FNK+99]. Yin and Basu [YB97] present low bit-rate model based coding system th at partitions the face into Textures of Interest (TOI) and encodes a set of representative wrinkle textures using PCA. This is similar to [WS92] who also uses region based PCA for encoding local texture changes. However, without labels, only correlation to appearance is provided with no semantic information about the gesture state. 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.1.2 Face Recognition Facial recognition is a subproblem of facial analysis where the interest is in characterizing the differences in spatial arrangement and geometric structure between different people’s faces for the purpose of classification. Many of the methods used in face recognition have been applied to analysis of face gestures and expressions. The eigenface method by Turk and Pentland [TP91] projects a source facial image into the orthogonal eigenspace computed with PCA from a set of training images (representative faces in neutral pose and expression) and selects a closest match using Euclidean distance. Pose invariant ap­ pearance based eigenface models have been explored in [PMS94] where multiple modular eigenspaces are constructed for faces at varying viewing angles. Input faces are pro­ jected into the multiple modular spaces rather than a single all-encompassing eigenspace. Beymer extends the eigenface approach to pose invariant faces by generating synthetic views of an input face using prior information of face pose variations [BP95]. Halli- nan shows that lighting variations on a front face images can be represented in a three dimensional PCA space when treated as lambertian surfaces [Hal94], Subspace representations other than PCA have been used with varying success. In [BHK97] face images under varying lighting conditions are projected into a three di­ mensional subspace maximizing between class distance using Fisher linear discriminant analysis (FLDA). The goal is to reduce variations due to non-lambertian properties of the face surface and self shadowing effects not modeled in Hallinan’s face representation. Independent component analysis is a refinement of PCA th at attem pts to extract sta­ tistically independent feature vectors (rather than merely uncorrelated features as with 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. PCA ). Applications of independent component analysis (ICA) to facial analysis have had more success than PCA. It is speculated in [BS97] that this is due to analysis of higher order statistics of the data, rather than the second order (pixel pair) statistics derived with PCA. ICA features have been used for face recognition in [Bar98][BS97]. 2 .1 .3 G e stu re A n a ly sis The majority of work on facial gesture analysis deals with recognition of binary gesture state: whether the face is in a neutral configuration or fully active with a given gesture. Though a large body of work is devoted to binary analysis, to use gesture data for animation or determine expression intensity, a fine level of actuation intensity must be measured. The binary analysis trend arose from manual FACS coding which also deals primarily with binary action unit assessment due to the difficulty in reliably discriminating between subtly different facial states. Motion data can assist in such coding, but dense optical flow requires high fidelity input imagery and can be computationally intensive. FACS AU parameters are analyzed using Hidden Markov Models (HMM’s) trained on dense optical flow data and gradient information in [CZLYW99]. Classification of AU’s using ICA was shown to perform better in comparisons with global PCA, and on par with Gabor wavelet and local eigen-features [DBH+99]. Optical flow features with correlation based classification are widely used for facial gesture analysis [CZLYW99] [CLK+98] [LKCL98b] [TKCOO] [Mas91] [YD94]. Lien et. al. augments dense optical flow (dimensionality reduced using PCA) with coarse facial feature locations and image gradient information (to account for high contrast wrinkle regions). These feature vectors are used to classify 9 FACS AU’s (3 upper face AU’s 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and 6 lower face) using Hidden Markov Models. Intensity is estimated by finding the existing training state with highest correlation to a test vector. A label of this intensity is not provided, only the image representative for the intensity. Furthermore, only full face expression intensity is estimated, not gesture intensity [Lie98], and not in real-time. Essa uses optical flow and a finite element model of the face surface to estimate a set of actuation parameters called FACS+ that encapsulates gesture velocity information [Ess95]. Essa shows how image appearances can be correlated to muscle states derived from dynamic simulation of a physical model of the face. In [HonOO], 25 spontaneous gestures are identified. Hong defines a set of component subregions (rectangular sample grids) from which relatively sparse motion samples are acquired and used for classification of binary gestures. Many of the defined gestures are combinations of the independent gestures analyzed in this thesis. For example, Hong considers [Left cheek raise + Right cheek raise], [left cheek raise], and [right cheek raise] to be three separate gesture, requiring training data explicitly for all three. In contrast, in this thesis, data is acquired for the left and right gestures only, and the combined state is inferred from combination of independent states. This provides more flexibility to recognize combinations of gestures not present during training. 2 .1 .4 U n iv ersa l E x p ressio n A n a ly sis The term expression is often used to mean facial gestures or facial action [LKCL98a] [CLK+98] [LKCL98b] [Lie98], however early use of the term referred to a set of prototyp­ ical facial expressions -the six universals- defined by Ekman: HAPPINESS, SADNESS, ANGER, FEAR, DISGUST, and SURPRISE. Facial expressions of these emotions were 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. empirically determined to be recognized across cultural boundaries [Ekm93] (hence their universality). The face is designed to reflect internal emotions to communicate with other humans and these emotions will rarely fit into one of the restricted categories above. Essa uses peak muscle activations extracted from FACS+ actuation profiles as ex­ pression templates. Normalized profile feature vectors are compared to the templates to determine the active expression [EP97]. Neural network approaches to full face expres­ sion classification have been explored [FT01] [Zha99]. In particular, Franco utilizes an unsupervised local analysis phase similar to PCA and ICA. An eigenfaces approach is used in [LC01] where multiple samples of fully actuated expressions are used for training. Test images are projected into eigenspace and the minimal distance eigenexpression is selected. Choi et. al. presents a simple multimodal system [CH98] combining visual and audio features to recognize 6 basic expressions using nearest neighbor classification. 2.2 Performance Driven Facial Anim ation Performance driven facial animation describes the act of controlling facial animation with the actions of a live subject. Control systems for PDFA can be loosely divided into direct and indirect methods. D ir e c t m e th o d s Direct control methods are designed to provide a literal translation between the motion of the performer’s face and the motion on the rendered model. Motion capture or motion vector transfer are the two most common examples. Motion capture data is acquired with multi-camera arrays from faces with attached markers or makeup. Stereo image 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. pairs are used to estimate the 3D locations of markers fixed to the face. The resulting 3D points are used to directly interpolate 3D models in [GGW+98]. Noh transfers 3D motion vectors derived from motion capture data and existing animation to models with different geometry in [NN01]. The LifeFX system uses motion capture data to drive a finite element model of a virtual actor producing highly realistic results. A major drawback of these systems is the requirement of attaching physical markers to the face and the high cost and setup time for multi-camera acquisition systems. Furthermore, in many cases we are not controlling a likeness of the performer and thus it may not be desirable to impart her motion onto the character. To solve this problem, indirect control methods have been used. In d irect m e th o d s Indirect control methods introduce an abstraction layer between the control signal and animation mechanism. This gives more freedom to an animator as animation can be driven by the semantics of facial state rather than requiring low-level motion correspon­ dence. In physical/anatomically based methods, this abstraction layer consists of virtual facial muscles, or similar anatomically inspired param eter sets such as FACS action units and MPEG4 facial action parameters (FAP). The set of facial muscles and their actua­ tions provides an intuitive basis for human face state-space, and consequently, much of this work has focused on mapping facial motion data to virtual muscles fit to a 3D model. An early muscle system was developed by Waters using the notions of flat sheet and ellipsoidal sphincter muscles. Elasticity properties were simulated by varying parameters of a falloff function controlling the amount of displacement due to muscle contraction 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [Wat87b]. Terzopoulos introduces a more sophisticated animation generation system tying their muscles to three-layer mass-spring based skin model that simulated muscle, bone, and fatty tissue with different spring parameters [TW90]. Though FACS provides a relatively complete parameterization of facial state, the ill- defined boundaries of action units makes it a cumbersome and subjective interface for creating and editing facial animation. Each AU defines a ID activation profile regardless of the number of muscles that affect it. An animator using FACS must then memorize each of the 46 AU’s and build an intuition as to how it affects the animation. Muscles are more intuitive, as skin deformation occurs in the direction of, and proportional to, the muscle contraction. The most significant problem facing FACS based PDFA, however, is that no analysis system to date is able to discriminate a comprehensive set of AU’s, in real-time, or with intensity: all of which are needed for PDFA. 2.3 Facial Analysis for Anim ation The drawbacks of 3D motion capture as a control signal for PDFA were discussed above. There are several other methods, however, that do not rely on marker based tracking. 2 .3 .1 2D M o tio n C a p tu re an d O p tica l F lo w In [FNK+99] a Gabor wavelet based tracking system developed by Eyematic Inc. is used to track 16 2D feature points. These points are projected onto a 3D model of the performer to estimate the 3D locations, then used to interpolate the geometry to generate animation. There are relatively few reliably tracked feature points, and hence 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. each point carries significant weight in the final animation. Wrinkle states are classified on the performers face and triggered as dynamic textures on the animated model. Feature locations are mapped to MPEG4 facial animation parameters (FAP) in [EG98]. FAP’s are similar to FACS action units in that several ID parameterized local deforma­ tion units are identified on the face. A model based coding approach to animation is taken by combining FAP estimation with traditional block based coding of head and shoulder video sequences. Dynamic facial features such as wrinkles and eye blinking are block coded at each frame. A feedback loop is used to estimate FAP’s to achieve a higher final signal-to-noise ratio. The process is offline (10 seconds per frame), and not suited for real-time application. In [EBDP96] the control theoretic FACS+ analysis/synthesis framework was applied to facial animation. Dense optical flow features are extracted from video and drive a finite element face model. Real time animation is performed by storing the control parameters extracted using optical flow features and indexing them using an intensity image of the face at key configurations. Intensity correlation is performed to retrieve the key image (and its stored control parameters) closest to the current video frame. In similar work by Choe, muscle parameters are extracted from sparse feature point flow and used to compose muscle basis geometry for 3D character animation [CK01]. A 3D model of the performer is created initially to assist in tracking and an optimization phase is required to accurately fine-tune the muscle basis models. 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2 .3 .2 M o tio n P r o c e ssin g for A n im a to r C on trol Because of the literal nature of motion capture, an animator is left with very little freedom to customize the resulting animation. Some work has attem pted to get around this problem by introducing a processing layer between the motion capture data and the final animation. A technique called motion signal processing [BW95] was designed for body motion data, but experiments on face motion data were also performed [Noh02], Animation data is decomposed into several frequency bands and the animator is given control over band weights. It is difficult to predict the specific results of adjusting the frequency parameters. These techniques are better suited when combined with other more intuitive control methods such as muscles to give the animator more flexibility. The EMOTE model was developed with similar goals using Laban movement analysis as a core set of features, but provides a more intuitive interface based on emotional output [CCZBOO]. 2 .3 .3 F ea tu re T em p la tes Feature template based systems track the motion/deformation of geometric templates across frames of video. Parameterized feature templates are an improvement over feature points as they contain more intuitive semantic information about the deformation process. In addition, if the templates are deformable, an accurate estimation of the contour of each feature provides a richer feature set than simply positional information. Unfortunately, this richness is not without its costs. Deformable templates typically involve an energy minimization over the curve parameters and image gradient or intensity. Best results are achieved with high resolution imagery, and do not run in real time. 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In [BFJ+OO] ten position based facial parameters are extracted from input video and used to index a database of 2D hand drawn image samples for PDFA. A feature vector is associated with each sample in a preparation phase. A new input feature vector is projected into the 2D Delaunay triangulated feature space and modeled as a linear combination of the containing triangle. Image morphing is performed between samples using the resulting coefficients. As there is no explicit parameterization of facial state and the extracted features do not have well-defined semantics, the preparation phase requires ad hoc association of features to samples. Energy minimizing curves termed snakes are specified on contours along the eyebrows and lips to drive the mass-spring muscle system of a synthetic character [TW93]. Facial makeup must be worn by the performer to enhance the contrast of contours for accurate tracking. As tracking requires iterative minimization of an error functional over the entire curve, real-time performance is not possible. Deformable templates derived by connecting MPEG4 facial definition parameters with 2D B-spline curves are tracked using energy minimization techniques similar to snakes [MP01]. 2 .3 .4 D a ta D r iv e n A p p ro a ch es for P D F A D ata driven control/synthesis is an emerging sub-field of animation where statistics of real face performance are used to augment or drive existing animation. The majority of this work focuses on correlations between speech and facial gestures. Emotion based cartoon animation is driven entirely from speech in [LYX+01]. Training data of emo­ tional speech is acquired and used to train a support vector machine classifier. Speech and emotion are decoupled from the input stream and used to drive both components 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of animation separately. Lip configurations and expression templates are composited to­ gether to produce the final animation frame. In [Bra99] statistical analysis is performed on synchronized facial motion capture data and digitized speech. Facial animation (2D and 3D) is produced from entirely from audio input. The statistics of eyeball rotation from training data are used to improve the quality of speaker animation in [LBB02]. Image based control techniques have not been explored extensively for PDFA largely due to the fact that existing facial analysis work is not well suited to animation. Full- face image analysis of expressions is used to interpolate base 3D models in [LC01] by correlating input frames to a small training set. The set of analyzed expressions is very small and thus subtleties of facial gestures are lost in the animation. This problem is addressed in the CoArt system [FN02] presented in chapter 5. Free­ dom is given to the animator by allowing arbitrary image samples to be constructed and driven automatically with region analysis data, but also constrains the image samples with anatomical semantics. 2.4 Summary of Benefits and Novelty of the Approach The approach to PDFA control in this thesis differs from existing work in the following ways: Facial analysis approach Instead of simply using motion capture or feature point mo­ tion, a facial analysis inspired approach to animation is taken, whereby semantic information is extracted from face images. This semantic information is used to drive an animated character. 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. H ierarchical control The description of facial state starts with low level semantic fea­ tures (gestures) and builds up to full face expressions, making this information available to the animation on a frame-by-frame basis. Intuitive param eters Gestures/muscles are intuitive parameters and enables further processing. Adjusting parameters at the gesture level is much more intuitive than the feature or motion capture point level where several features must be manually coordinated to achieve a desired result. The analysis approach also differs significantly from existing facial analysis work: R egion based Most existing work either acts on full face images, or uniform rectangular windows on the face. G-Fold analysis, however, is region based with the regions accounting for the area of influence of a set of contracting muscles. This constrains the information contained in the image samples and limits unwanted cross-talk between sample regions. G esture intensity Most gesture analysis work detects whether a small set of gestures is on or off. Instead, this work detects very fine levels of gesture intensity which is crucial for animation and subtle expression detection. Person specific Instead of generalizing the analysis to unseen individuals, measure­ ments are made on person specific gesture data. This is particularly useful for PDFA applications where a training database for the performer can be acquired. N on-task-specific Existing methods are narrow in scope, focusing on only a small set of full face expressions or gestures. Intensity estimation of gestures enables a much 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. wider set of applications including PDFA and expression intensity. Furthermore, binary gesture analysis can also be trivially performed by simply quantizing the intensity measurement. Static im ages As optical flow and feature points require a time ordered sequence of images, methods using these features cannot judge expression or gesture state from single static images. Though G-Fold analysis requires image sequences for training, facial state can be judged from a single snapshot of facial appearance. 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3 Co-Articulation Regions 3.1 Facial Gesture Set and Partitioning M otivation Of the approximately 200 muscles in the human head, there are 26 facial muscles primarily responsible for communication. Many of these are, for the most part, co-articulated (contracted in unison for a given gesture), and hence may be lumped together into fewer muscle groups. From this set 15 muscles are selected for appearance analysis. While these muscles are not claimed to be a complete basis for facial state, they span a large portion of expressive space. The skin changes induced by these muscle contractions are key indicators of emotion and responsible for gestural communication [Fai90]. Work by Basilli has shown that there are key facial motions that indicate emotion [Bas79]. The gestures inducing these motions motivated, in part, the selection of gestures. As emotional gestures are the primary focus of this work, muscles such as the masseter, temporalis, and medial pterygoid, which are responsible for motion of the mandible, have been left out. 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Frontalis L Frontalis C Frontalis R Corrugator Orbicularis Occuli L Orbicularis Occuli R Levator Palpebrae L Levator Palpebrae R Levator Nasii Zygomatic Major L . Zygomatic Major R . Risorius L . Risorius R . Triangularis L . Triangularis R . Mentalis — 10 Z 12 I \ 14 15 CR Muscles 0 0 1 1, 3 2 2 3 4 , 5 4 5, 7 5 9, 11, 13 6 ST 1 —j o 1 —1 7 15 8 8 Figure 3.1: (left) List of analyzed muscles groups and their locations and directions of contraction, (right) Labeled co-articulation regions and list of contributing muscles. Because of the nonlinear deformation of the skin resulting from these muscle contrac­ tions the number of DOF’s in a full face image sample is prohibitively high. In the first phase of dimensionality reduction, the face is partitioned into a set of regions such that the number of muscle degrees of freedom of each region is constrained. 3.2 Co-articulation Regions Each facial muscle group listed in Figure 3.1 is capable of contracting independently and causing secondary motion of the skin in a continuous and local area on the face. This area is defined as the muscle’s region of influence (ROI). The state of each muscle is parameterized by a contraction value between 0 (relaxed) and 1 (fully contracted). When a muscle is actuated independently, the changes that are propagated to the skin surface are local and fully determined by the muscle’s level of contraction. When two or more muscles have an overlapping region of influence, the resulting skin change is a combination of the effects of the involved muscles. This fact is exploited and the changes 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 3.2: (top) Expressions designed to actuate the set of defined muscles, (bottom) Difference images between gesture and neutral face image th at assist in identifying co­ articulation regions. occurring on the surface of the face partitioned into a set of contiguous local regions of skin deformation called co-articulation regions (CR). A CR is analogous to a FACS AU but is more specifically defined as a nonempty intersection of n muscle regions of influence. The special case of a CR where n = l is called an Independent Region (IR). The activation level of a CR is defined by the n-tuple of activation levels of each muscle contributing to the region’s deformation. (In contrast with a FACS AU which defines a ID activation level for each AU [EF78].) Each CR defines the state space for a local region on the face whose dimensionality is given by the number of muscles acting on the region. This CR subspace parameterization defines a mapping from a point in CR appearance space (the space defined by all possible image samples in a given CR), to a point in muscle space. For each muscle there is a related facial gesture that exercises the muscle (Figure 3.2). The ROI of each gesture is computed by subtracting the maximally actuated gesture image from a neutral expression image. The ROI boundary is identified manually around 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Left Frontalis ^ £ Right Frontalis Corrugator Figure 3.3: State space defined for three co-articulation regions in the forehead area. The two inward pointing muscles are the corrugator and are always contracted in unison, hence the 2D parameterization of C R \. the area(s) exhibiting large variations in the difference image. ROI’s with significant overlap are merged and the resulting region boundaries define the nine CR’s. A specific example of CR state space parameterization is illustrated in Figure 3.3. 3.2.1 T h e C a n o n ica l C R T em p la te The co-articulation regions define the canonical CR template (CCRT) in image space. A sampling region is defined for each CR by a rectangular binary mask with nonzero pixels delineating membership, and its origin in image space. Each person’s face is geometrically fit to the CCRT by methods described in Chapter 4. Figure 3.4 shows the 10 subjects used for testing the various methods in this thesis. 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. S ID _ 0 0 1 0 Figure 3.4: Identification numbers and images of subjects analyzed. 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. P-A-N Diagram Neutral Neutral Figure 3.5: A gesture image sequence is broken into three segments for training. Neutral images are independently labeled. 3.3 Gesture Sample Acquisition Labeled data is required to model the appearance changes in the co-articulation regions. Ideally a uniform sampling of the entire CR state space could be achieved, however it is difficult to acquire this data from a human subject. This would require the subject to contract their muscles precisely and be conscious of each amount of contraction, but most people do not have such fine (conscious) control over their facial muscles. Most subjects can, however, produce the gestures in Figure 3.2 with a little practice (the same gestures used to define the Co-articulation regions). These gestures correspond to the basis actuations of a given CR and by controlling the gesture from neutral to full contraction, the frames can be labeled with respect to their relative actuation levels for a single actuation. The subject is instructed to perform each gesture starting from a neutral expression, holding at full actuation, and releasing back to neutral, repeated six times. These gestures are recorded to a 320x240 pixel image sequence. The recorded gestures are broken into 3 segments: Positive, Apex, and Negative (Figure 3.5). The positive segment spans the the first non-neutral frame indicating gesture motion to the frame preceding maximum 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Gestures Muscles 1 . Forehead Raise CM t— 1 O 2 . Eye Squint 4, 5 3. Eye Blink 6, 7 4 . Smile 9, 10 5. Frown 13, 14 6 . Cheek Pinch 11, 12 7 . Brow Furrow 3 8 . Chin Crinkle 15 9. Snarl 8 Figure 3.6: Spontaneous gestures and the muscles for which they provide data. actuation. The apex segment consists of all frames of maximal actuation. The negative segment spans the first frame after the apex to the frame preceding the final neutral state. The video frame rate and actuation velocity determines the number of frames in a given segment, and hence, the quantization resolution of the state space. Figure 3.6 shows the gestures and the muscles for which they provide appearance information. Labeling is performed separately for each CR to account for potentially asymmetric actuation of gestures. For each muscle affecting a given CR, a set of muscle actuation samples is extracted from the positive segment of the corresponding gesture. (For binary classification, neutral and apex frames are selected.) All samples are masked by the warped CR template, and only the pixels in the bounding box around the CR are retained. Each sample is transformed by horizontal scan to a column vector and concatenated to form the muscle data m atrix where and k depends on gesture velocity and can vary for different muscles. The data m atrix is formed for CR n by concatenating muscle matrices according to the region-to-muscle mapping. Each column is implicitly labeled with respect to the gesture and intensity it represents. Intensity labeling will be discussed in subsequent chapters. 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 3.7: Gesture samples assembled into the data matrix used for training CRe- Actual samples are vectorized and comprise the columns of the matrix. A pictorial representation of the data matrix for CRe is shown in Figure 3.7 where it contains the samples from each muscle gesture ordered by their temporal progression. The data matrices defined above are used for gesture training in subsequent chapters. 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4 Norm alization 4.1 Overview This work is focused on analyzing the type and intensity of facial gestures from image data. Under ideal conditions, the only observed variations in the data would be indications of this intensity, however, this is far from the case. Geometric variations between subjects, sessional variations, and sensor noise all contribute to the final sample appearance yet have nothing to do with gesture intensity. These variations cloud the structure of the sample space, making it much more difficult to accurately classify a new sample. The role of normalization, therefore, is to remove or reduce variations in the image samples th at are unrelated to the desired measurement. To do this effectively, the larger space in which the samples live must be defined in terms of the possible DOF’s. The analysis may then focus on a smaller set of DOF’s by removing, reducing, or controlling the unwanted DOF’s. The identifiable sources of variation in image based facial analysis are listed below, each of which can introduce several possibly unknown DOF’s. 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Lighting • Expression • Static Sessional Variations — Facial Hair — Makeup — Hair Style • Dynamic Sessional Variations — Object Occlusion — Dynamic Lighting — Head Pose — Facial Proportions (Geometry) • Sensor Noise 4.2 Reducing the D O F’s In this work, a few assumptions are made about the environment which eliminates some of the above variations. First, constant lighting conditions are assumed, meaning both that light sources are not moved or changed either during or between training and performance sessions. (Practically, this amounts to constructing a controlled analysis environment.) Second, hair style is assumed to not occlude facial features and is negated by masking out all pixels outside of a pre-defined face region. (This is actually accomplished indirectly through CR sampling.) And third, we assume that occlusions will not occur during a performance (or that analysis will suspend temporarily in the case of occlusion, say, by the subject rubbing his/her eye). The remaining error sources must be accounted for by processing the data. Sensor noise is likely to be negligible, however, each image is filtered with a Gaussian kernel to eliminate high frequency, low amplitude variations. Though the interest is in person 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. specific performance analysis and therefore the variability between different people’s faces does not need to be normalized, geometric normalization is still required to account for the variability between each subject and the geometry of the CR template. Geomet­ ric normalization and the head pose are more difficult to deal with, and hence will be discussed in more detail. 4.3 Head Pose Head pose correction of facial imagery from a single view is a difficult (and under­ constrained) problem with a large body of literature. There are several techniques to model/correct for head pose from a single image [OAvdMOO] however these are very in­ volved systems requiring extensive prior information about faces to be learned by the system. It is not the goal of this thesis to solve the larger pose problem. Two simple (and admittedly restrictive) solutions are employed to remove pose variations allowing gesture analysis to be performed. The first involves requiring the user to don a pair of eyeglass frames with infrared LED’s attached (Figure 4.1). The LED’s are tracked and the image is corrected to remove the affine transformations of the head. The LED tracker was part of the CoArt system presented at Computer Animation 2002 [FN02]. 4.3.1 L E D tracker If at least three rigid points on the face can be identified and tracked across adjacent frames, the affine transformation to a canonical pose can be computed, and the image corrected. The accuracy of this transformation relies entirely on the feature tracking. Several feature-tracking methods exist [NY98][TK91]. However, experiments using these 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 4.1: LED eyeglasses used for head tracking. methods to track features on the face showed large feature point drifts due to sensitivity to lighting conditions, rapid head motion, face deformation, and the low resolution and poor quality of acquired images. This problem is exacerbated by the fact th at there really are no truly rigid points on the face. Even the eye corners deform during squinting. (The nose tip is arguably rigid, but this is very hard to find exactly.) To obtain a balance of accuracy, simplicity, computational efficiency, and subject freedom, a pair of easily detectable and trackable eyeglass frames outfitted with a quadrature of infrared LED’s were created (Figure 4.1). The four LED’s attached to the glasses appear as high intensity spots in a grayscale image that are invariant to environmental lighting conditions. The image is thresholded and binarized, and a set of potential LED clusters is isolated in the image by collecting connected regions. The heuristic filters below are applied to the set to filter out outlier clusters. 1. Each LED is covered with a translucent spherical light diffuser making the projec­ tion on the image plane approximately circular. 2. All 4 LED clusters will be approximately equal in area given the small distance between the LED’s. 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3. The area of a cluster will lie within a given range constrained by an assumed normal distance of the user to the camera. The circularity of a cluster is described by its compactness descriptor [GW93]. If p is the cluster’s perimeter and a is its area, the compactness c is: which is minimum for circular clusters. The area range is determined empirically. In practice for a working area of one to three feet from the camera, the LED projection lies within .13% and .8% of the image size. Using these simple heuristics the LED’s are automatically detected in most cases. In problematic cases (for example, if the actor is wearing a white spotted shirt) the LED’s can be manually identified in the first frame. LED tracking can be performed by re-application of the detection algorithm, but to reduce the computation time involved in re-computing all connected regions, a simple prediction based tracking method is used. A new LED position is predicted using the current velocity estimate of its centroid. The four predicted locations are tested for the presence of clusters meeting the heuristic criteria above. If they are not met, the detection algorithm is reapplied. 4 .3 .2 P o se n o rm a liza tio n w ith th e L E D tracker Given the 2D pixel coordinates of each LED and a set of reference coordinates the trans­ formation between the two coordinate frames is modeled as a bilinear distortion [GW93]. The transformed coordinates x’ and y’ are: 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 4.2: Rigid camera hat used to fix subjects head pose relative to the camera. x' = c\x + c2y + c3xy + C 4 y' = c5x + c6y + c7xy + c8 (4.2) (4.3) Given the four pairs of corresponding LED positions, we have a set of 8 linear equations that allows us to solve for the coefficients (ci... c8). The warp equations are then applied to each pixel in the source image and the final normalized image is produced by bilinear interpolation. The bilinear model accurately corrects for head translation and rotation in the image plane. The errors increase significantly for large out-of-plane rotations. 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4 .3 .3 C am era H a t While the LED tracking works suitably for an online system with limited pose range, very high accuracy pose correction is required for training and testing the system. The easiest way to achieve this is to lock the subjects head in place, but this can lead to unnatural facial gestures. As an alternative, a simple head mounted camera was constructed and aimed at the subject’s face (Figure 4.2). The camera frame is held fixed relative to the head frame. Using this device it is assumed th at head pose is unchanged for the duration of the session. The relative pose differences between the canonical frame and the identified frame are accounted for in the geometric normalization process discussed next. 4.4 Geometric Norm alization Recall that the CCRT is defined for a generic individual. Because human faces all have different geometric proportions, the input face must first be warped into a canonical geometric frame consistent with the CCRT. In this effort, gesture analysis of unknown individuals is not the focus, therefore the criteria by which geometric normalization is evaluated is somewhat more relaxed. The only requirement is that the appearance of gestures belonging to a given CR is contained within the corresponding CR template after warping. Features within the CR’s need not align. For example, after warping, it is expected that the left half of the mouth and its surrounding area lies mostly within the boundary of the CR§ template. Exact containment is not essential, as long as a sufficient amount of the gesture region lies within the boundary. 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. To achieve this a set of feature points is defined on a 2D face template. The corre­ sponding feature locations are identified in a pose normalized neutral expression image allowing us to define a mapping between the different face geometries using a sparse set of feature points. The points not coinciding with the feature points are mapped using Radial Basis Function interpolation (discussed later). This mapping is used to warp the CR tem plate that was initially defined on a real face, to the generic face. The warped CR template becomes the canonical reference template. 4 .4 .1 F ea tu re P o in t S e le c tio n Forty one feature locations are chosen that can be easily and reliably identified on the face (Figure 4.3). Many of these points duplicate those identified by Farkas [Far94] and in the Mpeg4 specification [PE02]. In Farkas, the eyebrows have little physiological significance, and therefore are not valid features to measure geometric facial similarity. Instead, physical ridges in the bone structure are used. Unfortunately, as the skin tissue smooths out the appearance of the ridges, many of these points require physical contact with the subject to identify reliably which we do not have the luxury of doing on a frame- by-frame basis. Furthermore, eyebrow motion is clearly significant in determining upper face expression and intensity. As this work is not interested in cross-subject geometric normalization the eyebrow geometry is included in the feature correspondence. It is difficult to identify feature correspondences along the boundary of the face, but these are necessary to contain the extents of the face within the CR template. Boundary points are therefore defined in terms of the intersections of the boundary with lines defined by pairs of reliably identified feature points. For example, referring to figure 4.3, we see 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0 Figure 4.3: Labeled feature points we identify on the subjects face. Circled points are those that can be identified with high reliability across different subjects. Circled points are used to define the boundary points. 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. that point 10 is defined by the intersection of the boundary of the face with the line connecting points 35 and 38. Reliably detectable points are delineated with circles. 4.4 .2 R B F Im age w arp in g Radial basis functions (RBF) are used to define the mapping from an input face image to the CR template at locations other than the selected feature points. The RBF is used frequently for interpolation of scattered data where the structure of the true space is unknown. The RBF is expressed in terms of a weighted sum of basis functions operating on distances between the interpolating points. The general form of an RBF is given by: S(x) = (4.4) i= 1 where g is called the kernel function, £ is the kernel function offset or location parameter, A is a scale factor, and N is the number of basis elements. While optimization of the parameters (£, A , a) is possible, the problem can be simplified for the purposes of inter­ polation by enforcing £, = x\. thereby centering a basis function at each data point. By fixing A = 1 we get the pure radial sum case of the RBF: N S (x) = ^ 2 ai9(\\x-Xi\\) (4.5) i= 1 which can be solved for a by least squares given training data in the form of input/output pairs (x,S(x)). The RBF formulation is used for geometric warping of images as in [ADRY94] by defining a set of input (source) and output (target) anchor points in the image space 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Target Control Points {Template} Source Image and Control Points Figure 4.4: RBF image warping from source image to match the geometry of the feature point targets defined on the generic face. and evaluating the remaining image locations x using equation 4.5 with the weights a computed as above. This defines a mapping from the regular grid of points in the image plane to the warped space. Using the new warped coordinates, the input image is resampled to compute the new image intensities corresponding to the warped geometry. The resampling process is computationally expensive due the intensity interpolation necessary at each sampling location. Instead the image is applied as texture to an underlying geometric grid with a vertex defined for each pixel in the image. After applying the warp to the vertices, graphics hardware is used to render the warped image with a significant speedup over software warping. 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The choice of basis function can have a dramatic effect on the interpolation results. If a function with too broad of a support is chosen, distant points may affect each other which is undesirable for faces where deformation is localized. Very local support is problematic as well, as there may be gaps between feature points on the face that have little or no basis support: as is the case with poorly scaled Gaussians. were found using multiquadric functions of the form g(x) — 46 The best qualitative results ( x 2 + c2)q. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 5 CoArt 5.1 Unsupervised Statistical M ethods for CR Analysis The methods discussed in Chapter 4 can be thought of as a kind of user-guided normaliza­ tion. The degrees of freedom irrelevant to the analysis are identified and removed either by controlling the environmental conditions under which the data is acquired (lighting, occlusions, etc), controlling the data generation process (choosing the training gestures and controlling their actuation), or post-processing of the data (geometric normalization, difference imaging, etc). It may be possible that the data contains additional redundan­ cies or irrelevant D O F’s that can not be qualitatively identified. For this purpose, there exist many dimensionality reduction techniques which allow us to uncover such redun­ dancies and characterize the effect of removing them. PCA is one such unsupervised technique. 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Though not typically a dimensionality reduction technique, independent component analysis (ICA) is a refinement of PCA that finds statistically independent linear compo­ nents of non-Gaussian data. This can result in insight into the data generation process and provide features better suited for classification. This chapter describes the underlying methods used in preliminary studies of CR appearance changes leading up to G-Folds. 5.1 .1 P rin c ip le C o m p o n en t A n a ly sis (P C A ) an d E ig en fa ces PCA is used in the eigenfaces approach for holistic facial recognition. The principle axes of a set of face images are the orthogonal eigenvectors of the mean centered covariance matrix of the data. The principle axes corresponding to the largest eigenvalue lies in the direction of highest variance, and subsequent axes (in decreasing order of the correspond­ ing eigenvalues) lie in directions of decreasing variance. If the covariance m atrix for mean p centered data is given by: N C = cov(X) = ^ ( x — fi)(x — n)T (5.1) i= 1 the eigenvectors of C can be computed using singular value decomposition (SVD) of the original centered data matrix X — [xo — f-i. xi — //...., x/- — //]. X = UAVt (5.2) 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 5.1: Example of projection axes computed using PCA on a hypothetical data set. where A is a diagonal m atrix with the eigenvalues of X along the diagonal, U and V T are orthogonal matrices and the columns of U are the eigenvectors of C . Figure 5.1 shows an example of the axes found by PCA on a hypothetical data set. The original data is projected onto the new (rotated) PCA basis. Frequently there will be some number of zero (or small) eigenvalues. The corresponding eigenvectors point in directions of small variance, and are often interpreted as less im portant projection directions. The dimensionality of the original data set can be reduced by eliminating these eigenvectors and projecting into the lower dimensional subspace. Due to the success of the eigenfaces technique for face recognition, this approach was used for the initial analysis of the co-articulation regions [FN02]. 5.1.1.1 E igenanalysis o f C o-articulation R egion D ata As described in Chapter 3 a data matrix can be constructed for CRi by concatenating the gesture sample matrices Xj belonging to region i. 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. F = horzcat [Xj] \ j € muscleseti (5.3) For the remainder of this chapter, the CR superscript i will be dropped, as the same procedure is applied to each region independently. The mean sample n = jjYljLi ^ is computed and the data is centered Tj = T — n \j = 1 ... N . Using the SVD as above, the eigenvectors of the covariance m atrix C = TTt are computed. Let E be the matrix of eigenvectors with all eigenvectors corresponding to zero eigenvalues removed. Denote the original data projected onto the lower dimensional subspace $ PCA = ET. These projections are termed muscle signatures as they are transformed representations of the original gesture appearance samples. In the simple case, each training sample is treated as its own class center and a new sample x (extracted from region i) is classified by selecting the class with minimal euclidean distance to the new sample’s muscle signature K PCA(x) = E ■ x. c{x) = min K pca - $ PCA (5.4) o The muscle m to which < P PCA belongs is identified by construction of the CR data matrix. The class index is mapped to gesture intensity by dividing c(x) by the number of samples in the muscle set m. 5.1.1.2 Issues w ith P C A for C o-articulation R egion A nalysis While PCA is generally applicable to dimensionality reduction and has optimal recon­ struction properties [TK99], it is not always best suited to tasks involving classification. 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 5.2: Problems with PCA for classification purposes. Projection on the first prin­ cipal component completely obscures the class structure. This is illustrated in Figure 5.2. If the two classes that are clearly separable in the original space are projected onto the direction of maximal variance, the projections will entirely obscure the class structure. In contrast, projection on the second PC will retain the structure while simultaneously reducing the dimensionality of the data. For data classification an ideal transformation would maximize the distance between data in distinct classes. One possibility is to perform iterative selection to find which of the orthogonal projection axes to use that maximizes class distinction (ie. by testing all possible sets of projection axes). A more elegant solution is to apply techniques designed specifically to separate classes. Fisher linear discriminant analysis (FLDA) [BHK97] uses the concept of intra- and inter-class scatter matrices to characterize the tightness of data in a class and the distance between classes respectively. The FLDA formulation derives 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a projection matrix from R n — > f?n_1 that attem pts to minimize intra-class scatter while maximizing inter-class scatter. While theory suggests that there should be an improvement across the board FLDA applied to face recognition has varied results. Several examples report poor results from FLDA when used as a direct replacement for PCA in eigenface type systems such as [TP91]. The first comparison of the two methods by [BHK97] showed improved results with FLDA applied to the Yale face database with synthetic perturbations applied to extend the database. But the results suggests that FLDA is more sensitive to the number of training images per class. In the preliminary CR analysis case, each intensity sample represents a class in it­ self. The labeling problem discussed in the previous chapter makes it difficult to acquire multiple samples per class. While subsequent chapters will describe a better method for more accurate clustering of gesture data, another unsupervised learning approach was explored, independent component analysis, that has been shown to perform better on face data without additional supervision . 5 .1 .2 In d e p e n d en t C o m p o n en t A n a ly sis Like PCA, ICA is a linear transformation of the input data. However, where PCA derives an orthogonal basis from second order statistics of the data (the covariance m atrix), ICA uses higher order statistics of the data to compute a non-orthogonal basis. While ICA is not intrinsically better for classification, Donatto [DBH+99] has demonstrated empirically that features extracted from face images using ICA are better for classification of FACS action units than PCA. 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The no-noise case of ICA is expressed as: X — A S (5.5) where A is a matrix with the observed data vectors along the columns (often called the signals), A is a matrix of statistically independent data vectors (sources), and A is a mixing matrix. The observed signals are thus expressed as a linear combination of some unknown independent sources. Neither the source nor mixing matrices are known, yet with the assumption of statistically independent sources, they can be estimated with various ICA algorithms. Several ICA algorithms exist [Hyv99][BS95], however most have a choice of nonlinear transfer function as a free parameter. This parameter must be adjusted according to the distribution of the sources. This distribution is not always known (and in this thesis, is not) so the nonlinearity parameters must be chosen empirically. The FastICA algorithm reportedly does not require such tuning and the default tank nonlinearity is sufficient [Hyv99]. 5.1.2.1 C o-articulation R egion A nalysis using IC A The data m atrix X is the same as in the PCA case: the ordered set of muscle appearance samples for a given CR. The independent sources are taken as a set of basis vectors for CR appearance space. This follows the formulation used by Donato et. al. for full face analysis of FACS Action Units. These basis vectors should resemble the eigenvectors computed in the PCA case, but exhibit a more local structure due to their statistical 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. independence. The matrix A supplies the coefficients of representation in the appearance basis for each training sample in X. A new sample x can be represented in CR signature space by solving for the coefficients c: As this is, in practice, an over determined set of linear equations, c is estimated by least squares where minimizes the residual r = B ■ c — x. The muscle signature of x is the normalized coefficient vector normalizing the columns of A. Classification of the new muscle signature proceeds as in the PCA case where Because S is very large, iterative least squares methods are prohibitively expensive for real-time, online signature extraction. Instead, during training the SVD back substitution matrix M is computed which guarantees the least squares solution to 5.6. If the SVD of non-square matrix B is given by B = UDV T, pseudo-inverse of B th at guarantees the least squares solution to 5.6 is given by x = B ■ c (5.6) (5.7) Similarly, the muscle signatures of of the training data samples T 1CA are computed by c{x) = min K 1CA - ^ CA j (5.8) M = VD ~l UT (5.9) 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 5.3: Mutual contraction of the frontalis and corrugator muscles occurs frequently in expressions of sadness and fear. For each new sample vector x, c = M ■ x is computed. M is constant following the ICA basis computation and may therefore be assembled once, and stored for efficient online signature extraction. 5 .1 .3 D isc u ssio n o f M e th o d o lo g y It is im portant to note the structure of the data that this assignment assumes. Ideally a new incoming sample from a CR would be analyzed and its exact location in CR state space determined. Thereby, revealing the contributions from each muscle basis vector. This turns out to be extremely difficult, as the resulting visual mixture is a nonlinear combination of skin appearances. The problem is simplified by assuming there is a single active muscle in a CR at any point in time, but this does not account for the co-articulation ability of muscles in the CR. For example in the expression shown in Figure 5.3 the center frontalis and corrugator muscle groups are mutually actuated. A possible solution to this problem is to treat the co-articulation gestures as pseudo-muscles by acquiring these samples in training and assign the bounded gesture frames to a new basis vector of the CR. Classification would proceeds as already described. A drawback of this approach is th at it puts extra burden on the actor during the training data acquisition 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. phase and limits classification of co-articulation effects to those seen during training. This is the approach taken in [Bar98][Hon00][CZLYW99]. 5.1.3.1 N onlinearity o f G esture Profiles The model assumes th at there is a linear relationship between the temporal position of a sample in a gesture sequence and gesture intensity. If the gesture actuation velocity changes during muscle contraction this intensity model fails. The G PR will be presented in Chapter 6 as a method to remove the dependence of sample labels on time. Assigning gesture intensity to samples by their temporal progression, a linear relationship is assumed between time (the image sample rate) and muscle contraction. Spontaneous gestures are not linear, as indicated in the literature [Ess95], but our subjects are coached to generate a smooth continuous gesture from neutral to apex. However, as the facial muscles compress due to contraction, the resistance of the muscle fibers to further compression increases. The change in contraction decreases over time, despite intentional control by the subject. In [Ess95] the actuation profile extracted from finite element simulation of facial action unit actuation is compared to a commonly assumed linear profile. Essa shows that the change in intensity is not constant over time for any significant segment of the actuation. The unpredictability of gestures in the general case introduces considerable uncertainty to the accuracy score. 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.2 PC A and ICA Quantitative Results 5.2.1 O n G e stu re C lassifier A ssessm en t Classification assessment turns out to be a considerable problem without intensity labeled data. General statistical properties of the data (such as standard deviation and variance) are difficult to relate to the accuracy of the classifier. There are, however, a limited number of quantitative and qualitative measurements that can be made. 5.2.1.1 Q uantitative M easurem ents The neutral and apex frames of each gesture can be reliably labeled. Binary gestures have been studied in [Bar98] [HonOO] [CZLYW99] [EF78]. Though binary gestures are clearly too limited for PDFA, they provide insight into the baseline performance of the classifier. More importantly, it can help assess the amount and effects of cross-talk across co-articulation region boundaries. 5.2.1.2 Perturbed Sam ples On the surface it appears that controlled synthetic errors can be introduced to the small set of labeled intensity samples and quantitatively assess the classifier performance under these perturbations. However, raw classification rates that are suitable for the binary gestures do not directly translate into the quality of the animation connected to the gesture parameters. Classification errors that assign intensity labels farther from the true label will be perceived as worse than small intensity errors. The score function must therefore be modified to penalize misclassifications based on distance from the true 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. intensity. In theory this can be accomplished by changing the classification indicator function to a distance metric relating the true and estimated classes, but this can only be done correctly if strong assumptions are made about the true gesture actuation profile. In our case, it is known that the linear assumption is a poor approximation (but the best we can do, given the data.) 5.2.1.3 In direct/Q u alitative A ssessm ent Raw classification accuracy is not the only way to judge the effectiveness of the intensity classifier. Because the perceptual effects of classifier errors on animation are of primary interest, we can focus on subjective analysis of the animation results. Furthermore, by connecting the classifier output to the input of a similarly parameterized performance driven facial animation system, perceptual accuracy will be directly related to the accu­ racy of parameter extraction. For example, if the intensity of a smile gesture over time is mapped to a smile on an animated face, this can be compared this to the original video sequence. Elements such as the smoothness of the synthetic gesture and accuracy of matched intensities will contribute to the perceived quality of param eter extraction. 5.2 .2 B in a ry G e stu re C la ssifica tio n In binary gesture classification only the neutral and apex states of each gesture are con­ sidered. Two im portant quantities are described to demonstrate classification accuracy: the classification rate and false alarm rate. The classification rate measures the frequency of mistakes the classifier makes when presented with data samples related to a given re­ gion. For example, the classification rate of C R q is determined by training on 50% of 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the neutral and apex corpus for the left frontalis gesture and testing on the remaining corpus. False alarms can occur if there is cross-talk from the other regions. For example, contracting the zygomatic major muscles (smile) may induce skin compression in the eye regions which can be incorrectly interpreted as contraction of the orbicularis occuli mus­ cles (eye squint). We test for false alarms by training each region separately on 50% of the respective corpus, and simultaneously classifying all regions. 5.2.2.1 C lassification R ate Given labeled data D, each element with class label Lt and classification function g(x) mapping from data point s to a class label, the classification-rate is computed by aggre­ gating the number of correct classifications and dividing by the number of samples (5.10) where I is the indicator function 1 i f a = b I(a,b) — (5-11) 0 i f a % b Note that in the case of binary gesture classification, each label consists of a gesture identifier and whether it is a neutral or apex frame. 5.2.2.2 False A larm R ate The false alarm rate is the mis-classification rate for a region tabulated over frames with unrelated gesture actuation. Motion in a co-articulation region not caused explicitly by 59 a raw — 1 (hi 9{P i)) i= 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0 1 2 3 4 5 6 7 PCA 100% 100% 100% 100% 100% 100% 100% 100% ICA 100% 100% 100% 100% 100% 100% 100% 100% 8 9 10 11 12 13 14 15 PCA 100% 100% 100% 100% 100% 100% 100% 100% ICA 100% 100% 100% 100% 100% 100% 100% 100% Table 5.1: Classification results for each gesture. Classification rate is averaged over 3 subjects. Average number of test frames per gesture is 180. 0 1 2 3 4 5 6 7 PCA 100% 100% 100% 90.73% 93.66% 100% 91.96% 89.55% ICA 100% 100% 100% 90.73% 93.66% 100% 91.96% 90.07% Table 5.2: False alarm rate for each co-articulation region. The quantity shown indicates the percent correct for each region. Decrease in performance in CR$, CR % , and CR- can be attributed to cross-talk between these regions. Average number of test frames per gesture is 1200. one of the member muscles is effectively noise, and can cause classification errors. In our case it helps to understand the effect of cross talk between gestures belonging to different regions, enabling us to fine tune the region boundaries. 5 .2 .3 R e su lts Tables 5.1, 5.2, and 5.3 show classification and false alarm rates for the binary gesture classifier trained on 50% of the data and tested on the remaining 50%. Even with this 0 1 2 3 4 5 6 7 PCA 100% 100% 100% 92.47% 94.84% 100% 93.94% 90.10% ICA 100% 100% 100% 92.47% 94.84% 100% 93.94% 90.59% Table 5.3: Overall classification accuracy for each CR. 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. simple classification method perfect accuracy is achieved. Performance degrades, however, as we look at the false alarms due to cross talk of gestures across region boundaries. This is most pronounced in the mouth and eye regions. Further analysis of the error locations indicate the following: 1. Mentalis (chin) contraction is the primary source of false alarms in CR$ and C R q, accounting for 99% of the error. 2. Triangularis contraction (frown) accounts for 92% of C R 7 false alarms. 3. CR:$ and CR 4 (Eye region) error can be attributed to poor data. The lighting conditions on some of the data sets were such th at crow’s feet wrinkles (which are primary indicators of contraction of the orbicularis occuli muscles) were washed out. Therefore, some apex samples resemble neutral states and cause confusion during testing. The results for ICA are nearly identical to those for PCA except for a marginal re­ duction in false alarms in CR 7 . This contradicts evidence presented by [DBH+99][Bar98] suggesting that IC’s are a better representation for facial images. A possible explanation is that while in the above references ICA is applied to sets of full face images (holistic analysis), in this work IC ’s are extracted from a concise region of the face. It is possible that the data is approximately Gaussian, in which case ICA effectively reduces to PCA as decorrelation implies independence for Gaussian data [Hyv99]. These results suggest that refinement of CR boundaries may improve overall classification results by reducing region cross-talk. 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.3 Region Based Flipbook Anim ation 5.3.1 M e th o d As there is a direct mapping from a CR state to its visual appearance, a character is defined by creating a neutral face frame and assigning an explicit reconstruction sample to each CR state. Samples can take the form of hand drawn frames for the case of cartoon animation, or may be populated from the original training samples. If training samples are used, the region under each CR is used as a mask to extract reconstruction pixel data from each discrete CR state. A Triangularis Risorius Figure 5.4: Reconstruction samples for CR$ of Maggie (left) and those extracted from the training video (right). Locations of samples in CR parameter space (center). Figure 5.4 shows a set of appearance samples assigned to CRs in the construction of the character Maggie as well as the original samples used for video reconstruction. Each CR state vector applied to the character definition maps to a set of reconstruction elements and alpha masks. Alpha masks are the same size as the reconstruction elements and have a value of 1.0 everywhere with a 10 pixel gradient to 0 approaching the boundary. Reconstruction elements are composited into the final image using simple alpha blending 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. with the neutral image as a base. Animation is thereby performed by reconstructing frames using the stream of CR state vectors. CoArt is similar to [BCS97] that rearranges video samples extracted from footage acquired of speaking people to generate animation. However, these and other systems where audible speech is analyzed and phoneme co-articulation effects are considered are causal in nature and therefore cannot be run in real-time without noticeable delay. The CoArt system, however, as it derives all information from instantaneous image data is non-causal. fPI^ . lif ■ - Figure 5.5: (top) Reconstructed video frames, (bottom) Original normalized video frames. 5.3.2 R e su lts Figure 5.5 shows results of the classification system applied to a video reconstruction set acquired from the same actor. Input images consisted of 360x240 pixel grayscale images of an actor exercising various expressions. Lighting conditions were kept constant throughout the performance. The classification and reconstruction is performed in real­ time. Note that multiple CR’s are concurrently analyzed and reconstructed. The system 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. is able to accurately match the source CR states. In figure 5.6 the same training set is used to animate a hand drawn character. The same input data are used to drive both sequences demonstrating the ability to mix and match reconstruction databases within the CoArt animation system. 5.4 Discussion of Limitations Though the results of the CoArt system are promising, there are several inherent prob­ lems: 1. Input samples are mapped to a relatively sparse set of discrete states. This results in brittle classification when an input sample falls between existing training states. This is apparent from the video reconstruction results when viewed in real-time as there are high frequency jitters th at occur on the decision boundary. 2. The model for gesture intensity is dependent on time in the training sequence and forces linearity. 3. Because of intensity labeling difficulty, only single gesture actuations can be used for training. All of these issues are solved by gesture polynomial reduction discussed next. 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. J J || a^/:V J Figure 5.6: Animation frames using gesture analysis to control a hand drawn character. 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 6 G-Folds and G esture Polynom ial Reduction Gesture analysis by ICA in Chapter 5 is dependent on the assumption that gesture inten­ sity is directly proportional to the temporal position of a sample in a gesture actuation sequence. The accuracy of this assumption depends on the subject’s ability to produce a smooth and monotonic gesture. Because it impossible to control the gesture with the fidelity necessary to relate two different gesture actuations, and human labeling is very difficult, the analysis is limited to single gesture training sequences. Gesture polynomial reduction (GPR), therefore, is presented as a method to remove the temporal dependence of gesture samples and construct a parameterization of facial gestures based on true gesture intensity without additional manual labeling effort. The automatically labeled samples can then be used to perform gesture classification. The foundation of G PR is the observation that independent gestures trace out coher­ ent ID curves in a low dimensional PCA space. The subspace spanned by gesture samples is modeled with low order polynomials with the curve parameter directly proportional to gesture intensity. The following sections explain the existence of the curves and outline the steps taken to uncover and model this structure. 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.1 Appearance M anifolds Gesture polynomials are related to models of appearance data called appearance man­ ifolds. Appearance manifolds were first introduced by Nayar and Murase [NMN96] to model and analyze pose and lighting variations of rigid objects. PCA is performed on a set of images of an object at varying stages of rotation and lighting directions, and the first 10 eigenvectors are retained. The points are projected into this 10 dimensional space and modeled with a B-Spline surface. By projecting onto this manifold and interpolating the sample parameters, the rotation and lighting parameters of a new input sample are estimated. Though explored only for lighting and pose variation of rigid objects, any continuous process that exhibits high correlation between adjacent samples will induce a manifold in appearance space. Appearance manifold analysis has not previously been applied to the analysis of deformation properties of non-rigid objects. The deformation of a human face by facial muscles is continuous and results in significant correlation between images at adjacent levels of muscle contraction. We therefore expect the existence of appearance manifolds due to facial gestures. 6.1.1 W h a t ca u ses a p p ea ra n ce m a n ifold s? P roperty 6.1 Correlation in image space is inversely proportional to distance in PC A space. The result of this property is that images with higher correlation will be closer in PCA space. If the images are a time-ordered sampling of a continuous process, (such as 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. skin deformation due to muscle contraction) and adjacent pairs of images exhibit high correlation, the PCA space projections will trace a continuous surface. The shape of the manifold depends on the correlation properties of the process. As an example, consider a set of images of a process, I = {io, ii, ■ ■ ■ , in ) and let Ckj — corr(ik,ik-j) |0 < k < N . If for each k, Ck,k-i > Ck,k- 8 |0 < 5 < k then the correlation structure takes the form of a curve in PCA space. Property 6.1.1 is guaranteed in the unaltered PCA space (all eigenvectors retained) or the complete (all non-zero eigenvectors retained) PCA space. In the case of a dimen­ sionality reduced space (dropping k non-zero eigenvectors) this property is only true for the range of the truncated basis which is a subspace of the original image space1 6.1.2 Dimensionality Reduction and Appearance Manifolds That these manifolds exist is interesting theoretically, but without parameterization, the manifold samples are nothing more than scattered points. Unfortunately, the manifolds exist in the relatively high dimensional unaltered PCA space which makes visualization impossible and manifold modeling difficult. The PCA basis is frequently truncated to reduce the dimensionality of of the data, but there can be significant information loss when the data is projected into the lower dimensional space. Figure 6.1 helps visualize the process of projection into the PCA space computed from a set of training samples X. Ai denotes the PCA matrix computed from X retaining i eigenvectors corresponding to the i largest eigenvalues. Because the columns of A are Tt is NOT necessarily the case (though implied in Nayar96) that the correlation property holds for the origin al image space after non-zero truncation of the PCA basis. 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. = PCAl(X) _ mm nonzero eigenvectors < k X Figure 6.1: Conceptual diagram of projection from image space to and from original and truncated PCA spaces. orthonormal, AT = A-1 . The diagram shows the space mappings provided by A 1 for three key values of i. The matrix A N maps from image space to a rotated image space with axes aligned in the directions of maximum variance of X. The inverse maps back to the original image space. A k corresponds to truncation of all nonzero eigenvectors. In this case, points in the original image space are projected into a fc-dimensional subspace of I' spanned by X. Applying (Ak)T, points in this subspace are mapped back to the rotated subspace in I. Similarly Am defines a smaller subspace of dimensionality m accounting for the most variance in X possible. It is im portant to note the loss of information occurring in each mapping as this can affect shape the manifold after projection. In the case of A N, there is no loss, as the transformation consists of a rotation of basis. W ith A k there is no loss of information as long as the image samples belong to the fc-D subspace: they are a linear combination of the original data points. If this is not the case, the result is an orthogonal projection onto 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the k-D subspace. The lost information is denoted in the figure by the residual r k. This space is important, as the data points are reconstructed without loss in the original space. In the final case, where i = m < k even samples in X may lose information as they are projected into the m -D subspace. The residual due to projection from the k-D subspace to the m -D subspace is denoted by rm. The total residual r = rk + rm describes the total loss of detail resulting from an arbitrary image projection into the m -D subspace. While the residual characterizes the information loss for a single point projection, a global sense for how well the manifold structure will be maintained can be gleaned from inspection of the eigenvalue magnitudes. Projecting onto a PCA basis with all non-zero eigenvectors retained corresponds to rotating the subspace containing the original images. Clearly, this rigid transformation retains all correlation structure existing in the original subspace. However, if high energy eigenvectors are truncated, then we run the risk of “collapsing” the correlation information and potentially destroying significant manifold structure along the truncated dimensions. 6.1.3 Appearance manifolds and gesture polynomials It is expected that a manifold exists in a complete (non-truncated) PCA space because of Property 6.1.1. However it should not (in general) be expected that this manifold would be meaningful (retain the relevant structure) in extremely low dimensional PCA space (3 dim). While there are very few muscle degrees of freedom, the process being measured can have many more DOF’s as it results from the nonlinear deformation and reflectance properties of skin. This makes the discovery of gesture manifolds in 3-space even more compelling: It is not enough for the set of images to be correlated in image space, but 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. this correlation structure must not be lost or rendered meaningless when projected into the truncated space. The dimensionality of the manifold is related to the effective number of degrees of freedom of the process analyzed. “Effective” refers to the appearance realization of the underlying process. For example, smiling can be thought of as a one dimensional process (progressing from neutral to full smile), but the appearance generation mechanism in­ volves lighting, skin compression, and other factors that increase the effective degrees of freedom of the process. This discrepancy between process and appearance D O F’s is exem­ plified in [NMN96] where pose variations and lighting (6 DOF) required a 10 dimensional space to unambiguously represent the appearance manifold. C onjecture 6.1 The success of gesture m anifold analysis is dependent on the fa ct that the dim ensionality of the effective param eter space has been constrained to account for the locality of m uscle contractions by co-articulation region decomposition. The resulting model is very powerful as better intuition can be developed by visual­ izing the manifolds which is otherwise impossible in high dimensional parameter spaces. 6.2 G-Folds and G PR M odeling 6.2.1 Testing for Gesture Manifolds As discussed in Chapter 3, the face has been partitioned into co-articulation regions to limit the number of muscle degrees of freedom in the sample appearance space. Analysis is performed on each region independently. To test for low-dimensional gesture manifolds, CR data is analyzed using PCA. 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0. 06. 0 .0 4 . 0 .0 2 - -0.02 -0.04 - -0.06 0.15 0.3 0.2 0.05 -0.05 -0.1 Figure 6.2: Projection of gesture data for CR% onto the first 3 principal components. Let P C A n(X ) denote the projection of data set X onto the first n eigenvectors of the covariance m atrix of X. Figure 6.2 shows a plot of Y = P C A 3 (X). M atrix X consists of samples from six successive actuations of the three gestures affecting CRq. This initial transformation maps from image space to region state space and can be represented by the m atrix M \. There are several interesting observations that can be made from this figure: 1. Gesture samples lie in a tight subspace of image space that resembles 2D curves in Gesture Space. 2. The neutral starting points of each gesture originate from the same location. 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3. The position of a sample along each curve is related to the contraction level of the respective gesture. These observations indicate th at the manifolds do in fact exist, and their structure is maintained in the 3D PCA space. Furthermore, the manifold structure can be modeled with low order polynomials: the principle ingredient of gesture polynomial reduction. These gesture manifolds are termed G-Folds. Examples of coherent G-Folds are shown in figure 6.3 for various subjects and regions. In addition to purely cosmetic differences in facial appearance (such as skin color, facial hair, and geometry) the difference in shape of the manifolds can be attributed to: 1. The fact that the gestures are voluntary. This implies that there is a potentially significant amount of variation in gesture execution between subjects due simply to conscious preference or subjective interpretation of the gesture. 2. There are physiological differences between subjects such as the relative angles between facial muscles, bone structure, and fatty tissue distribution. For some subjects, for example, a grimace may have more visual correlation to a smile than in other subjects. Problem s w ith voluntary gestures Unfortunately, because the gestures are produced by voluntary muscle actuation, and the subjects are untrained, there can be significant variation between successive actuations of the training gestures. 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0.05 0.05 0.05 0.05 0.1 0.05 0.1 0.05 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.2 0.05 0 0.05 0.1 0.05 0.05 0.05 0.1 0.05 0.1 0.05 0.05 0.05 0.1 0.05 0.1 0.05 0.05 0.1 0.2 0.1 0.2 0.1 0.1 0.1 0.2 0.1 0.2 0.2 0.1 0.5 I / 0.2 0.2 0.1 0.1 0.05 0.05 0.1 0.1 0.1 0.1 0.05 0.05 0.1 0.05 0.1 0.05 0.05 0 fir# o o 0.05 0.05 0.1 0.1 0.1 0.1 0.1 0 0 0.05 0.1 0.02 0.02 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.1 0.05 0.1 Figure 6.3: Examples of G-Folds for different subjects and all regions. 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 6.4: Misbehaved G-Fold structure in a subject with poor mouth muscle control. Figure 6.4 shows an example of G-Folds induced by a subject with difficulty controlling their facial muscles. All of the bad cases occurred in the mouth region, as some subjects had a difficult time isolating the frown gesture, and at times, differentiating the smile from the grimace. Using G-Fold analysis, a better representation can be constructed for the true gestures. This is achieved by gesture polynomial reduction discussed next. 6.2.2 Subspace Parameterization By Least Squares Polynomial Regression We want to embed each set of points y% belonging to gesture i, in a 2D space to facilitate polynomial parameterization. To do this, an additional transformation is made from region space to gesture space by the operator Z l = PC A 2 (yl) which can be represented by the m atrix M % . The two principle axes define a projection plane that retains the 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. structure of the gesture samples. Writing Z = [zo zi] the quadratic regression problem is expressed as: Z l = [zq Z0 1] a<i 01 ao = F a (6 .1) (V is known as the Vandermonde matrix) and solve for the coefficients a 1 of the polynomial approximating z? using least squares a* = (VTV)~ 1 V T zi (6 .2) The process of gesture polynomial reduction is summarized by the following pseudo-code: Xk = G esture d a ta fo r CRk Mi = PCAz(X) / / f i r s t 3 eig en v ecto rs of cov(Xk) fo r a l l g e s tu re s i C j = samples fo r g e stu re i H i = M\C{ / / p ro je c t to CR space M | = PCA'iiin) II f i r s t 2 eig en v ecto rs Zi — M^yi II p ro je c t to gesture space a,i = LSPoly(zi) / / polynom ial re g re s s io n end Figure 6.5 shows the successive GPR transformations from sample space to gesture space, and the final polynomial parameterization for CR%. 6 .2 .3 R e c o n str u c tio n The quality of the polynomial model can be qualitatively validated by reconstruct­ ing samples of the polynomial in the original sample space. The gesture polynomial y = ci2 X2 + a%x + ao is evaluated at regular intervals of x. The resulting polynomial 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Mj*X = Q .0 6 - Q C « . o r a - -a.ra. • a o * . - •ac6;. y1 * •3 •3 •3 X 10 X 10 X 10 4 2 2 1 • J 0 V * "V 4 0 . ■ * : * * « • * > . * ■ * ‘ * -2 -1 -4 -2 -4 -2 0 2 -2 0 2 -1 0 1 -3 -3 -3 X 10 x 10 x 10 Figure 6.5: The projected data from each gesture is reprojected into a 2D PCA space and modeled with quadratic polynomials. samples are transformed back to the original sample space by applying the inverse GPR transformation: x = x + M -jr M j c (6-3) where c as evaluated on the polynomial above. 6 .2 .4 G P R R e su lts The following diagram illustrates the results of GPR parameterization on C R q from SID-0001. Each gesture is actuated six times. All figures consists of the following frames: 1. (top) Training data projected into the region space defined by the first 3 PCA axes. Blue points are neutral samples. 2. (center) Gesture sample projections and corresponding polynomial approximation. 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3. (bottom) 10 Synthetic reconstructed images generated from each GP at equal sam­ ple intervals along the curve using equation 6.3. An icon on each frame identifies the region being analyzed. The reconstructed frames are meant not to illustrate the reconstruction potential of the model, but to show that as the gesture polynomial is traversed, gesture intensity varies proportionally. Thus achieving our goal of removing the temporal dependence of the gesture samples. Appendix A shows all nine regions and their associated gestures. 6 .2 .5 G e stu re sig n reso lu tio n In the results from CR 3 the order of the samples appears to be reversed. (The eye progresses from closed to open, opposite to the recorded gesture in which the open eye state is regarded as neutral) Though the GPR correctly orders the samples with respect to gesture intensity (removing the temporal degree of freedom), the sign of the progression is unknown. This is resolved by projecting the mean neutral image into gesture space and computing the distance to the two endpoints of the polynomial. The closest endpoint is then regarded as gesture intensity zero. 6.3 Applying the G PR The result of gesture polynomial reduction is a set of continuous polynomials, each rep­ resenting the appearance properties of a given gesture. This section describes how the G-Folds are used for gesture intensity analysis for performance driven facial animation. 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ■ 3 X 10 -3 -3 -3 x 10 x 10 x 10 4 2 5 2 1 * « # > 0 A J ° 0 ** ?* . ... * -2 -1 -5 -4 -2 -1 0 1 -4 -2 0 2 -2 0 2 -1 0 1 x 10 X10 x 10 x 10 L : L ! L : L ! L j L ! k l £ E E Figure 6.6: GPR transformation applied to X = CR§. 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 6.7: GPR transformation applied to X = CRz- Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.3.1 Intensity Labeling Chapter 3 presented a simple method to assign intensity labels to gesture samples. W ith the assumption of a constant gesture velocity the gesture index was normalized by the number of gesture samples. This assumes an unreasonable model for the true actuation profile which can introduce large errors if the data is used for supervised learning. As­ signing labels to gesture intensity samples, however, is very difficult for a human labeler. This is indicated by the fact that FACS does not deal with fine levels of intensity, in fact, binary decisions on action unit actuation are the only reliably coded states. While it may be possible for a human observer to classify an intensity image relative to another inten­ sity image (ie. whether a given sample reflects a contraction that is more intense or less intense than another) to do so for an entire database would require n! comparisons and is still likely to be error prone for samples that are close in intensity. Furthermore, this only provides relative intensity, and to assign an absolute intensity, the labeler must supply the perceived distances between samples which is unlikely to be determined accurately. The results from GPR exploration suggest another method for labeling gesture sam­ ples th at is insensitive to velocity variations between gesture actuations. Each gesture polynomial is bounded by its origin and apex, and is parameterized by intensity. After GPR on a region, each sample is labeled according to the value of the GP intensity param­ eter at its closest projection point on the curve. Figure 6.8 illustrates this process. The gesture polynomial is estimated first from the data, then the data is labeled according to its location along the curve. 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. At t t Figure 6.8: A new sample is projected onto the polynomial by selecting the bin with the closest centroid. Only a small segment of the polynomial is used for the gesture model, and therefore, the closest point on the curve to an input sample is not necessarily its orthogonal pro­ jection. In the general case, the closest point will be one of the following points: tneutral, j.apex^ Qr ^.orthogonal' r _ Q ie analytic solution for the orthogonal projection onto a quadratic curve can be found by solving for the roots of a cubic polynomial presenting five possible solutions for the projection (the three roots plus the segment endpoints). As shown in Figure 6.9 the gesture polynomial is divided into N bins of size A t = ta P ex_tneut at represented by the median of the segment d * = (4>i,p((/>i)) where (pi = ^neutral _ |_ (y y I)A t and i = 1 ■ ■ ■ N. The intensity of the gesture is given by a = -fa where j = miiij ||(5 , — x\\. the bin with minimum Euclidean distance to the input point x. Q(x) = 8j is the projection function of a point x in gesture space onto the curve. 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.3.2 Quantized gesture intensity classification The same procedure used for intensity labeling can be built upon to perform gesture intensity classification. The process is illustrated in Figure 6.9. v= Figure 6.9: Gesture intensity classification using the G PR representation. After labeling existing training data with respect to gesture intensity, GPR model is used to perform gesture classification on new unlabeled samples. Gesture k is determined by the closest gesture polynomial to the input sample in 3D gesture space and intensity a is computed on the classified gesture by projection onto its manifold. The distance from an input sample to each gesture polynomial is computed by the Euclidean distance between the sample’s gesture projection Q(x) and its CR projection. This process is illustrated in Figure 6.9 for C R q. 6.4 Discussion The G-Fold representation has clear benefits over the CoArt model. A continuous model for facial gestures in each region is constructed instead of the sparse disjoint set of image 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. features. Discovering that gesture intensities are manifolds in PCA space enables labeling previously unlabeled data with respect to intensity. Though the selection of three or fewer eigenvectors for G-Fold representation has clear visualization benefits, it is possible th a t modeling in a higher dimensional space can increase performance, and/or provide better generalization. While the causes for differences in manifold structure are apparent, the reasons for the uncanny similarities between many manifolds for a given region is less so. These similarities suggest an internal semantic consistency in the gestures. For example, in CRo, the GPR spaces for each subject are different (different eigenvectors), yet the fact that the structures are very similar implies th at the eigenvectors encode the same property in each case. Exploring the meaning behind these properties is intriguing and will be the subject of future work. 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 7 G PR Exploration and Intuition Section 6.2 described the process of gesture polynomial reduction, reducing the high dimensional facial gesture space to a set of low dimensional curves using principal com­ ponent analysis and polynomial regression. This solved the gesture labeling problem and provided a compact continuous representation for gesture space in each co-articulation region. The model was then used to classify new gesture samples with respect to their gesture intensity. In this chapter G-Folds are explored in more depth with a set of exper­ iments designed to help build an intuition for GPR space and test the limitations of this representation. 7.1 Intuitive interpretation of points projected into gesture space The location of the projected point is dependent on the correlation between the original image and the images used to generate the subspace. Two types of data may be presented to the system: trained and untrained gesture samples. This section explores where these 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. two types of data are expected to project in GPR space. NOTE: A trained gesture does not imply the data belongs to the training set, just th at examples of this type of gesture have been seen. 7.1.1 G e stu re sa m p les Given a set of training points X from independent gesture actuations in a given co­ articulation region, its truncated PCA basis B, and a new image sample x. Where should we expect a sample to project in gesture space? If the point is similar to the existing training points, for example, if it is a sample from a new actuation of one of the training gestures, there will be strong correlation with the existing training data. The point will therefore project close to the gesture manifold. Deviations from the manifold can result due to sensor noise or lighting variations, or by slightly varying the actuation of the gesture. Variations that are not significantly correlated with any of the training samples are part of the residual r of Figure 6.1 and will be dissolved upon projection onto the truncated basis. 7 .1 .2 U n se e n g e stu r e sa m p les Quantized gesture analysis as presented in section 6.3.2 is robust for the basis gestures defining CR state space, but does not correctly model all possible appearance changes in a CR. Unconstrained facial motion will result in combinations of the underlying basis gestures. We therefore would like to uncover the intensity of multiple active gestures in a given CR from a single sample. In these cases the input sample has an intuitive semantic 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0.04 Frown Smile Neutral 0.03 0.02 0.01 0 -0.01 -0.02 -0.03 -0.04 L — -0.04 -0.02 0 0.02 0.06 0.08 Figure 7.1: Frown-Smile manifold interpretation in terms of the “actuation space” but does not share the linear correlation properties with the basis gestures. As a concrete example consider the space defined by the set of samples belonging to the smile and frown gestures (with a set of neutral samples). Figure 7.1 shows the manifolds traced out in 3D PCA space. For new input samples that are close to a smile, we would expect significant correlation with the samples from the smile manifold, and hence projection close to the smile curve. And the same for frown-like samples. If we consider the grimace gesture (contraction of the risorius muscle), semantically we may be inclined to think of this gesture as a combination of a smile and a frown. But the appearance space has no intrinsic semantic knowledge and only reflects the linear correlation between a given static sample and the existing training data and an image of a grimace state is not a linear combination of a smile and a frown. There should 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 7.2: High similarity between low intensity state in gestures for region C R q. (top) Smile, (middle) grimace, (bottom) frown gesture. In all cases, there is significant corre­ lation along the left edge of sample even at different intensities. be significant correlation with low intensity grimace gesture samples (as these resemble neutral states), but increasing the intensity of the gesture, the correlation would fall off. Due to the elastic properties of skin, the mouth will deform less, farther from the mus­ cle’s insertion point near the corner of the mouth. This will result in a greater similarity to the neutral state as the vertical mid line of the lips is approached, shown in Figure 7.2. This correlation with neutral is maintained in the projection, while the variation at higher degrees of risorius contraction are lost. The lost information is precisely that which defines the grimace gesture intensity. The result being that the grimace samples project close to the neutral state. This is actually a double edged sword: On one hand, this is a desirable property as noise coming from occlusions or other sources that are not correlated to the training data will be removed. On the other hand, it requires that the sample spaces be covered thoroughly with training data. In Figure 7.3 one can also observe some similarity to the smile and frown gestures. For example, the bulge under the lip is reminiscent of the wrinkling due to triangularis 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 7.3: Appearance similarities between grimace, smile, and frown, gestures. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. depression, and the deepening of the nasolabial fold shares characteristics with contraction of the zygomatic major muscle. Figure 7.4 illustrates this point. The PCA space is constructed from the first three eigenvectors of the data matrix with smile and frown data. Each panel shows a single actuation of the grimace gesture with each sample labeled with respect to gesture intensity (assigned by temporal progression). 7 .1 .3 A rb itra ry e x p ressio n s As indicated in the previous section, when we introduce untrained gestures, the behavior in CR space is more difficult to define. Projection of a sample into CR space forces the sample to be a linear combination of the CR space basis vectors, which are optimized for the independent gesture cases alone. However, we know by virtue of skin physics that the appearance will not be linear. Figure 7.5 shows a mouth opening gesture plotted through CRe space. Intuitively it seems that an opening mouth will have no correlation to any gestures, and hence project almost exclusively near neutral. However, upon inspection of the curves, there appears to be significant correlation with the frown gesture. An explanation for this is that when the frown gesture is actuated, the upper lip is pulled downwards along with the mouth corners. Upon close inspection, a similar lowering of the upper lip is present when the jaw is dropped, thereby attracting the projections towards the frown manifold. Another interesting case is the open mouth smile visualized as a deviation from the closed mouth training smile. In figure 7.6 the smile manifold has been isolated to avoid confusion with the other gestures. The positive actuation of the closed mouth smile 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Grimace Frown Smile Neutral 0.03 0.02 Q.OT 0 •0.02 -0.03 •0.04 .... 0.04 0.02 c 0.02 0 08 Figure 7.4: Grimace gesture in smile-frown space. Variations not correlated with the existing training data are lost. Similarity to frown and smile gestures is indicated by deviation from neutral along the gesture axes. The remaining correlation is with the neutral state. Most samples project close to neutral. 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. x: \ K \ A \ y?- c \ \ V i V \ i i" ' V * _ _ \ , , 7 j \ b . 1 \ V K i f V \ i \ Figure 7.5: Trajectory of mouth opening gesture in C R q space. 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. l J •i . ' ' / !rJ / " " Figure 7.6: Trajectory of open mouth smile shown as a deviation from the closed mouth smile. 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. projects predictably along the smile manifold. As the mouth is opened two interesting properties are observed. Lowering the jaw pulls the skin inward, making the nasolabial fold resemble a lower intensity smile gesture. The projection mimics this observation as it traverses the manifold in the negative direction. The deviation from the smile manifold occurs due to the variance present in the open mouth area. This is at least partly due correlation to the frown gesture as seen in the previous example. 7 .1 .4 S p eech Speech is arguably more complex than facial gestures as a large number of muscles are active in a relatively small area of the face. Figures 7.7 through 7.8 explore the projection of speech gestures (visemes) in CRq G-Fold space. Table 7.1 supplies the word context for the uttered phonemes. Overall it is difficult to interpret the results as there is little intuitive correlation with existing gestures. In fact, most samples hover near neutral. Some interesting cases to observe, however, are visemes using the zygomatic major muscle such as /IH / and /IY / where the zygomatic major muscle comes into play. The activation along the smile manifold is clearly present. Overall, the results indicate th at the three gesture GPR space for C R q is insufficient for accurate characterization of all visemes. Many phonemes require the use of the orbic­ ularis oris, masseter, temporalis, or medial pterygoid muscles which are not as common in expressive gestures and as such were not included in the co-articulation region decompo­ sition. Future work should include a finer decomposition of the mouth state space tuned specifically for speech. 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. \ \ Figure 7.7: Trajectory of dee (/D / /IY /) in C R q space. Note the correlation to the smile gesture in the /IY / viseme. 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. AW AY B CH \ D DH f \ v * \ EH r ■ * * ER \ \ ¥ '.V \ Figure 7.8: Trajectories of various phonemes in C R q space. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Phoneme Example Translation AA odd AA D AE at AE T AH hut HH AH T AO ought AO T AW cow K AW AY hide HH AY D B be B IY CH cheese CH IY Z D dee D IY DH thee DH IY EH Ed EH D ER hurt HH ER T EY ate EY T F fee F IY G green G R IY N HH he HH IY IH it IH T IY eat IY T JH gee JH IY K key K IY L lee L IY M me M IY N knee N IY NG ping P IY NG OW oat OW T OY toy T OY P pee P IY R read R IY D S sea S IY SH she SH IY T tea T IY TH theta TH EY T AH UH hood HH UH D UW two T UW V vee V IY w we W IY Y yield Y IY L D Z zee Z IY ZH seizure S IY ZH ER Table 7.1: Phonemes and example words used in speech tests and figures 7.7 and 7.8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.1.5 Combined Gestures The previous examples have all dealt with the mouth region, as it is arguably the most complex region on the face. As a final example, we look at C R \, the central forehead region. The basis gestures consist of a brow raise and brow furrow. While these are the more common gestures occurring in this region, a combination of the two can occur, and is often an indicator of fear or intense sadness. Figure 7.9 plots this gesture in C R \ space. The early stages of the gesture actually correspond to the muscle actuations; activation exists along both the corrugator and frontalis manifolds. However, as the gesture is intensified, the sample projections begin to resemble a pure frontalis actuation. This is explained by the fact that the corrugator training gesture (brow corners de­ formed downwards and inwards) is acquired relative to the neutral face configuration. Therefore, when the frontalis muscle is raised, the brows are shifted upwards, losing the correlation with the original corrugator gesture. As the shifted brow corners are also closer together than they are in the frontalis gesture, the variations are truncated when projected onto the C R \ basis. The final result is that the high-intensity combined gesture projects to the frontalis manifold. It is also interesting to note the different path taken during the release phase of the gesture. The results indicate th at in the positive actuation the corrugator muscle is actuated first, followed by the frontalis muscle. However, in the release phase, the corrugator is released almost completely before the frontalis is returned to its resting state. 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. l I Figure 7.9: Trajectory of combined frontalis and corrugator contraction in C R \ space 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Despite the nonlinearity of gesture appearance, one consistent property maintained regardless of gesture type is the continuity of the gesture trajectory. Continuity is ex­ pected due to the nature of the gesture generation process. Skin deformation resulting from muscle contractions is continuous so the correlation to the existing basis vectors will change continuously. If the unconstrained gestures are correlated to the basis gestures, the curve trajectory deviates from the neutral zone, otherwise, we expect the samples to project near neutral as this is the only information not truncated by the G PR basis. 7.2 Generalization It has been shown in the previous sections that a region and its corresponding gestures trace out a set of distinct paths in a low dimensional PCA space. The analysis has been performed on single subjects at a time, producing a person specific gesture model. On inspection of gesture manifolds for different subjects, however, it is apparent that the gestures exhibit similar structure. This begs the question: Are the cross-subject manifolds related? If a predictable similarity can be uncovered between existing gesture manifolds in different subjects, this information can be exploited to generalize the gesture classification. This section explores the possibilities of generalization of gesture analysis by relating independent gesture man­ ifolds from distinct subjects. The generalization test is performed by merging the CR data from two different subjects and analyzing the resulting manifolds. In figure 7.10 appearance data from C R q for subjects SID-0001, SID-0002, SID-0003, and SID-0004 has been merged into a single data matrix and the G PR transform has 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 7.10: Merged data generalization test on CRo for subjects 1, 2, 3 and 4. Individual structures are shown above merged case and are oriented to emphasize the 3D structure of the data. In the merged case, the data flattens as some of the structure is lost. 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. been applied. For comparison, the independent GPR transformations are shown above. Notice how in the independent cases, a clear trajectory unfolds in GPR space for the CR data. In the combined space, the trajectories have changed, forcing the curves to become near planar. This effect is explained by the fact that there is too much appearance variation be­ tween the two subjects’ gesture data to be effectively described in the generalized 3D GPR space. Projecting into a space with insufficient dimensionality to account for the appearance variance causes the manifold structure to shift. Due to the properties of PC A, the shift will occur such th a t the maximum appearance variation is retained. In most observed cases where two subjects were merged, this was not a problem as the new structure maintained the separation between distinct G-Folds. However, if many subjects are merged or there is a large disparity between the variance in the subjects structure, this problem is exacerbated and we run the risk of collapsing all meaningful structure. The disparity variance problem is illustrated in figure 7.11 with appearance data from C R \ for subjects SID-0001, SID-0002 and SID-0003. In the cases of SID-0002 and SID-0003 the corrugator actuation exhibits lower appearance variation than in the SID-0001 case. This causes the corrugator manifold to collapse, in favor of retaining the larger variance frontalis gesture. The im portant thing to note from these results is that the manifolds resulting from merged data analysis are distinct. There is no “global” structure than arises when merging the data. While it may be possible to build a large space of disparate manifolds and generalize the analysis by selecting the closest manifold, in addition to much more data, this would likely require analysis in a much higher dimensional space. 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. & { SID 0001 ■ % *. V SID 0002 SID 0003 V 4* s V f Vv ■% Figure 7.11: Merged data for C R \ and subjects 1, 2, and 3. Low variance structure collapses in subjects 2 and 3, while higher variance structure retained. 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.2.1 Cross Subject Classification The final behavior explored is the ability to classify a new subject’s gesture data given training data from a distinct subject. Although the previous sections indicate th at there can be significant variability between two subjects’ gesture appearance, the empirical implications are tested in this section. Test subjects were chosen with similar G-Fold structure. The classification plots for CRo are shown in figure 7.12. The intensity label plots from G PR labeling are shown under the classification plots and are treated as ground-truth. Though differing in scale, in Figure 7.12 the manifolds are very similar and well aligned. The intensity subtleties are not correctly identified (the cross classification case decides full actuation for the second group of frontalis actuations), but the classification results are reasonable, and better than one might expect. In this second case, however, it is apparent th at the manifold structures are signifi­ cantly different and do not align well. As a result, the cross-subject classification accuracy is very low. Figure 7.13 show the structure of the new subject’s gestures when projected into the trained space. In addition to a global scale problem, there are some significant structural differences th at cause classification errors. In some of the G-Fold cases, for example, CRo and C R \, the structures are very similar and it would appear as though a rigid transformation would correspond the manifolds for effective classification. Unfortunately, to derive the transformation, samples of the gesture manifold itself are required. It is conceivable that observation of an untrained 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. performance would allow the transformation parameters to be automatically learned, but this has not been explored. 7.3 Discussion It is clear from the experiments in this section that the gesture manifold property does not generalize; there are no “global” , low-dimensional G-folds uncovered when multiple subjects’ data are merged. This makes sense, as G-Folds are a result of tight correlation between samples. The whole reason humans perceive two faces as “different” is because they are significantly uncorrelated. These differences between manifolds make cross­ subject classification error-prone as shown in the examples. However, the fact that G-Folds from different subjects appear strikingly similar shows that the internal correlation structure is similar between different subjects. This is quite interesting as it implies that by removing the between subject variations, it may be possible to generalize the representation. Though expressions are the primary focus of this thesis, speech gestures were explored in the existing G-Fold space with little success. The gestures defining CR§ and C R q are correlated to the zygomatic, risorius, and triangularis muscles, none of which are used significantly in speech production. The orbicularis oris and muscles responsible for motion of the mandible are more important for speech. The results indicate that speech gestures are similarly not correlated to expression gestures. This is an interesting as it explains, in part, why a person can speak “neutrally” without being perceived as emoting. 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 7.12: Classification of SID-0007 using SID-0009 for training on CRo- 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. G .G 6 -. - 0.02 ' 0 - 0.04 ^ " - ^ . - - ' " " " " " ' ’ - 0.05 - 0.06 ’ - 0,1 Q. O B W Figure 7.13: Classification of SID-0007 using SID-0009 for training on C R \. 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 8 M uscle Morphing CoArt is a method for performing flip book style animation from quantized gesture states. While it is possible to adapt this to co-articulated states, this would require an animator to populate the entire state space for each CR. Instead, a fully parameterized 2D anima­ tion system is presented th at is able to reproduce co-articulated gesture states using a single neutral face image called muscle morphing. The mass spring musculature concept developed in [TW90] for 3D facial animation is applied it to 2D, adding a multi-layer facial warping technique using radial basis functions. Though the primary goal of muscle morphing is to assess the quality of gesture parameter extraction, it is a fairly general and flexible method for 2D performance driven facial animation. RBF image warping has been demonstrated for both image normalization [ADRY94] and video based animation for speech [NNOO]. Applications to animation interpolate pre­ defined target points to generate the final animation. However, muscles have an additive effect on the locations of points on the skin which cannot be duplicated with target point interpolation alone. It is therefore not well suited for direct control by gesture intensity 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. parameters. Instead of control point interpolation, in muscle morphing the target posi­ tions of control points are defined as a function of virtual muscle contractions. Gesture intensity parameters are linked directly to muscle contraction values. 8.1 Overview The face is divided into three conceptual layers: skin, muscle, and bone. All image deformation occurs at the skin layer. The muscle layer is responsible for controlling the deformation, and the bone layer provides structure for the muscle system. Muscles are modeled as springs and are divided into two classes: active and structural. The contraction of an active muscle is linked to a gesture intensity, whereas structural muscles exist only to preserve the structural integrity of the musculature. Both types of muscles are defined by 2 masses and a connecting spring. At most, one end of the spring can be fixed to the bone layer. Otherwise, the masses are embedded into the muscle layer as insertion points. Figure 8.1: Skin, muscle insertion, and CR boundary points defined on Maggie character. 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Insertion points share positional information with one of three types of RBF control points: skin, muscle insertion, and CR boundary points. These points are shown in figure 8.1. Skin Skin points define the deformable structure of the face and function similar to the standard control points used for geometric normalization in Chapter 4. When a skin point changes position, the image deforms to interpolate through the new point. Insertion Insertion points are connected to muscle. As a muscle contracts, the insertion point is moved, consistent with the mass attached to the muscle spring. Insertion points are connected to skin points by an RBF network. C R boundary Boundary points are also fixed and define the skin deformation region of influence. If a skin point moves, the CR boundary will hinder the propagation of the induced deformation past the boundary. 8.2 Character Creation Creating a character involves choosing or drawing a 2D face image, constructing a mus­ culature, and placing control points. While the generic musculature designed for our test case can be applied to a new face image, parameters must be tuned to suit the particular goal of the animator. In practice this would be an iterative process. Figure 8.2 shows examples of facial expressions created by manually specifying the contraction values of muscles. 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. iniimiiliii Neutral iiilmiiiiiiil Figure 8.2: Example muscle deformations on Maggie character. 8.3 Anim ation Process Gesture parameters are extracted from an input face image using G-Fold analysis. The corresponding muscles are contracted in proportion to the gesture intensity. We solve for the musculature’s equilibrium state and move the target RBF insertion control points to the new muscle insertion point mass locations. The insertion points are connected to the skin points by an RBF interpolation layer. By evaluating the skin points with the RBF network trained on the new insertion points, we propagate the muscle deformation to the skin layer. The final deformation is defined on the image surface via a second RBF network trained with the new skin control point positions and fixed CR boundary points. I l l Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 8.4 Mass Spring M usculature This section describes the mass spring muscle system in more detail. Mass spring sys­ tems were introduced to computer graphics by Terzopolous [TW90] and has since been applied to a variety of topics including realistic 3D facial skin deformation and virtual fish simulation. Due to the computational complexity of integrating the equations of motion, mass-spring systems are generally limited to applications where the equations can be evaluated over a coarse mesh. For example, in [TW90] the face mesh consisted of a few hundred vertices with relatively sparse connectivity. Similarly, the fish structure in [Tu96] required only 15 mass-nodes with between four and eight spring connections per vertex. In Muscle Morphing, the mass spring system defines only the musculature of the face, with the skin motion modeled by the RBF deformation. This allows for efficient animation despite the dense geometric support. Figure 8.3: Complete mass-spring muscle system defined on a human character. The musculature consists of two different types of springs: muscle and structure. Muscle springs are active, in th at their contraction levels are set by the animator, or 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in response to the analyzed gesture intensity. Muscle springs exert force proportional to their contraction level an move the system out of equilibrium. Structural muscles are passive, in th at they only contract in response to external forces induced by muscle springs. Structural muscles are important to help constrain the motion of the masses and actually increase the speed to reach system equilibrium. 8 .4 .1 M u scle A ssem b ly A spring element k is modeled by a pair of nodal masses m ,-, and rrij each with 2D positions x, velocity v, and acceleration a with associated mass scalars m,; and rrij and a connecting spring with spring constant s^. The resting length of the spring (the length at which the spring is in equilibrium) is given by r and the current length I = ||xi — Xj ||. As a spring deforms from its rest length, the force exerted on its masses changes proportionally, parallel to the spring. Formally, given deformation d = r — l and the unit spring direction vector s = n Xi~x^ l, the nodal forces are defined by = d ■ s- c, where c is a constant |\Xi Xj\\ that controls the stiffness of the spring. Using the equation above to compute the internal spring forces, the new velocities are computed by integrating Newton’s equations of motion F{j = midi + fij. As there are no external forces, a fully implicit integration of the equations is used. A conjugate gradient linear solver is used to solve for the velocity vectors which is used to update the positions of the nodal masses. 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 8.5 Results The muscle system used to perform the animation is shown in for the character in figure 8.3. The levator palpebrae and orbicularis oris muscles have been left out as the blink gesture cannot be reproduced with deformation alone, and it is difficult to visualize the effects of the orbicularis occuli on the surface of the skin without wrinkling effects. Shown below are representative frames from the muscle morphing system connected to gesture analysis using the GPR model for various subjects. The results validate both the accuracy of gesture analysis, as well as the ability of the muscle animation system to deliver accurate semantic information. 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 8.4: Muscle Morphing results. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 9 Summary This thesis introduced a new model for facial gestures called G-Folds that exploits the hitherto unrecognized manifold inducing property of gesture intensities in appearance space. The work was inspired by the goal of determining fine levels of gesture intensity from appearance data of human expressive images for performance driven facial ani­ mation. A representation for facial state was sought that was both consistent across different facial geometries, but flexible enough to capture subtleties of facial expression. The coupling of gestures to muscles provides this flexibility. Previous work in this area neglects modeling of gesture intensity at the level necessary for performance driven facial animation and judgment of fine levels of expression intensity. The most significant contribution of this thesis has been the revelation of intensity ordered structure of gestures in co-articulation regions and the derivative solution to the intensity labeling problem. GPR takes a set of samples and effectively orders them with respect to gesture intensity along a curve. This ordering allows us to automatically label the training samples associated with basis gestures regardless of the temporal ordering, thus opening the doors to the extensive body of supervised learning techniques. 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. W ith G-Folds comes the ability to better model the true statistics of intensity samples. The gesture polynomial reconstruction can also be used to generate new data samples along the curve, increasing the size of our training database without additional effort on the actor’s part. It is speculated that this will improve the robustness of the intensity classifier, but thorough testing of this hypothesis is left to future work. The G-Fold model eliminates the need for a linear model of gesture actuation and enables analysis of more accurate gesture intensity profiles. The animation methods presented are novel. They represent two very different philo­ sophical approaches to facial animation. The CoArt flip-book system gives full flexibility (burden) to the animator as she must hand-generate a visual representation of each ges­ ture state. From an artistic point of view, this has some interesting implications as the synthetic gesture states need not correspond to those analyzed. One can envision a filter mechanism whereby the analyzed input gestures are modulated or mapped to a set of semantically different visual representations. Muscle Morphing, on the other hand, requires only a neutral face to be selected or created. The animation results from the image deformation induced by the muscle system. As such, the animator has very little control over the final result. This can be desirable for cartoon chat or VR engagements where communication is of more importance than the art. 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 10 Future Work This is not the end. It is my hope that this thesis will spawn a new direction in fa­ cial gesture analysis; appearance manifold analysis of facial gestures, as there are many promising avenues left to explore. Speech decom position Speech was tested in expressive space and, understandably, the results were unsatisfac­ tory. I expect that a G-fold space can be constructed that is tuned to speech gestures, or possibly accounting for both speech and expression. This may include a refinement of the co-articulation region decomposition, or superimposing a separate speech analysis region that operates independent of the expression analyzer. Expressions Appendix B presents a preliminary experiment abstracting expression from gesture states. This approach has significant benefits over holistic template based expression analysis as 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the subtleties of facial expressions can be represented without acquiring a sample for each expression. Face recognition by tem poral gesture matching Though each person’s G-folds are similar, they are not identical. G-folds can be used as a spatio-temporal signature for augmenting existing facial recognition systems. There is existing work that attem pts to recognize faces under expression, but the results from this thesis imply that the expressions themselves, as they unfold over time, may be used as a signature. Co-articulation Region Correlation The data acquired from co-articulation regions has been treated independently. In natural gestural communication, there is significant spatio-temporal correlation between region states. This correlation should be analyzed and incorporated into the classification mech­ anism. To be effective, an assessment of classifier confidence is needed. In addition to increasing the reliability of full face state classification, cross correlation of CR states will also allow estimation of region states with missing information: ie. in the case of nonfrontal or obscured images, where an entire region is corrupted, hidden, or returns very low confidence classifier results. 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. G PR Bootstrapping Most statistical learning algorithms reach peak performance with large amounts of well- selected training data. Bootstrapping is a common technique in semi-supervised learning tasks where a small set of labeled training data is augmented with a larger set of unlabeled data to increase the performance of the classifier. In my work, the set of labeled samples is orders of magnitude smaller than those typically used. The GPR, however, gives us a model of facial gesture actuation from the easier-to-acquire gesture labeled data instead of intensity labeled data. We can use the polynomial gesture model as a backbone for subsequent training data acquisition. Unlabeled data can be automatically clustered along the GP maintaining gesture intensity ordering. 120 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reference List [ADRY94] [Aki82] [AS93] [BA93] [Bar98] [Bas78] [Bas79] [BCS97] [BFJ+OO] [BG95] [BHK97] N. Arad, N. Dyn, D. Reisfeld, and Y. Yeshurun. Image warping by radial basis functions: application to facial expressions. In Graphical Models and Image Processing, pages 161-172, 1994. K. Akita. Analysis of body motion image sequences. In Proceedings of the 6th International Conference on Pattern Recognition, pages 320-327, October 1982. T. Akimoto and Y. Suenaga. Automatic creation of 3d facial models. In IEEE Computer Graphics and Applications, pages 16-22, 1993. M.J. Black and P. Anandan. A framework for the robust estimation of optical flow. In Proceedings of the International Conference on Computer Vision, pages 231-236, May 1993. M.S. Bartlett. Face image analysis by unsupervised learning and redun­ dancy reduction. PhD thesis, UC San Diego, 1998. J. N. Basili. Facial motion in the perception of faces and of emotional expression. Journal of Experimental Psychology, 4:373-379, 1978. J. Basili. Emotion recognition: The role of facial movement and the relative importance of upper and lower areas of the face. Journal of Personality and Social Psychology, 37:2049-2059, 1979. C. Bregler, M. Covell, and M. Slaney. Video rewrite: Driving visual speech with audio. In Proceedings of SIGGRAPH, 1997. I. Buck, A. Finkelstein, C. Jacobs, A. Klein, D. Salesin, J. Seims, and R. Szeliski. Performance-driven hand-drawn animation. In Proceedings of Non-photorealistic Animation and Rendering, 2000. B. Blumberg and T. Galyean. Multi-level direction of autonomous creatures for real-time virtual environments. In Proceedings of SIGGRAPH, 1995. P.N. Belhumeur, J. Hespanha, and D.J. Kriegman. Eigenfaces vs. fisher- faces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):711— 720, 1997. 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [BK94] [BLCD02] [BU78] [BN92] [BP93] [BP95] [Bra99] [Bra02] [BS95] [BS97] [BW95] [BY95] [CCZBOO] [CF90] [CGOO] C. Bregler and Y. Konig. Eigenlips for robust speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 669-672, 1994. C. Bregler, L. Loeb, E. Chuang, and H. Deshpande. Turning to the masters. In Proceedings of SIGGRAPH , 2002. J.F. Blinn. Simulation of wrinkled surfaces. In Proceedings of SIGGRAPH, pages 286-292, 1978. T. Beier and S. Neely. Feature-based image metamorphosis. Computer Graphics, 26(2), July 1992. R. Brunelli and T. Poggio. Face recognition: Features vs. templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(10):1042— 1052, 1993. D. Beymer and T. Poggio. Face recognition from one example view. In 5th International Conference on Computer Vision, pages 500-507, 1995. M. Brand. Voice puppetry. In Proceedings of SIGGRAPH, 1999. M. Brand. Incremental svd of incomplete and uncertain data. In European Conference on Computer Vision, pages 707-720, 2002. A. Bell and T. Sejnowski. An information maximisation approach to blind separation and blind deconvolution. Neural Computation, 7:1129-1159, 1995. M.S. B artlett and T.J. Sejnowski. Viewpoint invariant face recognition using independent component analysis and attractor neworks. In Advances in Neural Information Processing Systems, 1997. A. Bruderlin and L. Williams. Motion signal processing. In Proceedings of SIGGRAPH, pages 97-104, 1995. M.J. Black and Y. Yacoob. Recognizing facial expressions under rigid and non-rigid facial motions. In Proceedings of the 1st International Workshop on Automatic Face and Gesture Recognition, pages 12-17, 1995. D. Chi, M. Costa, L. Zhao, and N. Badler. The emote model for effort and shape. In Proceedings of SIGGRAPH, pages 173— 182, 2000. G. Cottrell and M. Fleming. Face recognition using unsupervised feature extraction. In Proceedings of the International Neural Network Conference, pages 322-325, 1990. E. Cosatto and H.P. Graf. Photo-realistic talking-heads from image sam­ ples. IEEE Transactions on Multimedia, 2 (3): 152— 163, 2000. 122 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [CH98] [CHT91] [CK01] [CLK+98] [CTB92] [CZLYW99] [dB90] [DBH+99] [DEP94] [DiP91] [DM96] [EBDP96] [EDP94] L.S. Chen and T.S. Huang. Multimodal human emotion/expression recog­ nition. In Proceedings of the 3rd International Workshop on Automatic Face and Gesture Recognition, pages 366-371, 1998. C.S. Choi, H. Harashima, and T. Takebe. Analysis and synthesis of facial expressions in knowledge-based coding of facing image sequences. In Pro­ ceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 2737-2740, 1991. B.W. Choe and H.S. Ko. Analysis and synthesis of facial expressions with hand-generated muscle actuation basis. In Proceedings of Computer A ni­ mation, 2001. J.F. Cohn, J.J. Lien, T. Kanade, W. Hua, and A. J. Zlochower. Beyond pro- totypic expressions: Discriminating subtle changes in the face. In Proceed­ ings of the IEEE Workshop on Robot and Human Communication, pages 33-39, 1998. I. Craw, D. Tock, and A. Bennett. Finding face features. In Proceedings of the European Conference on Computer Vision, 1992. J.F. Cohn, A. J. Zlochower, J. Lien, and T. Kanade Y. Wu. Automated face coding: A computer-vision based method of facial expression. Psychophys­ iology, 35(l):35-43, 1999. C.B. Duchenne de Boulogne (1862). The Mechanism of Human Facial Expression. Cambridge University Press, 1990. G. Donato, M. B artlett, J. Hager, P. Ekman, and T. Sejnowski. Classify­ ing facial actions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(10), 1999. T. Darrel, I. Essa, and A. Pentland. Correlation and interpolation net­ works for real-time expression analysis/symthesis. In Advances in Neural Information Processing Systems, 1994. S. DiPaola. Extending the range of facial types. The Journal of Visualiza­ tion and Computer Animation, 2(4): 129-131, October-December 1991. D. DeCarlo and D. Metaxas. Deformable model-based face shape and mo­ tion estimation. In Proceedings of the 2nd International Conference on Automatic Face and Gesture Recognition, pages 176-181, 1996. I. Essa, S. Basu, T. Darrell, and A. Pentland. Modeling, tracking, and interactive animation of faces and heads using input from video. In Pro­ ceedings of Computer Animation, 1996. I. Essa, T. Darrell, and A. Pentland. Tracking facial motion. In Proceedings of the Workshop on Motion of Nonrigid and Articulated Objects, pages 36- 42. IEEE Computer Society, 1994. 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [EF78] [EG98] [Ekm93] [ELTC96] [EP94] [EP95] [EP97] [Ess95] [ETC98] [Fai90] [Far94] [FCH98] [FN02] [FNK+99] P. Ekman and W. Friesen. Facial Action Coding System: A Technique for the Measurement of Facial Movements. Consulting Psychologists Press, Palo Alto, CA, 1978. P. Eisert and B. Girod. Analyzing facial expressions for virtual conferenc­ ing. IEEE CG & A, 18(5):70-78, 1998. P. Ekman. Facial expression of emotion. American Psychologist, 48:384- 392, 1993. G. J. Edwards, A. Lanitis, C.J. Taylor, and T.F. Cootes. Modelling the vari­ ability in face images. In Proceedings of the 2nd International Conference on Automatic Face and Gesture Recognition, pages 212-217, 1996. I. Essa and A. Pentland. A vision system for observing and extracting facial action parameters. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 76-83, 1994. I. Essa and A. Pentland. Facial expression recognition using visually ex­ tracted facial action parameters. In Proceedings of the International Work­ shop on Automatic Face and Gesture Recognition, 1995. I. Essa and A. Pentland. Coding, analysis, interpretation, and recognition of facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7) :757-763, 1997. A. Essa. Analysis, Interpretation, and Synthesis of Facial Expressions. PhD thesis, Massachusetts Institute of Technology, 1995. G.J. Edwards, C.J. Taylor, and T.F. Cootes. Interpreting face images using active appearance models. In Proceedings of the 3rd International Confer­ ence on Automatic Face and Gesture Recognition, pages 300-305, 1998. G. Faigin. The Artists Complete Guide to Facial Expressions. Watson- Guptill Publications, 1990. ISBN-0-8230-1628-5. L. Farkas. Anthropometry of the Head and Face. Raven Press, 1994. B.J. Frey, A. Colmenarez, and T.S. Huang. Mixtures of local linear sub­ spaces for face recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 820-826, 1998. D. Fidaleo and U. Neumann. Coart: Co-articulation region analysis for per­ formance driven facial animation. In Proceedings of Computer Animation, 2002 . D. Fidaleo, J. Y. Noh, T. Kim, R. Enciso, and U.Neumann. Classification and volume morphing for performance-driven facial animation. In Interna­ tional Workshop on Digital and Computational Video, 1999. 124 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [For90] [FT01] [GGW+98] [GW93] [Hal94] [HNvdM98] [HonOO] [Hyv99] [JP98] [Kal93] [KH92] [Kis94] [KMB94] [KMMTT92] D.R. Forsey. Motion Control and surface modeling of articulated figures in computer animation. PhD thesis, University of Waterloo, Waterloo, Ontario, September 1990. L. Franco and A. Treves. A neural network face expression recognition system using unsupervised local processing. In Second International Sym ­ posium on Image and Signal Processing and Analysis, pages 628-632, 2001. B. Guenter, C. Grimm, D. Wood, H. Malvar, and F. Pighin. Making faces. In Proceedings of SIGGRAPH, pages 55-66, 1998. R. Gonzalez and R. Woods. Digital Image Processing. Addison Wesley Publishing Company, 1993. P.W. Hallinan. A low dimensional representation of human faces for arbi­ trary lighting conditions. In International Conference on Image Processing, pages 995-999, 1994. H. Hong, H. Neven, and C. von der Malsburg. Online facial expression recognition based on personalized gallery. In Proceedings of the 3rd In­ ternational Conference on Automatic Face and Gesture Recognition, pages 354-359, 1998. H. Hong. Analysis, Synthesis, and Recognition of Facial Gestures. PhD thesis, University of Southern California, 2000. A. Hyvrinen. Fast and robust fixed-point algorithms for independent com­ ponent analysis. IEEE Transactions on Neural Networks, 10(3):626-634, 1999. M.J. Jones and T. Poggio. Hierarchical morphable models. In IEEE Con­ ference on Computer Vision and Pattern Recognition, pages 820-826, 1998. P. Kalra. An interactive multi-modal facial animation system. PhD thesis, Ecole Polytechnique Federale de Lausanne, 1993. H. Kobayashi and F. Hara. Recognition of six basic facial expressions and their strength by neural networks. In IEEE International Workshop on Robot and Human Communication,'pages 381-386, 1992. P. Kishino. Virtual space teleconferencing system- real-time detection and reproduction of human images. In Proceedings of Imagina, pages 109-118, 1994. I. Kakadiaras, D. Metaxas, and R. Bajcsy. Active part-decomposition, shape and motion estimation of articulated objects: A physics-based ap­ proach. In Proceedings of CVPR, pages 980-984, 1994. P. Kalra, A. Magili, N. Magnenat-Thalmann, and D. Thalmann. Simula­ tion of muscle actions using rational free-form deformations. In Proceedings of Eurographics, volume 2, pages 59-69, 1992. 125 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [KMT93] [KNOK92] [KS90] [KSH+91] [KWT87] [LBB02] [LC01] [Lie98] [LKCL98a] [LKCL98b] [LM94] [LRF93] [LTC95] P. Kalra and N. Magnenat-Thalmann. Simulation of facial skin using tex­ ture mapping and coloration. In Proceedings ICCG ’ 93, pages 365-374, 1993. Y. Kitamura, Y. Nagashima, J. Ohya, and F. Kishino. Facial image syn­ thesis by hierarchical wire frame model. In SPIE Visual Communications and Image Processing, 1992. M. Kirby and L. Sirovich. Application of the karhunen-loeve procedure for the characterization of human faces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(1): 103— 108, 1990. M. Kato, I. So, Y. Hishinuma, O. Nakamura, and T. Minami. Descrip­ tion and synthesis of facial expressions based on isodensity maps. Visual Computing, pages 39-56, 1991. M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. International Journal of Computer Vision, 1(4):321-331, 1987. S. Lee, J. Badler, and N. Badler. Eyes alive. ACM Transactions on Graph­ ics, 21(3):637-644, July 2002. H.C. Lo and R. Chung. Facial expression recognition approach for per­ formance animation. In Proceedings of Second International Workshop on Digital and Computational Video, pages 132-139, 2001. J.J. Lien. Automatic recognition of facial expressions using hidden markov models and estimation of expression intensity. PhD thesis, Carnegie Mellon University, 1998. J.J. Lien, T. Kanade, J. Cohn, and C. Li. Subtly different facial expression recognition and expression intensity estimation. In IEEE Conference on Computer Vison and Pattern Recognition, pages 853-859, 1998. J.J. Lien, T. Kanade, J.F. Cohn, and C.C. Li. Automated facial expression recognition based on facs action units. In Proceedings of the 3rd Inter­ national Conference on Automatic Face and Gesture Recognition, pages 390-395, 1998. P. Litwinowitcz and G. Miller. Efficient techniques for interactive texture placement. In Proceedings of SIGGRAPH, pages 119-122, 1994. H. Li, P. Roivainen, and R. Forchheimer. 3d motion estimation in model based facial image coding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(6):545-555, 1993. A. Lanitis, C. Taylor, and T. Cootes. A unified approach to coding and interpreting face images. In Proceedings of the 5th International Conference on Automatic Face and Gesture Recognition, pages 368-373, 1995. 126 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [LTC97] [LW94] [LWS97] [LYX+01] [Mas91] [May97] [MMT97] [Mor95] [MPOl] [MSOK95] [MTCT93] [MTK+94] [MTPT88] A. Lanitis, C. Taylor, and T. Cootes. Automatic interpretation and cod­ ing of face images using flexible models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):743-756, 1997. P. Litwinowitcz and L. Williams. Animating images with drawings. In Proceedings of SIGGRAPH, 1994. S. Lee, G. Wolberg, and S.Y. Shin. Scattered data interpolation with multi­ level b-splines. IEEE Transations in Visualization and Computer Graphics, 3(3):228-244, 1997. Y. Li, F. Yu, Y. Xu, E. Chang, and H. Shum. Speech-driven cartoon animation with emotions. In Proceedings of AC M Multimedia, 2001. K. Mase. Recognition of facial expression from optical flow. IEICE Trans­ actions, 74(10), 1991. P.S. Maybeck. Stochastic Models Estimation and Control. Academic Press, 1997. ISBN-0-12-480703-8. L. Moccozet and N. Magnenat-Thalmann. Dirichlet free-form deformations and their application to hand simulation. In IEEE Computer Animation, 1997. S. Morishima. Emotion model- a criterion for recognition, synthesis, and compression of face and emotion. In Proceedings of the 1st International Conference on Automatic Face and Gesture Recognition, pages 284-289, 1995. M. Malciu and F. Preteux. Mpeg-4 compliant tracking of facial features in video sequences. In Proceedings EUROIMAGE International Confer­ ence on Augmented, Virtual Enviroments and Three-Dimensional Imaging (ICAVSD’ 01), pages 108-111, Mykonos, Greece, May 2001. Lh. Moubaraki, K. Singh, J. Ohya, and F. Kishino. Facial wrinkle anima­ tion using color texture synthesis and blending. Technical Report of IEICE, pages 94-115, January 1995. N. Magnenat-Thalmann, A. Cazedevals, and D. Thalmann. Modeling facial communication between an animator and a synthetic actor in real-time. In Proceedings of Modelling in Computer Graphics, pages 387-396, 1993. Lh. Moubaraki, H. Tanaka, Y. Kitamura, J. Ohya, and F. Kishino. Homotopy-based 3d animation of facial expressions. Technical Report of IEICE, pages 94-37, July 1994. N. Magnenat-Thalmann, E. Primeau, and D. Thalmann. Abstract muscle action procedures for face animation. The Visual Computer, 3:290-297, 1988. 127 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [MWP98] [NHRD90] [NMN96] [N N O O ] [NNOl] [Noh02] [NY98] [OAvdMOO] [OEKN96] [OKT+95] [0098] [Par 72] [Par82] B. Moghaddam, W. Wahid, and A. Pentland. Beyond eigenfaces: proba­ bilistic maching for face recognition. In Proceedings of the 3rd International Conference on Automatic Face and Gesture Recognition, pages 30-35, 1998. M. Nahas, H. Huitric, M. Rious, and J. Domey. Facial image synthesis us­ ing skin texture recording. The Visual Computer, 6(6):337-343, December 1990. S. Nayar, H. Murase, and S. Nene. Parametric appearance representation. In Early Visual Learning. Oxford University Press, 1996. J. Noh and U. Neumann. Talking face. In IEEE Conference on Multimedia and Expo, pages 627-630, 2000. J.Y. Noh and U. Neumann. Expression cloning. In Proceedings of SIG- GRAPH, pages 277-288, 2001. J. Noh. Facial Animation by Expression Cloning. PhD thesis, University of Southern California, 2002. U. Neumann and S. You. Integration of region tracking and optical flow for image motion estimation. In IEEE International Conference on Image Processing, 1998. K. Okada, S. Akamatsu, and C. von der Malsburg. Analysis and synthesis of pose variations of human faces by a linear pcmap model and its ap­ plication for pose-invariant face recognition system. In Proceedings of the 4th International Conference on Automatic Face and Gesture Recognition, pages 142— 149, 2000. J. Ohya, K. Ebihara, J. Kurumisawa, and R. Nakatsu. Virtual kabuki theatre: Toward the realization of human metamorphosis system. In Pro­ ceedings of the 5th IEEE International Workshop on Robot and Human Communication, pages 416-421, November 1996. J. Ohya, Y. Kitamura, H. Takemura, H. Ishi, F. Kishino, and N. Terashima. Virtual space teleconferencing: Real-time reproduction of 3d human im­ ages. Journal of Visual Communications an Image Representation, 6(1):1- 25, March 1995. T. Otsuka and J. Ohya. Recognizing abruptly changing facial expressions from time-sequential face images. In IEEE Conference on Computer Vision and Pattern Recognition, pages 808— 813, 1998. F. Parke. Computer generated animation of faces. In Proceedings of ACM National Conference, number 1, pages 451-457, 1972. F. Parke. Parameterized models for facial animation. IEEE Computer Graphics and Applications, 2(9):61-68, 1982. 128 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [PB81] [PBV94] [PC97] [PE90] [PE02] [Pel91] [PG89] [PH91] [Pic97] [PLG91] [PMS94] [PRZ92] [PSO O ] [Ree90] [Roh94] S. P latt and N. Badler. Animating facial expression. In Proceedings of AC M SIGGRAPH, pages 245-252, 1981. A. Pelachaud, N. Badler, and M. Viaud. Final report to nsf of the standards for facial animation workshop. Technical report, Philadelphia, PA 19104- 6389, 1994. C. Padgett and G. Cottrell. Representing face images for emotion classifi­ cation. In Advances in Neural Information Processing (NIPS), 1997. T. Poggio and S. Edelman. A network that learns to recognize three di­ mensional objects. Nature, 343(6255):263-266, 1990. F. Pereira and T. Ebrahimi. The MPEG-4 Book. Prentice Hall, 2002. C. Pelachaud. Communication and coarticulation in facial animation. PhD thesis, University of Pennsylvania, Philadelphia, PA 19104-6389, October 1991. T. Poggio and F. Girosi. A theory of networks for approximation and learning. Technical report, Cambridge, MA, July 1989. A. Pentland and B. Horowitz. Recovery of nonrigid motion and struc­ ture. IEEE Transactions of Pattern Analysis and Machine Intelligence, 13(7):730-742, July 1991. R. Picard. Affective Computing. MIT Press, 1997. ISBN 0-262-16170-2. E.C. Patterson, P.C. Litwinowitch, and N. Greene. Facial animation by spatial mapping. In Proceedings of Computer Animation, pages 31-44. Springer-Ver lag, 1991. A. Pentland, B. Moghaddam, and T. Starner. View-based and modular eigenspaces for face recognition. In Computer Vision and Pattern Recogni­ tion, 1994. S. Pieper, J. Rosen, and D. Zeltzer. Interactive graphics for plastic surgery: A task level analysis and implementation. In Proceedings of the Symposium on Interactive 3D Graphics, pages 127-134, 1992. P. Penev and L. Sirovich. The global dimensionality of face space. In Proceedings of the 4th International Conference on Automatic Face and Gesture Recognition, pages 264-270, 2000. W. Reeves. Simple and complex facial animation: Case studies. In State of the A rt in Facial Animation: SIG GRAPH 1990 course notes, number 26, pages 88-106. 1990. K. Rohr. Towards model-based recognition of human movements in image sequences. Computer Vision, Graphics, and Image Processing, 59(1):94- 115, January 1994. 129 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [RP95] [Ryd87] [RYD96] [SI92] [SL95] [SMT97] [SVG95] [TK91] [TK99] [TKCOO] [TP91] [TS93] [Tu96] [TW90] D. A. Rowland and D. I. Perrett. Manipulating facial appearance through shape and color. IEEE Computer Graphics and Applications, 15(5):70-76, 1995. M. Rydfalk. CANDIDE: A Parameterized Face. PhD thesis, Linkoping University, October 1987. M. Rosenbum, Y. Yacoob, and L. Davis. Human expression recognition from motion using a radial basis function network. IEEE Transactions on Neural Networks, 7(5):1121— 1138, 1996. A. Samal and P.A. Iyengar. Automatic recognition and analysis of human faces and facial expressions: A survey. Pattern Recognition, 25(l):65-77, 1992. S.Y. Shin S.Y. Lee, K.Y. Chwa. Image metamorphosis using snakes and free-from deformations. In Proceedings of SIGGRAPH, pages 439-448, 1995. G. Sannier and N. Magnenat-Thalmann. A flexible texture fitting model for virtual clones. In IEEE Proceedings of Computer Graphics, 1997. A. Saulnier, M.L. Viaud, and D. Geldreich. Real-time facial analysis and synthesis chain. In International Workshop on Automatic Face and Gesture Recognition, pages 86-91, 1995. C. Tomasi and T. Kanade. Detection and tracking of point features. Tech­ nical report, Carnegie Mellon University, 1991. S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999. ISBN-0-12-686140-4. Y. Tian, T. Kanade, and J. Cohn. Recognizing lower face action units for facial expression analysis. In Proceedings of the 4th International Confer­ ence on Automatic Face and Gesture Recognition, pages 484-490, 2000. M. Turk and A. Pentland. Eigenfaces for recognition. Cognitive Neuro­ science, 3(l):71-86, 1991. D. Terzopoulos and R. Szeliski. Tracking with kalman snakes. In Active Vision, pages 3-20. MIT Press, 1993. X. Tu. Artificial Animals for Computer Animation: Biomechanics, Lo­ comotion, Perception, and Behavior. PhD thesis, University of Toronto, 1996. D. Terzopoulos and K. Waters. Analysis of facial images using physical and anatomical models. In International Conference on Computer Vision, pages 727-732, 1990. 130 with permission of the copyright owner. Further reproduction prohibited without permission. [TW91] [TW93] [UR91] [VY92] [VY93] [WA93] [Wat87a] [Wat87b] [Wil90] [Wol90] [WS92] [WT91] [WT93] [YB97] D. Terzopoulos and K. Waters. Techniques for realistic facial modeling and animation. In Computer Animation Proceedings, 1991. D. Terzopoulos and K. Waters. Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(6):569-579, 1993. S. Ullman and R.Basri. Recognition by linear combination of models. IEEE PAMI, 13(10) :992-1007, 1991. M. Viaud and H. Yahia. Facial animation with wrinkles. In Proceedings of the Third Eurographics Workshop on Animation and Simulation, Septem­ ber 1992. M. Viaud and H. Yahia. Facial animation with muscle and wrinkle sim­ ulation. In Proceedings of the Second International Conference on Image Communication (IMAGECON), pages 117-121, 1993. J.Y.A. Wang and E. Adelson. Layered representation for motion analysis. In Proceedings of the Computer Vision and Pattern Recognition Conference, 1993. K. Waters. A muscle model for animating three-dimensional facial expres­ sion. Computer Graphics, 21 (4) :17— 24, 1987. K. Waters. A muscle model for animating three-dimensional facial expres­ sions. In Proceedings of SIGGRAPH, volume 21, pages 17-24, 1987. L. Williams. Performance-driven facial animation. In Proceedings of SIG­ GRAPH, pages 235-242, 1990. G. Wolberg. Digital Image Warping. IEEE Computer Society Press, Los Alamitos, CA, 1990. W J. Welsh and D. Shah. Facial feature image coding using principal com­ ponents. Electronics Letters, 28:2066-2067, 1992. K. Waters and D. Terzopoulos. Modeling and animating faces using scanned data. The Journal of Visualization and Computer Animation, 2:123-128, 1991. K. Waters and D. Terzopoulos. Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Transactions on PAMI, 15(6), June 1993. L. Yin and A. Basu. Mpeg4 face modeling using fiducial points. In Proceed­ ings of the International Conference on Image Processing, pages 109-112, 1997. 131 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [YD94] Y. Yacoob and L. Davis. Recognizing facial expressions by spatio-temporal analysis. In Proceedings of the 12th International Conference on Computer Vision and Pattern Recognition, 1994. [YHC92] A. L. Yuille, P.W. Hallinan, and D.S. Cohen. Feature extraction from faces using deformable templates. International Journal of Computer Vision, 8(2):9 9 -lll, 1992. [Zha99] Z. Zhang. Feature based facial expression recognition: Sensitivity, analysis and experiments with a multilayer perceptron. International Journal of Pattern Recognition and Artificial Intelligence, 13(6):893-911, 1999. 132 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix A G PR Results 133 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. □ □□□□Emil Figure A.l: GPR transformation applied to X = C R q. 134 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure A .2: GPR transformation applied to X = CR\. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. -3 x 10 x 10 0.5 1.5 x 10 x 10 □ nnrjnyr^ Figure A.3: GPR transformation applied to X = CR.2 - 136 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure A.4: GPR transformation applied to X — C R 3 . 137 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ••3' x 10 41 2 - 0- - 2 - 10 5 0 .3 X 1 0 - 3 • 3 - 3 x 10 x 10 x 10 4 6 4 2 *r • i 2 * 0 0 - z - 2 -4 -6 -A -4 -2 0 -5 0 5 -A X 10 :: n -2 0 2 c -: n Figure A.5: GPR transformation applied to X = C R 4 . 138 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ■ 3 x 10 2 • V -4 03 2 4 6 8 5 » 5 X -3 x 10 -3 -3 -3 -3 Xio X10 Xio x 10 5 » 0 u -5 4 2 J 0 -2 -4 2 1 ■ " V 4 0 -1 -2 -1 0 1 - 4 - 2 0 2 -2 0 2 -3 -3 -3 x 10 xio x 10 Lj L j H L .j □ □ t 3 fcS Lj Figure A.6: GPR transformation applied to X = CR$. 139 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure A .7: GPR transformation applied to X = CRq. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. -3 x 10 -2 -1.5 -1 -0.5 0 0.5 1 -2 -1 0 1 2 3 -3 -3 x IO x IO Figure A.8: GPR transformation applied to X = CRj. 141 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A ppendix B Expression Analysis B .l Introduction The term expression implies a limited set of prototypical facial gestures corresponding to emotions such as the six universals as defined by Paul Ekman: HAPPINESS, SADNESS, ANGER, FEAR, SURPRISE, and DISGUST. Though the majority of facial expression analysis literature focuses on this set it is evident that these expressions, in their most pure forms, occur relatively infrequently in daily interactions. Most human-human interaction involves a myriad of subtle and less quantifiable gestures, sometimes derivatives of the 6 universals, but often expressing more complex internal sentiments. Expressions, no m atter how subtle, are a communicative resource between humans, involuntarily reflecting our internal emotional/mental state. Human-machine interaction, however, presents a different scenario when it comes to facial expressions as the frequency of the 6 universals is diminished even further. This is primarily due to 1) the inability of or lack of desire for machines to evoke such emotions, and 2) the inability of machines to respond to the expressions, negating our need to employ this method of communication. 142 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Though we may not be compelled to engage a machine in an emotional dialog, we cannot help but reflect our internal state on our face. Thus, should a machine alter this state intentionally (or unintentionally), it also has access to the change it has instigated. This appearance then presents a task dependent window into the soul of the observed individual. This chapter describes two methods for facial expression analysis utilizing principles from CR analysis. The first is a simple holistic method requiring representative full face expressions for all desired expression classes. Limitations of this method are addressed and an alternative based on CR gesture parameters is presented demonstrating the flexibility of CR analysis. Expressions also present a unique method for qualitative assessment of the CR analysis system as incorrect parameter extraction will lead to errors in expression judgment. B.2 H olistic Expression Classifier As a preliminary experiment, the plausibility of classifying facial expressions with inten­ sity using the entire face (holistic analysis) was tested. A training phase is employed that derives an expression signature basis from a set of training samples using independent component analysis (ICA) similar to our initial unstructured CR analysis methods. New image samples are transformed into the signature space and classified with respect to the training samples deriving an expression state vector. 143 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Gestures: 0 1 e = Angry e° = Surprised e1 = Happy e2 = Sad 4 e3 = Afraid 2 4 Figure B .l: Holistic facial analysis using five prototypical facial expressions. Intensity frames are acquired from a neutral state. Face images are masked prior to analysis. B .2 .1 E x p ressio n T rain in g D a ta A c q u isitio n and P r e p r o c e ssin g For each expression in figure B .l, a set of training images Xi = {xo. aq,..., aq.} is acquired by prompting a subject to actuate expression e* from a neutral face state to full actuation and extracting the successive video frames. Images are normalized with respect to 2D head pose variations by tracking a set of infra-red LEDs attached to a pair of eyeglass frames, and warping each frame into a canonical reference frame as described in Chapter 4. Background changes are masked out of the normalized images using a static face mask identified once for each subject in the normalized image coordinates, and the image is cropped to the mask bounding rectangle. The face mask concentrates the analysis on the salient portion of the face responsible for what humans generally perceive as expression. To correct for lighting variations th at can occur between performance sessions, each masked image is convolved with a Laplacian kernel. 144 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The resulting processed images are reshaped and assembled into a matrix of column vectors. ICA is performed on the set of expression samples S = [Xq ... X^} and the independent components are treated as a a basis for the actuated expression space similar to CR appearance samples. The expression signature of a given image x is computed as the normalized set of coefficients of its approximation by the independent components of S. For a given test face image x, the closest expression signature from each training set (maximal dot product) is stored as the ith element of the expression state vector v. Each element in is ordered by its temporal progression from neutral to full actuation, and therefore v [i] is interpreted as the magnitude of contribution of expression i to the test image. Figure B.2: Holistic expression analysis results. 145 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. B.2.2 Results The system was tested with 5 expressions of emotion: happy, angry, surprised, sad, and afraid. Figure B.2 shows a plot of the expression state vectors over time for a video se­ quence progressing through SAD, ANGRY, SURPRISED, HAPPY, SAD, SURPRISED, SAD, AFRAID over 800 frames. Scores are normalized to the range [0,1]. Peak values represent the dominant expression. The graph accurately reflects the transition between expression states as intended by the subject. In Appendix C an interactive media-art installation is presented utilizing the expression analysis methods above. B.3 Gesture Space Representation for Expressions Requiring representative samples for all expressions is a significant limitation of the holis­ tic classifier described above. Most non-actors will supply a prototypical expression in response to labels such as HAPPY and SAD that are exaggerated and do not generally correspond to the real-world manifestations of such emotions. This implies that we must evoke the corresponding emotion or otherwise prompt the subject to make a natural expression which is very difficult, if not impossible in some cases. Unlike in the CR case where there is evidence suggesting that the CR basis spans a wide variety of facial configurations, there is no reason to believe this is the case with full face expressions. Hence, defining unseen expressions in terms of an expression basis is likely to be error-prone. In addition, there exists a large body of art, psychology, and physiology literature dissecting the relationship between emotion and facial expression that cannot be exploited without having corresponding visual representatives for training. 146 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. By representing expressions with physiological parameters an expression database can be tuned and populated without explicitly involving a human subject. As gestures defining co-articulation regions are related to largely independent muscle groups, and muscle contractions are responsible for facial expressions, expressions can be modeled in terms of gesture activation levels. This allows expressions to be defined as clusters in gesture space and new expression images to be classified with respect to these clusters. CR analysis can be used to map an input facial image to a point in a 16 dimensional gesture space. Figure B.3 illustrates the proposed expression classification system. An expression cluster Ei is defined by the mean /i and variance a of a set of representative expression vectors. A gesture actuation vector v belongs to the class with minimal euclidean distance. Note that with more training data, we can exploit the statistics of each cluster and use a more suitable distance metric such as the Mahalanobis distance. More sophisticated classification methods can clearly be used as well. E i Figure B.3: Expressions represented as clusters in CR space. 147 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix C Comfort Control Art Installation C .l Introduction Comfort Control is an art installation that incorporates custom designed facial expres­ sion recognition software into an interactive system for the exploration of psychological responses to imagery. The participant enters a lO’ xlO’ xlO’ cube with a 1960’s style entertainment room interior and sits in a chair facing a large television screen. The par­ ticipant is then locked into the chair using magnetic wrist and head restraints controlled by a computer. The computer engages the participant in a game where appropriate fa­ cial expressions must be made in responses to displayed images to gain freedom. Comfort Control is a collaborative effort between Douglas Fidaleo, Ann Page, Brian Cooper, and Tomoyuki Isoyama, shown in the Spring at the RAID Projects gallery in Los Angeles. C.2 Artistic Statem ent On an artistic/social level the piece deals with the process through which humans move from freedom to captivity by building walls of comfort around them. Outside of society 148 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure C.l: The Comfort Control art installation. Exterior and interior of the cube. J.P. conjures his best fear expression in response to an image of a pair of capybaras. and technology’s grasp, we are free in motion and will, but are constrained by the some­ times unpleasant tasks of securing shelter and food. To alleviate these constraints we look to technology and societal law, and grab elements of progress to build our personal wall of comfort. The comfort we gain is accompanied by a loss of personal freedom. As freedom is a basic human necessity, this loss normally incurs an element of pain. This discomfort, however, can be alleviated by pleasurable stimuli or simulated freedom. The process of securing comfort through loss of freedom is reversible, but to do so one must willingly take on pain, and move to a state of discomfort, which counters human nature. This potential for reversal is embodied in the game aspect of the installation, whereby one understands his/her captivity and must undergo a degree of personal dis­ comfort to gain the reward of freedom from the restraints. Facial expressions are not only prime indicators of emotion, but also have the ability to evoke, enhance, and maintain one’s feeling of the emotion. This emotional feedback concept provides the foundation for the nature of the pain imposed on the participant in our work. The participant is asked to conjure facial expressions th at conflict with the na­ ture of the presented images. The internal quarrel between the self actuated emotion and 149 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the externally imposed emotional stimulus is designed to create an internal disturbance, and embody the pain needed to reverse the process. The degree of conflict progresses through three levels of the game. C.3 Game Overview Game Flow 1. Prep Phase: The usher locks the participant into a comfortable cushioned recliner using medical wrist restraints and a headpiece preventing head motion. The usher leaves and starts the training sequence. 2. Training Phase: The screen displays an image and prompts the participant to make a specific expression. The computer records the facial expression. This is repeated for five expressions: HAPPY, SAD, ANGRY, AFRAID, DISGUSTED. 3. System Learning Phase: The expression classifier is trained using data acquired at previous step. 4. Game Phase Level 1: A level lexpression/image pair is randomly selected. The expression label is presented to the participant for 3 seconds. The image is then displayed and he/she is prompted to make a facial expression. A snapshot of the expression is recorded and analyzed to determine the accuracy of the supplied ex­ pression. The participant is informed if the expression is correct or incorrect. If the expression is correct, the level is increased and a single restraint is released. 150 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Image Description True Emotion Opposite Emotion Level 1 Yawning Cat Happy Sad Level 2 Adolf Hitler Angry Happy Level 3 Decapitated Body Disgust Happy Table C .l: Example image/expression label pairs for each game level. 5. Game Phase Level 2 : Same process as Level 1, except image and expression pairs are selected from level 2. 6. Game Phase Level 3: Image/expression pairs are selected from level 3. 7. Exit Phase: After three correct responses are made at level 3, the player wins, the head restraint is released and the chair inclines. Alternatively, after a total of 15 incorrect responses the player is informed th at they have lost and are released from the restraints. Levels and Imagery Selection Each level in the game is designed to make it increasingly difficult for the participant produce the expected facial expression. The level of discomfort of the participant (and game difficulty) is estimated by the amount of internal conflict arising from conjuring an expression in response to an image th at should evoke an opposite emotional response. The chosen images and expected responses are tuned to create various levels of difficulty. Table C .l shows example image/expression pairs for each level. The probability of getting a request for an opposite emotional response increases with the level of imagery. Four independent labelers of different age, sex, race, and profession labeled the trigger images 151 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. with respect to emotional content and 3 levels of intensity. Images were chosen for use in the installation if all four labelers agreed on these labels. C.3.1 Hardware An infrared sensitive NTSC video camera with an infrared band pass filter was mounted above the screen to capture the participant’s facial expressions. An infrared source was mounted below the camera to illuminate the face. An array of sensors was interfaced to the computer through an OOPic microcontroller. A contact sensor on the seat of the chair detected the presence of the participant. Two pinball tilt sensors were mounted to trigger when the chair was fully reclined and inclined. A contact sensor was mounted to the top of the chair back as a backup if the recline tilt sensor failed to throw. A final trigger was mounted outside of the cube for the usher to initiate the game sequence. A relay board was constructed and interfaced to the OOPic to control the chair’s motorized incline/recline mechanism and three electromagnetic restraints. The wrist manacles were built from medical wrist restraints with an iron plate attached. The arms of the chair were spring loaded and held down by the force of the magnet. When the magnet is turned off, the tension in the springs forces the plate and restraints to pop off leaving the participant free to move his/her arms. 152 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure C.2: Comfort Control system diagram. C.3.2 Software The software system coordinates the interaction between the video screen, camera input, and chair electronics. In addition to controlling the progression of the game, it synchro­ nizes the expression image acquisition with a countdown sequence used to prompt the participant for an expression. The expression analysis module classifies expression images and returns the result to the main game engine. The hardware interface communicates via a serial connection to OOPic to set relay states and read sensor states. The software runs on a 1.4GHz Pentium4 CPU with 512MB RAM. C.4 Expression Classification The expression classification module takes as input an image of the participant’s face and returns one of five predefined emotion classes: e =HAPPY, SAD, ANGRY, AFRAID, DISGUSTED. In the case that there is no match the module returns NONE. 153 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. System Constraints Following an initial setup phase, the installation runs without manual intervention. We would like the participant to remain focused on the experience throughout the game, and not on the underlying technology. This imposes several constraints that affect the design of the classifier: • Minimize Boredom: If the length of time for training or classification is too long participants will lose their connection to the piece. • Minimize Confusion: Classification inaccuracies can have three effects. First, the user may second-guess their responses and thereby modify their expressions, result­ ing in poor performance and frustration. Second, they may believe their expressions to be accurate, and hence conclude that the system must be inaccurate thus dis­ tracting them from the focus of the piece. Third, the user may start to think there is a trick to the game, such as being required to supply an expression opposite to the one they are prompted to make. • Instantaneous Feedback: As the participant is required to match the expression sup­ plied during training as best as possible, immediate feedback regarding the accuracy of their response should enhance their ability to recreate the expression. These constraints limit our choices for training and classification methods. The meth­ ods developed were an attem pt to balance the trade offs between computational complex­ ity and accuracy. 154 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C.4.1 Training and Expression Classification Our classification method involves a training phase that derives an expression signature basis from a set of training samples using ICA. New image samples are transformed into the signature space and classified with respect to the training images. The training data consists of images of the participant making the expressions in the training phase. Environmental lighting conditions and the subject’s head pose (by virtue of the head restraint) are relatively constant throughout the game so pose and lighting normalization is not necessary. Prior to the game start, the usher specified a rectangle around participant’s face in the camera image, focusing the analysis on the face and eliminating irrelevant background pixels. The training matrix S is constructed as follows: A set of training vectors is ac­ quired by prompting the subject to generate each expression e* defined above. Thirty samples of each expression are acquired and the cropped image samples are converted to column vectors by horizontal scan. S is formed by concatenating all vectors in order of acquisition. If x) is the j th image acquired from expression e^, then S = rrTo „ o „ o i r i ] U 4 4 r 4ll L L x ox i ■■■x kl [x ox i ■ • • x k\ • • • Y x ox i • • • x ki\ ■ where k — 1... 30. By construction, the row index of a sample in S defines the expression set to which it belongs. To eliminate redundancies in the samples and reduce the number of expression basis vectors to be computed, PCA is performed on the data in S and retain the eigenvectors corresponding to the ten largest eigenvalues, accounting on average for > 99% of the 155 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. variance of the data. S subsequently refers to the reduced dimensionality version of the data set. Expression classification is performed holistically as discussed in Appendix A. C.5 Discussion The expression classification system developed is accurate assuming the same type of expression is supplied by the user as provided during training. This presented an inter­ esting challenge, as the participants were required to shift from passive response to active performance mode. At one level, one could argue that this detracts from the emotional impact, as performance in non-actors may involve intellectualizing the act. (How do I make a happy face?) However, due to the biomechanical emotional feedback effect, we postulated th at the desired emotional conflict would in fact arise. This hypothesis was supported in most post-installation debriefings with participants. These discussions also uncovered several unexpected results. The first surprising obser­ vation was the unpredictability and variability of people’s responses to the images. At higher level images some people reported not having a problem making the expression while others required consciously disassociating themselves from the image, and focusing purely on the text or an imaginary scenario that matched the prompted text expression. A few participants reported having an easier time making opposite expressions at level 3 as the images were so intense that disassociation became easier. Mirrors were provided for the participants to practice making facial expressions prior to entering the cube, however few people actually took advantage of them. Hence, several 156 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. people reported an unexpected difficulty in associating an expression to the text during the game. Although both computers and media have a broad im pact on society, multimedia research is often (and understandably) narrowly focused on the problem at hand. In Comfort Control we turned the technology back on the issues, using multimedia compu­ tation itself to enable exploration of related social and psychological issues. 157 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button
Conceptually similar
Data -driven facial animation synthesis by learning from facial motion capture data
PDF
Data -driven facial animation synthesis by learning from facial motion capture data 
Facial animation by expression cloning
PDF
Facial animation by expression cloning 
Analysis, recognition and synthesis of facial gestures
PDF
Analysis, recognition and synthesis of facial gestures 
A modular approach to hardware -accelerated deformable modeling and animation
PDF
A modular approach to hardware -accelerated deformable modeling and animation 
Extendible tracking:  Dynamic tracking range extension in vision-based augmented reality tracking systems
PDF
Extendible tracking: Dynamic tracking range extension in vision-based augmented reality tracking systems 
Content -based video analysis, indexing and representation using multimodal information
PDF
Content -based video analysis, indexing and representation using multimodal information 
Data-driven derivation of skills for autonomous humanoid agents
PDF
Data-driven derivation of skills for autonomous humanoid agents 
A voting-based computational framwork for visual motion analysis and interpretation
PDF
A voting-based computational framwork for visual motion analysis and interpretation 
Algorithms for compression of three-dimensional surfaces
PDF
Algorithms for compression of three-dimensional surfaces 
Compiler optimizations for architectures supporting superword-level parallelism
PDF
Compiler optimizations for architectures supporting superword-level parallelism 
Energy efficient hardware-software co-synthesis using reconfigurable hardware
PDF
Energy efficient hardware-software co-synthesis using reconfigurable hardware 
High-frequency mixed -signal silicon on insulator circuit designs for optical interconnections and communications
PDF
High-frequency mixed -signal silicon on insulator circuit designs for optical interconnections and communications 
A syntax-based statistical translation model
PDF
A syntax-based statistical translation model 
Geometrical modeling and analysis of cortical surfaces:  An approach to finding flat maps of the human brain
PDF
Geometrical modeling and analysis of cortical surfaces: An approach to finding flat maps of the human brain 
Efficient minimum bounding circle-based shape retrieval and spatial querying
PDF
Efficient minimum bounding circle-based shape retrieval and spatial querying 
Complexity -distortion tradeoffs in image and video compression
PDF
Complexity -distortion tradeoffs in image and video compression 
Heterogeneous view integration and its automation
PDF
Heterogeneous view integration and its automation 
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
PDF
Application-specific external memory interfacing for FPGA-based reconfigurable architecture 
A study of unsupervised speaker indexing
PDF
A study of unsupervised speaker indexing 
Composing style-based software architectures from architectural primitives
PDF
Composing style-based software architectures from architectural primitives 
Action button
Asset Metadata
Creator Fidaleo, Douglas Alexander (author) 
Core Title G -folds: An appearance-based model of facial gestures for performance driven facial animation 
Contributor Digitized by ProQuest (provenance) 
School Graduate School 
Degree Doctor of Philosophy 
Degree Program Computer Science 
Publisher University of Southern California (original), University of Southern California. Libraries (digital) 
Tag Computer Science,OAI-PMH Harvest 
Language English
Advisor Neumann, Ulrich (committee chair), Narayanan, Shrikanth (committee member), Sorensen, Vibeke (committee member) 
Permanent Link (DOI) https://doi.org/10.25549/usctheses-c16-634539 
Unique identifier UC11334809 
Identifier 3116697.pdf (filename),usctheses-c16-634539 (legacy record id) 
Legacy Identifier 3116697.pdf 
Dmrecord 634539 
Document Type Dissertation 
Rights Fidaleo, Douglas Alexander 
Type texts
Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au... 
Repository Name University of Southern California Digital Library
Repository Location USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA