Close
The page header's logo
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Data -driven facial animation synthesis by learning from facial motion capture data
(USC Thesis Other) 

Data -driven facial animation synthesis by learning from facial motion capture data

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content DATA-DRIVENFACIALANIMATIONSYNTHESIS BYLEARNINGFROMFACIALMOTIONCAPTUREDATA by ZhigangDeng ADissertationPresentedtothe FACULTYOFTHEGRADUATESCHOOL UNIVERSITYOFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (COMPUTERSCIENCE) May2006 Copyright 2006 ZhigangDeng UMI Number: 3237134 3237134 2007 UMI Microform Copyright All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, MI 48106-1346 by ProQuest Information and Learning Company. Dedication tomygrandparents,parents,andsistersinChina ii Acknowledgments WhileworkingtowardsmyPh.D.degreeattheUniversityofSouthernCalifornia,Ihave been very fortunate to be surrounded by a group of wonderful people, whose contribu- tionandhelpsIwouldliketoacknowledge. First of all, I would like to express my deepest gratitude to my doctoral advisor, Ulrich Neumann, for his extraordinary guidance during my five-year study at USC. From the insightfuldiscussionsthathavealwayskeptmeontherighttrackinmyresearch,tothe thorough advices on finding the topic and refining my dissertation, he has continuously been a great source of inspiration, energy and invaluable knowledge. For these, and for many other reasons that helped make this dissertation completed and turned my Ph.D. dreamintoareality,Iwillbealwaysindebtedtohim. Many thanks to the other members of my Qualifying and Dissertation Defense Com- mittee: J.P. Lewis, Shrikanth Narayanan, Isaac Cohen, Scott Fisher, Mathieu Desbrun, Fred Pighin, and Yizhou Yu, for their thorough and valuable comments that helped improve my dissertation. In particular, I would like to express my heartfelt thanks to J.P.LewisforhisselflesshelpsandcooperationssincehearrivedatUSC,fromtheearly learningstatistical-learningtodiscussinginnovativefacialanimationresearchideas. His insightfulknowledgeaboutvisualeffectsindustrygreatlyhelpedmetodefinemycareer iii goal. Great thanks also to Shrikanth Narayanan and his emotion research team: Car- los Busso, Murtaza Bulut, Chul Min Lee and Sungbok Lee, for wonderful cooperation on audio-visualspeech processingand multi-modalhuman computerinterface research efforts. The wonderful cooperation experience constitutes a memorable chapter of my USCstudy. I would like to thank my labmates in the Computer Graphics and Immersive Technolo- gies Lab for the friendly and stimulating atmosphere that made the past years in the lab awonderfulexperience. SpecialthankstoSuyaYou,forcontinuoushelpsandsupports, to Jun-yong Noh, Tae-yong Kim, Douglas Fidaleo and Clint Chua, for introducing me into the computer animation field at my early studying stage at USC. Many thanks to Zhenyao Mo, Jinhui Hu, Lu Wang, Kevin Chuang, Ismail Oner Sebe, Lin Ma, and Tae- hyun Rhee, and other officemates, including Xiaodan Wu and Cheng Zhi Anna Huang, for always being there to help, for the great conversations, and for the numerous week- endsandnightsspenttogetherinthelab. Ialsoowemanythankstothosesmartundergraduateresearchstudentsworkingwithme via the USC Undergraduate Research Program (URP), for the face modeling work by Albin Cheenath, Pamela Fox and Shawn Drost, and human emotion study experiments byKimyTran,EricaMurpheyandStephanieParker. Ialsowanttothankthephoneme- alignment work by Ashlyn Buck and Nathan Sage. I also owe thanks to Hiroki Itokazu andBretSt. Clairfortheirgreatfacemodelingwork. I have saved the last of my acknowledgments for the people whom I can never hope to adequately thank - my family. My love and gratitude to my grandparents, my parents and my two sisters in China, for the many sacrifices they made so that I can achieve the best in my life. I always feel proud for being in such a great family, and their encouragementswillmotivatemetoachievenextexcellenceonmycareerpath. iv v Table of Contents Dedication ii Acknowledgments iii List of Figures vii List of Tables xiii Abstract xiv 1 Introduction 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 My Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Data Capture and Processing 13 2.1 Facial Motion Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Marker Labeling and Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Phoneme-Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Head Motion Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.4 Eye Motion Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Realistic Eye Motion Synthesis 21 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Previous Eye Motion Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Eye Motion Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Patch Size Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5 Results and Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.6 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4 Natural Head Motion Synthesis 36 vi 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Previous Head Motion Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Sample-based Head Motion Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.1 Head Motion Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.2 Search for the Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.3 Results and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3.4 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4 Model-based Head Motion Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.1 Modeling Head Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.2 Head Motion Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5 Expressive Speech Animation Synthesis 62 5.1 Speech Animation Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.1.1 Expressive Facial Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.1.2 Speech Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Model-based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2.1 Learning Speech Co-Articulation . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2.2 Construct Expression Eigen-Spaces . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2.3 Expressive Speech Animation Synthesis . . . . . . . . . . . . . . . . . . . . . 77 5.2.4 Results and Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2.5 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3 Sample-based Approach (eFASE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3.1 Facial Database Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.3.2 Expressive Phoneme-Isomaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3.3 Phoneme Motion Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3.4 Speech Motion Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.3.5 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.3.6 Results and Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.3.7 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6 Conclusions and Discussion 117 6.1 Modeling “the Brain” of Facial Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2 Linguistically Augmented Facial Animation . . . . . . . . . . . . . . . . . . . . . . . 119 6.3 Neck and Throat Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.4 Personified Facial Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Bibliography 121 vii List of Figures 1.1 Feature points defined in MPEG-4. FAPs are defined by motion of these feature points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 The two-step pipeline of the facial animation synthesis system. The top row is the learning stage, the bottom row is the synthesis stage . . . . . . . . 8 1.3 The divide-and-conquer diagram of the facial animation synthesis system. Basic motions are illustrated in yellow ellipses . . . . . . . . . . . . . . . . . . . . . 10 2.1 A VICON motion capture system (left and middle) is used for this work. The right panel shows the used facial marker layout . . . . . . . . . . . . . . . . . 14 2.2 A snapshot of the interface of VICON postprocessing package . . . . . . . . 15 2.3 The left panel shows the numbering of used facial markers. In the right panel, blue and red points together illustrate the captured markers, and red points illustrate the used markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Phoneme-alignment result for a recorded sentence “that dress looks like it comes from asia” (using wavesurfer software [wav]). . . . . . . . . . . . . . . 16 2.5 The captured subject for eye motion capture . . . . . . . . . . . . . . . . . . . . . . . 18 2.6 The Y-coordinate motion of captured left and right eye blinks (green for left and red for right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7 The labeled eye gaze signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.8 Eye direction training signals are digitized with a custom GUI . . . . . . . . . 20 3.1 Signal synthesis with non-parametric sampling schemes. Regions in a sample signal (top) that are similar to the neighborhood of the signal viii being synthesized (bottom) are identified. One such region is randomly chosen, and new samples are copied from it to the synthesized signal. . . . 23 3.2 Eye blink and gaze-x signals are plotted simultaneously. . . . . . . . . . . . . . . 25 3.3 Eye “blink” data (bottom) and a synthesized signal with the same auto- covariance (top). A simple statistical model cannot reproduce the corre- lation and asymmetry characteristics evident in the data . . . . . . . . . . . . . . 26 3.4 Synthesized eye blink motion (blue) vs. eye blink sample (red dots). The x-axis represents the frame number and y-axis represents the open- ness of eyelids.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 The top is the synthesized eye gaze x trajectory (blue) vs. eye gaze-x sample (red). The bottom is the synthesized eye gaze y trajectory (blue) vs. eye gaze-y sample (red).. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.6 The top is the histogram of time intervals of “large eye movements”. The bottom is the accumulated Covered Percentage vs. Time Interval limit. When the time interval limit is 20, the covered percentage is 55.68%. 32 3.7 Some frames of synthesized eye motion. . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.8 One-Way ANOVA (ANalysis Of VAriance) results of the evaluations. The p-value is 2.3977e-10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.1 Audio-Headmotion Database (AHD). Each entry in this database is com- posed of two parts: a AF-COEF (four dimensional audio feature pca coefficients) and a T-VEC (a six dimensional head motion transformation vector). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Overview of the sample-based head motion synthesis pipeline. The first step is to project audio features onto the audio feature PCA space, the second step is to find K nearest neighbors in the AHD, and the third step is to solve the optimal combination by dynamic programming. . . . . . . . . . 41 4.3 The search space of the weights (W a ,W n ,W r ). . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Plot of the search result after 20 initial weights are used (K=7). The global minimum is the red point, corresponding to the weights: W a = 0.31755, W n = 0.15782, and W r = 0.52463. . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5 Plot of minimum TTE versus K. For each K, 20 times of non-sequential random search are used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 ix 4.6 Comparison of ground-truth head motion (red solid curve) and the syn- thesized head motion (dashed blue curve), when the subject says with a neutral speech “Do you have an aversion to that?”. Note that the motion tendency at most places is similar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.7 Some frames of a synthesized head motion sequence, driven by the recorded speech “By day and night he wrongs me; every hour He flashes into one gross crime or other...” from a Shakespere’s play. . . . . . . . . . . . . 49 4.8 Head poses using Euler angles.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.9 2D projection of Voronoi regions using 32-size vector quantization. . . . . 53 4.10 The HMM model-based head motion synthesis framework. . . . . . . . . . . . 55 4.11 Spherical cubic interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.12 Synthesized head motion, front view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.13 Synthesized head motion, side view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1 Schematic overview of the model-based expressive facial animation syn- thesis system. It includes three main stages: recording, modeling and synthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2 Red markers are used for learning speech co-articulation. . . . . . . . . . . . . . 69 5.3 Phoneme samples and a co-articulation area for phonemes /ah/ and /k/. . . 70 5.4 An example of diphone co-articulation functions (for a phoneme pair /ae/ and /k/: F s (t) and F e (t)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.5 Another example of diphone co-articulation functions (for a phoneme pair /ae/ and /p/: F s (t) and F e (t)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.6 A example of triphone co-articulation functions. It illustrates triphone co-articulation functions for triphone (/ey/, /f/ and /t/): F s (t), F m (t), and F e (t) (from left to right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.7 Another example of triphone co-articulation functions. It illustrates the co-articulation weighting functions of triphone (/ey/, /s/ and /t/): Fs(t), Fm(t), and Fe(t) (from left to right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 x 5.8 Fitting error with respect to the degree of fitted co-articulation curves (red is for λ = 0, green is for λ = 0.00001, blue is for λ = 0.00005, and black is for λ = 0.0001). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.9 Phoneme-based time-warping for the Y position of a particular marker. Although the phoneme timings are different, the warped motion (black) is strictly frame aligned with neutral data (red). . . . . . . . . . . . . . . . . . . . . . 75 5.10 Extracted phoneme-independent angry motion signal from Fig 5.9. . . . . . . 76 5.11 Plot of three expression signals on the PIEES. It shows that sad signals and angry signals overlap in some places. . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.12 Plot of two expression sequences in the PIEES. It shows that expression is just a continuous curve in the PIEES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.13 The junctures of adjacent diphones and triphones. The overlapping part (the semitransparent part) in the juncture of two triphones (left panel) needs to be smoothed. Note that there is another diphone-triphone con- figuration, similar to the middle triphone-diphone panel. . . . . . . . . . . . . . . 78 5.14 Schematic overview of mapping marker motions to blendshape weights. It is composed of four stages: data capture stage, creation of mocapvideo pairs, creation of mocap-weight pairs, and RBF regression. . . . . . . . . . . . . 82 5.15 The used blendshape face model. The left panel shows the smooth shaded model and the right panel shows the rendered model. . . . . . . . . . . . 83 5.16 Some frames of synthesized happy facial animation. . . . . . . . . . . . . . . . . . . 84 5.17 Some frames of synthesized angry facial animation. . . . . . . . . . . . . . . . . . . 84 5.18 Comparisons with ground-truth marker motion and synthesized motion. The red line denotes ground-truth motion and the blue denotes synthe- sized motion. The top illustrates marker trajectory comparisons, and the bottom illustrates velocity comparisons. Note that the sampling fre- quency here is 120Hz. The phrase is “...Explosion in information tech- nology...”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.19 Overview of the eFASE pipeline. At the top, given novel phoneme- aligned speech and specified constraints, this system searches for best- matched motion nodes in the facial motion database and synthesizes expressive facial animation. The bottom illustrates how users specify motion-node constraints and emotions with respect to the speech timeline. 90 xi 5.20 To construct a specific /w/ phoneme cluster, all expressive motion cap- ture frames corresponding to /w/ phonemes are collected, and the Isomap embedding generates a 2D expressive Phoneme-Isomap. Colored blocks in the figure are motion nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.21 Schematic illustration of the organization of the processed motion node database. Here solid directional lines indicate predecessor/successor relations between motion nodes, and dashed directional lines indicate possible transitions from one motion node to the other. The colors of motion nodes represent different emotion categories of the motion nodes. 94 5.22 Comparisons between 2D Phoneme-PCA maps and 2D Phoneme-Isomaps. The left panels are 2D Phoneme-PCA maps for /aa/ (top) and /y/ (bot- tom), and the right panels are 2D Phoneme-Isomaps for /aa/ (top) and /y/ (bottom). In all four panels, black is for neutral, red for angry, green for sad, and blue for happy. Note that some points may overlap in these plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.23 A 2D expressive phoneme-Isomap for phoneme /ay/. Here each point in the map corresponds to a specific 3D facial configuration. Note that gray is for neutral, red for angry, green for sad, and blue for happy. . . . . . . 97 5.24 Snapshots of motion editing for the phoneme /ay/. Here each trajectory (curve) represents one motion node and image color represents emotion category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.25 Illustration of how to specify a motion-node constraint via the phoneme- Isomap interface. When users want to specify a specific motion node for expressing a particular phoneme utterance, its corresponding phoneme- Isomaps are automatically loaded. Then, users can interact with the system to specify a motion-node constraint for this constrained phoneme. 99 5.26 A snapshot of phoneme-Isomap highlights for specifying motion-node constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.27 3D face model used in the eFASE system. The left panel is a wireframe representation and the right panel is a textured rendering (with eyeball and teeth). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.28 Feature point based face deformation. The left shows the motion of every vertex (green) is the summation of motion propagations of neigh- boring markers (red). The right shows the propagation distance between two vertices is the shortest path (red) along edges,not the simple Euclid- xii ean distance (green). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.29 A snapshot of the running eFASE system. The left is a basic control panel, and the right panel encloses four working windows: a synthe- sized motion window (top-left), a video playback window (top-right), a phoneme-Isomap interaction window (bottom-left), and a face preview window (bottom-right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.30 A part of marker (#48 marker) trajectory of the sad sentence “Please go on, because Jeff’s father has no idea of how the things became so horrible.” The red dashed line is the groundtruth trajectory and the blue solid line is the synthesized trajectory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.31 A part of marker (#79 marker) trajectory of the sad sentence “Please go on, because Jeff’s father has no idea of how the things became so horrible.” The red dashed line is the ground truth trajectory and the blue solid line is the synthesized trajectory. . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 xiii List of Tables 1.1 FAP groups in MPEG-4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4.1 Results for different HMM configurations . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1 Speech motion synthesis algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2 The complete expressive facial animation synthesis algorithm. . . . . . . . . 83 5.3 Examples of an aligned phoneme input file (left) and an emotion mod- ifier file (right). Its phrase is “ I am not happy...”. Here the emotion of the starting 2.6 second is angry, and the emotion from #2.6 second to #16.6383 second is sadness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.4 Running time of synthesis of some example phrases. Here the computer used is a Dell Dimension 4550 PC (Windows XP, 1GHz Memory, Intel 2.66GHz Processor). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Abstract Synthesizing realistic facial animation remains one of the most challenging topics in the graphics community because of the complexity of deformation of a moving face and our inherent sensitivity to the subtleties of human facial motion. The central goal of this dissertation is to attempt data-driven facial animation synthesis that captures the dynamics, naturalness, and personality of facial motion while human subjects are speaking with emotions. The solution is to synthesize realistic 3D talking faces by learning from facial motion capture data. This dissertation addresses three critical parts of realistic talking face synthesis: realistic eye motion synthesis, natural head motion synthesis,andexpressivespeechanimationsynthesis. A texture-synthesis based approach is presented to simultaneously synthesize realis- tic eye gaze and blink motion, accounting for any possible correlations between the two. Thequalityofstatisticalmodelingandtheintroductionofgaze-eyelidcouplingare improvementoverpreviouswork,andthesynthesizedeyeresultsarehardtodistinguish fromactualcapturedeyemotion. Two different approaches (sample-based and model-based) are presented to synthesize appropriateheadmotion. Basedonthealignedtrainingpairsbetweenaudiofeaturesand head motion, the sample-based approach uses a K-nearest neighbors-based dynamic xiv programming algorithm to search for the optimal head motion samples given novel speech input. The model-based approach uses the Hidden Markov Models (HMMs) to synthesize natural head motion. HMMs are trained to capture the temporal relation betweentheacousticprosodicfeaturesandheadmotion. Thisdissertationalsopresentstwodifferentapproaches(model-basedandsample-based) to generate novel expressive speech animation given new speech input. The model- based synthesis approach accurately learns speech co-articulation models and expres- sion eigen spaces from facial motion data, and then it synthesizes novel expressive speech animations by applying these generative co-articulation models and sampling fromtheconstructedexpressioneigenspaces. Thesample-basedsynthesissystem(eFASE) automatically generates expressive speech animation by concatenating captured facial motion frames while animators establish constraints and goals (novel phoneme-aligned speechinputanditsemotionmodifiers). Userscanalsoedittheprocessedfacialmotion databaseviaanovelphoneme-Isomapinterface. xv Chapter1 Introduction 1.1 ProblemStatement People expect to communicate with computers in a natural way, beyond the traditional keyboard-mouse interaction. As such, talking faces offer an alternative for natural human computer interaction. Computer facial animation, in particular talking faces, has various applications in many fields. In the entertainment industry, realistic virtual humans with facial expressions have been increasingly used to current film production, for example, in The Lord of The Rings. In the education field, interactive talking faces may be more effective at grabbing and directing the attention of learners than prere- corded video. Talking faces can also enhance customer relation management systems, suchasvariousautomatedcheck-inkiosksandATMs[Cos02]. Inthesescenarios,facial animations not only make the interaction between users and machines more fun, but also provide a friendly interface and help to attract users. In the telecommunication field,facialanimationtechnologiescanbeusedfortransmittingfacialimagesoverlow- bandwidth channels [Pan02]. Other areas where facial animation technologies can be appliedincludemedicine,newscasting,advertising,psychology,andsoforth[PBV]. To meet the increased facial animation applications, facial animation is supported in MPEG-4 standard, an object-based multimedia compression standard. MPEG-4 spec- ifies and animates 3D face models by defining Face Definition Parameters (FDP) and 1 Group NumberofFAPs 1. Visemeandexpressions 2 2. Jaw,chin,innerlowerlip,cornerlip,midlip 16 3. Eyeballs,pupils,eyelids 12 4. Eyebrow 8 5. Cheeks 4 6. Tongue 5 7. HeadMotion 3 8. Outer-lipposition 10 9. Nose 4 10. Ears 4 Table1.1: FAPgroupsinMPEG-4. Facial Animation Parameters (FAP) [Ost98]. In order to define FAPs for arbitrary face models, MPEG-4 defines Face Animation Parameter Units (FAPU) that are used to scale FAPs for any face model. The fractions of distances between key facial fea- tures,forexamplethedistancebetweenthetwoeyes,aredefinedasFAPUs. 84Feature Points(FP)arespecifiedandgroupedontheneutralfaceinMPEG-4(Figure1.1). After excludingthefeaturepointsthatarenotaffectedbyFAPs,68FAPsarecategorizedinto 10 groups (Table 1.1). FAPs in group 2 to 10 are low-level parameters since they pre- cisely specify how much a given FP should be moved. On the other hand, FAP group 1 (visemes and expressions) are considered as high-level parameters, because these para- metersarenotpreciselyspecified,forexample,textualdescriptionsareusedtodescribe expressions. Assuch,reconstructingfacialanimationdependsontheimplementationof decoderprograms. MPEG-4onlyenclosesthespecificationoftransmittingfacialanimationovernetworks, andmostoftheexistingMPEG-4researchfocusesondecoderprograms[KGT00,AP99, LP99, Pan02]. Kshirsagar et al. [KGT00] present a deformation technique to deform arbitrary 3D face models upon receiving MPEG-4 FAP sequences. [AP99, LP99, 2 Figure 1.1: Feature points defined in MPEG-4. FAPs are defined by motion of these featurepoints. Pan02] present MPEG-4 decoder systems that are used to reconstruct arbitrary 3D face modelsandinterpretFAPsinreal-time. Previously, various 3D face modeling techniques have been presented [LTW95, PHL + 98, GGW + 98, ZLA + 01, Fua00, Jon01]. For example, cyberware scanners are used to obtain depth and texture information [LTW95] and computer vision techniques by taking images from multiple views [PHL + 98, GGW + 98, BL03] are used to recon- struct 3D face models. Methods from single video [Fua00, ZLA + 01] or volumetric data[Jon01]arealsopresented. 3 SinceParke’sseminalfacialanimationwork[Par72],manyfacialanimationtechniques havebeendeveloped. Anoverviewofvariousfacialanimationtechniquescanbefound in the well-known facial animation book [PW96]. Physics-based appproaches [PB81, Wat87, LTW93, LTW95, UGO98, ZPS01, KHS01, SNF05] animate faces by model- ing the contraction/relaxation of the muscles in the human facial anatomy. For exam- ple, based on the Facial Action Coding System (FACS) developed by Ekman and Friesen [EF75], Platt and Badler [PB81] present a FACS-tension network to generate facial animation. In their approach, muscles are used to connect the surface of a face mesh(skinlayer)andtheinnerbone. Waters[Wat87]treatsmusclesasspecialgeomet- ricdeformersforsimulatingtheeffectsofrealmuscles,andtwotypesofmuscles(linear and sphincter) are supported in the system. Lee et al. [LTW95] present a physics-based approach where a multiple-layer dynamic skin and muscle model is used to deform 3D surfaces and spring systems are used to interconnect different layers. The limitation of the above physics-based approaches [PB81, ZPS01, UGO98, LTW93, Wat87, LTW95] is that it is difficult to solve for desired muscle parameter values because of the com- plexityofrealhumanfacialmuscles. Sifakisetal.[SNF05]usenonlinearfiniteelement methodstodetermineaccuratemuscleactuationsfrommotionsofsparsefacialmarkers. Intheirwork,ananatomicallyaccuratemodeloffacialmusculature,passivetissue,and underlyingskeletonstructureisused. Theirworkdemonstratedthesuccessofthistech- nique on anatomically accurate face models, but it is not clear whether this approach canbeextendedtohandlegeneralfacemodelsthatarenotanatomicallyaccurate. Deformation-basedapproaches[KMTT92,SF98,LCF00]usevariousdeformationtech- niques to animate faces. By assigning weights to control points of Free Form Defor- mation (FFD) lattice, Kalra et al. [KMTT92] present a rational free form deformation techniquetoonlydeformtheregionofinterestonthefacemeshforthepurposeofsim- ulatingfacialmuscleactions. SinghandFiume[SF98]presentageometricdeformation 4 technique suitable for general deformation (including faces), for example generating wrinkles and creases on 3D faces. In their model, wires and domain curves are used to intuitivelycontrolthedeformation. Lewisetal.[LCF00]presentaposespacedeforma- tiontechniquebyapplyingradialbasisfunctionstogeneralcharacteranimation,includ- ing facial animation. These deformation techniques [KMTT92, SF98, LCF00] provide convenient controls for animators, but it is impossible to automate these synthesis pro- cedureswithoutconsiderableefforts. Performance-driven facial animation techniques [Wil90, CK01, NN01, CB02, NFN02, CXH03, BPL + 03, VBPP05, DCFN06] capture and analyze the facial performance of subjects and retarget it to various 3D face models, such as 3D face meshes [Wil90, CXH03], blend shape models [CB02, DCFN06], and muscle actuation inspired blend- shapebasis[CK01]. Emotionanalysiscanbedoneasapreprocessingstepbeforeretar- gettingthemotiononto3Dfacemodels[NFN02]. NohandNeumann[NN01]presentan “expression cloning” techniqueto transfer existingfacial motion onto different3D face models automatically. Pyun et al. [PKC + 03] further extended the expression cloning technique for different face models. Vlasic et al. [VBPP05] present a method to map videorecordedperformancesofsubjectsto3Dfacemodelsor2Dfacesofothersubjects, based on a learned multilinear model. Deng et al. [DCFN06] present a semi-automatic approach todirectly mapfacial motion capturedata togeneral blendshapeface models. These approaches can generate realistic facial animation, because the facial motion is extractedfromrealperformancesandplayedbackon3Dfacemodels. However,without generativemodels,itisdifficultfortheseapproachestosynthesizenovelfacialmotion. To acquire high-quality facial motion data, motion capture systems are popularly used for capturing facial motion [BPL + 03, Sco03]. The motion acquired in this way is realistic, but there are several limitations: time-consuming off-line data processing 5 that prevents its usage for real-time applications, and the acquired data are exclu- sively used for planned scenarios. To make the captured motion data reusable, various data-driven synthesis and editing approaches [BCS97, Bra99, Cos02, EGP02, KT03, CFKP04] have been presented. These data-driven approaches can be approximately divided into two categories: sample-based and model-based . Sample-based synthesis approaches[Cos02,BCS97,KT03,CFKP04]directlyrecombinerecordedmotionsam- ples to synthesize novel motion, while model-based approaches [Bra99, EGP02] learn generativemodelsfromrecorded(training)dataandthensynthesizenovelmotionfrom thegenerativemodels. Given novel audio input, Bregler et al. [BCS97] synthesize new speech anima- tion by re-combining existing triphone video segments extracted from training video footage. [Cos02, CFKP04] present sample-based talking faces, similar to [BCS97]. Both approaches search for motion samples in the pre-recorded database and re-order them, but the searching strategies and targets are different. “Triphone segments” are treated as basic units in [BCS97], while longer samples (crossing more than three phonemes) are also candidates in [Cos02, CFKP04]. Instead of constructing phoneme- related facial motion database [Cos02, BCS97], Kshirsagar and Thalmann [KT03] present a syllable-motion database based approach to synthesize novel speech anima- tion. In their approach, phoneme motion sequences are further grouped into syllable- motionsequences,andthenspeechmotionissynthesizedbyconcatenatingthecaptured syllable-motion database. Realistic synthesized animation is achieved by the above sample-based approaches [BCS97, Cos02, KT03, CFKP04], but the major limitation comes from the requirement of large facial motion database. These approaches need large training data for constructing the system. Additionally, expressions, eye motion, andheadmotioncannotbesynthesizedbytheseapproaches. 6 To decrease or remove the large facial motion database requirement, model-based approaches [Bra99, EGP02] learn generative models from limited training data. After the generative models are learned, novel motion are synthesized without retaining the original motion database. For example, Brand [Bra99] learn a HMM-based facial motion control model by an entropy minimization algorithm from example voice and video data and then effectively generate full facial motions for new audio track. Suc- cessfully driving a deformable face mesh is demonstrated in [Bra99], but whether this method can be successfully applied to produce realistic facial animation has not been verified yet. Ezzat et al. [EGP02] present a Multidimensional Morphable Model (MMM) that requires a limited set of mouth image prototypes and effectively synthe- sizes new speech animation given novel audio track. Compared with [BCS97], it only needlimitedtrainingdata. Buttheseapproaches[Bra99,EGP02]learntheaudio-motion mapping as a monolithic system; as such, it is difficult for animators to conveniently controldifferentfacialpartsseparately,suchasspeechandexpression. 1.2 MySolution Thecentralgoalofthisdissertationworkistoattemptdata-drivenfacialanimationsyn- thesis that captures the dynamics, naturalness, and personality of facial motion of talk- ingfaces. Toachievethisgoal,mysolutionistosynthesizeexpressive3Dtalkingfaces by learning from facial motion capture data. Specifically, a hybrid strategy of mixing machine-learninganddivide-and-conquerisused. 3D talking faces are chosen in this work due to its broad applications, e.g. com- puter games and films, and the visual quality of synthetic 3D faces is catching up 7 to real images of human faces, with advanced 3D face modeling and rendering tech- niques. Machine-learning techniques are used to “learn models” from high-quality facial motion capture data. The overall pipeline can be decomposed into two stages: learning stage and synthesis stage. In the learning stage, Facial Motion Capture Data (FM-CD) arefedintolearningalgorithmsthatbuildtheFacialMotionKnowledgeBase (FM-KB) ; then in the synthesis stage, given novel speech/text input as well as interac- tive user controls, corresponding expressive talking faces are synthesized based on the constructedFM-KB.Figure1.2illustratesthistwo-steppipeline. Figure1.2: Thetwo-steppipelineofthefacialanimationsynthesissystem. Thetoprow isthelearningstage,thebottomrowisthesynthesisstage. Because of the complexity of deformation of a moving human face, it is extremely difficult to find a unified solution to the whole moving face. Thus, in this work a divide-and-conquer strategy is adopted to partition the whole face into different parts and animate these parts separately. First, the system is divided into two parts: Strongly Speech-Related Motion (SSRM) and Weakly Speech-Related Motion (WSRM) . SSRM is strongly correlated with the speech content. An obvious example of SSRM is mouth 8 movement (visual speech). On the contrary, WSRM is weakly correlated with or even approximately independent of the speech content, for example, although in psycholog- ical literature [TSKES95, Ray98, MSL98, GB00, Gri01], it is observed that eye move- ments reflect event comprehension and sentence formulation, because people tend to look at what they are thinking about, there is no strong evidence for the determinis- tic linkage between spoken texts and eye movments. Thus, I argue that eye motion is weakly correlated with the speech content. SSRM and WSRM are further divided into smallBasicMotions(BM),suchasmouthmovement,eyemotion,andsoon. Figure1.3 illustratesthisdivide-and-conquerdiagramofthetalkingfacesynthesissystem. Itisarguablethatpartitioningthewholemovingfaceinthiswaybreakstheconnections amongthesebasicmotions. Iarguethatthesimplifieddivide-and-conquermodelinthis workis usefultoapproximatelymodel realtalkingfaceswith tractablecomputingcost. Inacompletemodel,a“virtualbrain”coordinatesthemotionsofdifferentfacialparts. Two different synthesis approaches (sample-based and model-based) are used in the system. Ifthesynthesissystemisviewedasawhole,ahybridoftwotypesofsynthesis approaches is used. For example, the sample-based approach is used to synthesize eye motion. For head motion and speech motion synthesis, both sample-based and model- basedapproachesarepresented. The facial animation synthesis system presented in this work provides more flexible controls and greater openness for users than the monolithic systems [Cos02, BCS97, Bra99, EGP02, KT03]. For each basic motion synthesis module, local controls are pro- videdfortuningsynthesis. Forexample,userscancontrolthediversitiesofsynthesized eye motion by adjusting the threshold value and patch size, used in the eye motion synthesis algorithm (Chapter 3). On the other hand, since this system is hierarchically 9 Figure 1.3: The divide-and-conquer diagram of the facial animation synthesis system. Basicmotionsareillustratedinyellowellipses. organized, users can conveniently replace basic motion modules (algorithms) without affectingothers. Theremainingchaptersofthisdissertationareorganizedasfollows: Chapter2describes how the facial motion data are captured and preprocessed. The preprocessing includes phoneme-alignment on the recorded speech using the Festival speech recognition sys- tem,headmotionremovalfrommotioncapturedatausingastatisticalanalysismethod, andextractingeyemotionsignals(blinkandeyegazemovement)fromdata. In Chapter 3, an automated eye motion synthesis algorithm [DLN03, DLN05a] is described. Atexturesynthesisbasedapproachispresentedtosimultaneouslysynthesize realisticeyegazeandblinkmotion,accountingforanypossiblecorrelationsbetweenthe 10 two. Thequalityofstatisticalmodelingandtheintroductionofgaze-eyelidcouplingare an improvement over previous work and the synthesized eye results are hard to distin- guishfromactualcapturedmotion. In Chapter 4, two different approaches (sample-based [DBNN04b] and model- based [BDNN05]) are presented to synthesize appropriate head motion. Based on the aligned training pairs between audio features and head motion (audio-headmotion), the sample-based approach [DBNN04b] uses a K-nearest neighbors-based dynamic pro- gramming algorithm to search for the optimal audio-headmotion samples given novel speech input. The model-based approach [BDNN05] uses the Hidden Markov Models (HMMs) to synthesize natural head motions. HMMs are trained to capture the tempo- ral relation between the acoustic prosodic features and head motions. The results show that the synthesized head motions follow the temporal dynamic behavior of real head motion. In Chapter 5, two different approaches (model-based [DLN05b, DBNN04a, DNL + ar] andsample-based[DN06])arepresentedtogeneratenovelexpressivespeechanimation given new speech/text input. The model-based expressive speech animation synthesis approach[DLN05b,DBNN04a,DNL + ar]accuratelylearnsspeechco-articulationmod- els and expression eigen spaces from facial motion data, and then it synthesizes novel expressive speech animations by applying these generative co-articulation models and sampling from the constructed expression eigen spaces. The sample-based expressive facial animation synthesis system (eFASE) [DN06] automatically generates expressive speech animation by concatenating captured facial motion data while animators estab- lish constraints and goals (novel phoneme-aligned speech input and its emotion modi- fiers). Usersoptionallyspecify“hardconstraints”(motion-nodeconstraintsforexpress- ing phoneme utterances) and “soft constraints” (emotion modifiers) to guide the search 11 process. Userscanalsoedittheprocessedfacialmotiondatabasebyinsertinganddelet- ingmotionnodesviaanovelphoneme-Isomapinterface. Novelfacialanimationsynthe- sisexperimentsandobjectivemarkertrajectorycomparisonsbetweensynthesizedfacial motionandcapturedmotiondemonstratethatthissystemiseffectiveforproducingreal- isticexpressivefacial(speech)animations. InChapter6,conclusionsandfutureworkaredescribed. 12 Chapter2 DataCaptureandProcessing This chapter describes how the facial motion data were captured, and how the captured datawerepreprocessedfornovelfacialanimationsynthesis. Thepreprocessingincludes phoneme-alignment on the recorded speech using the Festival speech recognition sys- tem,headmotionremovalfrommotioncapturedatausingastatisticalanalysismethod, andextractingeyemotionsignals(blinkandeyegazemovement)fromdata. 2.1 FacialMotionCapture To ensure the realism of synthesized facial animation, high-quality facial motion data were captured from a VICON motion capture system [vic] with camera-rigs. The motioncapturesystemcapturesfacialmotionwith120Hzsamplingfrequency(120sam- ples/second). With102markersonherface(ThemarkerlayoutintherightpanelofFig- ure2.1),anactresswasdirectedtospeakacustomizedphoneme-balancedcorpus(more than250sentences)fourtimes. Eachtimethesamesentenceswerespokenwithdifferent emotions 1 . Total four emotions (neutral, happiness, anger and sadness) are considered in this work. The actress was asked to speak the corpus with full specific emotion all the time. Motion of these facial markers, video, and accompanying audio/speech were 1 recorded sentences for different emotions are slightly different because the actress had difficulty to speakoutsomesentenceswithcertainemotions 13 recorded simultaneously. The recorded data include more than 105,000 motion cap- ture frames (approximately 135 minutes recording time). Figure 2.1 illustrates the used VICONmotioncapturesystem. Figure 2.1: A VICON motion capture system (left and middle) is used for this work. Therightpanelshowstheusedfacialmarkerlayout. 2.2 DataPreprocessing The captured facial motion data cannot be directly used for novel facial animation syn- thesis. To make the data aligned and reusable, a preprocessing stage is necessary. It is composed of four steps: marker labeling and cleaning, phoneme-alignment, head motionremovalandeyemotionextraction. 2.2.1 MarkerLabelingandCleaning Afterthemotionsofmarkerswerecaptured,labelingmarkersisthefirstthingthatneeds to do. Marker labeling is a semi-automatic process in the VICON postprocessing soft- ware package. Users first specify the correct labels for some key frames, then the pro- vided program can automate labeling in-between frames. In case of mislabeling, users 14 have to manually check and rectify these mislabelings. Figure 2.2 illustrates a snapshot oftheinterfaceofVICONpostprocessingpackage. Figure2.2: AsnapshotoftheinterfaceofVICONpostprocessingpackage. Duetoocclusionscausedbyheadmotionandtrackingerrors,somemarkersaremissing or switched each other at some frames. Although the VICON postprocessing package can deal with these marker missing or switching problems by automatic gap-filling and filtering,itisimperfect. Tokeepthequalityofthefacialmotiondatausedforthiswork, 90of102markerswerekept. (The90markerswerefullytracked.) Figure2.3illustrates thecaptured102markersand90markersfinallyusedinthisdissertationwork. 2.2.2 Phoneme-Alignment The Festival speech recognition system [fes] was used to perform phoneme-alignment in order to align each phoneme with its corresponding motion capture segments. This alignmentwasdonebyinputtingaudioanditsaccompanyingtextscriptsintothespeech 15 Figure 2.3: The left panel shows the numbering of used facial markers. In the right panel, blue and red points together illustrate the captured markers, and red points illus- tratetheusedmarkers. recognitionprograminaforce-alignmentmode. Accuratephoneme-alignmentisimpor- tant to the success of this work, and automatic phoneme-alignment is imperfect, so two linguistics experts manually checked and corrected all the phoneme-alignments by reading the corresponding spectrograms, with the assistance of the WaveSurfer soft- ware [wav]. Figure 2.4 visualizes the phoneme-alignment result of a recorded sentence “thatdresslookslikeitcomesfromasia”. Figure 2.4: Phoneme-alignment result for a recorded sentence “that dress looks like it comesfromasia”(usingwavesurfer software[wav]). 16 2.2.3 HeadMotionExtraction A translation is first applied to make a specific marker as the zero center of each frame, then a Singular Value Decomposition (SVD) method [SG02] is applied to extract head motion from motion frames. First, a neutral and closed frame X 0 is manually chosen as areferenceframe. Then,foranyotherframeX i ,thefollowingprocedureisusedtoalign theframeX i withX 0 . 1. ArrangebothX 0 andX i asn×3matrices(nisthenumberofmarkers). 2. Calculate the SVD of X T 0 X i in order to maximize the correlation between the two frames. Thus,X T 0 X i =UDV T . 3. ThesolvedrotationmatrixtoalignX i withX 0 isVU T . Since this transformation is only composed of rotation and translation, it can be fur- therdecomposedintoasixdimensionaltransformationvector[mat]: threeEulerangles (converted to “Radians”) and three translational values. As such, a six dimensional transformationvector(T-VEC)isgenerated. ThedifferencebetweentheT-VECsoftwo consecutiveframes(supposet i andt i+1 )istherelativeheadmotionatt i . 2.2.4 EyeMotionExtraction To extract eye motion signals (blink and gaze) from subjects, eye motions of another subject was captured specifically for eye motion capture (Figure 2.5), and two markers are put on the eyelids for recording blink motions. Then, the captured eyelid motion is converted to a 1D “blink” texture signal. Because the motions in three directions (X, Y, and Z) are strongly correlated, the eye blink motion in three dimensions can be 17 represented by a one dimensional “blink” signal based on the dominant Y (vertical) direction. Figure 2.6 illustrates the y-coordinate motions of captured left and right eye blinks. Figure2.5: Thecapturedsubjectforeyemotioncapture. Fromthecaptureddata,itisobservedthatthemotionofthelefteyelidisnearlysynchro- nizedwiththatoftherighteyelid. Assuch,onlythemotioncapturetraceforoneeyeis needed to create the eye blink texture signal. Scaling the y-coordinates of the eye blink motionintotherange[0,1]generatesaone-dimensionaleyeblinktexturesignal–here0 denotes a closed eyelid, 1 denotes the eyelid fully open, and any value between 0 and 1 represents a partially open eye. In this procedure, outliers are ignored and interpolating neighborpointsfillsthegaps. Corresponding eye gaze direction signals are obtained by manually estimating the eye direction incaptured eye video, frameby frame usingan “eyeball tracking widget”in a customGUI(Figure2.8). Whilethemanuallyestimateddirectiondataisnotcompletely 18 Figure2.6: TheY-coordinatemotionofcapturedleftandrighteyeblinks(greenforleft andredforright). Figure2.7: Thelabeledeyegazesignals. 19 Figure2.8: EyedirectiontrainingsignalsaredigitizedwithacustomGUI. accurate, it qualitatively captures the character and cadence of real human gaze move- ment, and the gaze durations are frame-accurate. (Information on automatic saccade identification techniques can be found in [SG00].) The resulting gaze signals are illus- trated at Figure 2.7. Note the rectangular or piecewise-continuous character of these signals, reflects the fact that gaze tends to fixate on some spot for a period of time and then rapidly shift to another spot. When doing a large change of gaze direction it appears that people often execute several smaller shifts rather than a single large and smoothmotion. Itisalsoobservedthatgazechangefrequentlyoccurduringblink. 20 Chapter3 RealisticEyeMotionSynthesis 3.1 Introduction As humans we are particularly aware of the nuances of the human face, making it a challenging subject for computer graphics. Of the parts of the face, the eyes are par- ticularly scrutinized, since eye gaze is one of the strongest cues to the mental state of anotherperson-whensomeoneistalking,theylooktooureyestojudgeourinterestand attentiveness, and we look into their eyes to signal our intent to talk [CPB + 94]. In fact, inattemptstocreaterealisticcomputeranimatedfaces,theeyesareoftenthethingsthat observerspointoutaslookingwrong 1 . Producing convincing eyes in computer graphics applications will require attention to a number of topics in modeling, rendering, and animation. In this chapter, improving the realism of aspects of eye movement is the focus, specifically, gaze saccades with correlated eyelid motion. The contribution of this work is to point out that synthesizing synthetic eye saccade signals can be formulated as a one-dimensional texture synthesis problem, and to show that recent simple and effective texture synthesis approaches can beappliedeffectivelytothisanimationdomain. 1 Special thanks go to G. Gutschmidt (Industrial Light and Magic) and L. Williams (Disney Feature Animation)forpersonaldiscussions. 21 3.2 PreviousEyeMotionWork Recently there have been several research efforts that model the eye gaze motions in different scenarios. Cassell et al. [CPB + 94, CVB01] present rule-based approaches for generatinganimatedconversationbetweenmultipleagentsgivenspecifiedtext. Chopra- Khullar et al. [KB99] present a framework for computing visual attending behaviors (e.g. eyeandheadmotions)ofvirtualagentsindynamicenvironments,givenhigh-level scripts. Vertegaaletal.[VDV00,VSDN01]investigatewhethereyegazedirectionclues can be used as a reliable signal for determining who is talking to whom in multiparty conversations. Leeetal.[LBB02]giveagoodsummaryoftheseandotherinvestigations ofeyemovement. Mostoftheaboveworktakesagoal-directedapproachtogaze,focusingonmajorgaze changes such as those for effecting conversational turn-taking. This high-level direc- tion indicates what the eyes should do and where the eyes should look, but there is still some freedom as to how particular gaze changes should be performed - the detailed timing of eye saccades and blinks can convey various mental states such as excited or sleepy. Leeetal.[LBB02]presentthefirstindepthtreatmentofthese“textural”aspects of eye movement, demonstrating the necessity of this detail for achieving realism and conveying an appropriate mental state. In the Eyes Alive model, signals from an eye tracker are analyzed to produce a statistical model of eye saccades. Eye movement is remarkably complex, however, and the Eyes Alive model does not consider all aspects ofeyemovement. Inparticular,onlyfirst-orderstatisticsareused,andgaze-eyelidcou- plingandvergencearenotconsidered. Inthischapter,thefirsttwooftheseissueswillbe addressedbyintroducingamorepowerfulstatisticalmodelthatsimultaneouslycaptures gaze-blinkcoupling. 22 Synthesizing eye movement can be considered as an instance of a general signal- modeling problem; as such some discussion of general approaches to signal modeling is appropriate. Parametric approaches are distinguished from non-parametric or data- driven approaches. In the former, the investigator proposes an analytic model of the phenomenoninquestion,withsomenumberofparametersthatarethenfittothedata(a purely rule-based approach with no explicit dependence on the data is a further possi- bility). This can produce compact, economical models of the data, but it has the danger that the analytic model may fail to capture some important aspects of the data. “Data- driven” approaches provide an alternative in which the data themselves are queried to producenewsignalsasdesired. Theseapproacheshavethedisadvantagesthatsufficient training data must be available and there is no explicit model of the phenomena (hence no “understanding” is acquired), but they have the important advantage that character- istics of the data are not lost due to the choice of an incorrect or insufficiently powerful model. Figure 3.1: Signal synthesis with non-parametric sampling schemes. Regions in a sam- plesignal(top)thataresimilartotheneighborhoodofthesignalbeingsynthesized(bot- tom) are identified. One such region is randomly chosen, and new samples are copied fromittothesynthesizedsignal. 23 The data-driven approach has been applied to several problems in computer graphics recently. The “body motion synthesis” research [AF02, KGP02, LWS02] approaches humanbodymotionsynthesisbyfirstpartitioninghumanmotionsignalsintosmallseg- mentsandthenconcatenatingthesesegmentstogether,chosenbyanoptimizationalgo- rithm. A number of recent and successful texture synthesis papers [EL99, LLX + 01] also take a data-driven approach. The basic idea is to grow one sample (or one patch) at a time given an initial seed, by identifying all regions of the sample texture that are sufficiently similar (controlled with threshold) to the neighborhood of the sample, and randomlyselectingthecorrespondingsample(orpatch)fromoneoftheseregions(Fig- ure3.1). Thesetexturesynthesisalgorithmshavesomeresemblancetotheabove“body motion synthesis” approaches, though they differ in that the entire texture training data is searched for matching candidates, whereas in the “body motion synthesis” case the data is divided into segments in advance and only transitions between these segments are searched - this trade-off assumes that possible matches in texture-like data are too manyandtoovariedtobeprofitablyidentifiedinadvance. Data-driven texture synthesis approach is adopted to the problem of synthesizing real- istic eye motion. The basic assumption is that eye gaze probably has some connection with eyelid motion (Figure 3.2), as well as with head motion and speech, but the con- nectionisnotstrictlydeterministicandwouldbedifficulttoexplicitlycharacterize. For example, as suggested in Figure 3.2, gaze changes often appear to be associated with blinks. A major advantage of a data-driven approach is that the investigator does not needtodetermineiftheseapparentcorrelationsactuallyexist-ifthecorrelationsarein thedata,thesynthesis(properlyapplied)willreproducethem. As such, a data-driven stochastic modeling approach is appropriate. Further, note that the combined gaze-blink vector signal does not have obvious segmentation points, as 24 Figure3.2: Eyeblinkandgaze-xsignalsareplottedsimultaneously. is the case with body motion capture data. Thus, the data-driven texture synthesis approaches are adapted to the problem of realistic eye motion modeling. Eye gaze and alignedeyeblinkmotionwillbeconsideredtogetherasan“eyemotiontexture”sample andwillbeusedforsynthesizingnovelbutsimilareyemotions. To justify this choice of a texture synthesis approach over a hand-crafted statistical model, consider the order of statistics classification of statistical models: the first-order statistics p(x) used in [LBB02] capture the probability of events of different magni- tude but do not model any correlation between different events. Correlation E[xy] is a “second-order moment”, or an average of the second order statistics, p(x,y). Third- order statistics would consider the joint probability of triples of events p(x,y,z), etc. What “order” of statistics should be used to capture all the relevant characteristics of 25 joint gaze-eyelid movement? Unfortunately, it is difficult to know the answer to this question, and whereas low-order statistics such as probability density and correlation clearly do not capture some visible features (Figure 3.3), higher-order models are algo- rithmicallycomplexandperformpoorlyifderivedfrominsufficientdata. Figure3.3: Eye“blink”data(bottom)andasynthesizedsignalwiththesameautocovari- ance (top). A simple statistical model cannot reproduce the correlation and asymmetry characteristicsevidentinthedata. The possibility of using a Hidden Markov Model (HMM) should be mentioned since HMMs can approximate all real-world probability distributions, and (as in the case of speech) the HMM architecture can also provide a deeper model of the phenomenon. In the case of modeling eye movement, however, the required number of hidden states is not obvious as it is in the speech case, and because the model must potentially capture subtle“mentalstates”(suchasagitated,distracted,etc.) asmanifestedineyemovement, the hidden states also may not be easily interpretable. Although an HMM approach would probably work for this problem, the effort of designing and training a suitable 26 HMMmaynotbeworththeeffortifthegoalissimplytosynthesizeanimatedmovement mimicking an original sample. A data-driven approach removes this issue and lets the data “speak for itself”, while easily achieving movement that is indistinguishable in characterfromcapturedeyesignals(c.f. Figure3.4andFigure3.5andSection3.5). 3.3 EyeMotionSynthesis After the eye blink and gaze motion signals are extracted and aligned (Section 2.2.4), texturesynthesisisusedtosynthesizeneweyemotions. Inthischapter,thepatch-based sampling algorithm [LLX + 01] is used, due to its time efficiency. The basic idea is to growonetexturepatch(fixedsize)atatime,randomlychosenfromqualifiedcandidate patchesintheinputtexturesample. Figure3.1illustratesthebasicidea. Formoredetails aboutthisalgorithm,pleasereferto[LLX + 01]. Eachtexturesample(analogoustoapixelinthe2Dimagetexturecase)consistsofthree elements: theeyeblinksignal,thexpositionofeyegazesignal,andtheypositionofthe eye gaze signal. Heret i =(b, g 1 , g 2 ) is used to represent one sample. First, the variance ofeachelementisestimatedinastandardway: V τ = ∑ N i=1 (t i τ −t τ ) 2 N−1 (3.1) Here τ = b,g 1 ,g 2 and assume their variances are V b , V g1 , and V g2 respectively. Each component is divided by its variance to give them equal contribution to the candidate patchsearching. Thedistancemetricbetweentwotextureblocksisdefinedasfollows: d(B in ,B out )=( ∑ A k=1 d tex(t k in ,t k out ) A ) 1/2 (3.2) 27 d tex(t 1 ,t 2 )=(b 1 −b 2 ) 2 /V b +(g 1 1 −g 2 1 ) 2 /V g1 +(g 1 2 −g 2 2 ) 2 /V g2 (3.3) Here as described in [LLX + 01],t in represents input texture sample,t out represents syn- thesized (output) texture sample, and A is the size of boundary zone that functions as a searchwindow[LLX + 01]. In this case, the patch size is 20 and the boundary zone size is 4. As described in[LLX + 01],patchsizedependsonthepropertiesofagiventextureandaproperchoice is criticalto thesuccess ofthe synthesis algorithm. If the patchsize istoo small, it can- not capture the characteristics of eye motions (e.g. it may cause the eye gaze to change too frequently and look too active); if it is too large, there are fewer possible matching patchesandmoretrainingdataarerequiredtoproducevarietyinthesynthesizedmotion. Insection3.4,howtodeterminetheproperpatchsizefromthedataisdescribed. Anotherparameterusedinthisalgorithmisthedistancetolerance[LLX + 01]: d max =ε( ∑ A k=1 (d tex(t k out ,0 3 )) 2 A ) 1/2 (3.4) Here 0 3 is three-dimensional zero vector. In this work, the tolerance constant ε is set to 0.2. Figure 3.4 and Figure 3.5 illustrate synthesized eye motions. Note that the synthesized signals in above two figures (Figure 3.4 and Figure 3.5) are synthesized at thesametime,whichisnecessarytocapturethepossiblecorrelationsbetweenthem. 3.4 PatchSizeSelection As mentioned in Section 3.3, proper patch size is critical for this eye motion synthe- sis algorithm. To determine the proper patch size, the transition interval distribution 28 Figure 3.4: Synthesized eye blink motion (blue) vs. eye blink sample (red dots). The x-axisrepresentstheframenumberandy-axisrepresentstheopennessofeyelids. is investigated. For eye blink data, all the time intervals (in terms of frame number) between two adjacent approximate eye blink actions are counted. If the eye blink value (openness of eyelid) is less then a threshold value (set to 0.2), then it is counted as an “approximateeyeblink”. (Notethatthisthresholdisusedonlyforthepurposeofchoos- ingthetexturepatchsize. Theeyeblinksynthesisusestheoriginalun-thresholdeddata.) For eye gaze data, all the time intervals between two adjacent “large saccadic move- ments” are counted. These are defined as places where either the x- or y-movement is larger than a threshold value (set to 0.1). Finally, all these time intervals are gathered , and their distributions (Figure 3.6) are plotted . In Figure 3.6, it is observed that a time interval of 20 is a transition point: the covered percentage increased rapidly when 29 Figure3.5: Thetopisthesynthesizedeyegazextrajectory(blue)vs. eyegaze-xsample (red). The bottom is the synthesized eye gaze y trajectory (blue) vs. eye gaze-y sample (red). 30 time interval is less than 20, while beyond 20 it slows down rapidly. Also when the timeintervallimitissetto20,itaccountsfornearly55.68%ofthe“largeeyemotions”. Thus,20ischosenastheproperpatchsizefortheeyemotionsynthesisalgorithm. Thesizeofboundaryzoneisusefultocontrolthenumberof“texture-blockcandidates”. If the size of boundary zone is too large, then few candidates are available and the diversity of the synthesized motion is impaired. On the other hand, if this size is too small,someofthehigher-orderstatisticsofeyemotionarenotcapturedandtheresulting synthesis looks “jumpy”. The similar strategy used in [LLX + 01] is adopted where the sizeofboundaryzoneisafractionofpatchsize,e.g. 1/6. Assuch,fourischosenasthe sizeofboundaryzoneandinpracticeitworkswell. 3.5 ResultsandEvaluations Onearbitrarysegmentfromextractedsamplesequenceisusedasan“eyemotiontexture sample”,andthensynthesizenoveleyemotionssimilartothissample. Itwasfoundthat this synthesis algorithm produces eye movement that looks alert and lively (rather than e.g. the drugged, agitated, or “schizophrenic” moods that observers attribute to random or inappropriate eye movements [LBB02]), although the realism is difficult to judge since the face model itself is not completely photo-realistic. Figure 3.7 shows some framesofsynthesizedeyemotions. To compare this method with other approaches, eye motions on the same face model using three different methods are synthesized, and then conduct subjective tests. In the first experiment (Method I), the Eyes Alive model [LBB02] is used to generate gaze motion and eye blink motion is sampled from a Poisson distribution (note that 31 Figure 3.6: The top is the histogram of time intervals of “large eye movements”. The bottom is the accumulated Covered Percentage vs. Time Interval limit. When the time intervallimitis20,thecoveredpercentageis55.68%. 32 Figure3.7: Someframesofsynthesizedeyemotion. “discrete event” in this Poisson distribution means “eyelid close” event). In the second experiment (Method II), random eye gaze and eye blink with a Poisson distribution are used together. In the third (Method III), the method presented in this chapter is used to synthesize eye blink and gaze motion simultaneously. Three eye motion videos with the same duration time are presented to 22 viewers in random order. The viewers are asked to rate each eye motion video on a scale from 1 to 10, with 10 indicating a completely natural and realistic motion. Figure 3.8 illustrates the average score and standard deviation of this subjective test. From Figure 3.8, both Method I (Eyes Alive) andMethodIII(texturesynthesis)receivemuchhigherscoresthanMethodII(random), andinfactviewersslightlypreferthesynthesizedeyemotionbythisapproach(Method III)overthatofMethodI. A second test is also conducted to see whether it is possible to distinguish the synthe- sized eye motion from original captured eye motion. In this test, 16 viewers (total 15 valid viewers 2 ) are asked to forcedly identify which is the original after they carefully watched two eye motion videos (one is original captured eye motion, the other is syn- thesized from this captured segment). Seven out of 15 make correct choices, and the other eight subjects make wrong choices. Using the normal approximation to the bino- mial distribution with p=7/15, it is observed that equality (p=0.5) is easily within the 95%confidenceinterval, althoughwiththissmallnumberofsubjectstheintervalistoo 2 One of the viewers thought these two were equivalent and declined to make any choice, and he/she wasconsideredas“aninvalidviewer”inthistest. 33 Figure 3.8: One-Way ANOVA (ANalysis Of VAriance) results of the evaluations. The p-valueis2.3977e-10. broad to fully conclude that the original and synthesized videos are truly indistinguish- able. 3.6 ConclusionsandDiscussion Eyemotion(includingblinkandgaze)playsanimportantroleforrealisticfacialanima- tion and 3D avatars in virtual environments, and it is complex and hard to model using explicitapproaches. Inthischapter,insteadofgeneratingeyemotionfromprogrammer- defined rules or heuristics, or using a hand-crafted statistical model, a “motion texture” strategy is explored and successfully applied to produce realistic eye motion. In so doing, it is also demonstrated that texture synthesis techniques can be applied in the animation realm, to the modeling of incidental facial motion. The quality of statistical modeling and the introduction of gaze-eyelid coupling are improvements over previous 34 work, and the synthesized results are hard to distinguished from actual captured eye motion. A limitation of this approach is that it is hard to know in advance how much data are needed to have to avoid getting “stuck”. However, after generating some animation it is easy to evaluate the variety of synthesized movement and more data can be obtained if necessary. This approach works reasonably well for applications where nonspecific but natural looking automated eye motions are required, such as for action game char- actersetc. Itisawarethatcomplexeyegazemotionsexistinmanyscenarios,especially communications among multiple agents. Realistic eye motion in these scenarios will require a combination of goal-directed gaze modeling (as described in the related work in Section 3.2) and realistic motion quality as demonstrated here. Such a combination isconceivable-ifthegoalis(forexample)tolookawayfromthespeakerforaparticu- lar amount of time and appear uninterested, a suitable gaze/blink signal of the required durationcouldbesynthesizedwiththisapproach. Infuturework,whetherdifferent“moods”(attentive,bored,nervous,etc.) canberepro- duced by this approach are needed to verify. Head rotation (and especially rotation- compensated gaze) and vergence will need to be addressed. Introducing “constraints” into the system is proposed to enhance the synthesis algorithm to deal with more com- plexscenarios. High-levelconstraints(suchas“lookaheadfor15seconds”)willbegen- erated by a goal-directed or other system, and a constrained texture synthesis approach will fill in the details of the eye motion. It should also be noted that fully realistic eye movement involves a variety of phenomena not considered here, such as upper eyelid shape changes due to eyeball movement, skin deformation around the eyes due to mus- clemovement,etc. 35 Chapter4 NaturalHeadMotionSynthesis 4.1 Introduction Humanscommunicateviatwochannels[BS96,MJC + 04]: anexplicitchannel(speech) andanimplicitchannel(non-verbalgestures). Significantefforthasbeenmadetomodel the explicit speech channel in the graphics community (refer to Section 1.1). However, speechproductionisoftenaccompaniedby non-verbalgestures,e.g. headmotion. Sur- prisingly, few efforts have focused on the generation of natural head motion, which is animportantingredientforrealisticfacialanimations. Infact,Munhalletal.[MJC + 04] reported that head motion is important for auditory speech perception, which suggests thatappropriateheadmotioncansignificantlyenhanceengaginghuman-computerinter- faces. Because of the complexity of the association between the speech channel and its accompanying head motion, making appropriate head motion for novel audio is time- consumingforanimators. Theyoftenmanuallymakekeyheadposesbyreferringtothe recorded video of actors reciting the audio/text or capturing the head motion of actors usingmotioncapturesystems. However,itisimpossibletoreusethecaptureddata/video for other scenarioswithoutconsiderable efforts. Furthermore, making appropriate head motionfortheconversationofmultiplehumans(avatars)posesamorechallengingprob- lemformanualandmotioncapturemethods. Inmanyapplications,suchasautonomous 36 avatars,interactivecharactersincomputergames,etc.,automatedheadmotionsynthesis is often a requirement. One useful and practical approach is to synthesize natural head motiondrivenbyspeech. Although human head motion is associated with many factors, such as speaker style, idiosyncrasies and affective states, linguistic aspects of speech play a crucial role. Kuratate et al. [KMR + 99] presented preliminary results about the relationship between head motions and acoustic prosodic features. Based on the strong correlation (r=0.8), they concluded that these two are somehow correlated, but perhaps under independent control. This suggests that the tone and the intonation of the speech provide important cues about head motion and vice versa [MJC + 04]. Notice that, here it is more impor- tant how the speech is uttered rather than just what is said. The work of [YKB00] even reports that about 80% of the variance observed in the pitch can be determined from head motion. In this chapter, two different approaches (sample-based [DBNN04b] and model-based[BDNN05])arepresentedtosynthesizeappropriateheadmotiondrivenby novelspeechinput. The sample-based head motion synthesis approach [DBNN04b] first constructs a data- baseofaudiofeaturesandalignedheadposes,thentheaudiofeaturesoftheinputnovel audio are used to find their K nearest neighbors in the database and put these nearest neighbors into a nearest neighbor candidate pool, then a dynamic programming tech- nique is used to search for the optimal nearest neighbor combination from the near- est neighbor candidate pool by minimizing a cost function. This method can be used for synthesizing head motions of autonomous avatars, and can also provide intuitive keyframe control: after key head poses are specifed, this method will synthesize appro- priate head motion that maximally meets the requirements of both the speech and key headposes. 37 The model-based head motion synthesis approach [BDNN05] generates natural head motion directly from acoustic prosodic features. First, vector quantization is used to produce a discrete representation of head poses. Then, a Hidden Markov Model is trained for each cluster, which models the temporal relation between the prosodic fea- tures and the head motion sequence. Given that the mapping is not one to one, the observation probability density is modeled with a mixture of Gaussians. The smooth- ness constraint is imposed by defining a bi-gram model (first order Markov model) on headposeslearnedfromthedatabase. Then,givennewspeechinput,theHMMs,work- ing as sequence generators, produce the most likely head motion sequences. Finally, a smoothing operation based on spherical cubic interpolation is applied to generate the finalsmoothheadmotionsequences. Thedifferencebetweenthemodel-basedapproach andthesample-basedapproachisthatthemodel-basedheadmotionsynthesisapproach only deals with head rotation (no translation), while the above sample-based approach alsomodelsheadtranslation. 4.2 PreviousHeadMotionWork Researchers have presented various techniques to model head motion. Pelachaud et al. [PBS94] generate facial expressions and head movements from labeled text using a set of custom rules, based on Facial Action Coding System (FACS) representa- tions [EF75]. Cassell et al. [CPB + 94] present an automated system that generates appropriate non-verbal gestures, including head motion, for the conversation among multiple avatars, but they address only the “nod” head motion in their work. Perlin and Goldberg [PG96] develop an “Improv” system that combines procedural and rule- based techniques for behavior-based characters. The character actions are predefined, 38 and decision rules are used to choose appropriate combinations and transitions. Kur- lander et al. [KSS96] construct a “comic chat” system that automatically generates 2D comics for online graphical chatting, based on the rules of comic panel composition. Chietal.[CCZB00]presentanEMOTEmodelbyimplantingLabanMovementAnaly- sis (LMA) and its efforts and shape components into character animations. It is suc- cessfully applied to arm and body movements, but the applicability of this method to headmovementswithspeechhasnotbeenestablished. Casselletal.[CVB01]generate appropriate non-verbal gestures from text, relying on a set of linguistic rules derived fromnon-verbalconversationalresearch. Thismethodworksontext,butthepossibility ofapplyingthismethodtoaudio(speech)inputhasnotbeenverifiedyet. Researchers have reported that there are strong correlations between speech and its accompanying head movements [KMR + 99, YKB00, CCL01, GCSH02, MJC + 04]. For example, Munhall et al. [MJC + 04] show that the rhythmic head motion strongly corre- lateswiththepitchandamplitudeofthesubject’svoice. Grafetal.[GCSH02]estimate the probability distribution of major head movements (e.g. “nod”) according to the occurenceofpitchaccents. [YKB00,KMR + 99]evenreportthatabout80%ofthevari- ance observed for fundamental frequency (F0) can be determined from head motion, although the average percentage of head motion variance that can be linearly inferred from F0 is much lower. Costa et al. [CCL01] use the Gaussian Mixture Model (GMM) tomodeltheassociationbetweenaudiofeaturesandvisualprosody. Intheirwork,only eyebrow movements are analyzed, and the connection between audio features and head motionswasnotreported. 39 4.3 Sample-basedHeadMotionSynthesis As mentioned in Chapter 2, head motion were extracted from the facial motion capture data. Additionally, the acoustic information of recorded speech is extracted using the Praat speech processing software [BW] with a 30-ms window and 21.6-ms of over- lap. The audio-features used are the pitch (F0), the lowest five formants (F1 through F5), 13-MFCC (Mel-Frequency Cepstral Coefficients) and 12-LPC (Linear Prediction Coefficients). These 31 dimensional audio feature vectors are reduced to four dimen- sions using Principal Component Analysis (PCA), covering 98.89% of the variation. AnaudiofeaturePCAspaceexpandedbyfoureigen-vectors(correspondingtothefour largesteigen-values)isalsoconstructed. In this way, a database of aligned audio-headmotion is constructed (illustrated in Fig 4.1). For simplicity, AHD (Audio-Headmotion Database) is used to refer to this database in the remaining sections. Each entry of the AHD is composed of a four dimensional audio feature PCA coefficients (AF-COEF) and a head motion transfor- mation vector (T-VEC) (how to construct T-VECs is described in Section 2.2.3). Note thattheAHDisindexedbytheAF-COEF. 4.3.1 HeadMotionSynthesis AftertheAHDisconstructed,theaudiofeaturesofgivennovelspeechinputarereduced intoAF-COEFsbyprojectingthemintotheaudiofeaturePCAspace(Eq.4.1). f =M T .(F− ¯ F) (4.1) 40 Figure 4.1: Audio-Headmotion Database (AHD). Each entry in this database is com- posed oftwo parts: aAF-COEF (four dimensionalaudio feature pca coefficients)and a T-VEC(asixdimensionalheadmotiontransformationvector). Here F is a 31 dimensional audio feature vector, f is its AF-COEF, and M is the eigen- vector matrix (31×4 in this case). Then, these AF-COEFs are used to index the AHD andsearchfortheirKnearestneighbors. Aftertheseneighborsareidentified,adynamic programming technique is used to find the optimal nearest neighbor combination by minimizing a cost function. Finally, the head motion of the chosen nearest neighbors is concatenatedtogethertoformthefinalheadmotionsequence. Figure4.2illustratesthis headmotionsynthesispipeline. Figure4.2: Overviewofthesample-basedheadmotionsynthesispipeline. Thefirststep is to project audio features onto the audio feature PCA space, the second step is to find K nearest neighbors in the AHD, and the third step is to solve the optimal combination bydynamicprogramming. 41 Givenaninput(inquiry)AF-COEF q,itsnearestKneighborsintheAHDarelocated. In thiscase,K(thenumberofnearestneighbors)isexperimentallysetto7(Section4.3.2). EuclideandistanceisusedtomeasurethedifferencebetweentwoAF-COEFs(Eq.4.2). dist = v u u t 4 ∑ i=1 (q i −d i ) 2 (4.2) Here d representsaAF-COEF ofaentryinthe AHD.Inthisstep, thisdistance (termed neighbor-distance in this chapter) is also retained. Numerous approaches are pre- sented to speed up the K-nearest neighbor search, and a good overview can be found in [Das91]. Inthiswork,KD-tree[FBF77]isusedtospeedupthissearch. Theaverage timecomplexityofaKD-treesearchis O(logN d ),N d isthesizeofthedataset. After the PCA projection and K nearest neighbors search, for a AF-COEF f i at time T i , itsKnearestneighborsarefound(assumeitsKnearestneighborsareN i,1 ,N i,2 ,...,N i,K ). Which neighbor should be optimally chosen at time T i ? In this work, a dynamic pro- grammingalgorithmisusedtofindtheoptimumneighborcombinationbyminimizinga total “synthesis cost” (“synthesis error” and “synthesis cost” are used interchangablely inthissection). Thesynthesiscost(error)attimeT i isdefinedtoincludethefollowingthreeparts: • Neighbor-distance Error (NE) : the neighbor-distance (Eq. 4.2) between the AF- COEFofanearestneighbor(e.g. c i,j )andtheinputAF-COEF f i (Eq.4.3). NE i,j =kc i,j − f i k 2 (4.3) 42 • Roughness Error (RE): represents the roughness of the synthesized head motion path. Smoothheadmotion(smallRE)ispreferred. SupposeV i−1 istheT-VECat time T i−1 and TV i,j is the T-VEC of j th nearest neighbor at time T i . When the j th neighbor is chosen at time T i , RE i,j is defined as the second derivative at time T i asfollows(Eq.4.4): RE i,j =kTV i,j −V i−1 k 2 (4.4) • AwayKeyframeError(AE):representshowfarawaythecurrentheadposeisfrom specified key head pose. Head motion toward specified key head poses decreases the AE. Suppose KP is the next goal of key head pose at time T i and P i−1 is the accumulatedheadposeattimeT i−1 ,thenAE i,j iscalculated(Eq.4.5). AE i,j =kKP−(P i−1 +TV i,j )k 2 (4.5) If the j th neighbor is chosen at time T i and W n , W r , and W a (assume W n ≥ 0,W r ≥ 0, W a ≥ 0, and W n +W r +W a = 1) are the weights for NE, RE and AE respectively, the synthesiserrorerr i,j (whenthe j th nearestneighborischosenattimeT i )istheweighted sumoftheabovethreeerrors(Eq.4.6). err i,j =W n .NE i,j +W r .RE i,j +W a .AE i,j (4.6) Since the decision made at time T i only depends on the current K neighbor candidates and the previous state (e.g. the accumulated head pose) at time T i−1 , a dynamic pro- grammingtechniqueisusedtosolvetheoptimalnearestneighborcombination. 43 Suppose ERR i,j represents the accumulated synthesis error from time T 1 to T i when j th neighbor is chosen at time T i ; PATH i,j represents the chosen neighbor at time T i−1 when the j th neighbor is chosen at time T i . Further assume that all the NE i,j , RE i,j , AE i,j , ERR i,j ,and PATH i,j areavailablefor1≤i≤l−1and1≤ j≤K,timestep T l is computedusingthefollowingequations(Eq.4.7-4.8). ERR l,j = min m=1...K (ERR l−1,m −W a .AE l−1,m +W r .RE l,j +W a .AE l,j +W n .NE l,j ) (4.7) PATH l,j =arg min m=1...K (ERR l−1,m −W a .AE l−1,m +W r .RE l,j +W a .AE l,j +W n .NE l,j ) (4.8) Notethatintheaboveequations,1≤m≤Kand(ERR l−1,m −AE l−1,m )isusedtoremove theoldAE,becauseonlynewAE isusefulforcurrentsearch. PATH l,j retainsretracing information: which neighbor is chosen at time T l−1 if j th nearest neighbor is chosen at timeT i . Finally, the optimal nearest neighbor combination is determined by Equation 4.9-4.10. Assume S i representsthenearestneighboroptimallychosenattimeT i . S n =arg min j=1...K ERR n,j (4.9) S i−1 =PATH i,S i 2≤i≤n (4.10) Suppose TV i,j is the T-VEC of j th nearest neighbor at time T i , the final head pose HeadPos i attimeT i (1≤i≤n)iscalculatedinEq.4.11. HeadPos i = i ∑ j=1 TV j,S j (4.11) The time complexity of this KNN-based dynamic programming synthesis algorithm is O(n.logN d +n.K 2 ), where K is the number of nearest neighbors, N d is the number of entries in the AHD, and n is the number of input AF-COEF, for example, if 30 head 44 motionframespersecondissynthesizedandt isthetotalanimationtime(second),then n=t×30. 4.3.2 SearchfortheWeights AsdescribedinSection4.3.1,thedynamicprogrammingsynthesisalgorithmusesthree weights ~ W(W n ,W a ,W r )toinfluencetheoutcomeofthechosennearestneighbors. What are the optimum weights for this head motion synthesis algorithm? Since it is assumed thatW a ≥0,W n ≥0,W r ≥0,andW a +W n +W r =1. thesearchingspacecanbeillustrated asFig.4.3. Figure4.3: Thesearchspaceoftheweights(W a ,W n ,W r ). Several speech segments (from the captured data, not those used for constructing the AHD) are used for cross-validation [HTF01]. For each speech segment, the key head poses at the start time and the ending time are specified as the same as the original 45 capturedheadposes. Foraspecificweightconfiguration,TotalEvaluationError (TEE) isdefinedasfollows(Eq.4.12): TEE(W n ,W a ,Wr)= N ∑ i=1 6 ∑ j=1 ( ˆ V j i −V j i ) 2 (4.12) Where N is the number of total cross-validation head motion frames, ˆ V i is the synthe- sizedheadposeatframei,andV i istheground-truthheadposeatframe i. A variant of gradient-descent method and non-sequential random search [Pie86] are combined to search the global minimum TEE (its weights are the optimum weights) (Eq. 4.13-4.14). Here only four basic directions are considered: ~ e 1 = (α,0,−α),~ e 2 = (−α,0,α),~ e 3 =(0,α,−α), and~ e 4 =(0,−α,α). α isthestepsize(experimentallyset to0.05inthiswork) j =arg min i=1..4 TEE( ~ W t +~ e i ) (4.13) ~ W t+1 = ~ W t +~ e j (4.14) Initial weight ~ W 0 is generated as follows: W a is randomly sampled from the uniform distribution [0..1], then W n is randomly sampled from uniform distribution [0...1- W a ], andW r isassigned1−W a −W n . Non-sequentialrandomSearch[Pie86]isusedtoavoidgettingstuckatalocalminimum in the weight space: a given number of initial weights are generated at random, then each initial weight performs an independent search, and finally, the winner among all the searches determines the optimum weights. Fig 4.4 illustrates the search result after 20 initial weights are used. The resultant optimum weights ~ W=[W a = 0.31755,W n = 0.15782,W r =0.52463]. 46 Figure 4.4: Plot of the search result after 20 initial weights are used (K=7). The global minimum is the red point, corresponding to the weights: W a = 0.31755,W n = 0.15782, andW r =0.52463. It is arguable that the optimal weights may depend on the subject, since the audio- headmotion mapping reflected in the constructed AHD may enclose the head motion personalityofthecapturedsubject. Furtherinvestigationisneededtocomparetheopti- malweightsofdifferentsubjects. Since the number of nearest neighbors is discrete, not like a continuous weight space, the optimized K is experimentally set to 7 using the following experiments: after K is set to a fixed number, the above searching method is used to search the minimum TEE. Figure4.5illustratestheminimumTTEwithrespecttodifferentK. 47 Figure 4.5: Plot of minimum TTE versus K. For each K, 20 times of non-sequential randomsearchareused. 4.3.3 ResultsandApplications Toevaluatethisapproach,ground-truthheadmotioniscomparedtothesynthesizedhead motion. A speech segment that was not used for training and cross-validation is used for comparisons, and approriate key head poses are also specified (only start head pose and ending head pose). Figure 4.6 illustrates the trajectory comparisons of synthesized headmotionandground-truthone. In many applications, such as avatar-based telepresence systems, automated head motion is required. This approach can be applied to these applications by simply set- ting the W a to zero. Therefore, the head motion is guided only by the roughness and neighbor-distance criterias. In some cases, staying in the initial head pose is preferred, for example, the avatar speaking and paying attention only to one specific listener. By 48 Figure 4.6: Comparison of ground-truth head motion (red solid curve) and the synthe- sizedheadmotion(dashedbluecurve),whenthesubjectsayswithaneutralspeech“Do youhaveanaversiontothat?”. Notethatthemotiontendencyatmostplacesissimilar. automatically setting key head poses to the initial head pose, the system can simulate thesescenarios. Figure4.7illustratessomesynthesizedheadmotionframes. Figure4.7: Someframesofasynthesizedheadmotionsequence,drivenbytherecorded speech “By day and night he wrongs me; every hour He flashes into one gross crime or other...”fromaShakespere’splay. Keyframing is still a useful tool for animators. For example, in the case of the conver- sation of multiple avatars, head motion often accompanies the turn-taking. Therefore, animators can specify the appropriate key head poses, corresponding to the turn-taking time. This approach will automatically fill in the head motion gaps. If animators want thesynthesizedheadmotiontomorecloselyfollowkeyheadposes,animatorsjustneed toincreasetheweightW a . 49 4.3.4 ConclusionsandDiscussion In this section, an audio-based approach is presented for automatically synthesizing appropriate head motion. The audio-headmotion mapping is stored in the AHD, con- structed from the captured head motion data of a human subject. Given novel speech (audio) input and optional key head poses, a KNN-based dynamic programming tech- niqueisusedtofindtheoptimizedheadmotionfromtheAHD,maximallysatisfyingthe requirementsfrombothaudioandspecifiedkeyheadposes. Keyframecontrolprovides flexibilityforanimatorswithoutthelossofthenaturalnessofsynthesizedheadmotion. Thisapproachcanbeappliedtomanyapplications,suchasautomatedheadmotionand the conversation of multiple avatars. Flexiblely tuning the weights used in the algo- rithmandspecifyingappropriatekeyheadposeswillgeneratevarioussynthesizedhead motionwithdifferentstyles. A limitation of this approach is that it is hard to anticipate in advance the amount of training data needed for specific applications. For example, if the specified key head posesarebeyondthetrainingdata,theperformanceofthisapproachwilldegrade,since therearenotenough“matched”headmotionentriesintheAHDtoachievethespecified key head poses. But after some animation is generated, it is easy to evaluate the variety andappropriatenessofsynthesizedheadmotionandobtainmoredataifnecessary. Since head motion is not an independent part of the whole facial motion, and it may strongly correlate with eye motion, e.g. head motion-compensated gaze, appropriate eye motion will greatly enhance the realism of synthesized head motion. The linguis- tic structure of the speech also plays an important role in the head motion of human subjects [Pel91, PBS94]. Future work could be to combine the linguistic structure with 50 Figure4.8: HeadposesusingEulerangles thisapproachbyusingacombinationoflinguisticsandaudiofeaturestodrivethehead motion. 4.4 Model-basedHeadMotionSynthesis AsmentionedinChapter2,anexpressivefacialmotiondatabasewascapturedandhead motionswereextractedfromthedata. Inthemodel-basedsynthesisapproach,onlyhead rotationisconsidered(intheremainingsections,“headrotation”and“headmotion”are interchangably used). Head rotations are represented as Euler angles (Fig. 4.8). To extracttheprosodicfeatures,theacousticsignalswereprocessedbytheEntropicSignal Processing System (ESPS), which computes the pitch (F0) and the RMS energy of the audio. The window was set to 25-ms with an overlap of 8.3-ms. Notice that the pitch takes values only in voiced region of the speech. Therefore, to avoid zeros in unvoiced regions, a cubic spline interpolation was applied in those regions. Finally, the first and 51 second derivatives of the pitch and the energy were added to incorporate their temporal dynamics. To validate the close relation between head motion and acoustic prosodic features, as suggested by Kuratate et al. in [KMR + 99], Canonical Correlation Analysis (CCA) [DFC00]wasappliedtothisaudiovisualdatabase. CCAprovidesascale-invariantopti- mum linear framework to measure the correlation between two streams of data with different dimensions. The basic idea is to project both feature vectors into a common dimensionalspace,inwhichPearson’scorrelationcanbecomputed. Using pitch, energy and their first and second derivatives (6D feature vector), and the angles that define the head motions (3D feature vector), the average correlation com- putedfromtheaudiovisualdatabaseisr=0.7. Thisresultindicatesthatusefulandmean- ingful information can be extracted from the prosodic features of speech to synthesize headmotion. 4.4.1 ModelingHeadMotion HMMs (Hidden Markov Models) are chosen to model head motions, because they pro- vide a suitable and natural framework to model the temporal relation between acoustic prosodic features and head motions. HMMs are used to generate the most likely head motion sequences based on the given observation (prosodic features). The HTK toolkit[YEH + 02]isusedtobuildtheHMMs. The output sequences of the HMMs cannot be continuous, so a discrete representation of head motion is needed. For this purpose, the Linde-Buzo-Gray vector Quantization 52 Figure4.9: 2DprojectionofVoronoiregionsusing32-sizevectorquantization (LBG-VQ) algorithm [LBG80] is used to define K discrete head poses, V i . The 3D- space defined by the Euler angles is split into K Voronoi regions (Figure 4.9). For each region,themeanvectorU i andthecovariancematricesΣ i areestimated. Thepairs(U,Σ) define the finite and discrete set of code vectors called codebook. In the quantization step, the continuous Euler angles of each frame are approximated with the closest code vector in the codebook. For each of the clusters, V i , an HMM model will be created. Consequently,thesizeofthecodebookwilldeterminethenumberofHMMmodels. The posterior probability of being in cluster V i , given the observation O, is modeled accordingtoBayesruleas P(V i /O)=c·P(O/V i )·P(V i ) (4.15) 53 where c is a normalization constant. The likelihood distribution, P(O/V i ), is modeled as a Markov process, which is a finite state machine that changes state at every time unit according to the transition probabilities. A first order Markov model is assumed, in which the probabilistic description includes only the current and previous state. The probability density function of the observation is modeled by a mixture of M Gaussian densities which handle, up to some extent, the many-to-many mapping between head motion and prosodic features. Standard algorithms (Forward-backward, Baum-Welch re-estimation) are used to train the parameters of the HMMs, using the training data [Rab89, YEH + 02]. Notice that the segmentation of the speech according to the head poses clusters is known. Therefore, the HMMs were initialized with this known align- ment(forcealignmentwasnotneeded). The prior distribution, P(V i ), is used to impose a first smoothing constraint to avoid sudden changes in the synthesized head motion sequence. In this approach, P(V i ) is built using bi-gram models, which are learned from data (similar to standard bi-gram language models used to model word sequence probabilities [YEH + 02, HAH01]). The bi-gram model is also a first order state machine, in which each state models the prob- ability of observing a given output sequence (in this case, a specific head pose cluster, V i ). The transition probabilities are computed using the frequency of their occurrences. Inthetrainingdatabase,theinter-clustertransitionsarecountedandstored,andthesta- tistic learned is used to reward transitions according to their appearances. Therefore, in the decoding step, this prior distribution will penalize transitions that did not appear in thetrainingdata. 54 Figure4.10: TheHMMmodel-basedheadmotionsynthesisframework. 4.4.2 HeadMotionSynthesis Figure 4.10 describes this HMM-based head motion synthesis procedure. For each test sample,theacousticprosodicfeaturesareextractedandusedasinputoftheHMMs. The modelwillgeneratethemostlikelysequence, b V =( b V t i , b V t+1 j ...),where b V t i isdefinedby (U i ,Σ i ). ThemeanvectorU i willbeusedtosynthesizetheheadmotionsequences. The transitions between clusters will introduce breaks in the synthesized head motion signal, even if their cluster means are close (see Figure 4.11). Therefore, a second smoothing step needs to be implemented, to guarantee continuity of the synthesized head pose sequences. A simple solution is to interpolate each Euler angle separately. However,ithasbeenshownthatthistechniqueisnotoptimal,becauseitintroducesjerky movements and other undesired effects such as Gimbal lock [Sho85]. As suggested by Shoemake,abetterapproachistointerpolateinthequaternionunitsphere[Sho85]. The basic idea is to transform the Euler angles into a quaternion, which is an alternative rotationmatrixrepresentation,andtheninterpolatetheframesinthisquanternionspace. 55 Inthisapproach,sphericalcubicinterpolation[Ebe00],squad,isused,whichisbasedon spherical linear interpolation, slerp. For two quaternions q 1 and q 2 , the slerp function isdefinedas: slerp(q 1 ,q 2 ,μ)= sin(1−μ)θ sinθ q 1 + sinμθ sinθ q 1 (4.16) wherecosθ =q 1 ·q 2 andμ isavariablethatrangesfrom0to1anddeterminestheframe position of the interpolated quaternion. Given four quaternions, the squad function is definedas: squad(q 1 ,q 2 ,q 3 ,q 4 ,μ)=slerp(slerp(q 1 ,q 4 ,μ),slerp(q 2 ,q 3 ,μ),2μ(1−μ)) (4.17) After the Euler angles are transformed into quaternions, key-points are selected by down-sampling the quaternions at a rate of 6 frames per second (this value was empir- ically chosen). Then, spherical cubic interpolation is used in those key-points by using the squad function. After interpolation, the frame rate of the quaternions is 120 frames per second, as the original data. The last step in this smoothing technique is to trans- form the interpolated quaternions into Euler angles. Figure 4.11 shows the interporla- tionresultforoneofthesentences. Theresultingvectorsaredenoted b U i . Furtherdetails aboutsphericalcubicinterpolationcanbefoundin[Ebe00]. Finally,thesynthesizedheadpose,b x t ,attimet willbeestimatedas: b x t =(α,β,γ) T = b U i +W i (4.18) whereW i is a zero-mean uniformly distributed random white noise. Notice that b x t is a blurredversionofV i ’smean. 56 Figure4.11: Sphericalcubicinterpolation If the size of the codebook is large enough, the quantization error will be insignificant. However, the number of HMMs needed will increase and the discrimination between classes will decrease. Also, more data will be needed to train the models. Therefore, thereisatradeoffbetweenthequantizationerrorandtheinter-clusterdiscrimination. 4.4.3 ResultsandDiscussion The topology of the HMM is defined by the number and the interconnection of the states. The most popular configurations are the Left-to-Right topology (LR), in which onlytransitionsinforwarddirectionbetweenadjacentstatesareallowed,andtheergodic topology(EG),inwhichtransitionsbetweenallthestatesareallowed. TheLRtopology is simple and needs less data to train its parameters. The EG topology is less restricted, so it can learn a larger set of state transitions from the data. In this particular problem, it is not clear which topology gives better description of the head motion dynamics. 57 Therefore, eight HMM configurations, described in table 1, with different topologies, number of models, K, number of states, S, and number of mixtures, M, were trained. NoticethatthesizeofthedatabaseisnotbigenoughtotrainmorecomplexHMMswith morestates,mixturesormodelsthanthosedescribedintable4.1. Asmentionedbefore,thepitch,theenergyandtheirfirstandsecondderivativesareused asacousticprosodicfeaturestotraineachoftheproposedHMMconfigurations. Eighty percentofthedatabasewasusedfortrainingandtwentypercentfortesting. To evaluate the performance of this approach, the prosodic features from the test data were used to generate head motion sequences, as described in previous section. For those samples, the Euclidian distance, d euc , between the Euler angles of the original frames and the Euler angles of the synthesized data,b x t , was calculated. The average and the standard deviation of all the frames of the testing data, is shown in table 4.1 (D). Notice that the synthesized head motions are directly compared with the original data(notitsquantizedversion),sothequantizationerrorisincludedinthevaluesofthe table. Asmentionedintheintroduction,headmotiondoesnotdependonlyonprosodic features,sothislevelofmismatchisexpected. Table 4.1 also shows the canonical correlation analysis between the synthesized and original data. As can be observed the correlation was around r=0.85 for all the topolo- gies. This result strongly suggests that the synthesized data follow the behavior of the realdata,whichvalidatesthisapproach. As can be seen from table 4.1, the performance of the different HMM topologies are similar. The Left-to-Right HMM, with a 16-size codebook, 3 states and 2 mixtures achievesthebestresult. However,ifthedatabasewerebigenough,anergodictopology 58 HMMconfig. D CCA Mean Std Mean Std K=16S=5M=2LR 10.2 3.4 0.88 0.11 K=16S=5M=4LR 9.3 3.4 0.87 0.11 K=16S=3M=2LR 9.1 3.6 0.87 0.12 K=16S=3M=2EG 9.1 3.4 0.87 0.10 K=16S=3M=4EG 9.5 3.4 0.83 0.12 K=32S=5M=1LR 12.8 4.0 0.83 0.14 K=32S=3M=2LR 10.7 3.3 0.86 0.12 K=32S=3M=1EG 10.4 3.1 0.86 0.11 Table4.1: ResultsfordifferentHMMconfigurations Figure4.12: Synthesizedheadmotion,frontview. with more states and mixture could perhaps give better results. The next experiments wereimplementedusingthistopology(K=16S=3M=2LR). Sentences, not included in the corpus mentioned before, were also recorded to synthe- sizenovelheadmotionusingthisapproach. Foreachrecordedaudio,thisapproachwas used to generate head motions. Figures 4.12 and 4.13 show frames of the synthesized data. 59 Figure4.13: Synthesizedheadmotion,sideview. 4.4.4 Conclusions Thissectionpresentsanovelapproachtosynthesizenaturalhumanheadmotionsdriven by speech prosody features. HMMs are used to capture the temporal relation between the acoustic prosodic features and head motions. The use of bi-gram models in the sequencegenerationstepguaranteessmoothtransitionsfromthediscreterepresentations of head movement configurations. Furthermore, spherical cubic interpolation is used to avoidbreaksinthesynthesizedheadmotions. The results show that the synthesized head motion sequences follow the temporal dynamic behaviors of real data. This proves that the HMMs are able to capture the close relation between speech and head motion. The results also show that the smooth- ing techniques used in this approach can produce continuous head motion sequences, evenwhenonlya16wordsizedcodebookisusedtorepresentheadmotionposes. 60 In this approach, it is showed that natural head motion can be synthesized by using just speech. In future work, more components could be added to the system. For example, iftheemotionofthesubjectisknown,asisusuallythecaseinmostoftheapplications, suitable models that capture the emotional head motion pattern can be used, instead of ageneralmodel. 61 Chapter5 ExpressiveSpeechAnimationSynthesis 5.1 SpeechAnimationRelatedWork In this section, speech animation related work is reviewed. An extensive overview can befoundinthewell-knownfacialanimationbookbyParkeandWaters[PW96]. 5.1.1 ExpressiveFacialAnimation Cassell et al. [CPB + 94] present a rule-based automatic system that generates expres- sions and speech for multiple conversation agents. In their work, the Facial Action Coding Systems (FACS) [EF75] are used to denote static facial expressions. Noh and Neumann [NN01] present an “expression cloning” technique to transfer existing expressive facial animation between different 3D face models. This technique and the extended variant by Pyun et al. [PKC + 03] are very useful for transferring expressive facial motion. However, they are not generative; they cannot be used for generating new expressive facial animation. Chuang et al. [CDB02] learn a facial expression map- ping/transformation from training footage using bilinear models, and then this learned mappingisusedtotransforminputvideoofneutraltalkingtoexpressivetalking. Intheir work, the expressive face frames retain the same timing as the original neutral speech, which does not seem plausible in all cases. Cao et al. [CFP03] present a motion editing 62 techniquethatappliesIndependentComponentAnalysis(ICA)ontorecordedexpressive facialmotioncapturedataandthenperformmoreeditingoperationsontheseICAcom- ponents, interpreted as expression and speech components separately. This approach is usedonlyforeditingexistingexpressivefacialmotion,notforthepurposeofsynthesis. Zhangetal.[ZLGS03]presentageometry-driventechniqueforsynthesizingexpression details on 2D face images. This method is used for static 2D expression synthesis, but the applicability of this method to animate images and 3D face models has not been established. Blanz et al. [BBPV03] present an animation technique to reanimate faces inimagesandvideobylearninganexpressionandvisemespacefromscanned3Dfaces. Thisapproachaddressesbothspeechandexpressions,butstaticexpressionposesdonot provideenoughinformationtosynthesizerealisticdynamicexpressivemotion. Thesuc- cess of expressive speech motion synthesis by voice puppetry [Bra99] depends heavily on the choice of audio features used, and as pointed out by Brand [Bra99], the optimal audiofeaturecombinationforexpressivespeechmotionisstillanopenproblem. Kshir- sagar et al. [KMT01, MsGST03] present a PCA-based method for generating expres- sive speech animation. In their work, static expression configurations are embedded in an expression and viseme space, constructed by PCA. Expressive speech animation is synthesizedbyweightedblendingbetweenexpressionconfigurations(correspondingto somepointsintheexpressionandvisemespace)andspeechmotion. 63 5.1.2 SpeechAnimation The key part of speech animation synthesis is modeling speech co-articulation. In lin- guistics literature, speech co-articulation is defined as follows: phonemes are not pro- nounced as an independent sequence of sounds, but rather that the sound of a particu- lar phoneme is affected by adjacent phonemes. Visual speech co-articulation is anal- ogous. Phoneme-driven methods require animators to design key phoneme shapes, and then empirical smooth functions [CM93, GB96, KG01, CMPZ02, KP05] or co- articulation rules [Pel91, Bes95, BP04] are used to generate speech animation. The Cohen-Massaro co-articulation model [CM93] controls each viseme shape using a tar- get value and a dominance function, and the weighted sum of dominance values deter- minesfinalmouthshapes. Recentco-articulationwork[GB96,CMPZ02,KP05]further improvedtheCohen-Massaroco-articulationmodel,forexample,Cosietal.[CMPZ02] added a temporal resistance function and a shape function for more general cases, such as fast speaking rates. Goff and Benoˆ ıt [GB96] calculated the model para- meter value of the Cohen-Massaro model by analyzing parameter trajectories mea- sured from a French speaker. Rule-based co-articulation models [Pel91, Bes95] leave some visemes undefined based on their co-articulation importance and phoneme con- texts. BevacquaandPelachaud[BP04]presentedanexpressivemodifier,modeledfrom recorded real motion data, to make expressive speech animation. Physics-based meth- ods [TW90, LTW95, WF95, KHS01] drive mouth movement by simulating the facial muscles. Physics-basedapproachescanachievesynthesisrealism,butitishardtosolve theoptimalparametervalueswithoutconsiderablecomputingtimeandtuningefforts. Data-driven approaches [BCS97, Cos02, CFKP04, Bra99, EGP02, KT03, MCP + 05] synthesize new speech animations by concatenating pre-recorded motion data or sam- pling from statistical models learned from real motion data. Bregler et al. [BCS97] 64 presentthe“videorewrite”methodforsynthesizing2Dtalkingfacesgivennovelspeech input, based on the collected “triphone video segments”. Instead of using ad hoc co- articulation models and ignoring dynamics factors in speech, this approach models the co-articulation effect with “triphone video segments”, but it is not generative (i.e. the co-articulation cannot be applied to other faces without retraining). [Cos02, CFKP04, MCP + 05] further extend “the triphone combination idea” used in “video rewrite” to longer phoneme segments in order to generate new speech animation. Brand [Bra99] learns a HMM-based facial motion control model by an entropy minimization learning algorithmfromtrainingvoiceandvideodataandtheneffectivelysynthesizesfullfacial motionsfromnovelaudiotrack. Thisapproachmodelsco-articulation,usingtheViterbi algorithm through vocal HMMs to search for most likely facial state sequence that is usedforpredictingfacialconfigurationsequences. Ezzatetal.[EGP02]learnamultidi- mensional morphable model from a recorded video database that requires a limited set of mouth image prototypes and use the magnitude of diagonal covariance matrices of phonemeclusterstorepresentco-articulationeffects: thelargercovarianceofaphoneme cluster means this phoneme has a smaller co-articulation, and vice versa. Instead of constructing phoneme segment database [BCS97, EGP02, Cos02, CFKP04, MCP + 05], KshirsagarandThalmann[KT03]presentasyllablebasedapproachtosynthesizenovel speech animation. In their approach, captured facial motion are categorized into syl- lable motions, and then new speech animation is achieved by concatenating syllable motions optimally chosen from the syllable motion database. However, most of the above data-driven approaches are restricted to synthesizing neutral speech animation and their applications for expressive speech animation synthesis have not been fully demonstratedyet. 65 5.2 Model-basedApproach In this section, an expressive facial animation synthesis system that learns speech co- articulation models and expression spaces from recorded facial motion capture data is presented. After users specify the input speech (or texts) and its expression type, the systemautomaticallygeneratescorrespondingexpressivefacialanimation. Figure5.1: Schematicoverviewofthemodel-basedexpressivefacialanimationsynthe- sissystem. Itincludesthreemainstages: recording,modelingandsynthesis. Figure5.1illustratestheschematicoverviewofthesystem. Thissystemiscomposedof threestages: recording,modelingandsynthesis. Intherecordingstage,expressivefacial motion and its accompanying audio are recorded simultaneously and preprocessed. In the modeling stage, a new approach is presented to learn speech co-articulation models from facial motion capture data, and a Phoneme-Independent Expression Eigen-Space (PIEES) is constructed. In the final synthesis stage, based on the learned speech co- articulation models and the PIEES from the modeling stage, corresponding expressive facial animation are synthesized according to the given input speech/texts and expres- sion. 66 This synthesis system consists of two sub-systems: neutral speech motion synthesis and dynamic expression synthesis. In the speech motion synthesis subsystem, it learns explicit but compact speech co-articulation models from recorded facial motion cap- ture data, based on a weight-decomposition method [PB02]. Given a new phoneme sequence, this system synthesizes corresponding neutral visual speech motion by con- catenating the learned co-articulation models. In the dynamic expression synthe- sis subsystem, first a Phoneme-Independent Expression Eigen-Space (PIEES) is con- structed by a phoneme-based time warpping and subtraction and then novel dynamic expression sequences are generated from the constructed PIEES by texture-synthesis approaches [EL99, LLX + 01]. Finally, the synthesized expression signals are weight- blended with the synthesized neutral speech motion to generate expressive facial ani- mation. The compact size of the learned speech co-articulation models and the PIEES makeitpossiblethatthissystemcanbeusedforon-the-flyfacialanimationsynthesis. This speech co-articulation modeling approach constructs explicit and compact speech co-articulation models from real human motion data. Both co-articulation between two phonemes (diphone co-articulation ) and co-articulation among three phonemes ( tri- phone co-articulation ) are learned. This co-articulation modeling approach offers two advantages: (1) It produces an explicit co-articulation model that can be applied to any facemodelratherthanbeingrestrictedto“re-combinations”oftheoriginalfacialmotion data. (2) It naturally bridges data-driven approaches (that accurately model the dynam- icsofrealhumanspeech)andflexiblekeyframingapproaches(preferredbyanimators). This co-articulation modeling approach can be easily extended to model the co- articulation of longer phoneme sequences, e.g. those with more than three phonemes, with the cost of requiring significant more training data because of the “combinational 67 sampling.” As a reasonable trade-off between training data and ouput realism, diphone andtriphoneco-articulationmodelsareusedinthisapproach. The dynamic expression synthesis approach presented in this approach shares similari- tieswith[KMT01,MsGST03],butthenotabledistinctionofthisapproachisthatexpres- sionsaretreatedasadynamicprocess,notasstaticposesasin[KMT01,MsGST03]. In general,theexpressiondynamicsincludetwoaspects: (1)ExpressiveMotionDynamics (EMD): Even in an invariant level of anger, people seldom keep their eyebrows at the same height for the entire duration of the expression. Generally, expressive motion is a dynamic process, not statically corresponding to some fixed facial configurations; (2) Expression Intensity Dynamics (EID): both the intensity of human expressions and the type of expression may vary over time, depending on many factors, including speech contexts. Varying blending weights over time in [KMT01, MsGST03] can simulate the EID, but the EMD are not modeled, because the same static expressive facial configu- rations are used. In this approach, the EMD is embodied in the constructed PIEES as continuouscurves. Theoptionalexpression-intensitycontrolisusedforsimulatingEID, similarto[KMT01,MsGST03]. 5.2.1 LearningSpeechCo-Articulation This section details the construction of explicit speech co-articulation models. Since only speech co-articulation modeling is concerned, only facial motion capture data with a neutral expression are used to learn “speech co-articulation.” To avoid the neg- ative effects of markers with low fidelity, only ten markers around the mouth area (see red points in the Figure 5.2) are used. The dimensionality of these motion vectors (concatenated ten-markers’ 3D motion) is reduced using EM-PCA algorithm [Row97], 68 because instead of directly applying Singular Value Decomposition (SVD) on the full dataset,EM-PCAalgorithmsolveseigen-valuesandeigen-vectorsiterativelyusingthe Expectation-Maximization (EM) algorithm, and it uses less memory than regular PCA. Themotiondataarereducedfromtheoriginalthirtydimensionstofivedimensions,cov- ering 97.5% of the variation. In this section, these phoneme-annotated five-dimension PCAcoefficientsareused. Figure5.2: Redmarkersareusedforlearningspeechco-articulation. Each phoneme with its duration is associated with a PCA coefficient subsequence. The middle frame of its PCA coefficient subsequence is chosen as a representative sample of that phoneme (termed a phoneme sample). In other words, a phoneme sample is a five dimensional PCA coefficient vector. Hence, the PCA coefficient subsequence between two adjacent phoneme samples captures the co-articulation transition between two phonemes (termed co-articulation area ). Figure 5.3 illustrates phoneme samples and co-articulationarea. Then, theweight decompositionmethod adoptedfrom [PB02] isusedtoconstructphoneme-weightingfunctions. AssumeamotioncapturesegmentM[P s ,P e ]foraspecificdiphonepair[P s ,P e ]isincluded in the database (see Figure 5.3). S s is the phoneme sample of the starting phoneme 69 Figure5.3: Phonemesamplesandaco-articulationareaforphonemes/ah/and/k/. (phoneme P s in this case) at time T s , S e is the phoneme sample of the ending phoneme (phoneme P e in this case) at time T e . Notice that subscript s stands for the starting phoneme and subscript e stands for the ending phoneme in the above notations. F j at timeT j isanyintermediatePCAcoefficientframeintheco-articulationareaof M[P s ,P e ]. Eq. 5.1 is solved to get the normalized timet j (0≤t j ≤1), and Eq. 5.2 is solved, in the least-squaresense,togettheweightofthestartingphoneme, W j,s ,andtheweightofthe endingphoneme,W j,e : t j =(T j −T s )/(T e −T s ) (5.1) F j =W j,s ×S s +W j,e ×S e (5.2) whereT s ≤T j ≤T e ,W j,s ≥0andW j,e ≥0. Thus,twotime-weightrelations <t j ,W j,s >and<t j ,W j,e >areobtainedforanyinter- mediate frame F j . Assume there are total N motion capture segments for this specific diphone pair [P s ,P e ] in the database, and the co-articulaton area of the i th segment has K i frames. Then, the gathered time-weight relations <t i j ,W i j,s > and <t i j ,W i j,e > (1≤ i≤ N and 1≤ j≤ K i ) encode all the co-articulation transitions for the diphone 70 [P s ,P e ](thesuperscriptofnotationsdenoteswhichmotioncapturesegment). Twopoly- nomialcurvesF s (t)andF e (t)areusedtofitthesetime-weightrelations. Mathematically, F s (t)andF e (t)aresolvedbyminimizingthefollowingerrorfunctions: e S (P s ,P e )= N ∑ i=1 K i ∑ j=1 (F s (t i j )−W i j,s ) 2 (5.3) e E (P s ,P e )= N ∑ i=1 K i ∑ j=1 (F e (t i j )−W i j,e ) 2 (5.4) Where F s (t) and F e (t) are referred as the Starting-Phoneme Weighting function and the Ending-Phoneme Weighting function respectively. Here F s (t) and F e (t) are exper- imentally constrained to third degree polynomial curves (see follow-up explanations). Fig. 5.4 and 5.5 illustrate two examples of these diphone co-articulation functions. In Fig. 5.4 and 5.5, the decrease of phoneme /ae/ during the transition from phoneme /ae/ tophoneme/k/isfasterthanthatofthetransitionfromphoneme/ae/tophoneme/p/. Figure 5.4: An example of diphone co-articulation functions (for a phoneme pair /ae/ and/k/: F s (t)andF e (t)). 71 Figure 5.5: Another example of diphone co-articulation functions (for a phoneme pair /ae/and/p/: F s (t)andF e (t)). For triphone co-articulations, Eq. 5.5 and 5.6 are analogously solved to get three time- weightrelations<t j ,W j,s >,<t j ,W j,m >,and<t j ,W j,e >: t j =(T j −T s )/(T e −T s ) (5.5) F j =W j,s ×S s +W j,m ×S m +W j,e ×S e (5.6) where S s , S m , and S e represent the three phoneme samples and the weight values are non-negative. In a similar way to Eq. 5.3 and Eq. 5.4, three polynomial weighting functions F s (t), F m (t), and F e (t) are used to fit these time-weight relations separately ande S (P s ,P m ,P e ),e M (P s ,P m ,P e ),ande E (P s ,P m ,P e )aresimilarlycalculated. F s (t),F m (t), and F e (t) are termed as the Starting-Phoneme weighting function , the Middle Phoneme WeightingFunction,andtheEnding-PhonemeWeightingFunction respectively. Fig.5.6 andFig.5.7illustratethesetriphoneco-articulationfunctionsfortwotriphonecases. 72 Figure 5.6: A example of triphone co-articulation functions. It illustrates triphone co- articulation functions for triphone (/ey/, /f/ and /t/): F s (t), F m (t), and F e (t) (from left to right). Figure 5.7: Another example of triphone co-articulation functions. It illustrates the co- articulation weighting functions of triphone (/ey/, /s/ and /t/): F s (t), F m (t), and F e (t) (fromlefttoright). To determine the optimal degree for polynomial fitting, a fitting cost including a model complexity term is minimized. The fitting cost function C(λ) is defined as follows (Eq.5.7to5.9): C(λ)= ∑ P i ,P j diC(P i ,P j )+ ∑ P i ,P j ,P k triC(P i ,P j ,P k ) (5.7) diC(P i ,P j )= ∑ θ=S|E e θ (P i ,P j )+λ ∑ θ={s,e} kF θ k 2 (5.8) 73 triC(P i ,P j ,P k )= ∑ θ=S|M|E e θ (P i ,P j ,P k )+λ ∑ θ=s|m|e kF θ k 2 (5.9) Here λ is the penalty value for model complexity, andk Fk 2 is the sum of function F coefficients’ squares. Fig. 5.8 illustrates the cost curve as a function of the degree of fitting curves. From Fig. 5.8, even without the penalty (λ = 0), n=3 is still a good trade-off point. In this work, n=3 is experimentally chosen for fitting polynomial co- articulationcurvesinthiswork. Figure5.8: Fittingerrorwithrespecttothedegreeoffittedco-articulationcurves(redis forλ=0,greenisforλ=0.00001,blueisforλ=0.00005,andblackisforλ=0.0001). 74 5.2.2 ConstructExpressionEigen-Spaces Since the same sentence material was used for capturing facial motions of the four dif- ferent expressions and spoken by the subject without different emphasis, the phoneme sequences, except for their timing, are the same. Based on this observation, a phoneme-based time warping and resampling (super-sample/down-sample) is applied to the expressive capture data to make them align strictly with neutral data, frame by frame. Note that the time warping assumption is just an approximation (the veloc- ity/acceleration in the original expressive motion may be impaired in this warpping), sinceexpressivespeechmodulationsdoinvolvedurationalmodifications[YBL + 04]. In this step, eyelid markers are ignored. Fig. 5.9-5.10 illustrate this time-warping proce- dureforashortpieceofangrydata. Figure 5.9: Phoneme-based time-warping for the Y position of a particular marker. Althoughthephonemetimingsaredifferent,thewarpedmotion(black)isstrictlyframe alignedwithneutraldata(red). 75 Figure5.10: Extractedphoneme-independentangrymotionsignalfromFig5.9. Subtracting neutral motion from aligned expressive motion produces pure expressive motion signals. Since they are strictly phoneme-aligned, it is assumed that the above subtraction removes “phoneme-dependent” content from expressive speech motion capture data. As such, the extracted pure expressive motion signals are Phoneme- IndependentExpressiveMotionSignals(PIEMS). The extracted PIEMS are high dimensional when the 3D motion of all markers are concatenated together. As such, all the PIEMS are put together and reduced to three dimensions, covering 86.5% of the variation. The EM-PCA algorithm [Row97] is used here. Inthisway,athree-dimensionalPIEES(PhonemeIndependentExpressionEigen- Space)isfoundwhereexpressionisacontinuouscurve. Fig.5.11andFig.5.12illustrate the PIEES and the PIEMS. Note that the personality of the captured subject may be irreversiblyreflectedinthePIEES,andonlyfourbasicexpressionsareconsidered. 76 Figure5.11: PlotofthreeexpressionsignalsonthePIEES.Itshowsthatsadsignalsand angrysignalsoverlapinsomeplaces. 5.2.3 ExpressiveSpeechAnimationSynthesis SpeechMotionSynthesis After co-articulation models and PIEES are constructed, this approach synthesizes new expressivefacialanimations,givennovelphoneme-annotatedspeech/textsinputandkey visememappings. Atotalof13keyvisemeshapes(eachcorrespondstoavisuallysim- ilar group of phonemes, e.g. /p/, /b/, and /m/.) are used. The mapped key shapes (key visemes) are 3D facial control point (marker) configurations. New speech anima- tionsaresynthesizedbyblendingthesekeyshapes,anddynamicphoneme-independent expressionsequencesaresynthesizedfromtheconstructedPIEESbyatexturesynthesis 77 Figure5.12: Plotoftwo expressionsequencesinthe PIEES.Itshowsthat expressionis justacontinuouscurveinthePIEES. approach. Finally, these two are weight-blended to produce expressive facial anima- tions. Figure 5.13: The junctures of adjacent diphones and triphones. The overlapping part (the semitransparent part) in the juncture of two triphones (left panel) needs to be smoothed. Notethatthereisanotherdiphone-triphoneconfiguration,similartothemid- dletriphone-diphonepanel. Given a phoneme sequence P 1 ,P 2 ,···,P n with timing labels, blending these key shapes generatesintermediateanimations. Thesimplestapproachwouldbelinearinterpolation between adjacent pairs of key shapes, but linear interpolation simply ignores any co- articulation effects. Inspired by [CFKP04], a greedy searching algorithm is proposed to concatenate these learned co-articulation models. Fig. 5.13 illustrates all possible juncture cases for adjacent diphones/triphones. As illustrated in Fig. 5.13, only the 78 left case in the figure needs motion blending. In this work, motion blending technique presented in [MCP + 04] is used. Eq. 5.10 describes the used parametric rational G n continuousblendingfunctions[MCP + 04]. b n,μ (t)= μ(1−t) n+1 μ(1−t) n+1 +(1−μ)t n+1 (5.10) wheret isin[0,1], μ isin(0,1),andn≥0. Algorithm 1 describes the procedure of the speech motion synthesis algorithm. Note thatinthecasethatdiphonemodelsforspecificdiphonecombinationsarenotavailable (notincludedinthetrainingdata),cosineinterpolationisusedasanalternative. Algorithm1MotionSynthesis Input: P 1→n ,Keys Output: Motion 1: i←1, prevTriphone←FALSE,Motion=φ 2: whilei<n do 3: ifi+2≤nandtriM(P i →P i+2 )existsthen 4: NewMo=synth(triM(P i →P i+2 ),Keys) 5: if preTriphonethen 6: Motion=catBlend(Motion,NewMo) 7: else 8: Motion=concat(Motion,NewMo) 9: endif 10: preTriphone=TRUE,i=i+1 11: else 12: if preTriphonethen 13: preTriphone=FALSE 14: else 15: NewMo=synth(diM(P i →P i+1 ),Keys) 16: Motion=concat(Motion,NewMo) 17: endif 18: i=i+1 19: endif 20: endwhile 79 ExpressiveMotionSynthesis Ontheexpressionside,fromFig.5.12,itisobservedthatexpressionisjustacontinuous curve in the low-dimensional PIEES. Texture Synthesis, originally used in 2D image synthesis, is a natural choice for synthesizing novel expression sequences. Here non- parametric sampling methods [EL99, LLX + 01] are used. The patch-based sampling algorithm [LLX + 01] is chosen due to its time efficiency. Its basic idea is to grow one texturepatch(fixedsize)atatime,randomlychosenfromqualifiedcandidatepatchesin the input texture sample. In this work, each texture sample (analogous to a pixel in the 2D image texture case) consists of three elements: the three coefficients of the projec- tionofamotionvectoronthethree-dimensionalPIEES.Theparametersofpatch-based sampling [LLX + 01] for this case (Figure 3.1), patch size = 30, the size of boundary zone=5,andthetoleranceextent=0.03. Since the expressive facial motion data used for extracting the PIEES are captured with full expressions. However, in real-world applications, humans usually vary their expression intensity over time. Thus, an optional expression-intensity curve scheme is provided to intuitively simulate the EID. This expression-intensity curve is used to control the weighted-blending of synthesized expression signals and syn- thesized neutral visual speech. Ideally, the EID (expression intensity curves here) should be automatically extracted from the given audio by emotion-recognition pro- grams[CCT + 01,Pet99,LNP03]. Theoptionalexpression-intensitycontrolisamanual alternativetothisprogram. Basically, an expression intensity curve can be any continuous curve in time versus expression-intensity space, and its range is from 0 to 1, where zero represents “no expression” (neutral) and one represents “full expression.” By interactively controlling 80 expression-intensity curves, users can conveniently control expression intensities over time. MappingMarkerMotionto3DFaces Afterthemarkermotiondataaresynthesized,itisnecessarytomapthemarkermotions to 3D face models. The target face model is a NURBS face model composed of 46 blendshapes (Fig. 5.15), such as{leftcheekRaise, jawOpen,···}. The weight range of each blendshape is [0,1]. A blendshape model B is the weighted sum of some pre- designedshapeprimitives[LMDN05]: B=B 0 + N ∑ i=1 W i ∗B i (5.11) where B 0 is a base face, B i are delta blendshape bases, andW i are blendshape weights. HereBandB i (Eq.5.11)arevectorsthatareconcatenatingallmarkers’3Dpositions. A RBF-regression based approach [DCFN06] is used to directly map synthesized marker motions to blendshape weights. In the first stage (capture stage), a motion cap- turesystemandavideocameraaresimultaneouslyusedtorecordthefacialmotionsofa human subject. The audio recordings from the two systems are misaligned with a fixed time-shift because of slight differences in start time of recording. The manual align- ment of these two audio recordings results in strict alignments between mocap frames and video frames (referred to as mocap-video pairs ). In the second stage, a few refer- ence mocap-video pairs are carefully selected that cover the spanned space of visemes and emotions as completely as possible. In the third stage, motion capture data were reduced to a low dimensional space by Principal Component Analysis (PCA). Mean- while, based on the selected reference video frames (face snapshots), users manually 81 tune the weights of the blendshape face model to perceptually match the model and the reference images, which creates supervised correspondences between the PCA coeffi- cientsofmotioncaptureframesandtheweightsoftheblendshapefacemodel(referred to as mocap-weight pairs ). Taking the reference mocap-weight pairs as training exam- ples, the Radial Basis Functions (RBF) regression technique is used to automatically compute blendshape weights for new motion capture frames. Fig. 5.14 illustrates this process. Moredetailsaboutthismappingalgorithmcanbefoundin[DCFN06]. Figure5.14: Schematicoverviewofmappingmarkermotionstoblendshapeweights. It is composed of four stages: data capture stage, creation of mocap-video pairs, creation ofmocap-weightpairs,andRBFregression. In summary, the complete synthesis algorithm can be described in Algorithm 2. Here procedure MotionSynthesis synthesizes neutral visual speech using the above Algo- rithm 1, procedure ExprSynthesis synthesizes novel expression signals with specified 82 Figure 5.15: The used blendshape face model. The left panel shows the smooth shaded modelandtherightpanelshowstherenderedmodel. expressionfrom thePIEES,theprocedure Blend combinesthese twotogethertogener- ate expressive facial motion, and note that this blending is done on the motion marker level. The final procedure Map2Model maps the synthesized marker motion to a spe- cific3Dfacemodel. Algorithm2ExpressiveFacialAnimationSynthesis Input: a phoneme sequence with timing P[1...N], specified key shapes keyShapes, specfied expressioninformationExpr,specfied3DfacemodelsModel Output: AnimFace 1: SpeechSeq=MotionSynthesis(P,keyShapes) 2: ExpSignal =ExprSynthesis(PIEES,Expr) 3: ExprMotion=Blend(SpeechSeq,ExpSignal) 4: AnimFace=Map2Model(Model,ExprMotion) 5.2.4 ResultsandEvaluations To evaluate the performance of this expressive facial animation synthesis system, two differenttestsweredesigned. Thefirsttestistosynthesizenewexpressivevisualspeech 83 animation given novel audio/text inputs. The second test is used to verify this approach by comparing ground-truth motion capture data and synthesized speech motion via tra- jectorycomparisons. New sentences (not used in the training) and music are used for synthesizing novel speech animations. First, recorded speech (or music), with its accompanying texts (or lyrics),wasinputtedtothephoneme-alignmentprogram(speechrecognitionprogramin a force-alignment mode) to generate a phoneme sequence with timing labels, then this phonemesequencewasfedintothefacialanimationsynthesissystemtosynthesizecor- responding expressive facial animations. Fig. 5.16 and Fig. 5.17 illustrate some frames ofsynthesizedexpressivefacialanimations. Figure5.16: Someframesofsynthesizedhappyfacialanimation. Figure5.17: Someframesofsynthesizedangryfacialanimation. The learned speech co-articulation models are evaluated by marker trajectory compar- isons. Several original speaker’s sentences (not used in previous training stage) were used to synthesize neutral speech animations using this approach. The synthesized 84 motion of the same ten markers around the mouth area is then compared frame-by- framewiththecorrespondingground-truthdata. Inthisevaluation,manuallypickedkey shapes are used that may not perfectly match the motion capture geometry. Fig. 5.18 shows comparison results for a lower lip marker of one phrase. These trajectories in Fig. 5.18 are similar, but the velocity curves have more obvious differences at some places. Its underlying reason could be that in current work, only markers’ 3D positions are used during the modeling stage, while velocity information is ignored. Hence, an interesting future extension could be combining position and velocity for facial anima- tionlearning. 5.2.5 ConclusionsandDiscussion In this section, a novel system is presented for synthesizing expressive facial anima- tion. It learns speech co-articulation models from real motion capture data, by using a weight-decomposition method, and the presented automatic technique for synthesizing dynamicexpressionmodelsboththeEMDandtheEID,improvingonpreviousexpres- sionsynthesiswork[KMT01,MsGST03]. Thisapproachlearnspersonality(speechco- articulations and phoneme-independent expression eigen-spaces) using data captured from the human subject. The learned personality can then be applied to other target faces. The statistical models learned from real speakers make the synthesized expres- sivefacialanimationmorenaturalandlife-like. Thisco-articulationmodelingapproach can be easily extended to model the co-articulation effects among longer phoneme sequences(e.g. 5-6phonemelength) atthecostofrequiringsignificantlymoretraining data. 85 Figure 5.18: Comparisons with ground-truth marker motion and synthesized motion. Theredlinedenotesground-truthmotionandthebluedenotessynthesizedmotion. The top illustrates marker trajectory comparisons, and the bottom illustrates velocity com- parisons. Note that the sampling frequency here is 120Hz. The phrase is “...Explosion ininformationtechnology...”. Thissystemcanbeusedforvariousapplications. Foranimators,afterinitialkeyshapes areprovided,thisapproachcanserveasarapidprototypingtoolforgeneratingnatural- looking expressive facial animations while simultaneously preserving the expressive- nessoftheanimation. Afterthesystemgeneratesfacialanimations,animatorscanrefine theanimationbyadjustingthevisemesandtimingasdesired. Becauseoftheverycom- pactsizeoftheco-articulationmodelsandthePIEESlearnedbythisapproach,itcanbe 86 conveniently applied onto mobile-computing platforms (with limited memory), such as PDAsandcellphones. This co-articulation modeling approach is efficient and reasonably effective, but it does not differentiate between varying speaking rates. As such, future work could include extending current co-articulation models to handle different speaking rates and affec- tivestates,forexample,investigatinghowthelearnedco-articulationfunctions(curves) change when speaking rates are increased or decreased. Another major limitation of thisco-articulationmodelingisthatitdependson“combinationalsampling”fromtrain- ing data. Hence, the best results require that the training data have a complete set of phoneme combinations. Additionally, the current system still requires users to provide key viseme shapes. Future work on automatically extracting key visemes shapes from motioncapturedatawouldbeapromisingwaytoreplacethismanualstep. In terms of validating this approach, I am aware that objective comparisons are not enough, conducting audio-visual perceptual experiments could be another useful way to evaluate this approach, which I plan to pursue in the future. Another consideration is that only ten markers around the lips does not capture all the details of lip motion, for example, when the lips are close, inner lips could penetrate each other. I plan to usemoremarkersforfacialmotioncaptureinordertofurtherimproveandvalidatethis approach. Alimitationoftheexpressionsynthesisapproachisthattheinteractionbetweenexpres- sion and speech is simplified. I assume there is a PIEES extracted by phoneme-based subtraction. The time-warpping algorithm used in the expression eigen-space construc- tion may cause the loss of velocity/acceleration that is essential to expressive facial motion. I plan to investigate the possibility of learning statistical models for veloc- ity/acceleration patterns in captured expressive speech motion. Transforming these 87 learnedpatternsbacktothesynthesizedfacialmotionwillfurtherenhanceitsexpressive realism. A large amount of expressive facial motion data are needed to construct the PIEES, becauseitisdifficulttoanticipateinadvancehowmuchdataareneededtoavoidgetting “stuck” during the synthesis process. However, after some animations are generated, it iseasytoevaluatethevarietyofsynthesizedexpressionsandmoredatacanbeobtained ifnecessary. Iplantolookintosomeautomaticwaystoavoid“gettingstuck”incasethe trainingdataarenotenough,forexample,ifthesynthesisalgorithmcannotfindenough qualified candidates with a predefined threshold value, the algorithm should can adjust thisthresholdvalueadaptivelyandautomatically. Anotherlimitationoftheexpressionsynthesisworkisthatsomevisemesmayloosetheir characteristic shapes when blending expression with neutral ones. As a future work, I plantoavoidthisproblembyusing“constrainedtexturesynthesis”forexpressionsignal synthesis that imposes certain hard constraints at specified places. Another promising way to avoid “the possible relaxation of characteristic shapes” is to learn a speech co- articulationmodelforeachaffectivestate. Most of the learning-based systems face one common difficult concern: what is the optimalparameterdecisions/trade-offsinvolvedinthelearningsystemandhowtodeter- mine these parameters from the data itself. This system has similar issues too. In the part of learning speech co-articulation, I have to make experimental decisions on the fittingdegreeandthelengthoflearnedphonemesequences(e.g. 3fortriphones). Addi- tionally, another trade-off between the dimensionality of the learned expression space andthequalityofsynthesizedexpressionsisconcernedintheexpressionsynthesispart. Understandingthesetrade-offsandtheireffectsonthesystemperformancewouldbean importantandinterestingdirectiontobepursuedinthefuture. 88 I am aware that expressive eye motion and head motion are critical parts of expressive facialanimation,sincetheeyeisoneofthestrongestcuestothementalstateofaperson, andheadmovementissomehowcorrelatedwithspeechcontents[GCSH02,BDNN05]. Simplyaddingpre-recordedheadmovementandeyemotionontonewsynthesizedtalk- ing faces that speak novel sentences may create unrealistic mouth-head gesture coordi- nation. Futureworkonspeech-drivenexpressiveeyemotionandheadmotionsynthesis cangreatlyenhancetherealismofsynthesizedexpressivefacialanimations. 5.3 Sample-basedApproach(eFASE) In this section, a novel data-driven expressive facial animation synthesis and editing system(eFASE)isdevelopedtogenerateexpressivefacialanimationsbyconcatenating and resampling captured data while users establish constraints and goals (Fig. 5.19). Its algorithm synthesizes an expressive facial motion sequence by searching for best- matched motion capture frames in the database by minimizing a search cost function, basedonthenewspeechphonemesequence,theuser-specifiedconstrainedexpressions forphonemes,anduser-specifiedemotionmodifiers. Users can browse and select constrained expressions for phonemes using a novel 2D expressive phoneme-Isomap visualization and editing tool. Simultaneously, users can optionally specify emotion modifiers over arbitrary time intervals. These user interac- tionsarephonemealignedtoprovideintuitivespeech-relatedcontrol. Itshouldbenoted that user input is not needed to create motion sequences, only to impart them with a desired expressiveness. Figure 5.19 illustrates the high-level components of the eFASE system. 89 Figure5.19: OverviewoftheeFASEpipeline. Atthetop,givennovelphoneme-aligned speech and specified constraints, this system searches for best-matched motion nodes in the facial motion database and synthesizes expressive facial animation. The bottom illustrates how users specify motion-node constraints and emotions with respect to the speechtimeline. 90 Besides the effective search algorithm and intuitive user controls, this eFASE system provides novel and powerful editing tools for managing a large facial motion database. Since facial motion capture is not perfect, contaminated marker motions can occasion- ally occur somewhere in a motion capture sequence. Eliminating these contaminated motions is difficult but very useful. The phoneme-Isomap based editing tool visualizes the facial motion database in an intuitive way, which can help users to remove contam- inated motion sequences (motion nodes), insert new motion sequences intuitively, and reusecaptureduncontaminatedmotionsefficiently. 5.3.1 FacialDatabaseProcessing As mentioned in Chapter 2, head motion was removed from the motion capture data and the motions of all 90 markers in one frame were packed into a 270 dimensional motionvector,PrincipalComponentAnalysis(PCA)isappliedontoallthemotionvec- tors to reduce its dimensionality. The reduced dimensionality is experimentally set to 25,whichcovers98.53%ofthevariation. Therefore,each270-dimensionalmotionvec- toristransformedintoareduced25-dimensionalvectorconcatenatingtheretainedPCA coefficients. Inthisapproach, Motion FramesareusedtorefertothesePCAcoefficient vectorsortheircorrespondingfacialmarkerconfigurations. To make the terms used in this approach consistent, I defined two new terms: Motion Nodes and Phoneme Clusters. Based on the phonemes’ time boundaries (from the phoneme-alignment results in Chapter 2), I chopped the motion capture sequences into small subsequences that span several to tens of motion frames, and each subsequence correspondstothedurationofaspecificphoneme. Eachphonemeoccursmanytimesin the spoken corpus, with varied co-articulation. I refer to these subsequences as Motion 91 Nodes. For each motion node, its triphone context that includes its previous phoneme and next phoneme is also retained. Putting all motion nodes of a specific phoneme togetherproducesthousandsofmotionframesrepresentingthefacialconfigurationsthat occurforthisphoneme. Allthemotion-framescorrespondingtoaspecificphonemeare referredtoasaPhonemeCluster. Eachmotion-frameinaphonemeclusterhasanemo- tionlabelandarelativetimeproperty(relativetothedurationofthemotionnodethatit belongs to). The specific phoneme that a motion node represents is called the phoneme of this motion node. Fig. 5.20 illustrates the process of constructing phoneme clusters andmotionnodes. Besides the above phoneme clusters, a facial motion-node database is also built. The processedmotionnodedatabasecanbeconceptuallyviewedasa3Dspace(spannedby sentence,emotion,andmotionnodeorder). Becausethesentenceistheatomiccaptured unit,eachmotionnodeo i (exceptthefirst/lastmotionnodeofasentencerecording)has apredecessormotionnode pre(o i )andasuccessivemotionnodesuc(o i )initssentence recording (illustrated as solid directional lines in Fig. 5.21). Possible transitions from one motion node to another motion node are illustrated as dashed directional lines in Fig. 5.21. Note that motion nodes for the silence phoneme /pau/ were discarded, and if the /pau/ phoneme appears in the middle of a sentence’s phoneme transcript, it will breakthesentenceintotwosub-sentenceswhenconstructingthemotionnodedatabase. Specialpostprocessingforthesilencephoneme/pau/willbedescribedinSection5.3.5. Figure5.21illustratestheorganizationoftheprocessedmotionnodedatabase. 92 Figure 5.20: To construct a specific /w/ phoneme cluster, all expressive motion cap- ture frames corresponding to /w/ phonemes are collected, and the Isomap embedding generates a 2D expressive Phoneme-Isomap. Colored blocks in the figure are motion nodes. 5.3.2 ExpressivePhonemeIsomaps This section describes how the phoneme clusters are transformed into 2D expressive phoneme-Isomaps. The phoneme-Isomaps are needed to allow users to interactively browseandselectmotion-frames. SimilartotheapplicationofPCAtoaspecifictypeof human body motion (e.g. jumping) to generate a low-dimensional manifold [SHP04], 93 Figure 5.21: Schematic illustration of the organization of the processed motion node database. Here solid directional lines indicate predecessor/successor relations between motionnodes,anddasheddirectionallinesindicatepossibletransitionsfromonemotion node to the other. The colors of motion nodes represent different emotion categories of themotionnodes. each phoneme cluster is processed with the Isomap framework [TSL00] to embed the clusterinatwo-dimensionalmanifold(theneighbornumberissetto12). 2D Phoneme-Isomaps are compared with 2D Phoneme-PCA maps (two largest eigen- vector expanded spaces ). By visualizing both maps in color schemes, it is found that points for one specific color (emotion) were distributed throughout the 2D PCA maps, and thus, the 2D PCA display is not very useful as a mean for frame selection. The 2D Phoneme-Isomapsclustermanyofthecolor(emotion)pointsandleadtoabetterprojec- tion, so that the points from the various emotions are better distributed and make more sense. I also found that directions, such as a vertical axis, often corresponded to intu- itive perceptual variations of facial configurations, such as an increasingly open mouth. Figure 5.22 compares PCA projection and Isomap projection on the same phoneme clusters. 94 Figure5.22: Comparisonsbetween2DPhoneme-PCAmapsand2DPhoneme-Isomaps. Theleftpanelsare2DPhoneme-PCAmapsfor/aa/(top)and/y/(bottom),andtheright panelsare2DPhoneme-Isomapsfor/aa/(top)and/y/(bottom). Inallfourpanels,black is for neutral, red for angry, green for sad, and blue for happy. Note that some points mayoverlapintheseplots. The above point-renderings (Fig. 5.22) of 2D expressive phoneme-Isomaps are not directly suitable for interactively browsing and selecting facial motion-frames. A Gaussian kernel point-rendering visualizes the Isomaps, where pixels accumulate the Gaussian distributions centered at each embedded location. Pixel colors are propor- tionalto theprobabilityof acorrespondingmotion-framerepresenting thephoneme. In thisway,aphoneme-Isomapimageforeachphoneme-Isomapisgenerated(Fig.5.23). A 2D Delaunay triangulation algorithm is applied to the embedded 2D Isomap coor- dinates of each phoneme-Isomap to produce a triangulation network. Each vertex of these triangles corresponds to an embedded phoneme-Isomap point (a motion-frame in 95 the phoneme cluster). These triangles cover most of the points in the phoneme-Isomap image without overlap (some points around the image boundary are not covered by the triangulation network). Therefore, when a point in the Phoneme Isomaps is picked, its 2D position is mapped back to the 2D embedded Isomap coordinate system, then the mappedpositiondeterminestheuniquecoveringtriangle. Thebarycentricinterpolation isusedtointerpolatethreevertices(motion-frames)ofthecoveringtriangletogenerate anewmotion-frame(correspondingtothepickedpoint). Aphoneme-Isomapimageisa visualizedrepresentationofacontinuousspaceofrecordedfacialconfigurationsforone specificphoneme(Fig.5.23). Thephoneme-Isomapimageofthe/ay/phonemeisshown in Fig. 5.23. Note that these phoneme-Isomap images and their mapping/triangulation informationwereprecomputedandstoredforlateuse. Basedontheaboveinterpolatedmotionframe(foranypickedpoint),a3Dfacemodelis deformedcorrespondingly. Afeaturepointbasedmeshdeformationapproach[KGT00] is used for this rapid deformation. More details about how to map a motion-frame to a 3DfacemodelaregiveninSection5.3.5. 5.3.3 PhonemeMotionEditing The captured facial motion database is composed of hundreds of thousands of motion captureframes,anditischallengingtomanageandeditthesehugedata. Thephoneme- Isomapimagesallowuserstoeditsuchhugefacialmotiondata. Userscaninteractively createandaddnewmotionnodesintothefacialmotiondatabase. AsdescribedinSection5.3.1,eachmotionnodeisasequenceofmotioncaptureframes ofonespecificphonemeintheirrecordingorder. Itisvisualizedasadirectedtrajectory (curve)inphoneme-Isomapimages(leftofFig.5.24). Sinceeachpointonthetrajectory 96 Figure 5.23: A 2D expressive phoneme-Isomap for phoneme /ay/. Here each point in the map corresponds to a specific 3D facial configuration. Note that gray is for neutral, redforangry,greenforsad,andblueforhappy. represents a specific facial configuration (see Fig. 5.23), and the image color behind a motion-node trajectory represents the emotion category of the motion node, users can intuitively and conveniently inspect any frame in the motion node (a point on the trajectory) as follows: when users click any point on the trajectory, its corresponding 3Dfacedeformationisinteractivelydisplayedinapreviewwindow. 97 If contaminated motion nodes are found, users can choose to select and delete these motionnodesfromthedatabase,sothatthefollow-upmotionsynthesisalgorithm(Sec- tion 5.3.4) could avoid the risk of being trapped into these contaminated motion nodes. For example, yellow curves in the middle panel of Fig. 5.24 represent two selected motionnodesthataretobedeleted. Basedonexistingmotionnodesandtheircorrespondingtrajectoriesinphoneme-Isomap images, users can create new motion nodes by drawing free-form 2D trajectories (each continuoustrajectorycorrespondstoanewmotionnode). Inthisway,userscanexpand the facial motion database. The right panel of Fig. 5.24 shows a new motion node (trajectory)beingcreated. Figure 5.24: Snapshots of motion editing for the phoneme /ay/. Here each trajectory (curve)representsonemotionnodeandimagecolorrepresentsemotioncategory. 5.3.4 SpeechMotionSynthesis Given a novel phoneme sequence and its emotion specifications as input, how motion synthesis algorithms synthesize corresponding facial motion will be described in this section. Thesystemisfullyautomaticwhileprovidingoptionalintuitivecontrols: users 98 can specify a motion-node constraint for any phoneme utterance (“hard constraints”) via the above phoneme-Isomap interface, and the algorithm will automatically regard the emotion modifiers as “soft constraints”. Under these hard and soft constraints, this algorithm searches for a best-matched path of motion nodes from the processed facial motion node database by minimizing a cost function using a constrained dynamic pro- grammingtechnique. SpecifyMotion-NodeConstraints Usersinteractivelybrowsephoneme-Isomapimagestospecifymotion-nodeconstraints and tie them to a specific phoneme utterance’s expression. This time is referred to as a constrained time and its corresponding phoneme as a constrained phoneme. Phoneme timing is included in the preprocessed phrase (phoneme) transcript, so phoneme- Isomapsareautomaticallyloadedonceaconstrainedphonemeisselected(Fig.5.25). Figure 5.25: Illustration of how to specify a motion-node constraint via the phoneme- Isomap interface. When users want to specify a specific motion node for expressing a particular phoneme utterance, its corresponding phoneme-Isomaps are automatically loaded. Then,userscaninteractwiththesystemtospecifyamotion-nodeconstraintfor thisconstrainedphoneme. 99 To guide users in identifying and selecting proper motion nodes, this system auto- matically highlights recommended motion nodes and their picking points. Assum- ing a motion node path o 1 ,o 2 ,...,o k is obtained by the automatic motion-path search algorithm (the follow-up Section 5.3.4 details this algorithm), users want to specify a motion-nodeconstraintforaconstrainedtime T c (assumeitscorrespondingconstrained phoneme is P c and its motion-frame at T c is F c , called current selected frame). The constrained time T c is first divided by the duration of the constrained phoneme P c to calculate its relative time t c (0≤t c ≤ 1). Then, for each motion node in the phoneme cluster, the system highlights one of its motion frames whose relative time property is theclosesttocurrentrelativetimet c . Thesemotionframesarereferredtoastime-correct motionframes. As mentioned in Section 5.3.1, the specific triphone context of each motion node was also retained. By matching the triphone context of the constrained phoneme with those of existing motion nodes in the phoneme cluster of P c , this system identifies and high- lightsthemotionnodesinthephonemeclusterthathavethesametriphonecontextasthe constrainedphoneme(termedcontext-correctmotionnodes ). Forexample,inFig.5.25, thecurrentconstrainedphonemeis/w/,anditstriphonecontextis[/iy/,/w/,/ah/],sothe systemwillidentifythemotionnodesofthe/w/phonemeclusterthathavethetriphone context [/iy/, /w/, /ah/] as the context-correct motion nodes. In this way, by picking their representative time-correct motion frames, users can choose one of those motion nodes as a motion-node constraint for P c . This motion node constraint is imposed per phonemeutterance. Inotherwords,ifonespecificphonemeappearsmultipletimesina phoneme input sequence, users can specify different motion-node constraints for them. Figure 5.26 shows a snapshot of phoneme-Isomap highlights for specifying motion- nodeconstraints. Notethatthephoneme-Isomapimageisalwaysthesameforaspecific phoneme, but these highlighting symbols (Fig. 5.26) are related to current relative time 100 t c and current triphone context. So, these markers are changed over time (even for the samephoneme). Figure5.26: Asnapshotofphoneme-Isomaphighlightsforspecifyingmotion-nodecon- straints. SearchfortheOptimalConcatenations This motion-node path search problem can be formalized as follows: Given a novel phomeme sequence input Ψ = (P 1 ,P 2 ,···,P T ) and its emotion modifiers Θ = (E i ,E 2 ,···,E T ) (E i only can be one of four possible values: neutral, angry, sad and 101 happy),andoptionalmotion-nodeconstraints Φ=(C t 1 =o i 1 ,C t 2 =o i 2 ,···,C t k =o i k ,t i 6= t j ),itsgoalistosearchforabest-matchedmotion-nodepath Γ ∗ =(o ∗ ρ 1 ,o ∗ ρ 2 ,···,o ∗ ρ T )that minimizes a cost function COST(Ψ,Θ,Φ,Γ ∗ ). Here o i represents a motion node with indexi. To make the definition of the above cost function clear, I first leave out the constraint parameterΦanddefineaplainversionCOST(Ψ,Θ,Γ)withoutmotion-nodeconstraints. How the constraint parameterΦ affect the cost function and the search process will be described later in this section. The cost function COST(Ψ,Θ,Γ) is the accumulated summation of Transition Cost TC(o ρ i ,o ρ i+1 ), Observation Cost OC(P i ,o ρ i ), and Emo- tionMismatchPenaltyEMP(E i ,o ρ i ),asdescribedinEq.5.12. COST(Ψ,Θ,Γ)= T−1 ∑ i=1 TC(o ρ i ,o ρ i+1 )+ T ∑ i=1 (OC(P i ,o ρ i )+EMP(E i ,o ρ i )) (5.12) Transition cost TC(o ρ i ,o ρ i+1 ) represents the smoothness of the transition from one motionnodeo ρ i totheothermotionnodeo ρ i+1 . Ifo ρ i isthecapturedpredecessormotion node of o ρ i+1 (pre(o ρ i+1 ) =o ρ i ), their transition cost is set to zero (an expected perfect transition). If the phoneme of pre(o ρ i+1 ) exists and belongs to the same viseme cate- gory of that of o ρ i , then the cost value is the weighted sum of Direct Smoothing Cost DSC(o ρ i ,pre(o ρ i+1 ))andPositionVelocityCostPVC(o ρ i ,o ρ i+1 ). Theconsiderationhere isthatifthesourcemotionnodehasthesamevisemecategorywiththeexpectedprede- cessor motion node, it is reasonable to somehow consider it. If these two motion nodes do not share the same viseme category, then an additional penalty value PNT is added. If pre(o ρ i+1 ) does not exist, a big penalty value α∗PNT is assigned. Here PNT is a 102 large constant penalty value and α is a magnifying coefficient (> 1). This transition costisdefinedinEq.5.13. TC(o ρ i ,o ρ i+1 )=                                                  0 if pre(o ρ i+1 )=o ρ i β∗DSC(o ρ i ,pre(o ρ i+1 ))+PVC(o ρ i ,o ρ i+1 ) ifviseme(o ρ i )=viseme(pre(o ρ i+1 )) β∗DSC(o ρ i ,pre(o ρ i+1 ))+PVC(o ρ i ,o ρ i+1 )+PNT ifviseme(o ρ i )6=viseme(pre(o ρ i+1 )) α∗PNT if pre(o ρ i+1 )=NIL (5.13) To compute DSC(o ρ i ,pre(o ρ i+1 )), time-warp o ρ i to make it align with pre(o ρ i+1 ) frame by frame, then do a linear blend on the time-warped motion warp(o ρ i ) and pre(o ρ i+1 ), finally,computetheintegralofthesecondderivativeoftheblendedmotionastheDirect Smoothing Cost. PVC(o ρ i ,o ρ i+1 ) is the weighted sum of position gap and velocity gap between the end of o ρ i and the start of o ρ i+1 . Eq. 5.14 to Eq. 5.15 define these two cost definitions. DSC(o ρ i ,pre(o ρ i+1 ))= Z Blending(warp(o ρ i ),pre(o ρ i+1 )) 00 dt (5.14) PVC(o ρ i ,o ρ i+1 )=η∗PosGap(o ρ i ,o ρ i+1 )+VeloGap(o ρ i ,o ρ i+1 ) (5.15) 103 Observation Cost OC(P i ,o ρ i ) that measures the goodness of a motion node o ρ i for expressing a given phoneme P i is computed as follows: if the phoneme of o ρ i is P i orP i isthesilencephoneme/pau/,thecostissettozero. Iftheyarethesameintermsof viseme category, then it is set to a discounted penalty value (0<γ <1), otherwise, the costisapenaltyvalue. Eq.5.16definestheobservationcost. OC(P i ,o ρ i )=          0 ifP i = pho(o ρ i )orP i =/pau/ γ∗δ∗PNT ifviseme(P i )=viseme(o ρ i ) δ∗PNT otherwise (5.16) If the emotion label of a motion node o ρ i is same as the specified emotion modifier E i , the emotion mismatch penalty is set to zero, otherwise it is set to a constant penalty value. Eq.5.17describesthisdefinition. EMP(E i ,o ρ i )=    0 ifE i =emotion(o ρ i ) ϕ∗PNT otherwise (5.17) Note that in the above equations, constant parametershα,β,η,γ,δ,ϕi are used to bal- ancetheweightsofdifferentcosts. Theexperimentaldeterminationoftheseparameters willbedescribedinSection5.3.5. Basedontheabovecostdefinitions(Eq.5.12toEq.5.17),adynamicprogrammingalgo- rithmisusedtosearchforthebest-matchedmotion-nodesequence Γ ∗ (o ∗ ρ 1 ,o ∗ ρ 2 ,···,o ∗ ρ T ). Assume there are total N motion nodes in the processed motion node database. This 104 searchalgorithmcanbedescribedasfollows: (1)Initialization(for1≤i≤N): ϕ 1 (i)=OC(P 1 ,o i )+EMP(E 1 ,o i ) (5.18) (2)Recursion(for1≤ j≤N;2≤t≤T): ϕ t (j)=min i {ϕ t−1 (i)+TC(o i ,o j )+OC(P t ,o j )+EMP(E t ,o j )} (5.19) χ t (j)=argmin i {ϕ t−1 (i)+TC(o i ,o j )+OC(P t ,o j )+EMP(E t ,o j )} (5.20) (3)Termination: COST ∗ =min i {ϕ T (i)} (5.21) ρ ∗ T =argmin i {ϕ T (i)} (5.22) (4)Recoverpathbybacktracking(t fromT−1to1): ρ ∗ t =χ t+1 (ρ ∗ t+1 ) (5.23) In this way, the best-matched motion-node path Γ ∗ = (o ∗ ρ 1 ,o ∗ ρ 2 ,···,o ∗ ρ T ) can be found. The time complexity of the above search algorithm isΘ(N 2 ∗T), here N is the number ofmotionnodesinthedatabaseandT isthelengthofinputphonemes. Now I describe how the specified motion-node constraints Φ = (C t 1 = o i 1 ,C t 2 = o i 2 ,···,C t k =o i k ,t i 6=t j )affecttheabovesearchalgorithmtoguaranteethatthesearched motion-nodepathpassesthroughthespecifiedmotionnodesatspecifiedtimes. Thecon- straintsaffectthesearchprocessbyblockingthechancesofothermotionnodes(except 105 thespecifiedones)atcertainrecursiontime. Eq.5.19-5.20intheabovesearchalgorithm arereplacedwiththefollowingnewequations(5.24-5.26). ϕ t (j)=min i {ϕ t−1 (i)+TC(o i ,o j )+OC(P t ,o j )+EMP(E t ,o j )+B t (j)} (5.24) χ t (j)=argmin i {ϕ t−1 (i)+TC(o i ,o j )+OC(P t ,o j )+EMP(E t ,o j )+B t (j)} (5.25) B t (j)=    0 if∃m,t m =t and j =i m HugePenalty otherwise (5.26) Giventheoptimalmotion-nodepath Γ ∗ ,itsmotionnodesareconcatenatedbysmoothing theirneighboringboundaries(detailedinSection5.3.5)andtransformingfacialmotions of the motion nodes from their retained PCA space to markers’ 3D space (Eq. 5.27). Finally, the synthesized marker motion sequence is transferred onto specific 3D face models(detailedinSection5.3.5). MrkMotion=MeanMotion+EigMx∗PcaCoef (5.27) 5.3.5 ImplementationIssues In this section, the implementation issues of some non-trivial parts of the eFASE sys- tem are described, including how a cross-validation approach is used to determine the costtrade-offparameters(Section5.3.4),howtotransferthesynthesizedmarkermotion sequence to a specific 3D face model, and how smoothing and resampling techniques areusedinthesystem. 106 Cross-ValidationforCostTrade-Off AsmentionedinSection5.3.4,parametershα,β,η,γ,δ,ϕibalancetheweightsofcosts from different sources, which is critical to realistic facial animation synthesis. Addi- tional twenty-four sentences (each emotion has six), not used in the training database, are used to determine these optimal parameters by cross-validation [HRF01]. A metric (Eq. 5.28) is introduced to measure the error between ground-truth motion capture data andsynthesizedmarkermotion. Gradient-descentmethods[Pie86]areusedtosearchfor optimalparametervalues. Incasethesearchprocessmaystuckatalocalminimum,the gradient-descentprocesswasrunhundredsoftimesandeachtimeitstartedatarandom place. Finally, a certain combination of parameters that leads to the minimum value among all searched results was picked as the optimal parameter setting (α = 3.6052, β =5.7814,η =6.8282,γ =0.2903,δ =0.315,andϕ =0.109). Err = K ∑ i=1 (( N i ∑ j=1 MrkNum ∑ k=1 kSYN j,k i −ORI j,k i k 2 )/(N i ∗MrkNum)) (5.28) Here K is the number of cross-validation sentences, N i is the total number of frames of thei th sentence,SYN j,k i isthe3Dpositionofthek th syntheticmarkerofthe j th frameof thei th sentence,andORI j,k i isforthemotioncaptureground-truth. Deform3DfaceModels After facial marker motions are synthesized, it is necessary to transfer these motions ontoaspecific3Dfacemodel. A3Dfacemodelwasbuiltforthesamecapturedactress (refining a rough 3D face model created by computer vision technologies). Figure 5.27 showsthebuilt3Dfacemodel. 107 Figure 5.27: 3D face model used in the eFASE system. The left panel is a wireframe representationandtherightpanelisatexturedrendering(witheyeballandteeth). Although Radial Basis Functions (RBF) have been shown to have successful applica- tionsinfacialanimationdeformation[PHL + 98,LCF00,NN01],thefeaturepointbased mesh deformation approach [KGT00] is chosen to deform the 3D face model due to its efficiency. Specifically, for each vertex of the face model, the contributing weight of each feature point (marker) are precomputed and stored - assuming there is a gaussian distributed motion propagation centered at each marker. The distance metric between two vertices are the shortest path along edges, not the euclidean distance (Figure 5.28). Then, after the precomputed contributing weights are loaded, this system can deform the face on-the-fly at an interactive rate given any synthesized marker motion frame (Eq.5.29). Figure5.28illustratesthefeaturepointbaseddeformation. VtxMotion i = MrkNum ∑ k=1 w k,i ∗MrkMotion k , MrkNum ∑ k w k,i =1 (5.29) HereVtxMotion i is the motion of the i th vertex of the face model, and MrkMotion k is themotionofthe k th marker,and w k,i istheweightofthe k th markercontributingtothe motionofthei th vertex. 108 Figure 5.28: Feature point based face deformation. The left shows the motion of every vertex (green) is the summation of motion propagations of neighboring markers (red). Therightshowsthepropagationdistancebetweentwoverticesistheshortestpath(red) alongedges,notthesimpleeuclideandistance(green). SmoothingandResampling Undesired changes may exist between concatenated motion nodes, so smoothing tech- niques are needed to smooth transitions. In this approach, the trajectory-smoothing technique based on the spline function [MCP + 04, EGP02] are used. The smoothing operationfromamotionnodeo j toanothermotionnodeo k canbedescribedasfollows: assuming smoothing window size is 2s, f 1 to f s are the ending s frames of the motion node o j , f s+1 to f 2s are the starting s frames of next motion node o k , and f i = f(t i ), 1≤ i≤ 2s, a smooth curve g(t) to best fit f(t i ) is found by minimizing the following objectivefunction: s ∑ i=1 W j ∗(g i − f i ) 2 + 2s ∑ i=s+1 W k ∗(g i − f i ) 2 + Z t 2s t 1 g 00 (t)dt (5.30) Here W j (or W k ) is a pre-assigned weight factor for the phoneme of the motion node o j (or o k ). Because this value depends on the specific phoneme (of one motion node), in this eFASE system, all phonemes are categorized into three groups (mainly based on the phoneme categories proposed in [Pel91]): high visible phonemes (e.g. /p/, /b/, 109 /m/) for a high weight factor (=2.5), low visible phonemes (e.g., /k/, /g/, /ch/) for a low weight factor (=0.4), and intermediate visible phonemes for a middle weight factor (=1.0). Note that in the system, half smoothing window size s is experimentally set 4, andthissmoothingtechniqueisonlyappliedtotwoconcatenatedmotionnodeswithout acapturedpredecessor/successorrelation. As mentioned in Section 5.3.1, motion nodes for the silence time (the /pau/ phoneme) werediscardedwhenconstructingtheprocessedmotionnodedatabase. Therefore,when computing the observation cost for the /pau/ phoneme time (Eq. 5.16), as long as P i = /pau/,theobservationcostissettozero. Inotherwords,anymotionnodeisperfectfor expressingthesilencetimeduringthemotionnodesearchprocess(Section5.3.4). After motion nodes are concatenated and smoothed, these synthesized frames corresponding to the silence time are postprocessed: first identify these silence-time frames based on the input phoneme transcript and then regenerate these frames by performing a linear interpolationontheboundaryofnon-silenceframes. During the postprocessing stage, it is necessary to resample motion frames. When motion nodes are concatenated, the number of frames of the motion node may not exactly match the duration of the input phoneme. The time-warping technique is used to resample the searched motion nodes to obtain the desired number of motion frames. Thisresamplingisdoneat120Hz(thesameastheoriginalmotioncapturerate). Notice that the synthesized marker motion frames are 120 frames/second, the resulting anima- tions are often at an ordinary animation rate of 30 frames/second. As such, before the synthesized marker motion frames are transferred onto a specific 3D face model, these motionframesaredown-sampledtomeettheordinaryanimationrate. 110 Figure 5.29: A snapshot of the running eFASE system. The left is a basic control panel,andtherightpanelenclosesfourworkingwindows: asynthesizedmotionwindow (top-left), a video playback window (top-right), a phoneme-Isomap interaction window (bottom-left),andafacepreviewwindow(bottom-right). 5.3.6 ResultsandEvaluations The eFASE system was developed using VC++ on the MS Windows XP system. Fig. 5.29 shows a snapshot of the running eFASE system. The left is a basic con- trol panel, and the right panel encloses four working windows: a synthesized motion window (top-left), a video playback window (top-right), a phoneme-Isomap interac- tion window (bottom-left), and a face preview window (bottom-right). The synthe- sized motion window and the face preview windows can switch among several display 111 0.122401pau 2.6angry 0.24798ay 16.6383sad 0.328068ae 0.457130m 0.736070n ... Table5.1: Examplesofanalignedphonemeinputfile(left)andanemotionmodifierfile (right). Its phrase is “ I am not happy...”. Here the emotion of the starting 2.6 second is angry,andtheemotionfrom#2.6secondto#16.6383secondissadness. modes, including marker-drawing mode and deformed 3D face mode. In the basic con- trol panel, users can input a novel speech (WAV format) and its aligned phoneme tran- script file, and an emotion specification (modifier) file, then the system automatically synthesizes its corresponding expressive facial animation (shown in the synthesized motion window). Table 5.1 shows examples of a phoneme input file and an emotion specification file. Once the facial motion is synthesized, users can interactively browse every frame and play back the animation in the synthesized motion window (top-left in Fig. 5.29). Additionally, the system can automatically compose an AVI video (audio- synchronized), which user can play back immediately in the video playback window (top-rightinFig.5.29)tocheckthefinalresult. On the user interaction side, users can edit the facial motion database and impose motion-node constraints via the phoneme-Isomap interaction window (bottom-left in Fig. 5.29) and the face preview window (bottom-right in Fig. 5.29). If a point in the phoneme-Isomap interaction window is picked, the face preview window will show the deformed3Dface(orcorrespondingfacialmarkerconfiguration)interactively. A running time analysis was conducted on the eFASE system. The used computer is a Dell Dimension 4550 PC (Windows XP, 1GHz Memory, Intel 2.66GHz Processor). 112 phrases(numberofphonemes) time(second) “Iknowyoumeantit”(14) 137.67 “Andsoyoujustabandonedthem?”(24) 192.86 “Pleasegoon,becauseJeff’s fatherhasnoidea”(33) 371.50 “Itisafactthatlongwordsaredifficult toarticulateunlessyouconcentrate”(63) 518.34 Table5.2: Runningtimeofsynthesisofsomeexamplephrases. Herethecomputerused isaDellDimension4550PC(WindowsXP,1GHzMemory,Intel2.66GHzProcessor). Table 5.2 encloses the running time of some example inputs. As mentioned in Sec- tion 5.3.4, the motion node searching part (the most time-consuming part of the eFASE system)hasatimecomplexityofΘ(N 2 ∗T)thatislineartothelengthofinputphonemes (assuming N is a fixed value for a specific database). The computing time enclosed in theTable5.2isapproximatelymatchedwiththisanalysis. The synthesized expressive facial motion were also compared with ground-truth cap- tured motion. Twelve additional sentences were exclusively used for test comparisons. One of these sentences was “Please go on, because Jeff’s father has no idea of how the things became so horrible.” with sad expression. A right cheek marker (#48 marker) in an expression-active area and a lower lip marker (#79 marker) in a speech-active area were chosen for the comparisons (the left panel of Fig. 2.3). A part of the synthesized sequence and ground truth motion for these marker trajectory comparisons was plotted in Fig. 5.30 (the right cheek marker) and Fig. 5.31 (the lower lip marker). It was found thatthetrajectoriesofthesynthesizedmotionsarequiteclosetotheactualmotionscap- tured from the actress. Notice that the synthesized motions for these comparisons were automatically generated without the use of motion-node constraints. Numerous expres- sivefacialanimationswerealsosynthesizedusingnovelrecordedorarchivalspeech. 113 Figure5.30: Apartofmarker(#48marker)trajectoryofthesadsentence“Pleasegoon, becauseJeff’sfatherhasnoideaofhowthethingsbecamesohorrible.”Thereddashed lineisthegroundtruthtrajectoryandthebluesolidlineisthesynthesizedtrajectory. 5.3.7 ConclusionsandDiscussion Asample-basedexpressivefacialanimationsynthesisandeditingsystem(eFASE)with intuitive phoneme-level control is presented in this section. Users control the facial motion synthesis process by specifying emotion modifiers and expressions for certain phoneme utterances via novel 2D expressive phoneme-Isomaps. This system employs a constrained dynamic programming algorithm that satisfies hard constraints (motion- node constraints) and soft constraints (specified emotions). Objective trajectory com- parisons between synthesized facial motion and captured motion, and novel synthesis experiments, demonstrate that this eFASE system is effective for producing realistic expressivefacialanimations. 114 Figure5.31: Apartofmarker(#79marker)trajectoryofthesadsentence“Pleasegoon, becauseJeff’sfatherhasnoideaofhowthethingsbecamesohorrible.”Thereddashed lineisthegroundtruthtrajectoryandthebluesolidlineisthesynthesizedtrajectory. ThismethodintroducestheIsomapframework[TSL00]forgeneratinganintuitivelow- dimensional manifold for each phoneme cluster. The advantage of the Isomap (over PCA, for example) is that it leads to a better projection of motion frames with different emotions,anditmakesbrowsingandeditingexpressivemotionsequences(andframes) more intuitive and convenient. An interactive and intuitive way of browsing and select- ing among the large number of phoneme variations is itself a challenging problem for facialanimationresearch. Asthisisanewapproachtofacialanimationsynthesisandediting,severalissuesrequire further investigations. The quality of novel motion synthesis depends on constructing 115 a large facial motion database with accurate motion and phoneme alignment. Building thisdatabasetakescareandtime;integratedtoolscouldimprovethisprocessimmensely. In this system, users optionally specify motion-node constraints via phoneme-Isomaps byselectingmotionnodesfrompre-existingmotiondatabase. Thissystemalsooffersa novelwaytointeractivelycreatenewmotionnodes. Extensionsoffacialanimationedit- ingtechniques[JTDP03,ZLGS03]thatautomaticallymodifythewholefaceinresponse to a local user change could be another promising method to expand the facial motion database. The motions of the silence phoneme (the /pau/ phoneme in the Festival system) are not modeled. This phoneme and other non-speaking animations (e.g. yawning ) need to be represented as motion nodes to allow more flexibility and personified realism. Lastly, thereareopenquestionsastowhethercombiningthespeakingstylesofdifferentactors intoonefacialmotiondatabasewouldresultinprovidingagreaterrangeofmotionsand expressions, or if such a combination would muddle the motion-frame sequencing and expressiveness. Currently, this system cannot be used for real-time applications. Optimizations could further improve efficiency by reducing the size of the facial motion database through clusteringmethods,orbyfindingmoreefficientsearchinganddeformationalgorithms. Phoneme-visememappingsarewidelyusedinadhocfacialanimationsystems,butthere are no uniform phoneme-viseme mappings, and it is difficult to evaluate any mapping scheme. The 2D expressive phoneme-Isomaps introduced in this work could provide a basisforevaluatingthesemappingschemesordeterminingnewphoneme-visememap- ping schemes based on the probabilities in the phoneme-Isomaps and the variations amongthephoneme-Isomapsproducedbydifferentsubjects. 116 Chapter6 ConclusionsandDiscussion A novel data-driven 3D expressive facial animation synthesis system that learns from facial motion capture data is presented in this dissertation. In the system, a divide- and-conquer strategy is employed to construct a hierarchical structure where terminal nodes are basic facial motions, such as eye motion, speech motion, etc. For each basic facial motion, different modeling methods are used to learn and synthesize its motion. Users can conveniently control the synthesis of each basic facial motion. This system is an open system, since the system is hierarchically organized, users can conveniently replacebasicmotionmodules(algorithms)withoutaffectingothers. In this dissertation, efficient techniques are presented to synthesize three basic facial motions: eye motion, head motion, and expressive visual speech motion. A texture- synthesis based approach is presented to simultaneously synthesize realistic eye gaze andblinkmotion,accountingforanypossiblecorrelationsbetweenthetwo. Thequality of statistical modeling and the introduction of gaze-eyelid coupling are improvement over previous work, and the synthesized eye results are hard to distinguish from actual capturedeyemotion. Two different approaches (sample-based and model-based) are presented to synthe- size appropriate head motion. Based on the aligned training pairs between audio features and head motion (audio-headmotion), the sample-based approach uses a K- Nearest Neighbors-based dynamic programming algorithm to search for the optimal 117 headmotionsamplesgivennovelspeechinput. Themodel-basedapproachusestheHid- denMarkovModels(HMMs)tosynthesizenaturalheadmotions. HMMsaretrainedto capture the temporal relation between the acoustic prosodic features and head motions. Theresultsshowthatthesynthesizedheadmotionsfollowthetemporaldynamicbehav- iorsofrealheadmotions. This dissertation also presents two different approaches (model-based and sample- based) to generate novel expressive speech animation given new speech input. The model-based expressive speech animation synthesis approach accurately learns speech co-articulation models and expression eigen spaces from facial motion data, and then it synthesizes novel expressive speech animations by applying these generative co- articulation models and sampling from the expression eigen spaces. The sample-based expressive facial animation synthesis system (eFASE) automatically generates expres- sive speech animation by concatenating captured facial motion data while animators establish constraints and goals (novel phoneme-aligned speech input and its emotion modifiers). Users optionally specify “hard constraints” (motion-node constraints for expressingphonemeutterances)and“softconstraints”(emotionmodifiers)toguidethe searchprocess. Userscanalsoedittheprocessedfacialmotiondatabasebyinsertingand deleting motion nodes via a novel phoneme-Isomap interface. Novel facial animation synthesisexperimentsandobjectivemarkertrajectorycomparisonsbetweensynthesized facial motion and captured motion demonstrate that this system is effective for produc- ingrealisticexpressivefacialanimations. The following future work should be addressed for synthesizing more realistic 3D talk- ingfacesthatarehardtodistinguishfromrealhumanfaces. 118 6.1 Modeling“theBrain”ofFacialMotion As described in section 1.2, the divide-and-conquer diagram is used to reduce the com- plexity of a moving face and make it tractable. Indeed, there are some connections among different basic facial motions, although these connections can be weak or hard to explicitly model. For example, how do emotions play a role during visual speech production? and what is the relation between head motion and eye motion? Answering allthesequestionswillneedtomodel“theBrain”offacialmotion. 6.2 LinguisticallyAugmentedFacialAnimation Whenpeoplespeak,theirfacialmotions,especiallyfacialgestures,areconnectedtothe linguistics features 1 . For example, suppose there is a short dialog between two persons (namedBandC)asfollows: B:Iamgoingtogotoschoolnow. C:really? todayisaholiday,areyoucrazy? Generally, in this dialog, B’s eyes will gaze at C, and C’s eyebrows will move up a little bit. Probably, C’s head will also move up. If the “speech acts” are examined in this dialog, it can be found that the sentence spoken by B is classified as a “statement”, and that by C is a “question”. There are some connections between speech acts and facialgesture,althoughitisnotclearwhethertheseconnectionsdependonindividuals. Actually,notonlyspeechacts,butotherlinguisticsfeatures,suchasintonation,hedges, 1 ThanktoDr. BilyanaMartinovskiforusefuldiscussions 119 etc., also have close connections with accompanying facial gestures. Although previ- ous linguistics rule-based methods have been presented [Pel91, PBS94], a data-driven approach can be used here to model these implicit connections, and it will enhance the realismofsynthesized3Dtalkingfaces. 6.3 NeckandThroatAnimation When people speak, neck and throat are also moving according to certain rules. It is obviousthatneckandthroatmotionisSSRM.Unfortunately,thisphenomenonisalways ignoredorunderemphasizedingraphicscommunity,butthistinymotionisimportantto realistictalkingfaces. Data-drivenapproachwouldbeapromisingsolutiontoneckand throatmotionsynthesis. 6.4 PersonifiedFacialAnimation Current work synthesizes novel faciaal animation based on the recorded facial motion capture database of a specific subject. One concern is that it may not provide sufficient generality. An interesting future work that could be explored is to learn personalities from facial motion capture data of different subjects, and then synthesize personified facial animation. It has various promising applications, for example, animating the 3D face model of one subject to mimic the personality of another subject (e.g. famous actors/actressorhistoricalfigures). 120 Bibliography [AF02] O. Arikan and D. A. Forsyth. Interactive motion generation from exam- ples. In ACM Trans. Graph.(Proc. of ACM SIGGRAPH’02), volume 21, pages483–490.ACMPress,2002. [AP99] G. A. Abrantes and F. Pereira. Mpeg-4 facial animation technology: Sur- vey, implementation, and results. IEEE Transaction on Circuits and Sys- temsforVideoTechnology,9(2):290–305,1999. [BBPV03] V.Blanz,C.Basso,T.Poggio,andT.Vetter. Reanimatingfacesinimages and video. Computer Graphics Forum(Proc. of Eurographics 2003), 22(3),2003. [BCS97] C. Bregler, M. Covell, and M. Slaney. Video rewrite: Driving visual speech with audio. In Proc. of ACM SIGGRAPH’97, pages 353–360, 1997. [BDNN05] C.Busso,Z.Deng,U.Neumann,andS.Narayanan. Naturalheadmotion synthesis driven by acoustic prosody features. the Journal of Computer Animation and Virtual Worlds (Special Issue of Best Papers of CASA 2005),16(3-4):283–290,July2005. [Bes95] J. Beskow. Rule-based visual speech synthesis. In Proc. of Eurospeech 95,Madrid,1995. [BL03] G. Borshukov and J. P. Lewis. Realistic human face rendering for ”the matrixreloaded”. InProc.ofACMSIGGRAPH2003SketchesandAppli- cations,SanDiego,2003. [BP04] E. Bevacqua and C. Pelachaud. Expressive audio-visual speech. Journal ofVisualizationandComputerAnimation,15(3-4):297–304,2004. 121 [BPL + 03] G.Borshukov,D.Piponi,O.Larsen,J.P.Lewis,andC.T.Lietz. Universal capture: Image-basedfacialanimationfor“thematrixreloaded”. In Proc. ofACMSIGGRAPH’2003SketchesandApplications,2003. [Bra99] M. Brand. Voice puppetry. In Proc. of ACM SIGGRAPH 1999, pages 21–28,LosAngeles,1999. [BS96] R. Banse and K. Scherer. Acoustic profiles in vocal emotion expression. JournalofPersonalityandSocialPsycology,70(3):614–636,1996. [BW] P.BoersmaandD.Weenink. Praatspeechprocessingsoftware,instituteof phoneticssciencesoftheuniversityofamsterdam.http://www.praat.org. [CB02] E. Chuang and C. Bregler. Performance driven facial animation using blendshape interpolation. CS-TR-2002-02, Department of Computer Sci- ence,StanfordUniversity,2002. [CCL01] M. Costa, T. Chen, and F. Lavagetto. Visual prosody analysis for realistic motionsynthesisof3dheadmodels. InProc.ofInt’lConf.onAugmented, Virtual Environments and Three-Dimensional Imaging , Ornos, Mykonos, Greece,2001. [CCT + 01] R.Cowie,E.D.Cowie,N.Tsapatsoulis,G.Votsis,S.Kollias,W.Fellens, and J. G. Taylor. Emotion recognition in human-computer interaction. IEEESignalProc.Mag.,18(1):32–80,2001. [CCZB00] D.Chi,M.Costa,L.Zhao,andN.Badler. Theemotemodelforeffortand shape. InProc.ofACMSIGGRAPH’00,pages173–182,2000. [CDB02] E. S. Chuang, H. Deshpande, and C. Bregler. Facial expression space learning. InProc.ofPacificGraphics’2002,pages68–76,2002. [CFKP04] Y. Cao, P. Faloutsos, E. Kohler, and F. Pighin. Real-time speech motion synthesisfromrecordedmotions.InSCA’04: Proc.ofthe2004ACMSIG- GRAPH/Eurographics symposium on Computer animation, pages 345– 353.ACMPress,2004. [CFP03] Y. Cao, P. Faloutsos, and F. Pighin. Unsupervised learning for speech motion editing. In Proc. of ACM SIGGRAPH/Eurographics Symposium onComputerAnimation,2003. [CK01] B. W. Choe and H. S. Ko. Analysis and synthesis of facial expressions withhand-generatedmuscleactuationbasis.In IEEEComputerAnimation Conference,pages12–19,2001. 122 [CM93] M. M. Cohen and D. W. Massaro. Modeling coarticulation in synthetic visual speech. Magnenat-Thalmann N., Thalmann D. (Editors), Models andTechniquesinComputerAnimation,SpringerVerlag,pages139–156, 1993. [CMPZ02] P. Cosi, C. E Magno, G. Perlin, and C. Zmarich. Labial coarticulation modeling for realistic facial animation. In Proc. of Int’l Conf. on Multi- modalInterfaces02,pages505–510,Pittsburgh,PA,2002. [Cos02] E. Cosatto. Sample-based talking-head synthesis. Ph.D. Thesis, 2002. SwissFederalInstituteofTechnology. [CPB + 94] J. Cassell, C. Pelachaud, N. Badler, M. Steedman, B. Achorn, T. Becket, B. Douville, S. Prevost, and M. Stone. Animated conversation: Rule- based generation of facial expression, gesture and spoken intonation for multiple conversational agents. In Proc. of ACM SIGGRAPH’94, pages 413–420,1994. [CVB01] J. Cassell, H. Vilhjalmsson, and T. Bickmore. Beat: The behavior expression animation toolkit. In Computer Graphics (Proc. of ACM SIG- GRAPH’01),pages477–486,LosAngeles,2001. [CXH03] J. Chai, J. Xiao, and J. Hodgins. Vision-based control of 3d facial ani- mation. In Proc. of Eurographics/SIGGRAPH Symposium on Computer Animation,2003. [Das91] B. V. Dasarathy. Nearest Neighbor Pattern Classification Techniques. IEEEComputerSocietyPress,1991. [DBNN04a] Z. Deng, M. Bulut, U. Neumann, and S. S. Narayanan. Automatic dynamic expression synthesis for speech animation. In Proc. of IEEE Computer Animation and Social Agents (CASA) 2004, pages 267–274, Geneva,Switzerland,July2004. [DBNN04b] Z.Deng,C.Busso,S.S.Narayanan,andU.Neumann. Audio-basedhead motion synthesis for avatar-based telepresence systems. In Proc. of ACM SIGMM2004WorkshoponEffectiveTelepresence(ETP2004),pages24– 30,NewYork,NY,Oct.2004. [DCFN06] Z. Deng, P. Y. Chiang, P. Fox, and U. Neumann. Animating blendshape facesbycross-mappingmotioncapturedata. In Proc.ofACMSIGGRAPH SymposiumonInteractive3DGraphicsandGames,March2006. 123 [DFC00] C.Dehon,P.Filzmoser,andC.Croux. Robustmethodsforcanonicalcor- relation analysis. In Data Analysis, Classification, and Related Methods, pages321–326,Springer-Verlag,Berlin,2000. [DLN03] Z. Deng, J. P. Lewis, and U. Neumann. Practical eye movement model using texture synthesis. In Proc. of ACM SIGGRAPH 2003 Sketches and Applications,SanDiego,2003. [DLN05a] Z. Deng, J. P. Lewis, and U. Neumann. Automated eye motion synthesis usingtexturesynthesis.IEEEComputerGraphicsandApplications,pages 24–30,March/April2005. [DLN05b] Z. Deng, J. P. Lewis, and U. Neumann. Synthesizing speech animation by learning compact co-articulation models from motion capture data. In Proc. of Computer Graphics International (CGI) 2005, long island, NY, June2005.IEEEComputerSocietyPress. [DN06] Z.DengandU.Neumann. efase: adatadrivenexpressivefacialanimation synthesisandeditingsystem. Insubmission,2006. [DNL + ar] Z.Deng,U.Neumann,J.P.Lewis,T.Y.Kim,M.Bulut,andS.Narayanan. Expressive facial animation synthesis by learning speech co-articulation and expression spaces. IEEE Transaction on Visualization and Computer Graphics,2006(toappear). [Ebe00] D. Eberly. 3D Game Engine Design: A Practical Approach to Real-Time Computer Graphics. Morgan Kaufmann Publishers, San Francisco, CA, USA,2000. [EF75] P.EkmanandW.V.Friesen. UnmaskingtheFace: AGuidetoRecognizing EmotionsfromFacialClues. Printice-Hall,1975. [EGP02] T.Ezzat,G.Geiger,andT.Poggio. Trainablevideorealisticspeechanima- tion. ACM Trans. Graph.(Proc. of ACM SIGGRAPH’02), 21(3):388–398, 2002. [EL99] A.EfrosandT.K.Leung. Texturesynthesisbynon-parametricsampling. InICCV’99,pages1033–1038,1999. [FBF77] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transaction on Mathe- maticalSoftware,3(3):209–226,1977. [fes] http://www.cstr.ed.ac.uk/projects/festival/. 124 [Fua00] P. Fua. Regularized bundle-adjustment to model heads from image sequences without calibration data. International Journal of Computer Vision,38(2):153–171,2000. [GB96] B. L. Goff and C. Benoit. A text-to-audovisual-speech synthesizer for french. In Proc. of the Int’l. Conf. on Spoken Language Processing (ICSLP),pages2163–2166,1996. [GB00] Z.M.GriffinandK.Bock. Whattheeyessayaboutspeaking. Psycholog- icalScience,11(4):274–279,2000. [GCSH02] H. P. Graf, E. Cosatto, V. Strom, and F. J. Huang. Visual prosody: Facial movements accompanying speech. In Proc. of IEEE Int’l Conf. on Automatic Face and Gesture Recognition(FG’02), Washington, D.C., May,2002. [GGW + 98] B.Gucenter,C.Grimm,D.Wolf,H.Malvar,andF.Pighin. Makingfaces. InProc.ofACMSIGGRAPH’98,pages55–66,1998. [Gri01] Z. M. Griffin. Gaze durations during speech reflect word selection and phonologicalencoding. Cognition,82:B1–B14,2001. [HAH01] X. Huang, A. Acero, and H-W. Hon. Spoken Language Processing: A guide to theory, algorithm and system development. Printice-Hall, Upper SaddleRiver,NJ,USA,2001. [HRF01] T. Hastie, R. Ribshirani, and J. Friedman. The Elements of Statistical Learning: DataMining,Inference,andPrediction.Springer-Verlag,2001. [HTF01] T. Hastie, R. Tibshirani, and J. Friedman. The elements of Statistical Learning: DataMining,InferenceandPrediction. Springer-Verlag,2001. [Jon01] M. W. Jones. Facial reconstruction using volumetric data. In Proc. of Vision,ModelingandVisulization2001,2001. [JTDP03] P. Joshi, W. C. Tien, M. Desbrun, and F. Pighin. Learning controls for blend shape based realistic facial animation. In Proc. of ACM SIG- GRAPH/EurographicsSymposiumonComputerAnimation,2003. [KB99] S. C. Khullar and N. Badler. Where to look? automating visual attending behaviors of virtual human characters. In Proc. of Third ACM Conf. on AutonomousAgents,pages16–23,1999. 125 [KG01] G.KalbererandL.V.Gool. Faceanimationbasedonobserved3dspeech dynamics. InIEEEComputerAnimationConference,pages20–27,2001. [KGP02] L.Kovar,M.Gleicher,andF.Pighin. Motiongraphs. ACMTrans.Graph. (Proc.ofACMSIGGRAPH’02),21(3),2002. [KGT00] S. Kshirsagar, S. Garchery, and N. M. Thalmann. Feature point based mesh deformation applied to mpeg-4 facial animation. In Proc. Deform’2000,WorkshoponVirtualHumansbyIFIPWorkingGroup5.10 (ComputerGraphicsandVirtualWorlds),pages23–34,November2000. [KHS01] K. K¨ ahler, J. Haber, and H. P. Seidel. Geometry-based muscle modeling forfacialanimation. InProc.ofGraphicsInterface’2001,2001. [KMR + 99] T. Kuratate, K. G. Munhall, P. E. Rubin, E. V. Bateson, and H. Yehia. Audio-visualsynthesisoftalkingfacesfromspeechproductioncorrelates. InProc.Eurospeech’99,1999. [KMT01] S. Kshirsagar, T. Molet, and N. M. Thalmann. Principal components of expressive speech animation. In Proc. of Computer Graphics Interna- tional,2001. [KMTT92] P. Kalra, A. Mangili, N. M. Thalmann, and D. Thalmann. Simulation of facial muscle actions based on rational free from deformation. Computer GraphicsForum(Proc.ofEurographics’92),2(3):59–69,1992. [KP05] S. A. King and R. E. Parent. Creating speech-synchronized animation. IEEE Transaction on Visualization and Computer Graphics, 11(3):341– 352,2005. [KSS96] D. Kurlander, T. Skelly, and D. Salesin. Comic chat. In Proc. of ACM SIGGRAPH’96,pages225–236,1996. [KT03] S. Kshirsagar and N. M. Thalmann. Visyllable based speech animation. ComputerGraphicsForum(Proc.ofEurographics’03),22(3),2003. [LBB02] S. P. Lee, J. B. Badler, and N. Badler. Eyes alive. ACM Trans. Graph.(Proc.ofACMSIGGRAPH’02),21(3):637–644,2002. [LBG80] Y.Linde,A.Buzo,andR.Gray. Analgorithmforvectorquantizerdesign. IEEETransactionsonCommunications,28(1):84–95,Jan1980. 126 [LCF00] J. P. Lewis, M. Cordner, and N. Fong. Pose space deformation: A unified approachtoshapeinterpolationandskeleton-drivendeformation. In Proc. ofACMSIGGRAPH’2000,2000. [LLX + 01] L. Liang, C. Liu, Y. Q. Xu, B. Guo, and H. Y. Shum. Real-time texture synthesisbypatch-basedsampling. ACMTransactiononGraphics,20(3), 2001. [LMDN05] J. P. Lewis, J. Mooser, Z. Deng, and U. Neumann. Reducing blendshape interferencebyselectedmotionattenuation. InProc.ofACMSIGGRAPH Symposium on Interactive 3D Graphics and Games (I3DG) 2005, pages 25–29,WashingtonDC,2005.ACMPress. [LNP03] C. M. Lee, S. Narayanan, and R. Pieraccini. Recognition of negative motionsfromthespeechsignal. InProc.ofAutomaticSpeechRecognition andUnderstanding,Trento,Italy,2003. [LP99] F. Lavagetto and R. Pockaj. The facial animation engine: Toward a high- level interface for the design of mpeg-4 compliant animated faces. IEEE TransactiononCircuitsandSystemsforVideoTechnology,9(2):277–289, 1999. [LTW93] Y.Lee,D.Terzopoulos,andK.Waters. Constructingphysics-basedfacial modelsofindividuals. InProc.ofGraphicsInterface’93,1993. [LTW95] Y. Lee, D. Terzopoulos, and K. Waters. Realistic modeling for facial ani- mation. InProc.ofACMSIGGRAPH’95,pages55–62,1995. [LWS02] Y. Li, T. Wang, and H. Y. Shum. Motion texture: A two-level statistical model for character motion synthesis. In Proc. of ACM SIGGRAPH’02, 2002. [mat] http://skal.planet-d.net/demo/matrixfaq.htm. [MCP + 04] J. Ma, R. Cole, B. Pellom, W. Ward, and B. Wise. Accurate automatic visible speech synthesis of arbitrary 3d model based on concatenation of diviseme motion capture data. Computer Animation and Virtual Worlds, 15:1–17,2004. [MCP + 05] J. Ma, R. Cole, B. Pellom, W. Ward, and B. Wise. Accurate visible speech synthesis based on concatenating variable length motion capture data. IEEETransactiononVisualizationandComputerGraphics(online), 2005. 127 [MJC + 04] K. G. Munhall, J. A. Jones, D. E. Callan, T. Kuratate, and E. V. Bate- son. Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological Science, 15(2):133–137, Feb 2004. [MsGST03] A. S. Meyer, s. Garchery, G. Sannier, and N. M. Thalmann. Synthetic faces: Analysis and applications. International Journal of Imaging Sys- temsandTechnology,13(1):65–73,2003. [MSL98] A. S. Meyer, A. Sleiderink, and W. J. M. Levelt. Viewing and nam- ing objects: Eye movements during noun phrase production. Cognition, 66:B25–B33,1998. [NFN02] J. Y. Noh, D. Fidaleo, and U. Neumann. Gesture driven facial animation. CS-TR-02-761, Department of Computer Science, University of Southern California,2002. [NN01] J. Y. Noh and U. Neumann. Expression cloning. In Proc. of ACM SIG- GRAPH’01,pages277–288,2001. [Ost98] J. Ostermann. Animation of synthetic faces in mpeg-4. In Proc. of IEEE ComputerAnimation,1998. [Pan02] I. S. Pandzic. Facial animation framework for the web and mobile plat- forms. InProc.ofthe7 th Int’lConf.on3DWebtechnology,2002. [Par72] F. Parke. Computer generated animation of faces. In Proc. ACM Nat’l Conf.,volume1,pages451–457,1972. [PB81] S. M. Platt and N. I. Badler. Animating facial expressions. Computer Graphics(Proc.ofACMSIGGRAPH’81),15(3):245–252,1981. [PB02] K. Pullen and C. Bregler. Motion capture assisted animation: texturing and synthesis. In ACM Trans. on Graph.(Proc. of ACM SIGGRAPH’02), pages501–508.ACMPress,2002. [PBS94] C.Pelachaud,N.Badler,andM.Steedman. Generatingfacialexpressions forspeech. CognitiveScience,20(1):1–46,1994. [PBV] C. Pelachaud, N. I. Badler, and M. L. Viaud. http://hms.upenn.edu/pelachaud/workshop face/workshop face.html. [Pel91] C. Pelachaud. Communication and coarticulation in facial animation. Ph.D.Thesis,Univ.ofPennsylvania,1991. 128 [Pet99] V.Petrushin. Emotioninspeech: Recognitionandapplicationtocallcen- ters. ArtificialNeu.Net.InEngr.,pages7–10,1999. [PG96] K. Perlin and A. Goldberg. Improv: A system for scripting interactive actorsinvirtualworlds. InProc.ofACMSIGGRAPH’96,pages205–216, 1996. [PHL + 98] F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, and D. H. Salesin. Syn- thesizing realistic facial expressions from photographs. In Proc. of ACM SIGGRAPH’98,pages75–84,1998. [Pie86] D.A.Pierre. OptimizationTheoryWithApplications. GeneralPublishing Company,1986. [PKC + 03] H. Pyun, Y. Kim, W. Chae, H. W. Kang, and S. Y. Shin. An example- based approach for facial expression cloning. In SCA ’03: Proc. the 2003 ACM SIGGRAPH/Eurographics Symposium on Computer anima- tion,pages167–176.EurographicsAssociation,2003. [PW96] F. I. Parke and K. Waters. Computer Facial Animation. A K Peters, Wellesley,Massachusets,1996. [Rab89] L. R. Rabiner. A tutorial on hidden markov models and selected applica- tionsinspeechrecognition. ProceedingsoftheIEEE,77(2):257–286,Feb 1989. [Ray98] K. Rayner. Eye movements in reading and information processing: 20 yearsofresearch. PsychologicalBulletin,124:372–422,1998. [Row97] S.Roweis. Emalgorithmforpcaandspca. NeuralInformationProcessing Systems(NIPS)’97,pages137–148,1997. [Sco03] R. Scott. Sparking life - notes on the performance capture sessions for the lord of the rings: The two towers. Computer Graphics, 37(4):17–21, 2003. [SF98] K. Singh and E. Fiume. Wires: A geometric deformation technique. In Proc.ofACMSIGGRAPH’98,pages405–414,1998. [SG00] D. D. Salvucci and J. H. Goldberg. Identifying fixations and saccades in eye-tracking protocols. In Proc. of the Symposium on Eye Tracking ResearchandApplications(ETRA),pages71–78,2000. 129 [SG02] M.B.StegmannandD.D.Gomez. Abriefintroductiontostatisticalshape analysis,Mar2002. [Sho85] K. Shoemake. Animating rotation with quaternion curves. Computer Graphics(ProceedingsofSIGGRAPH85),19(3):245–254,July1985. [SHP04] A. Safonova, J. K. Hodgins, and N. S. Pollard. Synthesizing physi- callyrealistichumanmotioninlow-dimensional,behavior-specificspaces. ACMTrans.Graph.,23(3):514–521,2004. [SNF05] Eftychios Sifakis, Igor Neverov, and Ronald Fedkiw. Automatic deter- mination of facial muscle activations from sparse motion capture marker data. ACMTrans.Graph.,24(3):417–425,2005. [TSKES95] M. K. Tanenhaus, M. J. Spivey-Knownlton, K. M. Eberhard, and J. C. Sedivy.Integrationofvisualandlinguisticinformationinspokenlanguage comprehension. Science,268:1632–1634,1995. [TSL00] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geo- metric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2333,2000. [TW90] D. Terzopoulos and K. Waters. Physically-based facial modeling, analy- sis, and animation. Journal of Visualization and Computer Animation, 1(4):73–80,1990. [UGO98] B. Uz, U. G¨ ud¨ ukbay, and B. ¨ Ozg¨ uc. Realistic speech animation of syn- theticfaces. InProc.ofIEEEComputerAnimation’98,1998. [VBPP05] D. Vlasic, M. Brand, H. Pfister, and J. Popovi´ c. Face transfer with multi- linearmodels. ACMTransactiononGraphics,24(3),2005. [VDV00] R. Vertegaal, G. V. Derveer, and H. Vons. Effects of gaze on multiparty mediated communication. In Proc. of Graphics Interface’00, pages 95– 102,Montreal,Canada,2000. [vic] http://www.vicon.com. [VSDN01] R. Vertegaal, R. Slagter, G. V. Derveer, and A. Nijholt. Eye gaze pat- terns in conversations: There is more to conversational agents than meets the eyes. In Proc. of ACM CHI 2001 Conference on Human Factors in ComputingSystems,pages301–308,2001. 130 [Wat87] K.Waters. Amusclemodelforanimatingthree-dimensionalfacialexpres- sion. Computer Graphics (Proc. of ACM SIGGRAPH’87), 21(4):17–24, 1987. [wav] http://www.speech.kth.se/wavesurfer/. [WF95] K. Waters and J. Frisble. A coordinated muscle model for speech anima- tion. Proc.ofGraphicsInterface’95,pages163–170,1995. [Wil90] L. Williams. Performance-driven facial animation. In Proc. of ACM SIG- GRAPH’90,pages235–242.ACMPress,1990. [YBL + 04] S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and S. Narayanan. An acoustic study of emotions expressed in speech. InProc.ofICSLP’04,2004. [YEH + 02] S.Young,G.Evermann,T.Hain,D.Kershaw,G.Moore,J.Odell,D.Olla- son, D. Povey, V. Valtchev, and P. Woodland. The HTK Book. Entropic CambridgeResearchLaboratory,Cambridge,England.,2002. [YKB00] H. Yehia, T. Kuratate, and E. V. Bateson. Facial animation and head motion driven by speech acoustics. In 5 th Seminar on Speech Produc- tion: Models and Data, pages 265–268, Kloster Seeon, Germany, May 1-4,2000. [ZLA + 01] Z. Zhang, Z. Liu, D. Adler, M. Cohen, E. Hanson, and Y. Shan. Cloning yourownfacewithadesktopcamera. InICCV2001,pageII:745,2001. [ZLGS03] Q. Zhang, Z. Liu, B. Guo, and H. Shum. Geometry-driven photorealistic facial expression synthesis. In Proc. of ACM SIGGRAPH/Eurographics SymposiumonComputerAnimation,2003. [ZPS01] Y. Zhang, E. C. Parkash, and E. Sung. A physically-based model with adaptiverefinementforfacialanimation. InProc.ofIEEEComputerAni- mation’2001,pages28–39,2001. 131 
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button
Conceptually similar
Facial animation by expression cloning
PDF
Facial animation by expression cloning 
G -folds: An appearance-based model of facial gestures for performance driven facial animation
PDF
G -folds: An appearance-based model of facial gestures for performance driven facial animation 
Algorithms for compression of three-dimensional surfaces
PDF
Algorithms for compression of three-dimensional surfaces 
Energy efficient hardware-software co-synthesis using reconfigurable hardware
PDF
Energy efficient hardware-software co-synthesis using reconfigurable hardware 
Analysis, recognition and synthesis of facial gestures
PDF
Analysis, recognition and synthesis of facial gestures 
An implicit-based haptic rendering technique
PDF
An implicit-based haptic rendering technique 
Data-driven derivation of skills for autonomous humanoid agents
PDF
Data-driven derivation of skills for autonomous humanoid agents 
A voting-based computational framwork for visual motion analysis and interpretation
PDF
A voting-based computational framwork for visual motion analysis and interpretation 
Content -based video analysis, indexing and representation using multimodal information
PDF
Content -based video analysis, indexing and representation using multimodal information 
Compiler optimizations for architectures supporting superword-level parallelism
PDF
Compiler optimizations for architectures supporting superword-level parallelism 
Design issues in large-scale application -level routing
PDF
Design issues in large-scale application -level routing 
Accelerated gathering for real-time global illumination
PDF
Accelerated gathering for real-time global illumination 
A framework for learning from demonstration, generalization and practice in human -robot domains
PDF
A framework for learning from demonstration, generalization and practice in human -robot domains 
A modular approach to hardware -accelerated deformable modeling and animation
PDF
A modular approach to hardware -accelerated deformable modeling and animation 
High-frequency mixed -signal silicon on insulator circuit designs for optical interconnections and communications
PDF
High-frequency mixed -signal silicon on insulator circuit designs for optical interconnections and communications 
Design and analysis of server scheduling for video -on -demand systems
PDF
Design and analysis of server scheduling for video -on -demand systems 
Efficient acoustic noise suppression for audio signals
PDF
Efficient acoustic noise suppression for audio signals 
Extendible tracking:  Dynamic tracking range extension in vision-based augmented reality tracking systems
PDF
Extendible tracking: Dynamic tracking range extension in vision-based augmented reality tracking systems 
Characterizing Internet topology, routing and hierarchy
PDF
Characterizing Internet topology, routing and hierarchy 
A study of unsupervised speaker indexing
PDF
A study of unsupervised speaker indexing 
Action button
Asset Metadata
Creator Deng, Zhigang (author) 
Core Title Data -driven facial animation synthesis by learning from facial motion capture data 
Contributor Digitized by ProQuest (provenance) 
School Graduate School 
Degree Doctor of Philosophy 
Degree Program Computer Science 
Publisher University of Southern California (original), University of Southern California. Libraries (digital) 
Tag Computer Science,OAI-PMH Harvest 
Language English
Advisor Neumann, Ulrich (committee chair), Cohen, Isaac (committee member), Desbrun, Mathieu (committee member), Fisher, Scott (committee member), Lewis, J.P. (committee member), Narayanan, Shrikanth (committee member), Pighin, Fred (committee member), Yu, Yizhou (committee member) 
Permanent Link (DOI) https://doi.org/10.25549/usctheses-c16-439911 
Unique identifier UC11341372 
Identifier 3237134.pdf (filename),usctheses-c16-439911 (legacy record id) 
Legacy Identifier 3237134.pdf 
Dmrecord 439911 
Document Type Dissertation 
Rights Deng, Zhigang 
Type texts
Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au... 
Repository Name University of Southern California Digital Library
Repository Location USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA