Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Generating gestures from speech for virtual humans using machine learning approaches
(USC Thesis Other)
Generating gestures from speech for virtual humans using machine learning approaches
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Generating Gestures from Speech for Virtual Humans Using Machine Learning
Approaches
by
Chung-Cheng Chiu
Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2014
Copyright 2014 Chung-Cheng Chiu
Table of Contents
List of Tables v
List of Figures vi
Acknowledgements x
Abstract xi
Chapter 1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Key Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 2 Background 14
2.1 Gesture overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Gesture classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Previous works on modeling gestures . . . . . . . . . . . . . . . . . . . . . 17
2.3 Motion interpolation and synthesis . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Gaussian process latent variable model . . . . . . . . . . . . . . . . . . . . 20
2.5 Restricted Boltzmann machines and the temporal extensions . . . . . . . 21
2.5.1 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . 21
2.5.2 Conditional Restricted Boltzmann Machines . . . . . . . . . . . . . 22
2.5.3 Factored CRBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.4 Deep belief nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Deep learning and sequential data prediction . . . . . . . . . . . . . . . . 25
Chapter 3 Decompose and constrain the learning problem 27
3.1 Gesture categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Gestural signs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
ii
Chapter 4 Predicting Gestural Signs from Speech with Deep Conditional
Neural Fields 31
4.1 Deep Conditional Neural Fields . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.2 Gradient calculation . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.3 Parameter updates . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Dataset for training the prediction process . . . . . . . . . . . . . . . . . . 37
4.2.1 Gestural sign annotation . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2 Linguistic features . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.3 Prosodic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.4 Video segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Chapter 5 Low-Dimensional Embeddings with GPLVMs 40
5.1 Motion generation with manifolds . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Motion selection based on gestural signs . . . . . . . . . . . . . . . . . . . 45
5.3 Motion transition with manifolds . . . . . . . . . . . . . . . . . . . . . . . 45
Chapter 6 The Generation-based Approach with Hierarchical Factored
Conditional Restricted Boltzmann Machines 47
6.1 Decomposing the learning problem . . . . . . . . . . . . . . . . . . . . . . 48
6.2 Hierarchical factored conditional Boltzmann machines . . . . . . . . . . . 52
6.3 Gesture Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.3.1 Learning the Gesture Generator . . . . . . . . . . . . . . . . . . . 53
6.3.2 Gesture Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3.3 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3.4 Comparison with Previous Approaches . . . . . . . . . . . . . . . . 56
6.4 Human subject experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.4.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Chapter 7 Evaluations 64
7.1 Assessment Experiment for DCNFs . . . . . . . . . . . . . . . . . . . . . . 64
7.1.1 Baseline models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.1.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.1.4 Handwriting recognition . . . . . . . . . . . . . . . . . . . . . . . . 66
7.1.5 Baseline models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.1.6 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.2 Assessment Experiment for GPLVM-based Motion Synthesis . . . . . . . 68
7.2.1 Transition probability between all pairs of gestures . . . . . . . . . 69
7.2.2 Transition probability for gesture generation tasks . . . . . . . . . 69
7.3 Subjective Evaluation Study for the Overall Framework . . . . . . . . . . 70
7.4 Subjective Evaluation Study Comparing the Synthesis-Based Framework
with the Generation-Based Framework . . . . . . . . . . . . . . . . . . . . 72
7.5 Assessment Experiment for Comparing Manually-Annotated Gestural Signs
with Automatic Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 75
iii
Chapter 8 Conclusion 77
8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Appendices
Appendix A Additional works toward improving the gestural sign prediction 80
A.1 Improve models for predicting sequential dependency . . . . . . . . . . . . 80
A.1.1 Latent dynamic deep conditional neural elds . . . . . . . . . . . . 81
A.1.2 Semi-Markov deep conditional neural elds . . . . . . . . . . . . . 82
A.1.3 Preliminary experiment results and discussions . . . . . . . . . . . 83
A.2 Attempts to improve prediction quality by changing gestural signs . . . . 84
A.2.1 Prediction quality analysis . . . . . . . . . . . . . . . . . . . . . . . 84
Appendix B Assessment of extended RBMs on the direct generation task . . 86
B.1 Prosody-gated FCRBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
B.2 CRBMs and Hierarchical CRBMs . . . . . . . . . . . . . . . . . . . . . . . 88
B.3 CRBMs vs. RCRBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Reference List 92
iv
List of Tables
3.1 The list of our gestural signs and their descriptions. . . . . . . . . . . . . 30
7.1 Results of coverbal gesture prediction. . . . . . . . . . . . . . . . . . . . . 66
7.2 Results of handwriting recognition. Both the results of NeuroCRF and
Structured prediction cascades are adopted from the original report. . . . 67
A.1 Transition probability between gestural signs in percentage. 1:Rest, 2:Palm
face up, 3:Head nod without arm gestures, 4:Wipe, 5:Everything, 6:Frame,
7:Dismiss, 8:Block, 9:Shrug, 10:More-Or-Less, 11:Process, 12:Deictic.Other,
13:Deictic.Self, 14:Beats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.2 The top 5 gestural signs in terms of F1 scores. . . . . . . . . . . . . . . . 84
v
List of Figures
1.1 The dierence between manual approach and using a data-driven ges-
ture generator. (a) The manual approach builds data specically for each
project. (b) On building a gesture generator, the approach realizes the
relation between speech and gestures from human conversation data with
machine learning approaches and then applies the resulting generator to
generate gestures for novel utterances. The resulting gesture generator can
be applied for various project involving virtual character interactions. . . 3
1.2 The structure of the decomposition framework. The framework uses a
novel approach called deep conditional neural elds (DCNFs) to predict
gestural signs from speech signals, gestural signs to signify the relation
between speech and gestures, and an approach based on Gaussian process
latent variable models (GPLVMs) to synthesize gestures. Both DCNFs
and GPLVM-based approach are further introduced in section 1.3. . . . . 4
1.3 The overview of DCNFs for predicting gestural signs from speech. The
framework extracts words and part-of-speech tags from transcripts and
extracts prosody from speech audio, and applied the proposed approach to
learn to predict gestures at each time frame. . . . . . . . . . . . . . . . . . 7
2.1 Restricted Boltzmann machines. (a) A RBM where v denotes the visible
layer,h denotes the hidden layer, andW represents the connection between
the two layers. (b) A RBM with simplied representation. . . . . . . . . . 22
2.2 (a) A CRBM with order 2 where t represent index in time, v
t1
;v
t2
are
visible layers for data at t 1;t 2, and A and B are directed links from
v
t1
;v
t2
tov andh respectively. Each rectangle in the gure represents a
layer of nodes. (b) A FCRBM with order 2 where z denotes feature layer,
y represent labels, and triangles represent factored multiplications. . . . . 23
2.3 DBNs rst train networks layer-by-layer with input data where the output
of the lower layers are served as input for the higher layer, and then ne-
tune the connection weight of the entire network with both input features
and labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
vi
4.1 The structure of our DCNF framework. The neural network learns the
nonlinear relation between speech features and gestural signs. The top
layer is a second-order undirected linear-chain which takes the output of the
neural network as input and model the temporal relation among gestural
signs. The model is trained with the joint likelihood of the top undirected
chain and deep neural networks. . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 The synthesis process reuses motion data to generate gesture animations. 41
5.2 GPLVMs derive a low-dimensional space from the given motion samples.
After deriving this space with GPLVMs, we can sample a trajectory in
the space and map the trajectory back to the original data dimension and
result a gesture animation. Colors of the space indicate the density of data
points where the warmer correspond to the denser area. . . . . . . . . . . 42
6.1 The architecture of the generation process. The gesture generator models
gestures conditioned on audio features. The model generates next motion
frame based on given audio features and past motion frames. . . . . . . . 48
6.2 (a) A HFCRBM with order-2 CRBMs and order-2 FCRBMs. The model
use CRBMs to convert input data of the FCRBM. The CRBMs shown here
is the same CRBM applying for dierent input layers of the FCRBM and
dierent input set. The gure shows components at dierent time step
with dierent gray scale. (b) A HFCRBM with simplied representation. 52
6.3 The training process of HFCRBMs for gesture generators. (a) The HFCRBM
trains bottom CRBMs with motion frames. (b) After the training of
CRBMs is completed, the motion data goes bottom-up through CRBMs,
and the FCRBM models the output of CRBMs' hidden layers conditioned
on audio features. The parameters of CRBMs are not updated at this stage. 54
6.4 An example of generation
ow where gray color indicates that components
are inactive. (a) The model takes motion frames from t
1
tot
4
as input for
CRBMs. (b) The FCRBM takes bottom-up output of CRBMs and audio
features from t
1
to t
5
to generate next data at t
5
. (c) The CRBM maps
the data generated by the FCRBM to a motion frame. (d) To generate the
motion frame att
6
, the generation process includes the generated frame as
part of the new input to perform next generation loop. . . . . . . . . . . . 62
6.5 The video we used for evaluation. We use a simple skeleton to demonstrate
the gesture instead of mapping the animation to a virtual human to prevent
other factors that can distract participants. . . . . . . . . . . . . . . . . . 63
vii
6.6 Average rating for gesture animations. Dashed lines indicate two sets are
not signicantly dierent, solid lines indicate two sets are signicantly dif-
ferent. For the signicantly cases, the p values are all less than 10
7
. (a)
The average score of original, unmatched, and generated animations are
17:4, 7, and 17:6 respectively. (b) The average number of rated-rst among
original, unmatched, and generated animations are 5:7, 2:15, and 6:15. . 63
7.1 Assessment experiments for motion transitions. (a) The GPLVM-based
approach allows 91:46% of co-articulation among all pairs of gestures while
the conventional approach only allows 2:69%. (b) The transition success
rate of our GPLVM-based approach with dierent transition time limit.
The conventional approach failed to generate gesture animations for this
task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.2 Evaluation results for the overall framework with 95% condence interval.
Both results show that the proposed approach is signiciantly better than
the state-of-the-art. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.3 Comparing animations generated by our method and the previous work.
(a) Each video displays a pair of gesture animations generated for the same
speech audio by dierent approaches. (b) The percentage of animations
being voted as best matching the speech, and the dierence is statistically
signicant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
A.1 The structure of the LDDCNF framework. The CRF model is replaced
with LDCRFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.2 The structure of the semi-DCNF framework. The CRF model is replaced
with semi-CRFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.3 Transition potentials learned by semi-DCNFs. Each row represents a gestu-
ral sign listed from top to bottom in the same order as the list in table A.1.
Each column, from left to right, represents the potential for transition to
the same gestural sign after time 1 to 100. Brighter color represents higher
potentials while darker color represents lower potentials. . . . . . . . . . . 83
B.1 Sample generation results where the horizontal axis represents time step
and the vertical axis represents rotation value. Each curve corresponds to
the rotation value around an axis of a joint. (a) Curves of rotation vectors
generated by FCRBMs. FCRBMs lead the generated values to constant
in a few steps. The rst 6 frames are initial sequence and therefore are
omitted in the gures. (b) Generation results of HFCRBMs for the same
case. The rst 12 frames are initial sequence and therefore are omitted in
the gures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
viii
B.2 Sample generation results where the horizontal axis represents time step
and the vertical axis represents rotation value. This is the case case as
Figure B.1. (a) Hierarchical CRBM takes audio features as part of input
of past visible layers. u denotes audio features. u
t
does not necessary
contain only audio features at t. For example, with a window of size n,
u
t
includes audio features from tn to t +n. (b) Curves of rotation
vectors generated by HCRBMs. The rst 12 frames are initial sequence
and therefore are omitted in the gures. . . . . . . . . . . . . . . . . . . . 88
B.3 (a) A RCRBM is a CRBM without connection A. This can be observed
via comparing with Figure 2.2a. (b) A HFCRBM with RCRBMs at the
bottom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
B.4 Sample results generated by RCRBM-based HFCRBMs. A dramatic changes
can be observed at the beginning of the curves. This kind of dramatic
changes can make the gesture animation seem not natural. . . . . . . . . . 90
ix
Acknowledgements
I would like to thank Prof. Stacy Marsella for the years of positive support on research
and being patient with last minute paper revisions, great lessons on paper writing and
presentation giving, and friendship and the guidance on the overall research direction. I
would like to also thank Prof. Louis-Philippe Morency, without whom the development of
deep conditional neural elds would not be possible, and thank him for the education on
machine learning, advice on my work, and great help and support during thesis proposal
and dissertation. I would like to thank Prof. Jonathan Gratch, Prof. Ulrich Neumann,
Prof. Stephen Read, Prof. Shri Narayanan, and Prof. Gerard Medioni for their sugges-
tions for improving the thesis. I would like to thank Dr. Ari Shapiro and Dr. Wei-Wen
Feng for their helps and knowledge on character animation. I would like to thank Dr.
Cathy Ennis, Prof. Carol OSullivan, and Prof. Rachel McDonnell for sharing the motion
capture and conversation data which facilitated this thesis work. I would like to thank
my parents and Shiou-Shoiu, for their caring and warmhearted support.
x
Abstract
There is a growing demand for animated characters capable of simulating face-to-face
interaction using the same verbal and nonverbal behavior that people use. For example,
research in virtual human technology seeks to create autonomous characters capable
of interacting with humans using spoken dialog. Further, as video games have moved
beyond rst person shooters, there is a tendency for gameplay to comprise more and
more social interaction where virtual characters interact with each other and with the
player's avatar. Common to these applications, the autonomous characters are expected
to exhibit behaviors resembling a real human.
The focus of this work is generating realistic gestures for virtual characters, specically
the coverbal gestures that are performed in close relation to the content and timing of
speech. A conventional approach for animating gestures is to construct gesture animations
for each utterance the character speaks, by handcrafting animations or using motion
capture techniques. The problem with this approach is that it is costly in time and
money and is not even feasible for characters designed to generate novel utterances on
the
y.
This thesis applies machine learning approaches to learn a data-driven gesture gen-
erator from human conversational data that can generate behavior for novel utterances
and thereby saves development eort. This work assumes that learning to generate from
xi
speech is a feasible task. The framework exploits a gesture classication scheme about
gestures to provide domain knowledge about gestures and help the machine learning
models realize the generation of gestures from speech. The framework decomposes the
overall learning problem of generating gestures into two tasks: one realizes the relation
between speech and gesture classes and the other performs gesture generation based on
the gesture classes. To facilitate the training process this research has used real-world con-
versation data involving dyadic interviews and motion capture data of human gesturing
while speaking. The evaluation experiments assess the eectiveness of each component by
comparing with state-of-the-art approaches and evaluate the overall performance by con-
ducting studies involving human subjective evaluations. An alternative machine learning
framework has also been proposed to compare with the framework addressed in this the-
sis. Overall, the evaluation experiments show the framework outperforms state-of-the-art
approaches.
The central contribution of this research is a machine learning framework capable of
learning to generate gestures from conversation data that can be collected from dierent
individuals while preserving the motion style of specic speakers. In addition, our frame-
work will allow the incorporation of data recorded through other media and thereby sig-
nicantly enrich the training data. The resulting model provides an automatic approach
for deriving a gesture generator which realizes the relation between speech and gestures.
A secondary contribution is a novel time-series prediction algorithm that predicts ges-
tures from the utterance. This prediction algorithm can address time-series problems
with complex input and be applied to other applications that require classifying time
series data.
xii
Chapter 1
Introduction
This thesis proposes a framework to learn to generate gestures from speech for virtual
characters. To animate gestures, the framework takes utterance content and prosody
as input and outputs rotation values for joints in a virtual character's skeleton. This is
the rst fully-machine-learning framework for generating gestures from speech that can
generate gestures coupled to the content of the utterance. The framework addresses how
to combine domain knowledge about gestures with machine learning approaches and has
developed two machine learning approaches to help realize the task.
1.1 Motivation
Virtual characters (VCs) are autonomous, digital characters that exist within virtual
worlds and are designed to perceive, understand and interact with other virtual char-
acters or real-world humans, using the same verbal and non-verbal behavior, such as
gestures and facial expressions that people use to interact with each other. Virtual hu-
mans (VHs), specically are a kind of virtual character designed especially to interact
1
with real-world humans. They have been applied to an increasingly wide range of applica-
tions, including human-computer interaction [Cassell, 2000], social skills training [Rickel
and Johnson, 2000] and entertainment [Hartholt et al., 2009]. For example, through a
virtual human, the computer can potentially exploit the familiarity and eectiveness of
human face-to-face interactions to facilitate human-computer interaction. The creation of
these characters must addresses a wide range of research challenges in perception, cogni-
tion, dialog management and behavior generation. The focus of the work presented here
is on providing the virtual human with a capacity to use gestures, specically coverbal
gestures that are performed in close synchrony with speech. Although coverbal gestures
can mainly be expressed with hands and head, this work learns to generate movement
for the whole body to provide more natural gesture animation.
While this capability can be achieved by recording speech audio and manually creating
gesture motions for each sentence the character utters, that approach suers from several
problems: it does not scale well for projects that involve large numbers of utterances, the
results are unique to the project and the approach cannot even be used in projects that
use open-ended dialog generation techniques. An alternative, more general and ecient
way is to collect a set of human conversation data including speech and gesture motions
and apply machine learning approaches to realize the mapping from speech and gestures.
This approach builds a gesture generator which can be exploited in applications involving
virtual characters to generate gestures for novel utterances. A comparison between the
manual approach and the machine-learning-based approach is shown in Figure 1.1.
2
Manually animate
Animate virtual
characters
Output
Rotation for
each joint in
the skeleton
I like watching movies
PRB VBP VBG NNS
t
1
t
2
t
3
t
4
Input
POS
Lexical
Prosody
(e.g. f0)
time
(a) Manual approach
Data-driven gesture
generator
Animate virtual
characters
Output
Rotation for
each joint in
the skeleton
Derived from machine
learning approaches
I like watching movies
PRB VBP VBG NNS
t
1
t
2
t
3
t
4
Input
POS
Lexical
Prosody
(e.g. f0)
time
(b) Exploiting a gesture generator
Figure 1.1: The dierence between manual approach and using a data-driven gesture
generator. (a) The manual approach builds data specically for each project. (b) On
building a gesture generator, the approach realizes the relation between speech and ges-
tures from human conversation data with machine learning approaches and then applies
the resulting generator to generate gestures for novel utterances. The resulting gesture
generator can be applied for various project involving virtual character interactions.
The machine-learning-based approach realizes the mapping relation between speech
and gestures through human conversation data and therefore preserves detailed correla-
tions between speech and motion. These correlations are useful for capturing personal
style or motion dynamics tied to subtle aective features, but are challenging to specify
manually.
1.2 Key Challenges
The main criterion for building such a controller lies in modeling the correlation between
speech and gesture motion. There is a tight coupling between gesture motion and the
3
Joint rotation
Prediction
Motion synthesis
Gestural signs
I like watching movies
PRB VBP VBG NNS
POS
Lexical
Prosody
(e.g. f0)
Speech
signals
Use deep conditional neural fields to collapse
speech signals to gestural signs
Use GPLVM-based motion synthesis to synthesize
joint rotation based on gestural signs
Signifies the relation between speech and the
physical forms of gestures
Figure 1.2: The structure of the decomposition framework. The framework uses a novel
approach called deep conditional neural elds (DCNFs) to predict gestural signs from
speech signals, gestural signs to signify the relation between speech and gestures, and
an approach based on Gaussian process latent variable models (GPLVMs) to synthesize
gestures. Both DCNFs and GPLVM-based approach are further introduced in section 1.3.
evolving prosody of speech as well as the content of the utterance. The coupling between
speech and gestures, however, is complex as it is the product of the information conveyed
through both speech and gestures [Calbris, 2011], and the information conveyed through
speech and gesture may be shared at a hidden, abstract level [McNeill, 1985]. These
properties suggest that realizing generation of gestures from speech requires an eective
representation associating speech and gestures.
Because of this hidden common association, I argue the task of realizing a gesture gen-
erator requires dening gestural signs [Calbris, 2011] that provide the coupling between
speech and gestures. The related information conveyed through both speech and gestures
4
is dened as gestural signs. Gestural signs correlate speech and gestures by referring to
a physical form within the content of the utterance.
The framework I have developed uses these gestural signs to decompose the whole task
into two components: a prediction process that models the relation between speech and
gestural signs and a motion synthesis process that models the relation between gestural
signs and gesture motions. The structure of the resulting decomposition is illustrated in
Figure 1.2. The prediction process predicts the gestural signs from the speech, and the
gestural signs play the role of communicating between prediction process and the motion
synthesis, and the motion synthesis generates motions based on the given gestural signs.
The decomposition structure requires a set of gestural signs which summarize the
functions and forms of coverbal gestures to allow the predictingons of gestures from speech
signals. The set of gestural signs needs to be suciently expressive in order to represent
general gestures, at the same time the size of the set needs to be suciently small in
order to make learning from a limited set of conversation data be feasible. Moreover,
because some gestures can not be determined using the utterance alone [Cassell and
Prevost, 1996], the set of gestural signs used in this work focuses on those gestures that
can be determined from speech and also balance the trade-o between expressiveness and
conciseness.
The prediction process predicts gestural signs from speech input. There are two chal-
lenges for this task: (1) Speech signals are high-dimensional and have nonlinear relation
with gestural signs, and therefore the process requires a model capable of learning this
complex correlation. (2) Gestural signs are assumed to be temporally related and there-
fore the process requires a model that can realize the temporal relation of gestural signs
5
and use the relation to improve the prediction. One promising approach addressing this
problem as seen in a state-of-the-art work [Levine et al., 2010] applies conditional random
elds (CRFs) [Laerty et al., 2001]. CRFs make predictions for every frame based on
the observed signals and the likelihood of the entire sequence, and therefore is a promis-
ing approach in terms of realizing the temporal consistency of gestures. The limitation
of CRF is that it does not work well on modeling a nonlinear relation between input
signals and labels, nor does it capture well a nonlinear relation among input signals.
This limitation is an issue since the the prediction process requires mapping two complex
high-dimensional temporal signals jointly to low-dimensional, discrete gestural signs, and
therefore the learning model needs to be capable of learning this complex relation.
The motion synthesis process generates gestures that match the given gestural signs.
The major challenge on synthesizing gestures based on gestural signs is the issue of
smoothly transition between gestures, namely, gesture co-articulations. One approach
to gesture co-articulations restricts it to only combine pre-existing motion segments that
can transition smoothly [Kovar et al., 2002]. However, this greatly reduces the quality of
the gesture animation, as generating animations for novel utterances in general requires
novel sequential compositions of gestures, and this often involves connecting two motion
segments with very distinct motion dynamics. Not supporting such novel combinations
restricts the expressivity of the character, while supporting it requires some means to
realize smooth transitions between various motion segments.
6
I like watching movies
PRB VBP VBG NNS
DCNF
t
1
t
2
t
3
t
4
Gesture 3 Gesture 1
Input
Gesture output
(for virtual human
animations)
POS
Lexical
Prosody
(e.g. f0)
time
Figure 1.3: The overview of DCNFs for predicting gestural signs from speech. The
framework extracts words and part-of-speech tags from transcripts and extracts prosody
from speech audio, and applied the proposed approach to learn to predict gestures at
each time frame.
1.3 Approaches
To address the challenges described in the previous section, this thesis proposes a frame-
work based on the decomposition of gestural signs. The composition of the proposed
framework is illustrated in Figure 1.2. The design of each component is described in the
following paragraphs.
The gestural signs dened for this thesis is a set of gesture classes which takes ad-
vantage of previous literature on gestures [Kipp, 2004, Calbris, 2011, Marsella et al.,
2013] to provide an eective gesture representation. Specically, the framework focuses
on learning three types of gestures that tend to have stronger coupling with the content
of the speech: metaphorics, abstract deictics, and beats. Details behind these choices are
described in chapter 3.
7
To predict gestural signs from the spoken utterance, I have proposed a deep, temporal
model to address the challenge of realizing the mapping from both utterance content and
prosody to gestures and their temporal coordination. The structure of the gestural sign
prediction process is shown in Figure 1.3. The proposed approach is a novel model called
deep conditional neural elds (DCNFs), which combines the advantages of deep neural
networks for nonlinear mapping and an undirected second-order linear-chain for modeling
the temporal coordination of speech and gestures. Details are described in chapter 4.
On synthesizing gestures from gestural signs I have proposed an approach based on
Gaussian process latent variable models (GPLVMs) [Lawrence, 2005] to generate gestures
and allows more
exible co-articulation [Chiu and Marsella, 2014]. The GPLVM-based
approach learns a low-dimensional space which captures human motion dynamics, thereby
providing a means to generate natural, novel transitions between gesture motions. Specif-
ically, the approach applies GPLVMs with an additional dynamic term in the objective
function (which is also known as GPDMs, Gaussian process dynamic models [Wang et al.,
2008]) to derive the dynamic constraint of human motion and the respective manifold.
The process of gesture generation embeds gesture motions into the learned manifold and
determines the trajectory for the transition between the gestures, and the the process
then use the manifold to map the trajectory back to the original high dimensional motion
space to generate the composed gesture animations. Details are described in chapter 5.
In addition to comparing the proposed approach with state-of-the-art approaches, an
interesting research question is whether the task of learning gesture generators can be
realized with a generation-based approach, as opposed to the synthesis-based approach
applied in the framework. In other words, the generation-based approach realizes a
8
model of gesture movement and generates frame-by-frame motion based on the derived
model. As there is no previous work addressed applying motion-generation-based model
for realizing gestures and speech, I have proposed a novel algorithm based on deep energy
models, which is called hierarchical factored conditional restricted Boltzmann machines
(HFCRBMs), to realize the relation between speech and gestures as an autoregressive
process [Chiu and Marsella, 2011a]. The evaluation experiment compared the proposed
synthesis-based approach with the proposed generation-based approach on generating
gestures from prosody features. The evaluation asked participants, using Mechanical
Turk, to vote which animation matched the speech best. The experiment shows that the
proposed synthesis-based approach outperforms the proposed generation-based approach
signicantly. Details are described in chapter 6.
1.4 Evaluations
The evaluation experiments analyze the eectiveness of individual components and the
overall framework with following studies:
1. The performance of DCNFs on the prediction process is assessed by comparing the
accuracy on predicting gestural signs from speech with state-of-the-art approaches.
The input is extracted from a series of interview videos with interviewee's gestures
being annotated as the prediction targets. To further assess the generalization of
the DCNF model on predicting sequential data, the approach has also been applied
to a well-evaluated handwriting recognition task [Taskar et al., 2004] to compare
9
with existing state-of-the-art approaches. Both experiments show that DCNFs
outperform other approaches. Details are described in chapter 4.
2. The performance of GPLVM-based approach is assessed by comparing the eec-
tiveness on co-articulating between gestures with state-of-the-art approaches. The
results show that exploiting the proposed motion synthesis approach allows a more
exible co-articulation compared to approaches applied by the previous work [Levine
et al., 2010, Kovar et al., 2002]. Details are described in chapter 5.
3. The overall performance is evaluated through human subject studies involving par-
ticipants evaluating the generation results. The evaluation experiment compares the
framework with the state-of-the-art approach on generating gestures from linguistic
features and prosody features of the speech. Since the input signals include features
related to the content of the speech, the study showed videos to the participants
with both animations generated by the proposed approach and the state-of-the-art
approach, and ask participants to choose which is more natural and which bet-
ter matches the utterance content. The study, using Mechanical Turk, shows the
proposed framework outperforms the state-of-the-art approach signicantly. These
studies are described in chapter 7.
Another interesting research question is whether we can derive gestural signs auto-
matically from human conversation data in contrast to manually annotating gestural
signs. In other words, this research question seeks to determine whether it is crucial to
10
manually annotate gestural signs. To evaluate this question, I used unsupervised learn-
ing approaches to cluster data and analyze its correlation with the manually annotated
results. Details are described in chapter 7.
1.5 Contribution
This thesis has four major contributions:
1. The rst machine learning approach that can generate beyond beat ges-
tures.
Previous fully machine learning approaches toward learning gesture generators only
use prosody as input signals, and therefore are limited to mainly realize the mapping
for a subset of gesture types that correlate closely to prosody, a form of rhythmic
gesture called beat gestures. The proposed frameworks is the rst approach that
uses or exploits the content of the utterance and generates more expressive gestures
beyond beat gestures.
2. A novel time-series prediction algorithm that can be applied to other
applications for classifying time series.
The DCNF is a discriminative algorithm for recognizing time series and have shown
successful results in both predicting gestures from speech and on a handwriting
recognition task. It can be applied in many other applications that require realizing
sequential relation such as gesture recognition.
3. A novel motion synthesis algorithm that can be applied to other appli-
cations that require smooth transition between motions and preserving
11
gesture styles of the performer.
The GPLVM-based approach not only provides
exible transitions among general
motion segments but also preserves the gesture style of the performer. This algo-
rithm can be applied in any application involving character animations.
4. A decomposition structure for learning data-driven gesture generators.
The proposed framework decomposes the learning task with gestural signs, which
allows training the prediction process and the synthesis process with two dierent
sets of data and therefore provides a more
exible training process. The dened
gestural signs provide a set of gesture annotations which can also be applied on
other machine learning approaches for predicting gestures from speech.
1.6 Outline
The remainder of the thesis is organized as the following:
1. Chapter 2 reviews related work and the background of related algorithms.
2. Chapter 3 provides further insight about the decomposition framework and de-
scribes the set of gesture classes used for gestural signs.
3. Chapter 4 describes the work on DCNFs used for the prediction process.
4. Chapter 5 describes the GPLVM-based approach that generate gestures from ges-
tural signs and allows
exible gesture co-articulation.
5. Chapter 6 describes the alternative gesture generation technique based on HFCRBMs.
12
6. Chapter 7 describes the objective assessment experiments and studies involving
participant evaluations.
7. Chapter 8 discusses conclusions and propose future work.
8. Appendix A describes approaches that have been exploited to extend DCNFs for
the gestural sign prediction task.
9. Appendix B describes extensive assessment experiment for models related to HFCRBMs.
13
Chapter 2
Background
This chapter describes related works and background knowledge regarding our algorithms.
As the thesis work focus on learning coverbal gestures, section 2.1 gives a brief review
about gestures and section 2.2 discusses previous work on modeling coverbal gestures. Sec-
tion 2.3 describes previous work on motion synthesis and interpolation. In our annotation-
based framework we apply Gaussian process latent variable models (GPLVMs) on human
motion to learn a manifold which helps to facilitate motion transition, and therefore sec-
tion 2.4 describes Gaussian processes and GPLVMs. An alternative frame-by-frame direct
generation framework is proposed based on extensions of Restricted Boltzmann Machines
(RBMs) [Hinton et al., 2006], and therefore section 2.5 gives some brief background of
these models. The deep belief net [Hinton et al., 2006] is related to RBMs and the ges-
tural sign prediction task, and therefore section ?? gives a brief description. Section 2.6
talks brie
y about recent advances of deep learning and their relation to the proposed
DCNFs.
14
2.1 Gesture overview
Gestures are natural movement accompanying with speech and are tightly linked to its
timing or communication intent. They can be performed intentionally to help to clarify
the meaning of the speech, or they can be spontaneous movements that reveal the men-
tal state of the speaker. Their connections with speech showing a correlation with the
composition process of the speech, and therefore one theory about gestures suggest that
speech and gestures are composed together for co-expressing a single idea [Kendon, 2000].
As the goal of this work is to build a gesture controller and it requires a input with rich
information to determine the generation content, we choose speech as the required input
and did not take the co-generation theory for practical concern. Below we give a brief
overview about dierent classes of gestures.
2.1.1 Gesture classes
There are dierent ways for classifying gestures from previous literature [Kendon, 1983,
Mcneill, 1992, Ekman and Friesen, 1969], and here we adopt the scheme proposed in
[Mcneill, 1992] for categorizing gestures.
Iconics
Iconic gestures depict specic things and are usually performed by mimicking certain
properties of the referent. It is commonly seen in illustrating the shape of an object or
spatial relation between objects in an event such as describing a movement in a sport
game.
15
Metaphorics
Metaphoric gestures are similar to iconic gestures for their pictorial essence but dier in
that the referents are abstract such as an idea or a concept. For example, saying \open
your mind" while opening a closed hand and swinging arms outward which illustrates the
concept of \open".
Deictics
Deictic gestures are actions pointing to some direction, location, or object, which can
exist in reality or refer to an abstract or imaginary instance. An example of referring to
an abstract instance is pointing toward a direction orthogonal to the speaker and recipient
when saying \somewhere else".
Beats
Beat gestures are actions synchronized with speech and usually are not re
ecting the
semantic meaning. Some beat gestures have certain purpose like striking the hand at a
specic word of the speech for emphasis, but often time beat gestures move rhythmically
without emphasizing specic words.
Cohesives
Cohesive gesture serve to reconnecting thematically related concepts across dierent times
in the speech. The connection is realized by repeating the same gestural form, movement
or location in space. For example, a speaker may refer back to a prior concept in the
dialog by repeating the gesture form.
16
2.2 Previous works on modeling gestures
Data-driven approaches to generate coverbal gestures for intelligent embodied agent have
become more and more popular in the recent researches. [Stone et al., 2004] took the
co-generation perspective in which the framework synthesizes both speech and gestures
based on the determined utterance during the conversation. The matching relation among
speech, gestures, and utterance is dened manually. [Ne et al., 2008] addressed model-
ing individual gesture styles through analyzing the mapping relation from the extracted
utterance information to gestures. The extracted utterance information contains seman-
tic tags automatically mapped from transcripts and structure-related information of the
utterance which is manually annotated. [Kopp and Bergmann, 2012] also took the co-
generation perspective and focused on modeling individual styles on iconic gestures to
improve human-agent communication. The framework takes the information about the
depicting objects and the intention about the object as input. Some of the previous
work focus on realizing the relation between prosody and motion dynamics [Levine et al.,
2009, Levine et al., 2010, Chiu and Marsella, 2011a]. Specically, [Levine et al., 2009]
applies hidden Markov models (HMMs), [Levine et al., 2010] applies conditional random
elds (CRFs), and [Chiu and Marsella, 2011a] applies hierarchical restricted Boltzmann
machines [Chiu and Marsella, 2011b] for realizing gesture prediction. By realizing only
prosody allow the model to run in real-time, but at the same time the approaches are
limited to mainly realize the mapping for a subset of gesture types that correlate closely
to prosody, a form of rhythmic gesture called beat gestures.
17
With respect to prior work on gesturing, our approach is not aiming to replace but
rather is designed to be integrated into previous work and resolve the limitations. For
example, the limitations of [Ne et al., 2008] is the manual eort required to annotate.
Our technique can be applied to learn to predict this information, and their approaches
can then be applied to accomplish the gesture generation process. With respect to pre-
vious work focused on realizing prosody [Levine et al., 2009, Levine et al., 2010, Chiu
and Marsella, 2011a], our approach goes beyond prosody to realize a mapping from the
utterance content to more expressive gestures and can be integrated to extend existing
work to generate animations beyond beat gestures.
An alternative to data-driven machine learning approaches are the handcrafted rule-
based approaches [Lee and Marsella, 2006, Cassell et al., 2001, Marsella et al., 2013].
These exploit expert knowledge on speech and gestures to specify the mapping from
utterance features to gestures. While earlier works based on this approach have focused
on addressing the mapping relation between only linguistic features and gestures [Lee and
Marsella, 2006, Cassell et al., 2001], recent work [Marsella et al., 2013] has also addressed
how to use acoustic feature to help gesture determination.
When compared with existing approaches for gesture prediction [Levine et al., 2009,
Levine et al., 2010, Chiu and Marsella, 2011a], our work is the rst to introduce an
eective representation of gestures an a deep, temporal model capable of realizing the
relation between speech and the proposed gesture representation. [Chiu and Marsella,
2011a] adopt the concept of unsupervised training of deep belief net [Hinton et al., 2006],
but without an eective gesture representation and the supervised training phase the
prediction task is much challenging and therefore has been limited to realizing the relation
18
between prosody and rhythmic movement. Compared to previous work [Do and Artieres,
2010] on joining CRFs and deep neural network, our model introduce a more state-of-the-
art technique for training deep networks and higher-order CRFs which bring signicant
improvements over their work.
2.3 Motion interpolation and synthesis
The previous works mentioned in the above section address realizing the mapping re-
lation between dialogue and gesture annotations, but emphasize less the quality of the
motion synthesis. Synthesis process is, however, crucial in that gesture animations need
to maintain natural movement and without a synthesis process preserving the natural hu-
man dynamics the quality of animations will be greatly reduced. The annotation-based
framework seeks to improve synthesis process, which also has a potential to be integrated
with previous work and improve animation quality.
The annotation-based framework applies Gaussian process latent variable model (GPLVM)
[Lawrence, 2005] to embed motion in a latent space. Previous works have shown success
on synthesizing character animations with this approach [Grochow et al., 2004, Urta-
sun et al., 2008, Wang et al., 2008, Ye and Liu, 2010, Levine et al., 2012], and the
annotation-based framework extends further to deal with highly variant gesture motions.
Synthesizing gesture motions require an eective interpolation algorithm to perform tran-
sition among heterogeneous motions. We propose an interpolation algorithm that make
use of the knowledge of motion dynamic to generate natural motion on transiting between
19
gestures. Previous works have proposed various interpolation algorithms on motion syn-
thesis [Rose et al., 1998, Mukai and Kuriyama, 2005, Torresani et al., 2007, Chiu and
Marsella, 2011b, Min and Chai, 2012]. The common goal of their algorithm aims for tran-
sitions between similar motions, and on dealing with heterogeneous motions it requires
lling the gap with intermediate motions. This approach requires a large motion set and
the demand is signicant in terms of highly variant gesture motions. Our interpolation
algorithm is proposed for transiting between motions with dramatic dierence without
additional intermediate motion segments.
2.4 Gaussian process latent variable model
The GPLVM is a dimension reduction approach which determines a low-dimensional
space that better represents the given data, and the core idea is based on the Gaussian
process. To help explain the idea, we rst brie
y describe the Gaussian process. A
Gaussian process is a stochastic process which models the distribution of the predicting
variable y as Gaussian in which the mean is set to 0 in general and the covariance is a
function of the input variable x. Specically, the covariance function is represented as a
kernel function K where K
i;j
=k(x
i
;x
j
). Its log likelihood function is:
lnp(t) =
1
2
lnjKj
1
2
y
T
K
1
y
N
2
ln(2)
whereN is the number of data points. Here we omitted some parameters of the Gaussian
process to make the equation uncluttered. The GPLVM is an unsupervised learning
algorithm in which the original predicting variable y in the Gaussian process is now
20
given while x becomes the parameter to be determined, and the goal of the learning
algorithm is to inferx with respect toy and a corresponding Gaussian process that jointly
maximize the likelihood of p(yjx;), where denotes the parameters of the Gaussian
process. GPLVMs nd a low-dimensional projectionx fory while preserving the similarity
relation among the original datay. The data points that are far away from each other in
the original space will also be apart from each other in the low-dimensional manifold.
An extension of GPLVMs called Gaussian process dynamical model (GPDM) [Wang
et al., 2008] has been proposed to include dynamics in terms of determining low-dimensional
projection. GPDMs contain the same process as GPLVMs in determining low-dimensional
projection but with an extra autoregressive likelihood function p(x
t
jx
t1
) to maximize.
With this additional function, the optimization process for determining the low-dimensional
projection has an extra objective for maximizing the likelihood of p(x
t
jx
t1
). In other
words, GPDMs need to allocate x in a way that when two points x
t
and x
t
0 are close to
each other, their consecutive points x
t+1
and x
t
0
+1
also have to be close to each other.
As a result, the projectionx in GPDMs re
ect also the dynamics of the time series data.
2.5 Restricted Boltzmann machines and the temporal extensions
2.5.1 Restricted Boltzmann Machines
The RBM, as shown in Figure 2.1, is a complete bipartite undirected graphical model
with a hidden layer h, a visible layer v, and a symmetric connection W connecting
the two layers. RBMs learn to generate observed data with w in which the posterior
distribution over a visible node v
i
is (bias omitted) p(v
i
= 1jh) =(
P
j
h
j
w
ij
), in which
21
v
W
v
v
h
h
h
(a) RBM
W
v
h
v
W
v
h
(b) Simplied representation
Figure 2.1: Restricted Boltzmann machines. (a) A RBM wherev denotes the visible layer,
h denotes the hidden layer, andW represents the connection between the two layers. (b)
A RBM with simplied representation.
is the logistic function. After learning the parameters for generating observed data, it
becomes more eective to associate labels with observed data based on these parameters
than a direct association approach. In the domain of image recognition, since a natural
image follows certain structure, identifying a new basis which better represents the image
provides better information than the original raw data.
2.5.2 Conditional Restricted Boltzmann Machines
Conditional restricted Boltzmann machines (CRBMs) [Taylor et al., 2007], as shown in
Figure 2.2a, extend RBMs with visible layers that take past observed data as input. The
additional visible layers have directed links to the hidden layer and the current visible
layer, which introduces parametersA;B, whereA denotes directed links from prior visible
layers to the current visible layers and B denotes directed links to the hidden layer.
CRBMs generate the motion frame at time t via inferring the posterior distributions of
the hidden layer based on the motion frames at timet1;t2;. . . ... and then constructing
the data at the current visible layer. The model has shown successful results in modeling
human walking motion [Taylor et al., 2007].
22
A
W
v
t-2
v
t-1
v
t
h
B
(a) CRBM
W
v
W
h
W
z
A
v′
A
v
B
v′
B
v
W
z
W
z
z
y
h
v
t
v
t-2
v
t-1
(b) FCRBM
Figure 2.2: (a) A CRBM with order 2 where t represent index in time, v
t1
;v
t2
are
visible layers for data at t 1;t 2, and A and B are directed links from v
t1
;v
t2
to
v and h respectively. Each rectangle in the gure represents a layer of nodes. (b) A
FCRBM with order 2 where z denotes feature layer, y represent labels, and triangles
represent factored multiplications.
2.5.3 Factored CRBM
The Factored CRBM (FCRBM) [Taylor and Hinton, 2009], as shown in Figure 2.2b,
is an extension from the CRBM to model time series data with additional contextual
information. The term FCRBM here refers to factored conditional restricted Boltzmann
machine with contextual multiplicative interaction, not a factored version of CRBM. The
CRBM captures the transition dynamic of motion frames in an unsupervised way. It
generates motion frames based on only the information of past motion frames. The
FCRBM includes additional prosody information (contextual information) to generate
motion frames. The example application in the original work [Taylor and Hinton, 2009]
is to model human walking motion, where the contextual information is style labels for
annotating dierent walking like sad or drunk.
23
data data
labels
Figure 2.3: DBNs rst train networks layer-by-layer with input data where the output of
the lower layers are served as input for the higher layer, and then ne-tune the connection
weight of the entire network with both input features and labels.
The FCRBM has an input y for prosody, and a feature layer z to further expands
the information from contextual values. The matrices W;A;B of the CRBM break into
(W
h
;W
v
;W
z
), (A
v
0
;A
v
;W
z
), and (B
v
0
;B
v
;W
z
), where the prosody values in
uence the
transitions within these three pathways through the multiplication with W
z
. In the
standard FCRBM the three W
z
s are actually dierent matrices. The FCRBM used in
this work, as shown in Figure 2.2b, uses the same matrix W
z
for all three connections.
This simplication reduces the the complexity of the model and also maintains good
performance [Taylor and Hinton, 2009]. For the original version of FCRBM, please refer
to [Taylor and Hinton, 2009].
2.5.4 Deep belief nets
The deep belief nets (DBN) is a framework proposed to improve the training process of the
conventional articial neural network with multiple layers. Conventional articial neural
24
networks increase their complexity by stacking multiple layers, and by doing so the model
can learn complex functions. But increasing the number of layers can eventually lead to
ill performance as the required training time increases dramatically and the solution can
frequently converge at local optimal. DBNs introduce an unsupervised learning process
to train the network connection with input data at rst, and then allow the conventional
backpropagation training process to accomplish supervised learning. The unsupervised
pre-training process initiates the network connections and the supervised training process
after this initialization in general tend to converge to a more optimal solution, while the
combined training time is shorter than training with only the supervised learning process.
The unsupervised training algorithm is a greedy layer-wise algorithm which identies
patterns from input data and learns a better feature representation. The training
ow
of DBNs is shown in Figure 2.3. It is a common belief that the improvement of this
unsupervised learning for the latter supervised learning process comes from shaping the
network connections with a better representation about the data.
2.6 Deep learning and sequential data prediction
Deep learning has shown fruitful results in various domains, and specically the convolu-
tional deep neural network have shown signicant improvement over previous work [Deng
et al., 2013, Goodfellow et al., 2014]. However, the model does not take sequential relation
into account. A common approach to extend deep learning for modeling sequential data
is to train deep learning and a linear-chain graphical model separately. For example, in
speech recognition [rahman Mohamed et al., 2012] the common approach is to train deep
25
learning with individual frames and then applies hidden Markov models (HMMs) with
the hidden states. Our approach address training both deep neural network and CRFs
with a joint likelihood. There are previous works adopt similar perspective on extending
CRFs with deep structure [Yu et al., 2009, Do and Artieres, 2010] and show improvement
over a single-layer CRFs or CRFs combined with a shallow layer of neural network [Peng
et al., 2009].
26
Chapter 3
Decompose and constrain the learning problem
Our work uses gestural signs [Calbris, 2011] to decompose the learning task into two
processes: a prediction process and a motion synthesis process. The prediction process
predicts gestural signs from linguistic and prosodic features of the speech, and the motion
synthesis process generates joint rotations based on the given gestural signs. The frame-
work allows the learning of data-driven gesture generators, and the resulting generator
can be applied to generate gesture animations for novel utterance. Gestural signs play
an important role in this framework as they decompose the original challenging task of
learning the relation between speech signals and joint rotations into two more feasible
processes, at the same time the eectiveness of the framework relies on whether the ges-
tural signs encapsulate well the correlation between speech signals and joint rotations.
The overall structure is shown in Figure 1.2. We start the description of the framework
by rst introducing our design for gestural signs.
The learning task relies on the gestural signs to help realize the relation between
speech signals and joint rotations. There are several challenges for dening gestural signs.
First, the gestural signs need to be able to summarize and distinguish the functions and
27
forms of gestures, and the set size of the gestural signs need to be small to make the
learning task feasible. Additionally, some of the information conveyed through gestures
are not necessarily included in the utterance content or prosody [Cassell and Prevost,
1996], and therefore we may not be able to predict these gestures by using speech signals
alone. For example, when a speaker describes a ball by saying \I saw a ball with size
like this" and then gestures with two palms facing each other depicting the actual size of
the ball through the physical distance between the palms, the information about the size
is conveyed through gestures but is missing in the utterance content. Thus, these types
of gestures will at best be dicult for gestural signs to correlate speech signals with the
physical forms.
3.1 Gesture categories
We address these challenges by focusing on gesture categories that can be more reliably
predicted from the utterance content and prosody: abstract deictic, metaphoric, and
beat gestures. Abstract deictic gestures are pointing movements that indicate an object,
a location, or abstract things which are not physically present in the current surroundings.
An instance of abstract deictic gestures is that the speaker mentions that he comes from
the other side of the town while pointing at a direction orthogonal to the speaker and
recipient to refer to the concept of \other side". Metaphoric gestures exhibit abstract
concept as having physical properties. An instance of metaphoric gestures is the \frame"
gesture in which the speaker hold two hands with palms facing each other and use the
physical distance to illustrate the concept of importance. Beat gestures are rhythmic
28
actions synchronized with speech and they tend to correlate more with prosody as opposed
to utterance content.
3.2 Gestural signs
We design our dictionary of gestural signs based on previous literature in gestures [Kipp,
2004, Calbris, 2011, Marsella et al., 2013] and the three gesture categories, and then
calculated their occurrences in the motion capture data to lter out those rarely appeared.
We then applied the set of gestural signs for the annotation process on our dataset for
training the prediction process (we will describe the data in the next section), calculated
the occurrence of these gestural signs in that data, and then again removed those rarely
occurred. The nal set of gestural signs has size of 14, and the list and their descriptions
are shown in Table 3.1.
29
Gestural signs Description
Rest Rest.
Palm face up Lift hands, rotate palms facing up or a little bit inward, and hold
for a while. Usually happens when starting a conversation.
Head nod Head nod without arm gestures.
Wipe Flat hands, palms down, start near (above) each other and move
apart in a straight, wiping motion.
Whole Moving both hands along outward arcs with palms facing forward
and downward. Usually happened when saying \all", \everything",
\never", \completely". Similar to wipe but wipe moves along a
straight line while this category moves along an arc.
Frame Both hands are held some inches apart, palms facing each other, as
if something is between hands. The initial motion may be similar
to everything, and it diers in the hand holding.
Dismiss Hand wipes through the air to the side in an arc as if chasing away.
Could be wiping inward or outward.
Block Hand is positioned in front of the speaker, palm toward front.
Shrug Hands are opened in an outward arc, ending in a palm-up position,
usually accompanied by a slight shrug and a face saying \I don't
know".
More-Or-Less The open hand, palm down, swivels around the wrist.
Process Hand moves in circles.
Deictic.Other Hand is pointing toward a direction other than self.
Deictic.Self Speaker points with one hand or both hands to him/herself.
Beats Beats.
Table 3.1: The list of our gestural signs and their descriptions.
30
Chapter 4
Predicting Gestural Signs from Speech with Deep
Conditional Neural Fields
A common function of the parallel use of speech and gesture is to convey meaning in which
gesture plays the complementary or supplementary role [Goldin-Meadow et al., 1993], and
gestures may help to convey complex representations through expressing complementary
information about abstract concepts [McNeill, 1985]. Realizing this relation between
speech and gesture requires realizing the hidden abstract concept. The other important
property is their temporal coordination. There is a close connection in the production
of the two modality, and \the mutual co-occurrence of speech and gesture re
ects a
deep association between the two modes that transcends the intentions of the speaker to
communicate" [Iverson and Thelen, 1999]. Because of the correlation at abstract hidden
level and the temporal coordination, the two modality in general are argued to have
a deep common root [Alibalia et al., 2000]. These properties suggest that a model that
predicts coverbal gestures needs to be capable of (1) realizing the hidden abstract concept
from utterance and the deep association between speech and gestures , and (2) maintain
the temporal consistency of gestures.
31
A state-of-the-art work [Levine et al., 2010] applies conditional random elds (CRFs)
for learning coverbal gesture predictions. CRFs make predictions for every frame based on
the observed signals and the likelihood of the entire sequence, and therefore is a promising
approach in terms of realizing the temporal consistency of gestures. The limitation of
CRFs is that it does not work well on modeling the nonlinear relation between input
signals and labels, nor does it capture well the nonlinear relation among input signals.
This limitation is an issue since the relation between high-dimensional speech signals and
gestures are complex, and would at best be dicult to be realized with a simple linear
model.
In terms of realizing this complex relation between speech and gestures, we argue deep
neural networks are the most promising approach. Recent advances of deep learning has
shown signicant improvement over existing approaches on solving problems in which
the input signals and prediction targets have complex relation. The limitation of deep
learning is the lack of temporal modeling capability which is crucial for realizing the
temporal coordination between speech and gestures.
4.1 Deep Conditional Neural Fields
To address these challenges, we propose a novel algorithm called deep conditional neu-
ral elds (DCNFs) to combine state-of-the-art deep learning techniques with CRFs for
predicting gestures from utterance content and prosody. The prediction task takes the
transcript of the utterance, part-of-speech tags of the transcript, and prosody features
of the speech audio as input x =fx
1
;x
2
;:::;x
N
g, and learn to predict a sequence of
32
POSs prosody
y
t-1
y
t
x
t-1
gesture annotations
words POSs prosody
x
t
words POSs prosody
x
t+1
words POSs prosody
x
t+2
words
y
t+1
y
t+2
Figure 4.1: The structure of our DCNF framework. The neural network learns the
nonlinear relation between speech features and gestural signs. The top layer is a second-
order undirected linear-chain which takes the output of the neural network as input and
model the temporal relation among gestural signs. The model is trained with the joint
likelihood of the top undirected chain and deep neural networks.
gestural signs y =fy
1
;y
2
;:::;y
N
g in which the sequence has length N. Each class y
i
is
contained in the set of all possible classes y
i
2 Y, and each input x
i
is a feature vector
x
i
2 R
d
. Upon making predictions for a novel utterance, the derived model will take the
corresponding features x
0
of the utterance as input and predict which y the embodied
agent should perform at each time frame.
Our deep conditional neural elds (DCNFs), as shown in Figure 4.1, is represented
as:
P (yjx;;;; w) =
1
Z(x)
N
X
t=1
exp[
X
k
k
g
k
(y
t1
;y
t
)
+
X
l
l
g
l
(y
t1
;y
t
;y
t+1
)
+
X
i
i;yt
f
i
(x
t
; w)]
(4.1)
33
whereZ(x) is the normalization term,g
k
(y
t1
;y
t
) andg
l
(y
t1
;y
t
;y
t+1
) denote edge func-
tions,;; are model parameters, andf
i
(x
t
; w) is a function corresponding to the output
of a output node of the deep neural network and w represents the network connection
parameters of neural networks. To be more specic, in a DCNF with m layers of neural
networks,
f
i
(x
t
; w) =h(a
m1
w
m1
) where
a
i
=h(a
i1
w
i1
); i = 2:::m 1:
(4.2)
Here a
i
represents the output at ith layer, w
i
represents the connection weight between
ith and i + 1th layers, and h(x) is the activation function. We applied sigmoid function
as the activation function in this work.
4.1.1 Prediction
Given a sequence x, the prediction process of DCNFs predicts the most probable sequence
y:
y = arg max
y
P (yjx):
To help prevent co-adaptation of network parameters which result overtting, the model
applies dropout technique [Hinton et al., 2012] to change the feed-forward results of
f
i
(x
t
; w
i
) in the training phase. By performing dropout, at the feed-forward phase the
34
output of each hidden node has a probability of being disabled. In other words, during
the prediction of the training phase:
^ a
i
=a
i
;
=
8
>
>
>
>
<
>
>
>
>
:
0; r<
1; otherwise
;
in which ^ a
i
is the output at layer i in the training phase, r is a value randomly sampled
between 0 and 1, and is a dropout threshold. Consequently the output of hidden nodes
in the training phase ( ^ a
i
) is dierent from that of the testing phase (a
i
). The dropout
nodes are re-sampled at every feed-forward process. This stochastic behaviors encourage
hidden nodes to model distinct patterns and therefore further prevent the overtting.
The dropout technique is not applied during the testing phase.
4.1.2 Gradient calculation
The more challenging part of calculating the gradient of DCNFs is to calculate the gradi-
ent of w. Since the parameters of lower layers are hidden within neural network functions
f
i
(x
t
; w), deriving the gradient of a lower layer requires applying a series of chain rule.
We solve the issue by realizing the calculation with backpropagation. Backpropagation
35
requires determining the error term
m1
of w
m1
in whichrw
m1
=
m1
^ a
m1
for
rw
m1
denotes the gradient of w
m1
. As the gradient of w
m1
is given by:
@ logP
@w
m1
=
N
X
t
X
i
[
i;yt
@f
i
(x
t
; w)
@w
m1
X
~ y
p(~ yjx
t
)
i;~ y
@f
i
(x
t
; w)
@w
m1
]
=
N
X
t
X
i
[
i;yt
@h(^ a
m1
w
m1
)
@w
m1
X
~ y
p(~ yjx
t
)
i;~ y
@h(^ a
m1
w
m1
)
@w
m1
]
=
N
X
t
X
i
[
i;yt
h
0
i
(^ a
m1
w
m1
)^ a
m1
X
~ y
p(~ yjx
t
)
i;~ y
h
0
i
(^ a
m1
w
m1
)^ a
m1
]
(4.3)
we can decompose the gradient term and derive
m1
=
i;yt
h
0
(^ a
m1
w
m1
)
X
~ y
p(~ yjx
t
)
i;~ y
h
0
(^ a
m1
w
m1
):
DCNFs propagate
m1
to the lower layers to calculate the gradient of these layers. One
thing to notice is that the gradient is calculated with ^ a
m1
instead of a
m1
due to the
in
uence of dropout.
36
4.1.3 Parameter updates
To prevent the overtting of DCNFs, the model has a regularization term for all param-
eters and the objective function is as follows:
L() =
N
X
t=1
logP (y
t
jx
t
;)
1
2
2
kk
2
;
in which denotes the set of model parameters and
corresponds to regularization
coecients. The regularization term on training the deep neural networks encourages the
weight decay which reduce the complexity increase of the network connections along the
parameter updates. We applied stochastic gradient descent for training DCNFs with a
degrading learning rate to encourage the convergence of the parameter updates.
4.2 Dataset for training the prediction process
Our dataset consisted of more than 9 hours video recordings taken from a large scale study
focusing on semi-structured interviews [Gratch et al., 2014]. Our experiment focused
on predicting the interviewee's gestures learned on the utterance content and prosody.
There are 15 videos in which the interviewees exhibit a rich set of coverbal gestures. All
the videos were segmented and transcribed using the ELAN tool from the Max Planck
Institute for Psycholinguistics [Brugman et al., 2004]. Each transcription was reviewed
for accuracy by a senior transcriber.
37
4.2.1 Gestural sign annotation
In the annotation process, we rst taught the annotator the denition of the list of gestural
signs and showed a few examples for each gestural sign. The annotator then used the
ELAN tool, looked at the behavior of the participants only when they are speaking, and
marked the gestural signs of the behaviors and their beginning and the ending time in
the video. The annotation criterion is that there will be at most one gestural sign at
any time in the data. If the gesture does not correspond to our dened gestural signs it
will be marked as beat. Annotators are told to focus on the physical forms of gestures
and ignore the utterance content. The annotation results were inspected to analyze the
accuracy and insure the annotator had understood the denition of gestural signs.
4.2.2 Linguistic features
Linguistic features best encapsulate the utterance content and help determine the cor-
responding gestures. The extracted data has 5250 unique words, but most of them are
unique to a few speakers. To make the data more general, we remove words that happen
fewer than 10 times among all the 15 videos, and the resulting number of unique words is
down to 817. When a word appears, its appearance will also be marked in the previous
and next time frame.
The data collection process extracted text from the transcript and also ran a part-of-
speech tagger [Bird et al., 2009] to determine the grammatical role of each word. POS
tags are encoded at the word level and are automatically aligned with the speech audio
through using the analyzing tools of FaceFX.
38
4.2.3 Prosodic features
In terms of prosody, the data extracted the following audio features: normalized am-
plitude quotient (NAQ), peak slope, fundamental frequency (f0), energy, energy slope,
spectral stationarity [Scherer et al., 2013]. The sampling rate is 100 samples per second.
The extraction process also determines whether the speaker is speaking based on f0, and
for the periods in the speech that identied as not speaking all audio features are set to
zero.
4.2.4 Video segmentation
The extracted data is segmented into clips based on the speaking period. The segmen-
tation can be due to a long pause or the interviewer asked a question. Each frame in the
sample data is dened to be 1 second of the conversations. Some of the clips contained
only a very short sentence in which the interviewee replied the question of the interviewer
with a short answer such as \yes/no". We removed all sentences that are less than 3
seconds. The resulting dataset has total 637 clips with average length of 47:54 seconds.
Gesture annotators annotated the gestural signs of the interviewee in the 15 videos
based on the dened gestural signs described in chapter 3. They annotated the types and
duration of gestures, and their behaviors were annotated only when they are speaking.
For example, if the interviewee is talking without performing any arms or head movement
then he or she will be annotated as Rest. If the interviewee exhibits certain non-verbal
behaviors without accompanying speech, then the gestures will not be annotated.
The assessment experiments for gestural sign prediction are described in chapter 7.
39
Chapter 5
Low-Dimensional Embeddings with GPLVMs
The motion synthesis process takes gestural signs as input and generates joint rotations
according to the gestural signs. Upon receiving a sequence of gestural signs, the motion
synthesis process applied in our framework nds the motions that match the specied
gestural signs in the motion library and concatenates corresponding sequences to generate
resulting joint rotation. The process is shown in Figure 5.
One of the central challenges in gesture synthesis is to allow
exible co-articulations
between gestures. The task is dicult in that joint rotations are in a high dimensional
space. Algorithms used by conventional motion transition approaches commonly blend
using weighted averages of motion frames drawn from the two motion segments. Viewing
each motion frame as a data point and a motion segment as a sequence of data points, the
motion transitions between the two motion segments can then be understood as interpo-
lating a trajectory that connects the two sequences. The problem with interpolation in
the original high-dimensional space is that it does not necessarily preserve the movement
dynamics of the joint rotations, nor does it take the constraints of human motion such as
40
Motion synthesis
time
Gestural sign 2 Gestural sign 5
Motion library
Gesture 2 Gesture 5
t
1
t
2
t
3
t
4
Figure 5.1: The synthesis process reuses motion data to generate gesture animations.
joint or movement limits into consideration. As a result, the generated motion can have
abrupt movement or infeasible, un-natural joint rotations.
The issue can be resolved by learning, from human gesture data, a manifold that
realizes natural gesture motion. The idea of realizing human motion with a manifold
has been applied to other human motion such as walking, golf swings, and punching [El-
gammal and Lee, 2004, Lawrence, 2005, Levine et al., 2012]. After deriving a manifold
of gestures, we can determine motion transitions with an interpolation algorithm that
nds a smooth trajectory connecting the two specied sequences. As the manifold real-
izes the relation among gestures and motion dynamics in terms of the spatial relation, a
smooth trajectory in the manifold correspond to natural gesture motion. Thus, instead
of performing the interpolation in the original unconstrained, high dimensional motion
41
Figure 5.2: GPLVMs derive a low-dimensional space from the given motion samples.
After deriving this space with GPLVMs, we can sample a trajectory in the space and
map the trajectory back to the original data dimension and result a gesture animation.
Colors of the space indicate the density of data points where the warmer correspond to
the denser area.
space, our framework derives a manifold with respect to gestures and motion dynamics
to facilitate the interpolation.
This chapter rst describes how the motion generation process selects motions from
the manifold based on the assigned gestural signs and then second describes how the
process performs motion transition with the manifold.
5.1 Motion generation with manifolds
The motion generation process builds the manifold from the given motion data, and in the
resulting manifold each data point corresponds to a motion frame in the motion data and
the trajectory in the manifold corresponds to the original motion sequence. To perform
42
motion generation and motion transition with a manifold, the manifold needs to satisfy
these three criteria:
1. The manifold needs to be able to realize the sequential relation among the motion
frames, since the process needs to nd a trajectory in the manifold for motion
transition,
2. The manifold needs to allow mapping from low-dimensional space to reconstruct
motions in the high-dimensional space. The generation process project from the
low-dimensional manifold to a high-dimensional space, and it is crucial that this
projection does not lose the detailed information in the original motion data so
that the generated gesture animations can express the same detailed movement
exhibited in the data.
3. The process needs to be able to derive the manifold from a small set of motion data,
since the motion capture data usually has limited size.
Among all manifold learning techniques, Gaussian process latent variable models (GPLVMs) [Lawrence,
2005] satisfy these criteria the best.
GPLVMs are eective on modeling human motion and respective motion dynamics.
An extension of GPLVMs called Gaussian process dynamical model (GPDM) [Wang et al.,
2008] has been proposed to include dynamics in terms of determining low-dimensional
projection. Both GPLVMS and GPDMs have shown success in synthesizing human mo-
tion with the derived manifold. In our framework we apply GPDMs to learn the manifold.
The dynamic term added to the objective function allows the manifold to incorporate mo-
tion dynamics. As the GPDM is a GPLVM with an additional objective function that
43
maintains the dynamic relation, we follow the convention and refer it as GPLVMs in the
rest of the article. An example of the gesture manifold derived with GPLVM is shown in
Figure 5.2.
After deriving the manifold with GPLVMs, each point in the manifold corresponds to a
gesture frame, and sampling a trajectory and mapping it back to the original dimension
through GPLVMs results in a gesture animation. On transitions between two motion
segments, the framework nds a trajectory connecting the two segments in the GPLVM
manifold, and the resulting trajectory is mapped back to the original space to generate
the transition motion between gestures.
Although GPLVMs in general preserve the detailed movement of the motion data, it
still can smooth out some of the subtle movements if these motion rarely appear in the
data and at the same time their joint rotation values are distinct to other motion frames in
the data. This causes problems on generating head movement as the head joint performs
no or little movement in most of the motion data. As a results, the generated animations
will perform little or no head movement when projecting trajectories containing head
movements to the motion space. To address this issue, the generation process treats
the head as a special joint and directly reuses its rotation values in the motion data as
opposed to using the projection results, which is the approach applied for generating
rotation values for other joints. On the other hand, in the case of motion transitions, the
process still uses the projection results for animating head movements.
44
5.2 Motion selection based on gestural signs
Upon receiving a sequence of gestural signs, the motion generation process nds motion
clips that match the assigned gestural signs, extracts the associated trajectories in the
manifold, and then concatenates the trajectories. When there is more than one motion
sequence that matches the specied gestural sign, the selection process nds the motion
segment with length best match the specied length of the gestural sign. If there is still
more than one motion sequence that satisfy the both criteria, then the selection process
will choose the motion sequence that provides the smoothest transition from the motion
frame it is being concatenated to. After the concatenation, the generation process marks
the concatenation point and performs the motion transition around the point.
5.3 Motion transition with manifolds
We treat the problem of nding the transition trajectories in the manifold as an opti-
mization problem. As the desired transition trajectories need to be smooth and close to
the manifold, the optimization criterion is to maximize its tness to the manifold and
spread the trajectory evenly in the space. To be more specic, the objective criterion is
var
GP
(x) +jjx
1n1
x
2n
jj
2
2
in whichx denotes the trajectory points we want to infer,n is the length of the trajectory,
var
GP
(x) represent the variance ofx with the Gaussian process prediction derived by the
GPLVM, and is a coecient to adjust the objective function. Points will have higher
45
variance when they are farther away from the manifold, and therefore minimizing the
variance equals to increasing the closeness to the manifold. The variance of the Gaussian
process prediction has the form of
var
k
(x)k(x)K
1
k(x)
T
where var
k
(x) is the variance at x in the GPLVM, k(x) is the covariance of x with
observed samples in the GPLVM, andK is the covariance of the observed samples of the
GPLVM. The gradient of x
i
is then
2k
0
(x
i
)K
1
k(x
i
)
T
+(2x
i
x
i1
x
i+1
):
On transition between the two motion segments, the process needs to decide the length of
the trajectory. Dierent pairs of motion segments require dierent transition lengths, and
therefore we apply an adaptive approach to determine the length of the trajectory for each
interpolation. The transition process starts with setting a short lengthn for the trajectory,
and then applies the optimization process to infer trajectories. After a trajectory is
generated, the interpolation process checks whether the generated movements are within
a certain velocity threshold. If the value exceeds the threshold, the transition process
increases the length and repeats the optimization process until an admissible trajectory
is derived.
46
Chapter 6
The Generation-based Approach with Hierarchical
Factored Conditional Restricted Boltzmann Machines
While the framework developed in this thesis synthesizes gestures by selecting gesture
segments from the motion library, an alternative design is to learn to generate frame-
by-frame gestures explicitly. Following this perspective I have proposed a model called
hierarchical factored conditional Boltzmann machines (HFCRBMs) to learn gesture gen-
erators with a direct mapping formulation. It learns a model of gestures from motion
capture data of human conversation. The model considers speech features and gesture
motion as two sets of time series data and learns to generate motion frames based on
both time seres data. The framework is shown in Figure 6.1. The evaluation for this
framework focuses on using prosody features of the speech, and this chapter will address
the framework with this design. Further comparisons with the approach developed in
this thesis is described in chapter 7. This chapter starts from explaining the idea be-
hind modeling the learning problem with this approach and then describes HFCRBMs in
section 6.2.
47
Original Unmatched Generated Original Unmatched
Generator
t-3 t-2 t-1 t
…
Figure 6.1: The architecture of the generation process. The gesture generator models
gestures conditioned on audio features. The model generates next motion frame based
on given audio features and past motion frames.
6.1 Decomposing the learning problem
The space of all possible arm positions is very large while natural gestures will move
through the space in very constrained ways. The framework reduces the complexity of
the learning problem by decomposing it. To learn a model that generates natural gestures,
we rst learn a model that encapsulates the joint, spatial, and movement constraints of
gestures, and then learn the relation between prosody and gestures with the representa-
tion derived by this model. The training data contains audio and motion capture data
of people having conversations, and the model learns detailed dynamics of the gesture
motions. After training, the gesture generator can be applied to generate animations
with recorded speech for a virtual human. The generator is not designed to learn all
kinds of gestures but rather motions related to prosody like rhythmic movements of beat
48
gestures. Gestures tied to linguistic information like iconic gestures, pantomimes, deictic,
and emblematic gestures are not considered. These gestures play an important role on
communication, but the approach excludes them due to practical concerns. The space of
this kind of linguistic features is large, and therefore a gesture generator would require
a rich set of knowledge to map general utterances to gestures. The dataset applied for
learning the model is small compared to the entire space of linguistic information, and
therefore the knowledge it might reveal for mapping uttered content to our gestures is
sparse. Thus, the approach focuses on prosody-based gesture generation and excludes the
modeling between linguistic information and gestures. The type of gestures addressed in
the work is similar to the idea of a motor gesture [Krauss et al., 2000]. Prosody and
motion correspond to emphasis, and both of them can exhibit the emotional state of the
speaker.
Although the approach does not attempt to learn gesture generation from semantic
content, it did not exclude semantic-based gestures from the training data. This is be-
cause these gesture motions still preserve the consistent relation between prosody and
movement. For the purpose of modeling this relation between the two time series data,
this type of data provides valid samples for the learning process. Motion capture data
is expensive to create, and the recorded data is limited compared to the possible space
of human gestures. Thus, it is important to preserve more training data for the learning
model. In addition, this design makes the framework more
exible in that no manual
process is required for extracting the training data.
Consider the motion generation mechanism of human gestures. The motion is driven
by a set of motor signals that in combination produce a sequence of movements. In
49
contrast, the raw data we have on human gesturing is largely in the form of motion
capture data, represented as real valued vectors that in essence obscure the factors and
constraints that were involved in generating the data. Whereas it is conceivable to model
directly the relation between speech and this data on joint angels, we argue it is more
reasonable to model the relation between speech and the underlying causes of motion.
This argues for a decomposition: rst learn a model of the hidden factors that underlie
the causes of motion and then in turn learn a model of the relation between speech and
those factors.
Recent works in image recognition argues for the benet of this design [Hinton, 2010b].
First learn a model that can generate images via inferring hidden factors that composing
images. After the hidden factors of image generation have been derived, it becomes easier
to associate them with image labels. One generative model consistent with the idea is
called deep belief nets (DBN) [Hinton et al., 2006]. A DBN uses restricted Boltzmann
machines (RBMs) to learn the hidden factors for generating observed data.
To apply the same concept to gesture modeling, the framework needs to nd a model
that can eectively infer hidden factors for generating human motion. Human gestures are
constrained by three types of factors: joint constraint, movement constraint, and gesture
space constraint. (1) Joint constraint: each joint has physical limitations, so its possible
values are bounded within a range smaller than the original data space. (2) Gesture
space constraint: people gestures within a subspace of possible movement, and therefore
the reasonable space for generating gestures can be further narrowed. (3) Movement
constraint: a gesture at each time step depends on previous movement, and since our
50
limbs move with reasonable speed, the possible space of current gesture is narrowed given
previous gestures.
It is in this movement constraint that we see a dierence with the use of RBMs in image
recognition. Whereas the RBM is capable of identifying a better basis for the individual
motion frames and the corresponding joint and gesture space constraints implicit in those
frames, it disregards the temporal relation. As a consequence, the derived basis cannot
embed the movement constraint.
The conditional restricted Boltzmann machine (CRBM) [Taylor et al., 2007] is a
model that can nd a basis which follows the three constraints. It has the same two
layer structure as RBMs, but has additional visible layers which take past data as input.
This architecture allows the model to learn to generate gestures conditioned on past
motion frames, and therefore the generation of next motion frame follows the movement
constraint. Thus, our framework applies CRBMs to learn new basis for motion frames.
After deriving a better representation for gestures, the next challenge is to model the
relation between speech and gestures. Gesture motion at each time step relates to not
only speech but also past motion. For this reason, a reasonable model is to use both past
motion and speech to generate gestures. Our framework chooses factored conditional
restricted Boltzmann machines (FCRBMs) to learn the generation of gestures based on
past motion and speech information. The combination of both CRBMs and FCRBMs
formulate a model called hierarchical FCRBMs (HFCRBMs).
51
CRBM
FCRBM
h
t
v
t
v
t-1
v
t-2
v
t-3
v
t-4
t-2
t-1
z
y
(a) CRBM-based HFCRBM
CRBM
CRBM
CRBM
z
y
h
(b) Simplied representation of
HFCRBM
Figure 6.2: (a) A HFCRBM with order-2 CRBMs and order-2 FCRBMs. The model
use CRBMs to convert input data of the FCRBM. The CRBMs shown here is the same
CRBM applying for dierent input layers of the FCRBM and dierent input set. The
gure shows components at dierent time step with dierent gray scale. (b) A HFCRBM
with simplied representation.
6.2 Hierarchical factored conditional Boltzmann machines
HFCRBMs use CRBMs to model motion frames and apply FCRBMs to model the output
of CRBMs' hidden layer conditioned on prosody. We would like to clarify that there is
only one CRBM at the bottom level of the HFCRBM. The multiple CRBMs shown in
Figure 6.2a are the same model applied for the input sequence at dierent time interval.
This representation indicates the relation between input sequence, CRBMs, and the input
of the FCRBM: the CRBM use input data at t 4;t 3;t 2 to generate input for v
t2
of the FCRBM, and the data att 3;t 2;t 1 corresponds tov
t1
of the FCRBM, and
so on. A simplied representation for the HFCRBM is illustrated in Figure 6.2b, where
all CRBMs are the same model. In other words, the HFCRBM uses time series data to
train one CRBM, and uses it to convert the data input hidden factors as input of the top
FCRBM.
52
CRBMs nd hidden factors for generating gestures, and the factors formulate a new
basis which reduces the complexity for FCRBMs to modeling audio features and motion.
Each hidden factor corresponds to a component that composes gestures, and therefore
the generation process with these hidden factors has a higher probability to generate
natural gestures. This benet is especially important when generating gestures with
novel utterances.
6.3 Gesture Generator
This section explains the learning process and the generation process of our gesture
generator, and then describes a smoothing process for rening the generated animations.
A discussion about its advantage over previous work is given at the end of the section.
6.3.1 Learning the Gesture Generator
The rst step of learning the HFCRBM for gestures is to perform unsupervised learning
on motion data to identify hidden factors for generating human motion. After the motion
modeling steps by CRBMs, the HFCRBM can map motion data onto a new basis with
CRBMs and train top-layer FCRBMs conditioned on audio features. The training process
of our gesture generator is shown in Figure 6.3. Both the CRBM and the FCRBM in
the HFCRBM model gesture generation, in which the CRBM learns with only motion
frames, and the FCRBM uses prosody information to further model gestures that do not
predicted well by the CRBM.
Besides prosodic features, the framework also denes a correlation parameter as one
of the contextual information. The correlation parameter at each time step indicates how
53
CRBM
(a) Training CRBMs.
CRBM
CRBM
CRBM
z
y
h
audio features
(b) Training FCRBMs.
Figure 6.3: The training process of HFCRBMs for gesture generators. (a) The HFCRBM
trains bottom CRBMs with motion frames. (b) After the training of CRBMs is completed,
the motion data goes bottom-up through CRBMs, and the FCRBM models the output
of CRBMs' hidden layers conditioned on audio features. The parameters of CRBMs are
not updated at this stage.
strong the motion should be correlated with prosody. It provides a handle for users to
tune the motion of virtual human: the higher the correlation parameter, the more the
velocity of motions will be correlated with prosodic features.
We have also added a sparse coding criterion to the unsupervised learning (training
CRBMs) step because in our initial investigations it further improves the accuracy of
gesture generation. Dierent from [Lee et al., 2008] as they only update the bias term
of the hidden layer for encouraging sparsity, we also update the connection weight of
CRBMs. The objective function regarding the sparsity term is expressed with cross
entropy between the desired and actual distributions of the hidden layer as described in
[Hinton, 2010a].
54
6.3.2 Gesture Generation
Having learned a HFCRBM containing the model of gestures, the generation framework
works by taking previous motion frames and audio features as input to generate next
motion frame. A generation loop is shown in Figure 6.4. The generation of a sequence of
motion is done in a recurrent way in that the generated motion frame becomes part of the
input of next generation step. The process can generate a motion sequence with the same
length as the audio. Since the generation process depends on previous motion frames,
the initial step requires a preliminary short sequence of motion. A designer can sample
a motion sequence from the motion capture data that he wants the avatar to perform as
the preliminary motion sequence.
6.3.3 Smoothing
The motion sequence generated by HFCRBMs can contain some noise and the dierence
between frames may be greater than natural gestures. Although each motion frame
is still a natural gesture and this kind of noise is rare in the output, users are highly
sensitive to the discontinuity of the animation and a short un-natural motion can ruin the
entire animation. Therefore, after gesture motions are generated, an additional smoothing
process is performed on the output result.
The smoothing process computes the wrist position of each generated frame, and cal-
culates the acceleration of wrist movement. If the wrist acceleration of one motion frame
exceeds some threshold, we reduce the acceleration via modifying the joint rotation veloc-
ity of that motion frame to be closer to the velocity of previous joint rotations. The new
motion frame at timet is computed by:
55
r = 0:2
x
0
t
=x
t
while (wrist acceleration of x
0
t
) > threshold and r< 1 do
x
0
= (1r)(x
t
2x
0
t1
+x
0
t2
) + 2x
0
t1
x
0
t2
r+ = 0:1
end while
where x
0
is smoothed motion frame, and x is original output motion frame. The thresh-
old is chosen as the maximum wrist acceleration value of the human gesturing motion
observed from the motion capture data. The equation inside the while-loop is adjusting
the velocity of current frame to an interpolation between the original velocity and the
velocity of its previous frame, and x
0
t
is the resulting new motion frame corresponding to
the smoothed velocity. The smoothness criterion, wrist acceleration, is computed based
on the translation of the wrist joints, while the valuesx within update equation are values
of joint rotations. For a motion frame at time t that does not exceed the acceleration
threshold, it is smoothed as:
x
0
t
= 0:8x
t
+ (x
0
t1
+x
t+1
)=15 + (x
0
t2
+x
t+2
)=30
6.3.4 Comparison with Previous Approaches
The major improvement of our method over previous approaches is that it can encode
signicantly more information. The information capacity explicitly decides the expres-
siveness of the gesture generator. Our framework shares one common design with previous
approaches [Levine et al., 2009, Levine et al., 2010]: it projects motion data onto a basis
56
and uses the new representation to learn a relation with speech. An essential dierence is
our framework uses a basis with space signicantly larger than that of other approaches.
For example, the HMM-based approach clusters motions onto a basis withn states where
n < 100 in general. On the other hand, the basis inferred by our model has a space
with 2
n
possible states, where n is the number of the CRBM's hidden nodes. In our
experiment, n is set to 300.
Modeling gesture motion with fewer than 100 states limits the ability of gesture gen-
erators to encode gestures. Each derived state will in general quite dierent from each
other, and the gap between each state may result in \motion jump" when concatenating
them to formulate animations. Thus, an additional synthesis algorithm is required to
make smooth transition. Moreover, with a small set of gesture states, the derived model
can be viewed as a m{to{n mapping function which maps given audio features onto cer-
tain motion pattern provided by the training data. On the other hand, the signicantly
richer capacity of our model allows the gesture generator to learn a generalized relation
between prosody and gestures, and therefore can generate more related motion for new
speech audio. While previous works limit the generation process to use a small set of
motion sequences to synthesize animations, our design allows the generator to generate
new motions for new speech audio based on the model derived from training samples.
6.4 Human subject experiment
We evaluated the quality of the generated gestures by having subjects compare the ges-
tures it generates for utterances with the original motion capture for those utterances as
57
well as using motion capture from dierent utterances. We used a dataset created for ex-
amining how audio and body motion aect the perception of virtual conversations [Ennis
et al., 2010]. The dataset contains audio and motion of groups of people having conversa-
tions. There are two groups in the dataset, a male group and a female group. Each group
has three people. There are two types of conversations, debate and dominant speaker,
and each type has ve topics. We used the debate conversation data of the male group
for our experiments. The motion capture data contains the skeleton of subjects and the
recorded joints movement is a vector with 69 degree of freedom. Since this work focus
mainly on arm gestures, we removed leg, spine, and head movement from the data. Joint
rotation axes containing all zeros throughout the data set (i.e., joint never move along
that axis) are also removed. After removing these elements, the resulting joint rotation
vector has 21 degrees of freedom.
We extracted pitch and intensity values from audio using Praat [Boersma, 2001].
The values of pitch and intensity ranged from zero to hundreds. To normalize pitch
and intensity values to help model learning, the pitch values are adjusted via taking
log(x + 1) 4 and setting negative values to zero, and the intensity values are adjusted
via taking log(x) 3. The new pitch values range 0 to 2:4, and the new intensity values
range 0 to 1:4. The log normalization process also corresponds to human's log perception
property.
The original motion capture data contains non-gesture frames, and therefore an ex-
traction process has been proposed to identify valid training data. We analyzed mo-
tion data to identify gesture motions and non-gesture motions, and determine what y-
coordinate value of wrists best separates these two sets. Motion frames having at least
58
one wrist's y-coordinate higher than this value are dened as gestures. This criterion
is then applied to extract gesture motions automatically. Among valid motion frames,
only the animation segments with length longer than 2 seconds are kept. After the valid
motion frames are identied, the corresponding audio features are extracted. The time
constraint on data selection will exclude gestures with short period of time. The rationale
for dening this constraint is that in our data analysis most of gestures performed in less
than 2 seconds are often either iconic gestures or arm movement unrelated to utterances,
and neither cases are gestures we want to learn. In the motion capture data most of
gestures last for a longer period of time.
In the training of modied HFCRBMs, both the hidden layer of CRBMs and FCRBMs
have 300 nodes. The correlation parameters of each time frame is computed as the
correlation of prosody sequence and motion sequence with a window of1=6 seconds.
The prosodic features for gesture generators at each time frame also have a window of
1=6 seconds. We trained the model with the audio and motion capture data of one actor,
and use the other actor's speech as a test case to generate a set of gesture animations.
We applied the criterion described above to extract training and testing data, and there
are total 1140 frames (38 seconds) of training data and 1591 frames (53 seconds) of
testing data. Since testing data does not have correlation values, we sample correlation
parameters from training data. We extracted testing data in the same way as for training
data.
59
6.4.1 Evaluation
We used our gesture generator to generate gesture animations with testing data, and
to evaluate the quality of generated animations, we compare them with two gesture
animations:
The original motion capture data of the test data.
The motion capture data of the test case actor with respect to other utterance.
For the second case, we used the same extraction techniques described in subsection 6.3.2
to derive motion capture and audio data. We hypothesize that the gesture animations
generated by our model will be signicantly better than the motion capture data from dif-
ferent utterances, and rating scores for generated animations and actual human gestures
will be similar.
We displayed the three animations side-by-side, segmented them to the same length,
and rendered them into videos accompanied with original audio. One example frame of
our video is shown in Figure 6.5. In this example video, the left one is the motion capture
data with respect to dierent utterance, the middle one is the original gesture, and the
right one is the generated gestures. There are total of 14 clips with length 2 to 7 seconds.
The relative horizontal position of Original, Generated, and Unmatched cases is dierent
and balanced between clips (e.g. Original is on the left 5 times, middle 5 times, and right
4 times). The presentation of clips to participants was randomized. Since the motion
capture data with respect to dierent utterances is real human motion, participants can
not tell the dierence simply based on whether the motion is natural; they have to match
the motion with speech to do the evaluation. We recruited 20 participants with ages
60
ranging from around 25 to 55. All participants are familiar with computer animations,
and some of them are animators for virtual human or experts on human gestures. We
asked participants to rank which gesture animation in the video best matches the speech.
We performed balanced one-way ANOVA on the ranking results and the analysis
result suggests that at least one sample is signicantly dierent than the other two. We
then applied Student t-test to test our specic hypotheses. The evaluation results are
shown in Figure 6.6. Comparing the number of times a category was ranked rst, the
dierence between the original gesture motion and the generated gesture motion is not
signicant, and both them are signicantly better than the unmatched gesture motion.
We applied another study via assigning 2 point for ranked-rst cases and 1 point for
ranked-second cases, and calculated the overall score of each motion. Hypothesis testing
show that both the generated motion and the original motion are signicantly better than
the unmatched gestures. The generated motion receives better rating than the original
motion, but the dierence is not signicant. The results suggest that the movement of
generated gesture animations is natural, and the dynamics of motions are consistent with
utterances.
We additionally analyzed the rating score of each generated animation, and the score
of each animation diers little from the overall average score. The least-rated animations
are those with dramatic movement. Human subjects are quite sensitive to the smoothness
of the motion. It is important for gestures to match speech, but the motion will not be
considered as natural if it is not smooth enough. This is one of the main reasons that the
CRBM-based HFCRBM is better than RCRBM-based one on gesture generation. The
subjective evaluation result shows that there is still room for improvement.
61
t
1
t
2
t
3
t
4
h
z
y
(a) Bottom-up through CRBMs
audio features
t
1
~t
5
h
z
y
(b) Run FCRBMs
t
5
h
z
y
(c) Top-down through CRBMs
t
5
t
2
t
3
t
4
t
6
audio features
t
2
~t
6
h
z
y
(d) Use generated frames for next generation
Figure 6.4: An example of generation
ow where gray color indicates that components
are inactive. (a) The model takes motion frames from t
1
to t
4
as input for CRBMs. (b)
The FCRBM takes bottom-up output of CRBMs and audio features from t
1
to t
5
to
generate next data at t
5
. (c) The CRBM maps the data generated by the FCRBM to a
motion frame. (d) To generate the motion frame at t
6
, the generation process includes
the generated frame as part of the new input to perform next generation loop.
62
Figure 6.5: The video we used for evaluation. We use a simple skeleton to demonstrate
the gesture instead of mapping the animation to a virtual human to prevent other factors
that can distract participants.
0
3
6
9
12
15
18
21
Original Different Generated
0
2
4
6
(a) Total score
Original Different Generated
0
2
4
6
Original Different Generated
(b) Rated rst
Figure 6.6: Average rating for gesture animations. Dashed lines indicate two sets are
not signicantly dierent, solid lines indicate two sets are signicantly dierent. For the
signicantly cases, the p values are all less than 10
7
. (a) The average score of original,
unmatched, and generated animations are 17:4, 7, and 17:6 respectively. (b) The average
number of rated-rst among original, unmatched, and generated animations are 5:7, 2:15,
and 6:15.
63
Chapter 7
Evaluations
This chapter rst describes the experiments for comparing the proposed approach with
state-of-the-art approaches. The chapter also describes an ancillary experiment which
evaluates whether manually annotating the gestural signs is crucial as opposed to deriving
them with a clustering approach.
7.1 Assessment Experiment for DCNFs
We evaluated the performance of DCNFs on gestural signs prediction with the collected
coverbal gesture dataset described in chapter 4.
7.1.1 Baseline models
The assessment experiment compared DCNFs with several other approaches. We in-
clude CRFs, which is applied in the state-of-the-art work on gesture prediction [Levine
et al., 2010], for comparisons. We also compared with the second-order CRFs. Addition-
ally, we include support vector machines (SVMs) [Cortes and Vapnik, 1995] and random
forests [Breiman, 2001], two eective machine learning models. The SVM is an approach
64
that applies kernel techniques to help nd better separating hyperplanes in the data for
classications. The random forest is an ensemble approach which learns a set of decision
trees with bootstrap aggregating for classication. Both approaches show good gener-
alization in practice. The two existing works that combine CRFs and neural networks,
CNF [Peng et al., 2009] and NeuroCRF [Do and Artieres, 2010], are also evaluated in the
experiment. Finally, the experiment evaluated the performance of DCNFs without using
the sequential relation learned from CRFs (denoted as DCNF-no-edge).
7.1.2 Methodology
The experiment uses the holdout testing method to evaluate the performance of gesture
predictions in which the data is separated into training, validation, and testing set. The
selection of the hyper-parameters is based on the training and validation set, and the nal
result is the performance on the testing set. There are total 15 videos in the coverbal
gesture dataset and each video correspond to a dierent interviewee. We choose the rst 8
interviewees (total clip length correspond to 50:86% of the whole dataset) as the training
set, 9 through 12 interviewees (23:18% of the whole dataset) as the validation set, and
last 3 interviewees (25:96% of the whole dataset) as the testing set.
7.1.3 Results and discussion
The results are shown in Table 7.1. Both of DCNFs and DCNFs-no-edge outperform
other models signicantly. The performance similarity of DCNFs with and without edge
features suggest that the major improvement comes from the exploitation of deep ar-
chitecture. In fact, models that rely mainly sequential relation show signicantly lower
65
Models Accuracy(%)
CRF 27.35
CRF second-order 28.15
SVM 49.17
Random forest 32.21
CNF [Peng et al., 2009] 48.33
NeuroCRF [Do and Artieres, 2010] 48.68
DCNF-no-edge 59.31
DCNF (our approach) 59.74
Table 7.1: Results of coverbal gesture prediction.
performance, suggesting the bottleneck on coverbal gesture prediction lies in the realiza-
tion of the complex relation between speech and gesture
1
.
7.1.4 Handwriting recognition
This dataset [Taskar et al., 2004] contains a set of (total 6877) handwriting words collected
from 150 human subjects with average length of around 8 characters. The prediction
targets are lower-case characters, and since the rst character is capitalized, all the rst
characters in the sequences are removed. Each word was segmented into characters and
each character is rasterized into 16 by 8 images. We applied 10-fold cross validation
(9 folds for training and 1 fold for testing) to evaluate the performance of DCNFs and
compare the results with other models.
1
One reason that the gestural sign prediction task is dicult is that sometimes the gestures of the
speaker may not be coupled with the utterance content, and these cases can make the learning task more
challenging when appeared in the training data and decrease the assessment performance when appeared
in the testing data.
66
Models Accuracy(%)
CRF 85.8
CRF second-order 93.32
SVM 86.15
Random forest 96.97
CNF 91.11
NeuroCRF [Do and Artieres, 2010] 95.44
DCNF-no-edge 97.21
Structured prediction cascades [Weiss et al., 2012] 98.54
DCNF (our approach) 99.15
Table 7.2: Results of handwriting recognition. Both the results of NeuroCRF and Struc-
tured prediction cascades are adopted from the original report.
7.1.5 Baseline models
In addition to the models compared in the gesture prediction task, this experiment also
compared with the state-of-the-art results previously published using the structured pre-
diction cascade (SPC) [Weiss et al., 2012]. The SPC is inspired by the idea of the
classier cascade (for example, boosting) to increase the speed of the structured predic-
tion. The process starts ltering possible states at 0-order and then gradually increase
the orders with considering only the remaining states. While the complexity of a con-
ventional graphical model grows exponentially with the order, SPC's pruning approach
reduce the complexity signicantly and therefore allows applying higher order models.
The approach is the state-of-the-art results on the handwriting recognition task. The
comparison results of DCNFs with SPC, along with other existing models, are shown in
Table 7.2
67
0
20
40
60
80
100
GPLVM direct co-articulation
%
(a) Co-articulation rate
Without GPLVM-
based transition
20
40
60
80
100
0.00 0.20 0.40 0.60 0.80 1.00
%
seconds
(b) Transition rate
Figure 7.1: Assessment experiments for motion transitions. (a) The GPLVM-based ap-
proach allows 91:46% of co-articulation among all pairs of gestures while the conventional
approach only allows 2:69%. (b) The transition success rate of our GPLVM-based ap-
proach with dierent transition time limit. The conventional approach failed to generate
gesture animations for this task.
7.1.6 Results and discussion
In this handwriting recognition task our approach shows improvement over published
results. Compared to the gesture prediction task, the mapping from input to prediction
targets is easier to realize in this task, and therefore the sequential information provides
an in
uential improvement.
7.2 Assessment Experiment for GPLVM-based Motion Synthesis
A motion synthesis approach has to generate gestures based on the specied gestural
signs, and therefore its eectiveness lies on whether the resulting animations match the
gestures specied by the prediction process. The experiment evaluated the performance
of the GPLVM-based approach and compared it with the motion synthesis process that
does not attempt to generate transition motions for co-articulations [Levine et al., 2010].
There are two assessment experiments, the rst experiment assess the success rate of both
68
approaches on co-articulating among all pairs of gesture motions, and the second exper-
iment assesses the co-articulation rate of GPLVM-based approach on gesture generation
tasks and the eects of co-articulation time constraint. The motion capture data applied
for both experiment is from a study done by [Ennis and O'Sullivan, 2012]. The selected
data for our experiment is a set of motion capture data of two females talking to each
other face-to-face. In each recording session the speaker is assigned a topic and talks for
around a minute while the other person listen. The extracted data contains 8 sessions
which in total has 8 minutes of audio and gesture motion.
7.2.1 Transition probability between all pairs of gestures
There are total 199 gestures in the motion library and therefore there are total 39601
pairs of possible co-articulation. We perform both the GPLVM-based approach and the
conventional approach which relies on direct co-articulation with a smoothness constraint.
The results are shown in Figure 7.1a. The GPLVM-based approach allows 91:46% of co-
articulation while the conventional approach allows only 2:69%, which is close to the ratio
(2%) reported in [Levine et al., 2010].
7.2.2 Transition probability for gesture generation tasks
The second assessment experiment aims for evaluating the co-articulation rate of the
GPLVM-based approach on gesture generation tasks. The evaluation experiment rst
gives a set of motion capture data of human gestures for both approaches as the basis
motion segments, request both processes to generate gestures based on the specied
sequences of gesture types, and then measures the percentage of the generated gestures
69
direct co-articulation
(a) Evaluation video
%
70
0
10
20
30
40
50
60
70
Our approach Levine et al. 2010
(b) More natural
%
0
10
20
30
40
50
60
70
Our approach Levine et al. 2010 Levine et al. 2010
(c) Better matches the speech
Figure 7.2: Evaluation results for the overall framework with 95% condence interval.
Both results show that the proposed approach is signiciantly better than the state-of-the-
art.
that match the request. We took 6 sessions as the training set to build the motion libraries
of both approaches, annotated the gestures of the other 2 sessions, and used the derived
gesture annotations to evaluate the performance of both approaches. We trained the
GPLVM to project the data onto a 10-dimensional space. The relation between the co-
articulation time constraint and the success rate is shown in Figure 7.1b. The transition
process of the GPLVM-based synthesis allows 100% transition rate when the transition
time limit is no less than 0:67 second.
7.3 Subjective Evaluation Study for the Overall Framework
The overall evaluation study showed human subjects gestures generated by the proposed
approach and gestures generated by the state-of-the-art approach [Levine et al., 2010].
The study used audio clips from the data collected in [Ennis and O'Sullivan, 2012]. The
selected data are 7 audio clips of female performer J, and the length of each audio clip is
around 55 to 75 seconds. The compared approach [Levine et al., 2010] use CRFs to predict
kinematic parameters from prosody, which can be understood as a prediction process
through the denition of our decomposition framework. Thus, to apply their approach for
70
comparisons the experiment trained CRFs with the same DCAPS training data described
in the DCNF section and used the motion synthesis approach without GPLVM-based
motion transition. The animations are generated with Smartbody [Thiebaux et al., 2008]
for both approaches.
The gestures addressed in this work are those that are contextually dependent with
speech but may not by themselves convey denite meanings through their physical forms.
Thus, the quality of the animations are best be assessed by evaluating their consistency
with the accompanying speech and the naturalness of the motion
2
. In this reason, our
study shows animations generated by both approaches in the same video clips and asks
participants to choose which gesture looks more natural and which gesture better matches
the speech. A sample frame of the video is shown in Figure 7.2a. The study recruited
100 participants on Mechanical Turk, with the worker requirement to be outside China,
India, and Japan, to block a majority of workers who are not native English speaker,
and have had participated more than 5000 works with no less than 97% of those works
have been approved. The evaluation results, as shown in Figure 7.2, showing that the
proposed approach is perceived to generate gestures signicantly (p< 0:05) more natural
and better match the speech than the state-of-the-art approach.
71
(a) Evaluation video
0
20
40
60
80
100
Our Direct mapping
(b) Vote percentage
Figure 7.3: Comparing animations generated by our method and the previous work. (a)
Each video displays a pair of gesture animations generated for the same speech audio by
dierent approaches. (b) The percentage of animations being voted as best matching the
speech, and the dierence is statistically signicant.
7.4 Subjective Evaluation Study Comparing the Synthesis-
Based Framework with the Generation-Based Framework
The above three evaluation experiments compared the proposed approach with state-of-
the-art approaches. This section takes a step back and asked whether there is an alterna-
tive design to decompose the problem. This experiment compared the quality of the ani-
mations generated by the synthesis-based approach (framework that uses GPLVM-based
motion synthesis) with the generation-based approach (use HFCRBMs). The dataset ap-
plied in this experiment is from another study created for examining how audio and body
motion aect the perception of virtual conversations [Ennis et al., 2010]. The dataset
contains speech audio and motion capture of three people having conversations. The
experiment used a subset of the data in which each record contains a person gives a long
speech without being interrupted. The training data is the data of male number 1 and
2
In other words, the type of gestures addressed in this work is dierent from the focus of [Bergmann
and Kopp, 2009] which aims at generating gestures that convey more apparent semantic content.
72
the testing data is the data of male number 3. There are total 193 seconds of training
data and 238 seconds of testing data. The motion capture data contains the skeleton of
subjects and the recorded joints movement is a vector with 69 degrees of freedom. The
GPLVM-based approach is trained to project data with a 9-dimensional hidden space.
The original motion capture data has 120 frame rate, and in this experiment the data is
down-sampled to a 15 frame rate.
For speech input, as HFCRBMs mainly realizes the motion dynamics and motion dy-
namics correlate mainly with prosody, this experiment focus on extracting the following
prosodic features: normalized amplitude quotient (NAQ), peak slope, fundamental fre-
quency (f0), energy, energy slope, spectral stationarity [Scherer et al., 2013]. The process
also applied an automatic approach to determine the tenseness of the voice at each time
frame which gives the probability of being at low, medium, or high tenseness [Scherer
et al., 2013]. The extraction process also determines whether the speaker is speaking
based on f0, and for the periods in the speech that identied as not speaking all audio
features are set to zero. The resulting audio features have 9 dimensions.
Since this experiment use only prosodic features as input signals, we followed the
state-of-the-art work [Levine et al., 2010] which addressed using only prosody as input
and use CRFs as the prediction process of the synthesis-based approach in contrast to
using the original DCNFs.
The experiment applied an automatic process to segment and classify gestures into
gesture/non-gesture classes based on the height of both wrists. If any one of the wrist is
higher than a certain threshold, then that frame of motion is labeled as \gesture". The
threshold is dened based on observing the motion data and manually determine a value
73
that best separate gesture/no-gesture behaviors. This automatic classication scheme
results 52 non-gesture and 47 gesture motion segments from the training data, where the
length of each segment ranges from 1=15 seconds to 21:8 seconds.
The experiment trained HFCRBMs with the training data mentioned above and follow
the same conguration reported in that experiment. The test data are 14 speech audio
clips with length ranging from 11 to 25 seconds, with average of 18 seconds. We animate
the generated motion on a virtual character with SmartBody [Thiebaux et al., 2008].
Since the motion capture data does not include nger or lip information, to make the
virtual character more natural, we applied SmartBody's viseme animation mechanism
to automatically generate lip movement synchronized with the speech audio, and put
an idle motion to animate ngers. For the evaluation, we recruited 48 participants on
Mechanical Turk and asked them to make pairwise comparisons to vote which animation
best matches the speech audio. The evaluation video displays animations generated by
both algorithms side-by-side as shown in Figure 7.3a for a total 14 videos. The evaluation
result, illustrated in Figure 7.3b, shows that our framework is better than the previous
work where 74:1% of evaluations choose the animations generated by our framework.
Pearson's Chi-square test shows that the dierence is statistically signicant (p< 0:01).
While animations generated by both frameworks appear natural and related to the
speech, gesture animations generated by our framework exhibit more active movement.
This is due to that learning HFCRBMs to generate motion from speech is a challenging
task and as a result the derived model tends to generate less active movement since per-
forming dramatic motion can lead to higher error when mismatching the target gestures
than performing smooth and less active movement.
74
7.5 Assessment Experiment for Comparing Manually-Annotated
Gestural Signs with Automatic Clustering
This section describes an ancillary experiment which justies whether it is necessary to
apply manual annotation to derive gestural signs as opposed to derive them automatically
from human conversation data.
The exploitation of gestural signs requires manually annotate the training data, and
a research question is whether these annotations can be derived through an automatic
approach. One approach is to apply unsupervised learning algorithms on the motion data
to cluster motion into several groups. The unsupervised learning algorithm learns the
patterns of the motion data and distinguish them based on the derived patterns, and if
the distinguishing results share similar results as the manual annotations, the clustering
results can be applied to save the annotation eort.
As gestures are consecutive motions, the unsupervised learning algorithm for clus-
tering gestures needs to be capable of realizing sequential relation. HMMs infer hidden
states based on the observed variables and the sequential relation of the hidden states,
and therefore is a promising approach for clustering gestures. The approach for automat-
ically annotating gestures with HMMs is to apply HMMs on motion data and uses the
hidden states inferred for each data point as the annotation results.
One way to evaluate the eectiveness of this automatic approach is the similarity
between the clustering results and the manual annotation results. To be more specic, if
each cluster can be assigned an unique gestural sign and the motion data in the cluster
mostly belong to that same gestural signs, then the approach is eective. The evaluation
75
experiment uses this eectiveness criterion and conducts the evaluation on the motion
capture data [Ennis and O'Sullivan, 2012] described in the previous section. The process
annotates motion data with the dened annotation scheme, performs HMMs on the mo-
tion data with 14 hidden states, and analyzes the similarity between the clustering and
the annotation results. The similarity measurement relies on assigning gestural signs for
each cluster, and the evaluation process rst nds an assignment for gestural signs for
each cluster which provides highest consistency with the data within the cluster and then
reports the similarity value. To be more specic, the process assigns an unique gestural
sign for each cluster and calculates how many data points within the cluster have the
same gestural sign as the one assigned to the cluster. The process then pick the assign-
ment that overall have highest similarity and use the value as the similarity measure for
the clustering results. The estimation criterion found 23:11% of clustering results match
the manually annotated gestural signs, which is arguably low for real practice. The as-
sessment result shows that the clustering approach is not practical be applied to replace
the manual approach for annotating gestural signs dened in this thesis.
76
Chapter 8
Conclusion
8.1 Conclusion
This thesis provides a machine learning framework for learning to generate gestures from
speech. The framework denes a set of gestural signs which realize the correlation be-
tween utterance content and gestures. The framework is consist of two processes: a
prediction process called DCNFs which infer gestural signs from speech, and a synthesis
process that exploits GPLVMs to synthesize gestures based on the given gestural signs
and perform gesture co-articulation through determining a smooth trajectory within the
low-dimensional space of GPLVMs. Several objective assessment experiments and sub-
jective evaluation studies have been performed, and the results show that the framework
developed in this thesis outperforms state-of-the-art approaches. This thesis has also
proposed a machine learning framework based on the perspective of generating motions
frame-by-frame directly to compare with the framework addressed in this thesis which
synthesizes motions based on selecting motions from the motion library. The evaluation
77
experiments support that the synthesis-based approach outperforms the generation-based
approach.
The contributions of this thesis is as follows: (1) This is the rst fully machine learning
approach that includes both prosody and utterance content as input signals and there-
fore can generate more than beat gestures. (2) The DCNFs can be generalized to other
tasks involving time-series prediction, and have shown to outperform state-of-the-art ap-
proaches on handwriting character recognition. (3) The GPLVM-based motion synthesis
preserve the style of the gesture motion and can be applied to animate virtual characters
with specic gesture styles, and it also support a more
exible co-articulation that allows
the generated gestures to be more consistent with the content of the utterances. (4)
The proposed framework decomposes the learning task with gestural signs which allows
the training process to train the prediction process and the synthesis process with two
dierent sets of data and therefore provides a more
exible training process. The dened
gestural signs provides a set of gesture annotations which can also be applied on other
machine learning approaches for predicting gestures from speech.
8.2 Future Work
A fundamental limitation of this work is the quality of the data for both prediction and
motion synthesis processes. Thus, the key improvement will be improving the data for
the training process. Further, this work used linguistic and prosodic features as input
signals for the gesture prediction task, and as described in the chapter 3 there are some
gestures that can not be predicted based on this input as there is insucient information
78
in the prosody and the utterance content. To extend the framework to also realize the
predictions of these types of gestures, the process needs to include additional signals like
mental states and communicative intent of the speakers. In addition to extending input
signals, there are also other information concerning gestures that could be included to
expand the prediction targets and improve the gesture generation process. This includes
detailed information like the phase of a gesture (e.g. preparation, stroke), that would
improve the framework's ability to generate motion more in line with the speech.
The DCNF developed in this thesis work combines CRFs with deep neural networks,
and a promising next step to improve the model is to integrate deep convolutional neural
networks. As described in the chapter 2, recent advances of deep convolutional neural
networks have shown promising results with static recognition approaches like image
recognitions and speech recognitions. An important next step is to combine DCNFs with
deep convolutional neural networks to exploit sequential information to further improve
the prediction accuracy of deep convolutional neural networks.
79
Appendix A
Additional works toward improving the gestural sign
prediction
This chapter describes some analysis on the gestural sign prediction task and a few
attempts on improving the quality of the gestural sign prediction.
A.1 Improve models for predicting sequential dependency
The gestural sign prediction experiment shows that the sequential relation between ges-
tural signs does not help the prediction accuracy. To understand the reason that the
sequential relation is not helpful for this dataset, we calculated the transition probability
between gestural signs and the results are shown in Table A.1. As can be observed in the
data, gestural signs are signicantly more likely to have self-transition, and the probably
of transition to others are rare. Thus, the sequential relation between gestural signs are
not helpful. In order to exploit the sequential relation to help gestural sign prediction,
we developed two algorithms that extend DCNFs that aim to realize a more complicated
80
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 29.4 1.23 1.5 0.09 0.02 0.06 0.06 0.03 0.57 0.06 0.09 0.59 0.08 2.83
2 1.05 3.67 0.13 0.008 0.004 0.005 0.005 0.003 0.03 0.003 0.02 0.08 0.03 0.38
3 1.38 0.13 5.97 0.008 0.003 0.005 0.01 0.005 0.06 0.001 0.02 0.08 0.008 0.34
4 0.12 0.01 0.01 0.5 0.001 0.003 0.001 0 0.003 0 0.001 0.02 0.001 0.06
5 0.03 0.001 0 0.001 0.12 0.003 0 0 0 0 0.003 0.004 0.001 0.02
6 0.04 0.004 0.001 0.008 0.001 0.33 0 0.001 0.001 0 0.001 0.009 0.004 0.04
7 0.07 0.009 0.003 0 0.001 0 0.19 0 0.003 0 0 0.01 0.001 0.02
8 0.04 0.003 0.004 0.003 0 0.003 0 0.11 0 0 0.001 0.003 0 0.01
9 0.52 0.04 0.07 0.003 0 0.001 0.004 0.001 0.91 0 0.001 0.02 0.003 0.05
10 0.03 0.004 0.008 0.001 0.001 0 0 0 0 0.11 0 0.004 0 0.01
11 0.09 0.01 0.009 0.003 0 0.003 0.001 0.001 0 0.001 0.62 0.02 0.001 0.05
12 0.6 0.05 0.06 0.02 0.009 0.01 0.008 0.003 0.004 0.004 0.01 2.53 0.03 0.4
13 0.07 0.03 0.005 0.005 0 0.001 0 0.003 0 0.003 0.001 0.05 0.62 0.14
14 2.88 0.25 0.4 0.08 0.02 0.03 0.03 0.02 0.06 0.01 0.04 0.34 0.15 36.6
Table A.1: Transition probability between gestural signs in percentage. 1:Rest, 2:Palm
face up, 3:Head nod without arm gestures, 4:Wipe, 5:Everything, 6:Frame, 7:Dis-
miss, 8:Block, 9:Shrug, 10:More-Or-Less, 11:Process, 12:Deictic.Other, 13:Deictic.Self,
14:Beats.
sequential information beyond the observed pairwise sequential relation. The two mod-
els are latent dynamic deep conditional neural elds (LDDCNFs) and semi-Markov deep
conditional neural elds (semi-DCNFs).
A.1.1 Latent dynamic deep conditional neural elds
Although the observed sequential relation does not help the prediction, our hypothesis
is that there may be a hidden sequential relation that can help to make the prediction.
Thus, we developed LDDCNFs which replace the CRF of DCNFs with latent dynamic
conditional random elds (LDCRFs) [Morency et al., 2007], which address learning latent
variables for the prediction targets (gestural signs in our case) and models the sequential
relation among the hidden variables. The structure of LDDCNFs is shown in Figure A.1.
81
h
t-1
POSs prosody
h
t
y
t-1
y
t
x
t-1
Deep neural networks
words POSs prosody
x
t
words
h
t+1
POSs prosody
h
t+2
x
t+1
words POSs prosody
x
t+2
words
y
t+1
y
t+2
LDCRFs
Gestural signs
Latent
variables
Figure A.1: The structure of the LDDCNF framework. The CRF model is replaced with
LDCRFs.
POSs prosody
y
t-1
y
t
x
t-1
Deep neural networks
words POSs prosody
x
t
words POSs prosody
x
t+1
words
y
t+1
Semi-CRFs
Gestural signs
words POSs prosody
x
n
y
n
Figure A.2: The structure of the semi-DCNF framework. The CRF model is replaced
with semi-CRFs.
A.1.2 Semi-Markov deep conditional neural elds
The reason of the most transition is self-transition is that when a speaker performs a
gesture it usually last multiple time units. This implies that instead of using a Markov
model to realize the sequential relation, a more reasonable approach is to apply semi-
Markov model which allows each state to have variant length in time. We adopt this idea
and developed semi-DCNFs. The structure of semi-DCNFs is shown in Figure A.2.
To analyze the sequential relation learned by semi-DCNFs, we show the transition
potentials learned by semi-DCNFs in Figure A.3. As observed in the gure, each gestural
82
Rest
time
1 100
Beat
Figure A.3: Transition potentials learned by semi-DCNFs. Each row represents a gestural
sign listed from top to bottom in the same order as the list in table A.1. Each column,
from left to right, represents the potential for transition to the same gestural sign after
time 1 to 100. Brighter color represents higher potentials while darker color represents
lower potentials.
sign has a high probability to stay at the same gestural sign when transition to the next
time frame, and both rest and beat gestural signs have high tendency to last across
multiple time frames.
A.1.3 Preliminary experiment results and discussions
Despite extending DCNFs to have more complicated sequential models, both LDDCNFs
and semi-DCNF failed to improve the prediction accuracy. This implies that none of
these realization of the sequential relation help the gestural sign prediction. While these
experiment results indicate that the sequential relation among gestural signs can not
improve the prediction quality, it is still an open problem that whether there exist a hidden
sequential relation which requires a sequential model distinct from the models exploited in
this thesis to realize. My hypothesis is that the sequential relation may not be signicantly
helpful when using our dictionary of gestural signs, as the information is helpful only when
there is a constrained sequential relation among gestural signs which suggest that there
is a constrained sequential relation among the conveyed information during face-to-face
communication. The sequential relation among the conveyed information depends on
83
Gestural sign F1 score
Rest 0.3866
Palm face up 0.0373
Head nod 0.2299
Deictic other 0.0154
Beat 0.5472
Table A.2: The top 5 gestural signs in terms of F1 scores.
the content one is communicating, which in general can have arbitrary relation. Thus,
it is unlikely that these constrains exist. To exploit sequential relation for prediction, a
possible approach is to design a new set of prediction targets instead of gestural signs
dened in this thesis which capture a certain sequential relation in human behaviors.
A.2 Attempts to improve prediction quality by changing
gestural signs
As improving the sequential model does not lead to the improvement on the quality,
we have also attempted a few approaches on changing gestural signs to see whether it
provides a quality improvement.
A.2.1 Prediction quality analysis
To determine which gestural sign is best predicted by DCNFs, we calculated the F1 score
for each gestural sign, and Table A.2 shows the score for the top 5 gestural signs.
The gestural sign prediction dataset is composed of mainly rest, head nod, and beat,
and therefore one idea is to make some changes on the labels to see whether it can improve
prediction accuracy. Below is a list of approaches we have tried but failed to improve the
prediction accuracy:
84
1. Remove rest
We remove data points that contain \rest" gestural sign, and since sequential in-
formation does not show to improve the performance we applied DCNF-no-edge
on the modied dataset. The model ends out learning mainly beat and head nod
gestural signs.
2. Remove rest and beat
We remove data points that contain both \rest" and \beat" gestural signs, and ap-
plied DCNF-no-edge on the modied dataset. The model ends out learning mainly
head nod gestural signs.
3. Group infrequent gestural signs into one gestural sign
We assigned \everything", \block", and \more-or-less" with same gesture sign and
applied DCNFs on the modied dataset. The model still focus mainly on \rest"
and \beat".
85
Appendix B
Assessment of extended RBMs on the direct generation
task
Besides HFCRBMs, I have explored several other approaches to using Boltzmann models
to learn a gesture generator. Among them HFCRBMs provide the most promising results.
This appendix introduces these other models, and describe how and why HFCRBMs
outperform them.
B.1 Prosody-gated FCRBMs
A simpler approach is to use FCRBMs alone without applying CRBMs to infer hidden
factors. We can use the same design for the FCRBM of the HFCRBM to build a prosody-
gated FCRBM, with a Gaussian-valued input layer. But in our experiments the values
generated by FCRBMs converge to constants quickly, resulting in a posture without
movement. A sample sequence generated by FCRBM is shown in Figure B.1a.
These results show that the model fails to make a better prediction than converging
to certain constants which have fewer errors on average. On modeling time series data,
86
6 10 14 18 22 26 30
−1
−0.5
0
0.5
1
1.5
(a) Generation result of FCRBMs
12 16 20 24 28 32 36
−1
−0.5
0
0.5
1
(b) Generation results of HFCRBM
Figure B.1: Sample generation results where the horizontal axis represents time step and
the vertical axis represents rotation value. Each curve corresponds to the rotation value
around an axis of a joint. (a) Curves of rotation vectors generated by FCRBMs. FCRBMs
lead the generated values to constant in a few steps. The rst 6 frames are initial sequence
and therefore are omitted in the gures. (b) Generation results of HFCRBMs for the same
case. The rst 12 frames are initial sequence and therefore are omitted in the gures.
if a model can not learn any temporal relation, the optimal solution is to always predict
the mean values of the data. The FCRBM generates the next motion frame with values
close to previous motion frames and converges toward mean values of motion data. This
implies that the prosody-gesture relation derived by the model is poor and will result in
more error than generating the motion sequence toward certain constants, so the model
chooses to ignore audio features and make predictions based on motion information.
The FCRBM is unable to directly model the relation between gestures and prosody,
and this failure makes a decomposition step necessary. FCRBMs can learn to generate
motions which depend on past motion and simple contextual information. But when
motions depend on complex contextual information, a better representation for motion
is necessary to reduce the complexity of the learning problem for FCRBMs.
87
CRBM
u
t-2
CRBM
u
t-1
CRBM
x
t-2
x
t-1
x
t
h
(a) Hierarchical CRBM
12 16 20 24 28 32 36
−1
−0.5
0
0.5
1
1.5
(b) Generation results
Figure B.2: Sample generation results where the horizontal axis represents time step
and the vertical axis represents rotation value. This is the case case as Figure B.1. (a)
Hierarchical CRBM takes audio features as part of input of past visible layers. u denotes
audio features. u
t
does not necessary contain only audio features att. For example, with
a window of sizen,u
t
includes audio features from tn tot +n. (b) Curves of rotation
vectors generated by HCRBMs. The rst 12 frames are initial sequence and therefore are
omitted in the gures.
B.2 CRBMs and Hierarchical CRBMs
The CRBM have shown good results on modeling human motion, and it is natural to
wonder about its performance on building a gesture generator. We can do so by extending
the model to consider additional parameters. Previous work [Taylor and Hinton, 2009]
on style-content separation has explored this approach via adding labels as bias of the
hidden layer to see whether the extended CRBM can model dierent walking motion
with dierent style labels. The model failed to learn the eect of style labels, however,
because the information from the past motion is much stronger than that of the labels,
which leads the model to rely mainly on the past motion information and ignore labels.
88
Although previous work shows that time series data will impede information coming
from labels, this phenomenon may not happen in gesture modeling since the labels (au-
dio features) have much stronger information in this case. Thus, we extended CRBMs
to model gestures but with a dierent approach. Our version takes audio features as one
of the input instead of hidden layer biases. We explored with both single-level and hier-
archical model, in which the hierarchical model replaces the FCRBM of the HFCRBM
with the modied CRBM. The architecture of the hierarchical CRBM is shown in Fig-
ure B.2a. We applied HCRBMs, with both bottom CRBM and top CRBM having 300
hidden nodes, to model gestures. The experiment with HFCRBMs showed similar results
as the FCRBM case. The same phenomenon was also observed in single-level CRBM
case. One sample of generated results is shown in Figure B.2b.
Both CRBMs and HCRBMs were not capable of modeling gestures in our experiments.
Because prosodic features have rich information and comprise a large portion of input
data, one can eliminate exclude the possible reason that the motion data impedes prosody.
Since I encouter a problem using CRBMs to include prosody information, one possible
way to improve the model is to add an additional layer for the prosody input, the same
as the design of FCRBMs, in which the new layer may nd a better representation for
the prosody input. But we did not explore this approach in this work.
B.3 CRBMs vs. RCRBMs
The RCRBM, as shown in Figure B.3a, modies the CRBM by removing A. The visible
layerv then become conditionally independent from past visible layers when values of the
89
B
W
v
t-2
v
t-1
v
t
h
(a) RCRBM h
RCRBM
FCRBM
h
t
v
t
v
t-1
v
t-2
v
t-3
v
t-4
t-2
t-1
z
y
(b) HFCRBM
Figure B.3: (a) A RCRBM is a CRBM without connection A. This can be observed via
comparing with Figure 2.2a. (b) A HFCRBM with RCRBMs at the bottom.
12 16 20 24 28 32 36
−1
−0.5
0
0.5
1
1.5
Figure B.4: Sample results generated by RCRBM-based HFCRBMs. A dramatic changes
can be observed at the beginning of the curves. This kind of dramatic changes can make
the gesture animation seem not natural.
hidden layer are given, and the modeling of time series then relies only on the activation
of hidden nodes. In the CRBM, the connection A acts as an autoregressive model in
modeling the time series data, and the activation of hidden nodes plays a supplementary
role. In the case of RCRBM, the generation of the next frame depends mainly on the
activation of the hidden layer. The activation of RCRBMs' hidden layer provides more
information about time series data than that of CRBMs'.
90
The original HFCRBM, as shown in Figure B.3b, uses RCRBMs as the bottom model,
and in this work we changed it to use CRBMs. Building HFCRBMs with RCRBMs allows
the top FCRBM to better manipulate the modeling process, which further strengthens
the in
uences of contextual information on time series data. In the case of CRBM-based
HFCRBMs, the CRBM explicitly models time series data which emphasizes the temporal
relation of the motion data. The FCRBM in HFCRBMs takes the activation of the bot-
tom model's hidden layer as input, and since RCRBMs provide more information about
time series data than CRBMs, the FCRBM in the RCRBM-based HFCRBM has a more
in
uential role in modeling time series data than that of the CRBM-based HFCRBM.
The original HFCRBM has been applied on style-content separation for human walking
motion in which labels are styles and are static over time [Chiu and Marsella, 2011b].
The success criterion in that application relies mainly on whether the motion correlates
with specied styles, so it is crucial to emphasize the in
uence of labels. On the other
hand, the gesture modeling task has quite informative labels which are sequences of audio
features and are dierent at each time step. It is the case that one criterion for successful
gestures is to correlate motion with utterances, but it is also important to have movement
that relates well with previous motion. With such strong signals from labels the model
may reduce the temporal correlation among movements. In our empirical analysis, the
gesture motion generated by RCRBM-based HFCRBMs did look correlated well with ut-
terances, but the movements seem a little bit more dramatic than a real human gesture.
An example is shown in Figure B.4. This phenomenon makes the motion less natural.
Thus, we use CRBMs as the bottom model to make use more of motion information, and
the model produces smoother and natural gestures.
91
Reference List
[Alibalia et al., 2000] Alibalia, M. W., Kitab, S., and Youngc, A. J. (2000). Gesture
and the process of speech production: We think, therefore we gesture. Language and
Cognitive Processes, 15:593{613.
[Bergmann and Kopp, 2009] Bergmann, K. and Kopp, S. (2009). Increasing the expres-
siveness of virtual agents: Autonomous generation of speech and gesture for spatial
description tasks. In Proceedings of The 8th International Conference on Autonomous
Agents and Multiagent Systems - Volume 1, AAMAS '09, pages 361{368. IFAAMAS.
[Bird et al., 2009] Bird, S., Loper, E., and Klein, E. (2009). Natural Language Processing
with Python. Gesture Studies 5. OReilly Media Inc.
[Boersma, 2001] Boersma, P. (2001). Praat, a system for doing phonetics by computer.
Glot International, 5:341{345.
[Breiman, 2001] Breiman, L. (2001). Random forests. Machine Learning, 45(1):5{32.
[Brugman et al., 2004] Brugman, H., Russel, A., and Nijmegen, X. (2004). Annotat-
ing multi-media / multimodal resources with elan. In In Proceedings of the Fourth
International Conference on Language Resources and Evaluation, LREC 2004, pages
2065{2068.
[Calbris, 2011] Calbris, G. (2011). Elements of Meaning in Gesture. Gesture Studies 5.
John Benjamins, Philadelphia.
[Cassell, 2000] Cassell, J. (2000). Embodied conversational interface agents. Commun.
ACM, 43(4):70{78.
[Cassell and Prevost, 1996] Cassell, J. and Prevost, S. (1996). Distribution of semantic
features across speech and gesture by humans and computers. In Workshop on the
Integration of Gesture in Language and Speech.
[Cassell et al., 2001] Cassell, J., Vilhj almsson, H. H., and Bickmore, T. (2001). Beat:
the behavior expression animation toolkit. In SIGGRAPH '01: Proceedings of the
28th annual conference on Computer graphics and interactive techniques, pages 477{
486, New York, NY, USA. ACM.
[Chiu and Marsella, 2011a] Chiu, C.-C. and Marsella, S. (2011a). How to train your
avatar: A data driven approach to gesture generation. In 11th Conference on Intelligent
Virtual Agents, pages 127{140.
92
[Chiu and Marsella, 2011b] Chiu, C.-C. and Marsella, S. (2011b). A style controller for
generating virtual human behaviors. In Proceedings of the 10th international joint
conference on Autonomous agents and multiagent systems - Volume 1, AAMAS '11.
[Chiu and Marsella, 2014] Chiu, C.-C. and Marsella, S. (2014). Gesture generation with
low-dimensional embeddings. In Proceedings of the 13th international joint conference
on Autonomous agents and multiagent systems, AAMAS '13.
[Cortes and Vapnik, 1995] Cortes, C. and Vapnik, V. (1995). Support-vector networks.
Machine Learning, 20(3):273{297.
[Deng et al., 2013] Deng, L., Li, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., Seltzer, M.,
Zweig, G., He, X., Williams, J., Gong, Y., and Acero, A. (2013). Recent advances in
deep learning for speech research at microsoft. In IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP).
[Do and Artieres, 2010] Do, T. and Artieres, T. (2010). Neural conditional random elds.
In International Conference on Articial Intelligence and Statistics (AI-STATS), pages
177{184.
[Ekman and Friesen, 1969] Ekman, P. and Friesen, W. V. (1969). The repertoire of non-
verbal behavior: Categories, origins, usage, and coding. In Semiotica, volume 1, pages
49{98. de Gruyter Mouton.
[Elgammal and Lee, 2004] Elgammal, A. and Lee, C.-S. (2004). Separating style and
content on a nonlinear manifold. In Proceedings of IEEE Conference on Computer
Vision and Pattern Recognition, volume 1, pages 478{485. IEEE Computer Society.
[Ennis et al., 2010] Ennis, C., McDonnell, R., and O'Sullivan, C. (2010). Seeing is be-
lieving: body motion dominates in multisensory conversations. In ACM SIGGRAPH
2010 papers, SIGGRAPH '10, pages 91:1{91:9, New York, NY, USA. ACM.
[Ennis and O'Sullivan, 2012] Ennis, C. and O'Sullivan, C. (2012). Perceptually plausible
formations for virtual conversers. Computer Animation and Virtual Worlds, 23(3-
4):321{329.
[Goldin-Meadow et al., 1993] Goldin-Meadow, S., Alibali, M. W., and Church, R. B.
(1993). Transitions in concept acquisition: Using the hand to read the mind. Psycho-
logical Review, 100(2):279{297.
[Goodfellow et al., 2014] Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet,
V. (2014). Multi-digit number recognition from street view imagery using deep convo-
lutional neural networks. pre-print. 1312.6082v2.
[Gratch et al., 2014] Gratch, J., Artstein, R., Lucas, G., Stratou, G., Scherer, S., Nazar-
ian, A., Wood, R., Boberg, J., Devault, D., Marsella, S., Traum, D., Rizzo, A. S., and
Morency, L.-P. (2014). The distress analysis interview corpus of human and computer
interviews. In Chair), N. C. C., Choukri, K., Declerck, T., Loftsson, H., Maegaard,
B., Mariani, J., Moreno, A., Odijk, J., and Piperidis, S., editors, Proceedings of the
93
Ninth International Conference on Language Resources and Evaluation (LREC'14),
Reykjavik, Iceland. European Language Resources Association (ELRA).
[Grochow et al., 2004] Grochow, K., Martin, S. L., Hertzmann, A., and Popovi c, Z.
(2004). Style-based inverse kinematics. In ACM SIGGRAPH 2004 Papers, SIG-
GRAPH '04, pages 522{531, New York, NY, USA. ACM.
[Hartholt et al., 2009] Hartholt, A., Gratch, J., Weiss, L., and Team, T. G. (2009). At the
virtual frontier: Introducing gunslinger, a multi-character, mixed-reality, story-driven
experience. In Intelligent Virtual Agents, volume 5773 of Lecture Notes in Computer
Science, pages 500{501. Springer Berlin Heidelberg.
[Hinton, 2010a] Hinton, G. (2010a). A practical guide to training restricted boltz-
mann machines. UTML TR 2010003, Department of Computer Science, University
of Toronto.
[Hinton, 2010b] Hinton, G. E. (2010b). Learning to represent visual input. Philosophical
Transactions of the Royal Society B: Biological Sciences, 365(1537):177{184.
[Hinton et al., 2006] Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning
algorithm for deep belief nets. Neural Comput., 18(7):1527{1554.
[Hinton et al., 2012] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and
Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. pre-print. 1207.0580v1.
[Iverson and Thelen, 1999] Iverson, J. M. and Thelen, E. (1999). Hand, mouth and brain:
The dynamic emergence of speech and gesture. Journal of Consciousness Studies, 6:19{
40.
[Kendon, 1983] Kendon, A. (1983). Gesture and speech: How they interact. In Wiemann,
J. M. and Harrison, R., editors, Nonverbal Interaction, pages 13{46. Sage Publications,
Beverly Hills, California.
[Kendon, 2000] Kendon, A. (2000). Language and gesture: unity or duality? In McNeill,
D., editor, Language and Gesture. Cambridge University Press.
[Kipp, 2004] Kipp, M. (2004). Gesture Generation by Imitation - From Human Behavior
to Computer Character Animation. PhD thesis, Saarland University.
[Kopp and Bergmann, 2012] Kopp, S. and Bergmann, K. (2012). Individualized gesture
production in embodied conversational agents. In Zacarias, M. and Oliveira, J. V.,
editors, Human-Computer Interaction: The Agency Perspective, volume 396 of Studies
in Computational Intelligence, pages 287{301. Springer Berlin Heidelberg.
[Kovar et al., 2002] Kovar, L., Gleicher, M., and Pighin, F. (2002). Motion graphs. In
SIGGRAPH '02: Proceedings of the 29th annual conference on Computer graphics and
interactive techniques, pages 473{482, New York, NY, USA. ACM.
94
[Krauss et al., 2000] Krauss, R. M., Chen, Y., and Gottesman, R. F. (2000). Lexical
gestures and lexical access: a process model. In McNeill, D., editor, Language and
Gesture. Cambridge University Press.
[Laerty et al., 2001] Laerty, J. D., McCallum, A., and Pereira, F. C. N. (2001). Condi-
tional random elds: Probabilistic models for segmenting and labeling sequence data.
In ICML, pages 282{289.
[Lawrence, 2005] Lawrence, N. D. (2005). Probabilistic non-linear principal component
analysis with gaussian process latent variable models. Journal of Machine Learning
Research, 6:1783{1816.
[Lee et al., 2008] Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net
model for visual area v2. In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors,
Advances in Neural Information Processing Systems 20, pages 873{880. MIT Press,
Cambridge, MA.
[Lee and Marsella, 2006] Lee, J. and Marsella, S. (2006). Nonverbal behavior generator
for embodied conversational agents. In 6th Conference on Intelligent Virtual Agents,
volume 4133 of Lecture Notes in Computer Science, pages 243{255.
[Levine et al., 2010] Levine, S., Kr ahenb uhl, P., Thrun, S., and Koltun, V. (2010). Ges-
ture controllers. In ACM SIGGRAPH 2010 papers, SIGGRAPH '10, pages 124:1{
124:11, New York, NY, USA. ACM.
[Levine et al., 2009] Levine, S., Theobalt, C., and Koltun, V. (2009). Real-time prosody-
driven synthesis of body language. ACM Trans. Graph., 28:172:1{172:10.
[Levine et al., 2012] Levine, S., Wang, J. M., Haraux, A., Popovi c, Z., and Koltun, V.
(2012). Continuous character control with low-dimensional embeddings. ACM Trans-
actions on Graphics, 31(4):28.
[Marsella et al., 2013] Marsella, S. C., Xu, Y., Lhommet, M., Feng, A. W., Scherer, S.,
and Shapiro, A. (2013). Virtual character performance from speech. In Symposium on
Computer Animation, Anaheim, CA.
[McNeill, 1985] McNeill, D. (1985). So you think gestures are nonverbal? Psychological
Review, 92(3):350{371.
[Mcneill, 1992] Mcneill, D. (1992). Hand and Mind: What Gestures Reveal about
Thought. University Of Chicago Press.
[Min and Chai, 2012] Min, J. and Chai, J. (2012). Motion graphs++: a compact genera-
tive model for semantic motion analysis and synthesis. ACM Trans. Graph., 31(6):153.
[Morency et al., 2007] Morency, L.-P., Quattoni, A., , and Darrell, T. (2007). Latent-
dynamic discriminative models for continuous gesture recognition. In Computer Vision
and Pattern Recognition, 2007. CVPR '07. IEEE Conference on, pages 1{8. IEEE.
95
[Mukai and Kuriyama, 2005] Mukai, T. and Kuriyama, S. (2005). Geostatistical motion
interpolation. In ACM SIGGRAPH 2005 Papers, SIGGRAPH '05, pages 1062{1070,
New York, NY, USA. ACM.
[Ne et al., 2008] Ne, M., Kipp, M., Albrecht, I., and Seidel, H.-P. (2008). Gesture
modeling and animation based on a probabilistic re-creation of speaker style. ACM
Trans. Graph., 27(1):1{24.
[Peng et al., 2009] Peng, J., Bo, L., and Xu, J. (2009). Conditional neural elds. In
NIPS, pages 1419{1427.
[rahman Mohamed et al., 2012] rahman Mohamed, A., Dahl, G. E., and Hinton, G.
(2012). Acoustic modeling using deep belief networks. Audio, Speech, and Language
Processing, IEEE Transactions on, 20(1):14{22.
[Rickel and Johnson, 2000] Rickel, J. and Johnson, W. L. (2000). Task-oriented collab-
oration with embodied agents in virtual worlds. In Embodied conversational agents,
pages 95{122. MIT Press, Cambridge, MA, USA.
[Rose et al., 1998] Rose, C., Cohen, M. F., and Bodenheimer, B. (1998). Verbs and ad-
verbs: Multidimensional motion interpolation. IEEE Comput. Graph. Appl., 18(5):32{
40.
[Scherer et al., 2013] Scherer, S., Kane, J., Gobl, C., and Schwenker, F. (2013). In-
vestigating fuzzy-input fuzzy-output support vector machines for robust voice quality
classication. Computer Speech and Language, 27(1):263{287.
[Stone et al., 2004] Stone, M., DeCarlo, D., Oh, I., Rodriguez, C., Stere, A., Lees, A., and
Bregler, C. (2004). Speaking with hands: creating animated conversational characters
from recordings of human performance. In SIGGRAPH '04: ACM SIGGRAPH 2004
Papers, pages 506{513, New York, NY, USA. ACM.
[Taskar et al., 2004] Taskar, B., Guestrin, C., and Koller, D. (2004). Max-margin markov
networks. In Thrun, S., Saul, L., and Sch olkopf, B., editors, Advances in Neural
Information Processing Systems 16. MIT Press, Cambridge, MA.
[Taylor and Hinton, 2009] Taylor, G. and Hinton, G. (2009). Factored conditional re-
stricted Boltzmann machines for modeling motion style. In Bottou, L. and Littman,
M., editors, Proceedings of the 26th International Conference on Machine Learning,
pages 1025{1032, Montreal. Omnipress.
[Taylor et al., 2007] Taylor, G. W., Hinton, G. E., and Roweis, S. T. (2007). Modeling
human motion using binary latent variables. In Sch olkopf, B., Platt, J., and Homan,
T., editors, Advances in Neural Information Processing Systems 19, pages 1345{1352.
MIT Press, Cambridge, MA.
[Thiebaux et al., 2008] Thiebaux, M., Marsella, S., Marshall, A. N., and Kallmann, M.
(2008). Smartbody: behavior realization for embodied conversational agents. In Pro-
ceedings of the 7th international joint conference on Autonomous agents and multiagent
systems - Volume 1, AAMAS '08, pages 151{158.
96
[Torresani et al., 2007] Torresani, L., Hackney, P., and Bregler, C. (2007). Learning mo-
tion style synthesis from perceptual observations. In Sch olkopf, B., Platt, J., and
Homan, T., editors, Advances in Neural Information Processing Systems 19, pages
1393{1400. MIT Press, Cambridge, MA.
[Urtasun et al., 2008] Urtasun, R., Fleet, D. J., Geiger, A., Popovi c, J., Darrell, T. J.,
and Lawrence, N. D. (2008). Topologically-constrained latent variable models. In
Proceedings of the 25th international conference on Machine learning, ICML '08, pages
1080{1087, New York, NY, USA. ACM.
[Wang et al., 2008] Wang, J., Fleet, D., and Hertzmann, A. (2008). Gaussian process
dynamical models for human motion. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 30(2):283{298.
[Weiss et al., 2012] Weiss, D., Sapp, B., and Taskar, B. (2012). Structured prediction
cascades. pre-print. 1208.3279v1.
[Ye and Liu, 2010] Ye, Y. and Liu, C. K. (2010). Synthesis of responsive motion using a
dynamic model. Comput. Graph. Forum, 29(2):555{562.
[Yu et al., 2009] Yu, D., Deng, L., and Wang, S. (2009). Learning in the deep-structured
conditional random elds. In NIPS Workshop on Deep Learning for Speech Recognition
and Related Applications.
97
Abstract (if available)
Abstract
There is a growing demand for animated characters capable of simulating face‐to‐face interaction using the same verbal and nonverbal behavior that people use. For example, research in virtual human technology seeks to create autonomous characters capable of interacting with humans using spoken dialog. Further, as video games have moved beyond first person shooters, there is a tendency for gameplay to comprise more and more social interaction where virtual characters interact with each other and with the player's avatar. Common to these applications, the autonomous characters are expected to exhibit behaviors resembling a real human. ❧ The focus of this work is generating realistic gestures for virtual characters, specifically the coverbal gestures that are performed in close relation to the content and timing of speech. A conventional approach for animating gestures is to construct gesture animations for each utterance the character speaks, by handcrafting animations or using motion capture techniques. The problem with this approach is that it is costly in time and money and is not even feasible for characters designed to generate novel utterances on the fly. ❧ This thesis applies machine learning approaches to learn a data‐driven gesture generator from human conversational data that can generate behavior for novel utterances and thereby saves development effort. This work assumes that learning to generate from speech is a feasible task. The framework exploits a gesture classification scheme about gestures to provide domain knowledge about gestures and help the machine learning models realize the generation of gestures from speech. The framework decomposes the overall learning problem of generating gestures into two tasks: one realizes the relation between speech and gesture classes and the other performs gesture generation based on the gesture classes. To facilitate the training process this research has used real‐world conversation data involving dyadic interviews and motion capture data of human gesturing while speaking. The evaluation experiments assess the effectiveness of each component by comparing with state‐of‐the‐art approaches and evaluate the overall performance by conducting studies involving human subjective evaluations. An alternative machine learning framework has also been proposed to compare with the framework addressed in this thesis. Overall, the evaluation experiments show the framework outperforms state‐of‐the‐art approaches. ❧ The central contribution of this research is a machine learning framework capable of learning to generate gestures from conversation data that can be collected from different individuals while preserving the motion style of specific speakers. In addition, our framework will allow the incorporation of data recorded through other media and thereby significantly enrich the training data. The resulting model provides an automatic approach for deriving a gesture generator which realizes the relation between speech and gestures. A secondary contribution is a novel time‐series prediction algorithm that predicts gestures from the utterance. This prediction algorithm can address time‐series problems with complex input and be applied to other applications that require classifying time series data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Virtual extras: conversational behavior simulation for background virtual humans
PDF
The human element: addressing human adversaries in security domains
PDF
Computational foundations for mixed-motive human-machine dialogue
PDF
Human appearance analysis and synthesis using deep learning
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Alleviating the noisy data problem using restricted Boltzmann machines
PDF
Understanding and generating multimodal feedback in human-machine story-telling
PDF
Emotional speech resynthesis
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
A framework for research in human-agent negotiation
PDF
Leveraging structure for learning robot control and reactive planning
PDF
Incrementality for visual reference resolution in spoken dialogue systems
PDF
Towards more human-like cross-lingual transfer learning
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Multimodal representation learning of affective behavior
PDF
Fast and label-efficient graph representation learning
PDF
Iteratively learning data transformation programs from examples
PDF
Learning distributed representations from network data and human navigation
Asset Metadata
Creator
Chiu, Chung-Cheng
(author)
Core Title
Generating gestures from speech for virtual humans using machine learning approaches
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/24/2014
Defense Date
05/13/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data‐driven animation,deep neural networks,Gaussian process latent variable models,gesture synthesis,human animation,OAI-PMH Harvest,restricted Boltzmann machines
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Marsella, Stacy C. (
committee chair
), Gratch, Jonathan (
committee member
), Morency, Louis-Philippe (
committee member
), Neumann, Ulrich (
committee member
), Read, Stephen (
committee member
)
Creator Email
chungchengc@google.com,redjava@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-447274
Unique identifier
UC11286836
Identifier
etd-ChiuChungC-2734.pdf (filename),usctheses-c3-447274 (legacy record id)
Legacy Identifier
etd-ChiuChungC-2734.pdf
Dmrecord
447274
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Chiu, Chung-Cheng
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
data‐driven animation
deep neural networks
Gaussian process latent variable models
gesture synthesis
human animation
restricted Boltzmann machines