Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Virtual extras: conversational behavior simulation for background virtual humans
(USC Thesis Other)
Virtual extras: conversational behavior simulation for background virtual humans
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
VIRTUAL EXTRAS: CONVERSATIONAL BEHA VIOR SIMULATION FOR
BACKGROUND VIRTUAL HUMANS
by
Dusan Jan
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2012
Copyright 2012 Dusan Jan
Dedication
I would like to dedicate this dissertation to my mom, who at times believed in me even
more than I believed in myself. I have helped you even before I came into this world
and you helped me even after you left.
ii
Acknowledgments
Parts of this thesis have appeared in previous publications (Jan and Traum 2005, 2007;
Jan et al. 2007, 2009, 2011). The work on cultural model (Jan et al. 2007) was a col-
laboration with University of Texas at El Paso where the subject study reported in sec-
tion 4.5 and the collection of cross-cultural dialog corpus described in section 5.2 was
performed. I would like to thank David Herrera and Bilyana Martinovski for help with
the literature review on cultural variations of conversational behavior. I would also like
to thank those who worked on other parts of the Vigor project and specifically Eric
Chance for creating characters and animations for the background virtual humans in
Second Life. Finally, I would like to thank David Traum and David Novick for valuable
input on formulation of computational models and countless discussions that helped me
on this path.
Parts of this work have been sponsored by the U.S. Army. Statements and opin-
ions expressed do not necessarily reflect the position or the policy of the United States
Government, and no official endorsement should be inferred.
iii
Table of Contents
Dedication ii
Acknowledgments iii
List of Tables vii
List of Figures viii
Abstract x
Chapter 1: Introduction 1
1.1 Motivation and Objectives . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 General Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2: Dialog Simulation 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Conversation Analysis and Organization of Turn-Taking . . . . . . . 10
2.3 Padilha Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Simulation Framework . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Conversational Participation Algorithm . . . . . . . . . . . . . . . . 18
2.5.1 High-level planning . . . . . . . . . . . . . . . . . . . . . . 19
2.5.2 Conversation processing . . . . . . . . . . . . . . . . . . . 20
2.5.3 Claiming a turn . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.4 Starting to speak . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.5 Continuing speaking . . . . . . . . . . . . . . . . . . . . . 24
2.5.6 Tracking participation . . . . . . . . . . . . . . . . . . . . . 24
2.5.7 Responding to others . . . . . . . . . . . . . . . . . . . . . 25
2.6 Evaluation of Initial Dialog Model . . . . . . . . . . . . . . . . . . 26
iv
Chapter 3: Movement and Positioning Simulation 31
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Virtual Crowd Simulations . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Reasons for Movement . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Social Force Model . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Integration into Conversation Simulation . . . . . . . . . . . . . . . 42
3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6.1 Joining conversation . . . . . . . . . . . . . . . . . . . . . 45
3.6.2 Conversation splitting into two separate conversations . . . . 47
3.6.3 Effect of proxemics . . . . . . . . . . . . . . . . . . . . . . 48
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 4: Cultural Modeling 51
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Cultural Models for Virtual Agents . . . . . . . . . . . . . . . . . . 52
4.3 Human Culture-Specific Behavior . . . . . . . . . . . . . . . . . . 54
4.3.1 Proxemics . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.2 Gaze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.3 Turn Taking and Overlap . . . . . . . . . . . . . . . . . . . 57
4.4 Computational Model . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Evaluation of Literature-Based Cultural Model . . . . . . . . . . . . 63
Chapter 5: Model Refinement and Conversation Structure Model 66
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Cross-Cultural Dialog Corpus . . . . . . . . . . . . . . . . . . . . . 67
5.3 Conversation Structure Modeling . . . . . . . . . . . . . . . . . . . 70
5.3.1 Interaction Process Analysis . . . . . . . . . . . . . . . . . 70
5.3.2 Distribution of Participation . . . . . . . . . . . . . . . . . 72
5.4 Modifications to Dialog Simulation . . . . . . . . . . . . . . . . . . 74
5.5 Refinement of Cultural Model . . . . . . . . . . . . . . . . . . . . 80
5.6 Cross-Cultural Perception Study of Real Interactions . . . . . . . . 86
5.7 Conversational Task Classification . . . . . . . . . . . . . . . . . . 87
5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Chapter 6: Implementation and Applications 91
6.1 Simulation Level of Detail . . . . . . . . . . . . . . . . . . . . . . 92
6.2 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4 Unreal Tournament . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.5 BML and Smartbody . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.6 Vigor Framework and Second Life . . . . . . . . . . . . . . . . . . 106
6.6.1 Behavior Control . . . . . . . . . . . . . . . . . . . . . . . 107
v
6.6.2 Director . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.6.3 Conversation Management . . . . . . . . . . . . . . . . . . 112
6.7 MRE and SASO . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.8 SDO Moleno . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.9 Checkpoint Exercise . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Chapter 7: Conclusion 123
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . 125
Bibliography 128
vi
List of Tables
4.1 Mean Interpersonal Distance in Feet . . . . . . . . . . . . . . . . . 54
4.2 Mean Interpersonal Distance in Feet . . . . . . . . . . . . . . . . . 55
4.3 Amount of gaze (%) in triads and dyads . . . . . . . . . . . . . . . 56
4.4 American gaze parameters in table form. . . . . . . . . . . . . . . . 62
5.1 IPA Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Example interaction profile. . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Example progression of difference between desired and actual inter-
action rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Rate difference and reactive values when deciding to speak. . . . . . 80
5.5 Total overlap in seconds . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6 Amount of turns based on length . . . . . . . . . . . . . . . . . . . 83
5.7 Distribution of total turn holding time . . . . . . . . . . . . . . . . 83
5.8 Gaze distribution for speakers . . . . . . . . . . . . . . . . . . . . . 84
5.9 Gaze distribution for addressees . . . . . . . . . . . . . . . . . . . 84
5.10 Gaze distribution for listeners . . . . . . . . . . . . . . . . . . . . . 84
5.11 American gaze parameters . . . . . . . . . . . . . . . . . . . . . . 85
5.12 Arab gaze parameters . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.13 Mexican gaze parameters . . . . . . . . . . . . . . . . . . . . . . . 86
5.14 Accuracy of conversational task judgments (DM = Decision Mak-
ing, ST = Storytelling) . . . . . . . . . . . . . . . . . . . . . . . . 89
vii
List of Figures
2.1 Conversation Agent Message types . . . . . . . . . . . . . . . . . . 17
2.2 Conversational Agent Attributes . . . . . . . . . . . . . . . . . . . 18
2.3 Afghani civilians engaged in conversation. . . . . . . . . . . . . . . 27
3.1 A sample group positioning. Each circle represents an agent. A
thick border represents that the agent is talking, filled or empty shad-
ing indicates conversation group membership. . . . . . . . . . . . . 37
3.2 Attractive force toward speaker
~
F
speaker
. . . . . . . . . . . . . . . . 37
3.3 Repelling force away from other speakers
~
F
noise
. . . . . . . . . . . 39
3.4 Repelling force away from agents that are too close
~
F
proximity
. . . . 39
3.5 Agent’s deviation from circular formation exceeds threshold and
triggers force
~
F
circle
. . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Example of motion computation: The lower right agent decided to
join the unshaded conversation. He iteratively applies movement in
the direction of local forces. In each iteration the effects of different
component forces may take effect. The thick line indicates the final
destination and path the agent chooses for this planning cycle. . . . 42
3.7 The agent on the left is approaching a conversation. Arrows indi-
cate where the agents will move from now until the simulation sta-
bilizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.8 Stable point after the fourth agent joins the conversation. . . . . . . 46
3.9 Agents form in a circle to engage in a single conversation. . . . . . . 47
3.10 After two agents leave the conversation the agents adapt to it by
repositioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
viii
3.11 Incompatible social zones. . . . . . . . . . . . . . . . . . . . . . . 49
4.1 These are two examples taken from the videos used in the evalu-
ation. Left picture is from the North American model and right
picture from the Arab model. . . . . . . . . . . . . . . . . . . . . . 64
5.1 Arab group performing task 4. . . . . . . . . . . . . . . . . . . . . 68
5.2 Gaze annotation in Anvil. . . . . . . . . . . . . . . . . . . . . . . . 69
6.1 Example message exchange when an agent performs an action. . . . 94
6.2 Display of conversation attributes in control panel. . . . . . . . . . . 98
6.3 Cultural Parameters – Gaze . . . . . . . . . . . . . . . . . . . . . . 100
6.4 Cultural Parameters – Proxemics . . . . . . . . . . . . . . . . . . . 100
6.5 Cultural Parameters – Overlap in Turn-Taking . . . . . . . . . . . . 101
6.6 Agent Relationship Selection . . . . . . . . . . . . . . . . . . . . . 101
6.7 Importing Cultural Settings . . . . . . . . . . . . . . . . . . . . . . 102
6.8 Agent Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.9 Virtual humans engaged in conversation. . . . . . . . . . . . . . . . 103
6.10 SDO interacting with a returning user. . . . . . . . . . . . . . . . . 118
6.11 Scene at the school during Federal Virtual Worlds Challenge evalu-
ation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
ix
Abstract
When we simulate a large number of virtual humans in virtual worlds we encounter a
point where it is no longer feasible to simulate all of them in full detail. In the case
where some of them never interact with the user it is useful to distinguish between main
virtual humans and background virtual humans. The main purpose of background vir-
tual humans is to engage the user in the immersive environment. Their main objective is
to be believable to the user and allow him to suspend disbelief. To achieve this, the back-
ground virtual humans must be able to perform various behaviors, with conversational
behavior being one of the most commonly required.
In this dissertation we present a framework for believable simulation of conversa-
tional behavior for background virtual humans based only on a small number of param-
eters. It was developed over the course of four major iterations based on computational
models derived from literature review and analysis of video corpus data. It takes into
account the role of proxemics, gaze, turn-taking and pause on believability of conversa-
tional behavior and how context such as culture and conversational task affects believ-
ability.
We report on a number of evaluations performed during the development of the
simulation and show that simulations appear believable to the subjects. We describe the
applications where the simulation has been employed and show what role it can play in
a larger framework covering multiple levels of detail.
x
Chapter 1
Introduction
1.1 Motivation and Objectives
Immersive computer-based simulations have important applications in both training and
entertainment. In such simulations the user gets to experience simulated worlds, either
representing real environments and situations or fictional ones, for the purpose of learn-
ing about them, acquiring specific skills relevant to the situation or just to experience
an engaging story. One of the advantages of using computer simulations for training is
that it allows large number of users in potentially distant locations to interact in a simu-
lated environment that can include emergency scenarios without any risk to the trainees.
While the effect of immersion on the learning is not fully understood, it is seen as a
desirable element that enhances the learning experience and increases the engagement
of trainees (Richards et al. 2008).
One part of immersion is achieved by providing richer sensory input to the users
through the use of panoramic 3D displays, sound effects and haptic interfaces. The
other part has to do with how well the simulation matches with expectations of the
user. This is captured by the concept of believability and is related to suspension of
disbelief. When the simulation represents a real environment, believability is very close
to realism, however the two are not necessarily the same. For example a simulation
might exaggerate a certain element for the purposes of emphasizing a certain training
objective while still remaining believable. People also tend to see complex mechanisms
1
and intelligence even when there is none, as long as we don’t actively destroy the illusion
and break suspension of disbelief.
Another aspect of this is contextual believability. If the user has no context they
will usually try to explain unexpected behaviors by filling in the blanks. For example if
one sees a group of people communicating in a strange manner one might try to explain
that away by assuming they are arguing. However when the user has the contextual
knowledge of the situation then the expectations of the user will also change and this
will affect believability. If in the previous case the user knew the group was not arguing
then the strange behavior might be enough to break suspension of disbelief. With this
in mind we can then look at various contexts that might affect believability, such as for
example culture or task being performed.
An important part of simulated worlds is the actual environment, but to bring those
worlds to life we often populate them with a number of virtual humans. Virtual humans
are software components that imitate the look and behavior of a human. In addition to
increasing immersion, more realistic virtual human behavior also increases their social
influence on the user and makes the user exhibit social behaviors that typically only
occur between humans (Blascovich 2002). For virtual humans to be believable we want
to provide them with speech recognition, speech synthesis, high-fidelity gestures, facial
expressions and lip synchronization (Rickel et al. 2002). They also have to be able to
reason about their environment and be aware of their role in the world and what effects
different actions will have on the environment and other participants (Ijaz et al. 2011).
Equipping the virtual humans with such capabilities however can be very computation-
ally expensive. When we are dealing with a large number of such virtual humans we
encounter a point when it is no longer feasible to simulate each one of them in full detail.
In order to facilitate efficient use of resources, we distinguish between main and
background virtual humans. The role of the main virtual humans is to interact with the
2
human user in order to progress the story or provide a learning experience in a virtual
training. Background virtual humans, on the other hand, are not expected to interact
with the user and may not even come in close proximity of the user. Their purpose is
similar to that of extras in motion pictures and television. Their goal is to immerse the
user in the virtual environment, to provide an experience that is similar to what they
could experience in the real world (Gordon et al. 2004).
Background virtual humans still have to exhibit a large range of behaviors. If they
represent civilians then we usually want them to feel as if they are going about doing
their daily routine, going to the shop, meeting friends. If they’re performing a specific
job however then the behaviors would need to be sufficiently different to convey that.
A soldier at a checkpoint for example would be inspecting the passersby, a shopkeeper
would stay at his shop selling his wares. In all these cases when we have virtual humans
interacting with each other one of the most common behaviors that is required is con-
versational behavior. Conversational behavior refers to the actions the humans perform
while they are involved in a conversation with others. This includes nonverbal behaviors
such as gestures, proxemics, posture, facial expressions and gaze.
Simulation of conversational behavior will be the main topic of this work, specifi-
cally as it relates to background virtual humans and believability. When we are talking
about believability of background virtual humans it is good to first focus on the ele-
ments that are most likely to break immersion. The most common flaw is repetition.
If the virtual humans repeat the same exact thing over and over that will quickly ruin
the experience. Some variability is usually best, however care must be taken that the
behavior is not too random. An intermediate point where the behavior is varied but still
gives the appearance of intent on the part of the virtual human will usually produce best
results. Background virtual humans also have to act in accordance with their surround-
ing environment and appropriately react to changes in the environment. In addition we
3
also need variety of visualizations and behaviors. A crowd composed of the same indi-
viduals with the same behaviors would not be convincing even if one such virtual human
alone would be very realistic (Ulicny and Thalmann 2002).
These requirements are in conflict with the main purpose of background virtual
humans, that of reducing computational complexity. Believability is generally improved
by employing more accurate behavior simulation. At the same time, such simulation can
be more costly on computational resources and on development time. A key observation
here is that different aspects of behavior simulation can have varying effects on believ-
ability. Thus it is important to investigate what kind of effect these different components
have on believability so that we can focus our improvement efforts on those components
that will be most effective.
1.2 Thesis
The thesis this dissertation addresses is whether it is possible to generate a believable
simulation of conversational behavior for background virtual humans based only on a
small number of parameters. It looks at the role of turn taking, gaze, pause and overlap,
proxemics and movement in conversation simulation and how that affects believability
in general and specifically contextual believability. When looking at contextual believ-
ability it examines the context of culture and conversational tasks such as storytelling
and decision making.
The simulation should be real-time with support for multiparty conversations with
dynamic adjustments to conversation groups. It should be adaptable to a variety of
rendering environments with various fidelity and range of behaviors. To achieve the
intended balance between random and intentional behavior it should use a combination
of reactive and stochastic elements, parameterizable to adapt the simulation to different
4
cultural environments, different conversation structures and different individual char-
acteristics. The methods should exploit the fact that the user is not interacting with
the background virtual humans. Taking an example from extras in motion pictures,
it is not important what the group in the background is talking about as long as they
appear believable. This means that the simulation can focus on the appearance of the
conversation and the patterns of interaction, rather than actual information exchange or
communication of internal state.
The contributions of the work presented here are as follows. It provides software
components for running the conversational simulation in a number of rendering envi-
ronments. It provides results of experiments that show which aspects of conversational
behavior make an impact on believability. It shows what role the simulation plays in
the scope of a bigger picture with multiple levels of detail and how the simulation com-
ponents can be reused at different levels of detail. Finally, it provides examples of real
applications where the simulation and its components have been used.
1.3 General Approach
The design of the simulation follows an iterative process of collecting data, formalizing
the models, implementation and testing. The issues identified in testing lead to fur-
ther changes in implementation, refinement of models and collecting of data needed for
creation of richer models.
First we have to obtain data on patterns of interaction in human conversations and its
features. In many cases we can get this information from existing studies that examined
specific aspects of conversational behavior. For the parts not covered by existing litera-
ture, recording and analysis of human conversation video corpus has to be employed.
5
The collected data is then used to create formal models of the observed behaviors.
The models must balance between faithful representation of data and efficiency. They
should be as accurate as possible while sacrificing some of the details to maintain real-
time efficiency. At the same time aspects of the conversation that are variable between
individuals and different situations are extracted to form parameters suitable for diver-
sification and adaptation of background virtual humans.
The conversational behavior simulation was implemented as a component of the
virtual human simulation framework at the Institute for Creative Technologies (ICT).
It connects with a graphics engine to access the bodies of virtual humans and control
their appearance and actions. It also integrates with other components of the simula-
tion framework in order to provide information about the status of its virtual humans
to virtual humans controlled by other components. The various parts of the simula-
tion are organized in a modular fashion so that it is easy to plug in different rendering
environments and specify control of virtual humans while not engaged in conversation.
The last step in the design process is evaluation. First internal testing is used to make
sure that the resulting simulations follow the formal model. This is followed by subject
studies after any potential discrepancies are eliminated. These are used to examine
believability of the simulation and to find the most effective area of improvement for
future work.
The simulation underwent four major iteration cycles.
The first implementation was an extension of prior work in group conversation
simulation using autonomous agents. Its focus was on adding support for dynamic
formation of conversation groups, allowing for creation, splitting and joining of
groups as well as entry and exit of conversation participants.
The second iteration cycle focused on a major issue identified in the first step,
which was mobility of virtual humans. Previously all virtual humans were located
6
in fixed positions. This step added a movement and positioning component that
allows virtual humans to reposition themselves based on a number of social forces.
The third iteration added a cultural model to the simulation. It modeled cultural
variability of proxemics, gaze and pause in turn taking. Subject studies were
used to evaluate whether subjects can identify differences between simulations
of different cultures and how well they can recognize simulations of their own
culture.
In the last iteration we focused on analysis of a cross-cultural dialog corpus of real
human conversations. We used these to refine the existing models and extend the
simulation with a conversation structure model. Previously all simulations rep-
resented free form conversation. The conversation structure model employs turn
distribution and local conversation patterns based on interaction process analysis
to represent conversation structure ranging from storytelling to decision making.
1.4 Organization
The rest of this document is organized as follows. The next four chapters report on the
work accomplished in the various iteration cycles of the simulation. Chapter 2 explains
the core dialog simulation, chapter 3 describes the movement and positioning simula-
tion, chapter 4 describes the cultural model and chapter 5 describes the cross-cultural
dialog corpus analysis and the conversation structure model. Chapter 6 provides the
technical details of the implementation and integration within the scope of a larger
framework as well as describes the applications where the simulation and its parts have
been used. Finally, in chapter 7 the summary of contributions and possibilities for future
work are presented.
7
While the dissertation was written so that it can be read as a whole, different readers
will find particular segments of more interest. A reader interested in the workings of
the conversational simulation will be most interested in chapter 2 which mainly covers
the turn-taking behavior, chapter 3 that explains the movement algorithm and parts con-
trolling joining and leaving of conversations, section 4.4 that explains the gaze behavior
and sections 6.2-6.4 for further implementation details. A reader interested in behav-
ior simulations in online virtual worlds should look at sections 6.6-6.9 which contain a
detailed description of the Vigor framework used for controlling agents in online virtual
worlds and a brief overview of examples of use for virtual guides and training simu-
lations. Finally, readers interested in new findings of interest to cultural anthropology
should check section 5.2 that describes a cross-cultural corpus of conversational behav-
ior with groups consisting of native speakers of Arabic, American English and Mexican
Spanish, and section 5.5 which contains quantitative analysis of the corpus data as used
for the computational models in conversational simulation.
8
Chapter 2
Dialog Simulation
2.1 Introduction
The core of the simulation is the modeling of conversational behavior and how this is
applied to the visualization of virtual humans. How this is approached depends largely
on the purpose of the simulation and its requirements. Most work on multiparty interac-
tion has been done in the social sciences, but there have recently been several attempts
at simulating social interaction and effects of social context on virtual human behavior.
Guye-Vuill` eme (2004) attempts to bridge the virtual human simulation studies and
social sciences. He presents a model of small group interaction and interpersonal rela-
tionship development. It models interpersonal relationships as degree of liking, famil-
iarity, trust and commitment. These develop based on interactions between members
that are guided by Interaction Process Analysis (IPA) profiles of individuals. He also
linked this social simulation to a 3D visualization. It simulates proxemic aspects of
interactions and includes mechanism for selection of displays based on equilibrium the-
ory. Of particular note is the use of parametric postures to express status differences and
parametric adaptors for display of actions like small movements of wrists or scratching.
Overall this is a very strong work. While it focuses more on the social development of
groups than on turn-taking behavior the resulting simulation appears to produce believ-
able visualizations.
Rehm et al. (2005, 2007b) take a similar approach in simulation of a virtual meeting
place where agents engage in social interaction. The behaviors are controlled by a set of
9
theories from the social sciences. They implemented several theories such as IPA theory,
congruity theory, social impact theory and self attention theory which can be individu-
ally enabled in the simulation. They also used statistical language model for language
generation and explored how a user can interact with the agents to get integrated in the
agents’ social network.
In our case however we are mainly interested in the believability of the turn tak-
ing and other conversational behavior. Since the conversation simulation is meant for
background characters who are too far away from the main action to hear the content
we can focus on the appearance of conversation and the patterns of interaction, rather
than actual information exchange or communication of internal state. As the user is
not directly involved with background virtual humans, it is not of particular interest
to model social development of virtual humans. What we need instead is an accurate
representation of turn-taking behavior. Padilha provides a computational model of turn
taking based on established theories from sociology and psycholinguistics that serves
as a good basis for the simulation (Padilha and Carletta 2002, 2003; Padilha 2006). We
will first review the findings about conversational behavior, examine the computational
model proposed by Padilha and finally how we expanded the model to be used in our
simulation.
2.2 Conversation Analysis and Organization of Turn-
Taking
Participants in conversation take turns at talk. According to Sacks, Schegloff, and Jef-
ferson (1974), most of the time only one person speaks in a conversation, occurrences
of more than one speaker at a time are common, but brief, and transitions from one turn
to next usually occur with no gap and no overlap, or with slight gap or overlap.
10
When the speaker is speaking there are natural points where others can begin their
turn. These are called transition relevance points (TRPs). Units of talk between these
points, termed turn construction units (TCUs) are either sentences, clauses, phrases or
lexical items. Because of the grammatical nature of language, it is possible to iden-
tify these TCUs while the speaker is generating them and in this way predict possible
completion points before they actually occur. TRPs can also be predicted based on par-
alinguistic features such as rhythm, intonation and loudness or by nonverbal behavior.
This is the key reason why the amount of overlap is low.
If the speaker addresses a particular participant with a question then that person will
usually take turn at the TRP. The addressing is often accomplished by either gazing at
the addressee or attaching a vocative. On the other hand if the speaker leaves a free TRP
anyone can decide to speak, or the previous speaker may choose to continue to talk.
At a free TRP we can have more than one participant deciding to start to talk. In such
cases we have overlapped speech and there are various factors that influence who keeps
speaking. Regularly whoever starts first gets the turn. If there is no distinguishable first
starter then whoever appears loudest usually gets the turn (Meltzer et al. 1971). When
it is clear who started first then it is expected for second starters to realize this and stop,
but there are cases when this does not hold. In general the decision to talk or to stop
talking in simultaneous speech depends on how eager the participants are to make a
contribution and how involved they are in the conversation (Orestr¨ om 1983).
Another case that involves simultaneous speech is interruptions when someone inter-
rupts the current speaker by barging in either to forcibly take the turn or to clip redun-
dancy if he already has an idea of what the speaker is going to say. These can have
several outcomes. The interrupter may stop after a false start or the speaker may be cut
off or they may continue talking simultaneously. Speakers employ various techniques
trying to keep the floor during simultaneous speech. The talk gets louder, higher in pitch
11
and changes its tempo. There can be many cut-offs, repetitions and phoneme stretch-
ing (Schegloff 1980). In general the duration of simultaneous talk is not long and the
less persistent speaker halts their production. When there is a third party involved, like
when both speakers are addressing the same participants, the gaze of the third party
plays a large role in influencing who keeps talking. When they are speaking to different
parties the speaker may decide to ignore the interrupter if he perceives the interruption
as side talk and is not bothered by it.
While it can happen that more than one selects to speak at a TRP it is also possible
that no one does. At this point the previous speaker has the opportunity to continue
speaking. The pause until the speaker continues speaking is generally larger than if
someone else takes the turn immediately after the speaker finishes. This can be seen as
result of the fact that others are competing to take the turn and as a result try to start as
soon as possible. The current speaker however can take more time if no one takes the
turn. If the pause gets even longer it becomes apparent that the previous speaker does
not plan to continue and the floor is open to everyone.
Even though most of the time there is only one speaker, the listeners are not expected
to be completely passive. They may nod, smile and express other feedback, for example
indicating problems in understanding. These productions, usually called backchannels,
are produced in the background and are not meant to take the floor. They include sig-
nals of continued attention, exclamations, short questions, sentence completions, brief
restatements and clarification requests (Schegloff 1982). Most of the backchannels
appear at TRPs (Orestr¨ om 1983), about half of them without any overlap. Listeners
are rarely silent for more than 15 seconds, they produce feedback to the speaker in order
to show attentiveness and not be inactive for too long.
12
2.3 Padilha Model
The model proposed by Padilha is not tied to visualization, but instead provides a com-
putational model of verbal turn-taking that generates symbolic talk that matches aggre-
gate statistics of real conversations. By symbolic talk we mean that it is abstract, without
any actual content, so it only generates who is talking and when, but not what about.
This is achieved by using a probabilistic decision scheme driving various turn-taking
decisions. The discussion it models is a free form conversation that is more or less
informal and where talk is the only activity. It can not have any change of participants
or any external events. It also ignores any spatial relationships between participants.
The model uses a multi-agent system where each agent represents a participant in
conversation and operates independently of other agents. They are however synchro-
nized. They all share a blackboard representing the environment where they exchange
the behaviors they are performing. This is in addition to their internal states where they
store for example who they are currently paying attention to. The simulation proceeds
in lockstep. At each time tick each agent observes what was produced by the group in
last tick and based on that information and its internal state generates a new behavior
to be output to the blackboard. All behaviors produced by agents in one time step are
considered simultaneous.
Each agent also has attributes that influence the probabilistic decisions involved in
turn taking. They modify for example how often an agent talks, how likely it is to
generate feedback or how long the turn units it generates are. These will be explained in
more detail when we will discuss the actual implementation in our simulation later on.
The behaviors generated include speech, gaze, gesture and posture shifts. Since the
symbolic talk generated has no content, it is important that the simulation also includes
cue behaviors that would otherwise be inferred from syntax, semantics and intonation.
These are used to predict where TRPs will take place and are referred to as pre-TRPs.
13
Additionally there are symbolic behaviors representing selection of next speaker, which
represents situation where the current speaker would indicate who they want to take the
next turn such as for example addressing a question to a specific participant. Backchan-
nels are also differentiated from normal speech as they are usually short and serve a
different purpose, mainly to show attentiveness and provide feedback to the speaker.
Padilha was mainly interested in structural aspects of conversation and has in the later
models omitted the nonverbal behaviors and introduced other local speech behaviors
such as hesitations and focused more exclusively just on speech elements. In our case
the nonverbal behaviors are very important for visualization and local features such as
hesitations are not necessarily perceptible by distant observers, so we have mainly based
our implementation on the earlier models (Padilha and Carletta 2002).
The first visualization of the Padilha model was created by Patel et al. (2004) in
which the model was implemented without any major modifications. It used the same
blackboard model and linked the abstract behaviors to the animations of characters in
an immersive simulation. The main extension was in allowing attention of agents to
be directed toward external events such as explosions or vehicle movement. See also
chapter 6 for more information on the MRE project where the simulation was used.
2.4 Simulation Framework
While the initial simulation framework was largely influenced by Padilha and Carletta
(2002), it has since evolved into a simulation that is more suitable for purposes of con-
trolling virtual humans. One of the main limitations of the simulation presented by
Padilha was the fact that it only supported one dialog going on at a time, meaning that
all characters participated in the same conversation. When simulating virtual humans
embedded in virtual world we have to allow for more dynamic changes. While we could
14
run multiple conversation simulations and explicitly assign different virtual humans to
different conversations, this is still not realistic for many situations in which humans
move around and join or leave conversations. Likewise, even when people stay in the
same position (e.g., at a meal or meeting), there are often dynamic splits and realign-
ments into sub-conversations.
The goal of the simulation framework is to control visualization of background vir-
tual humans. To that effect it is important to have a clear understanding of what can be
expected of the virtual humans. We can assume to have a rendering application that is
able to display virtual humans and provides some atomic actions that virtual humans can
perform. The rendering environment used in the first iteration was Unreal Tournament
2003 game engine (more details in chapter 6), which influenced some assumptions made
about what is available to the simulation. In particular, we assume that virtual humans
can perform the following basic actions:
walk to a specific location,
gaze at a specific location or object,
play predefined animation sequences.
In addition we assume that we can get relevant information about objects in the
virtual world such as their locations and that we have a reasonable set of animations
available to indicate different modalities of the conversations we are simulating. To
cover the full range of expressiveness the following animations would be required:
conversational gestures, such as beat gestures, mainly involving arm movements
one makes while talking
feedback behavior, such as nods, shakes and other expressions like laughter or
anger
15
posture shifts, moving weight between legs
At the core of the simulation framework is the multi-agent architecture. The previ-
ous implementations used a synchronous model where all participants of the conversa-
tion would exchange their messages on a shared blackboard and had a fixed conversation
cycle synchronized between all participants. Padilha has identified this as one of the lim-
itations of the model and suggests asynchronous model as possible future work (Padilha
2006). Here we have decided to follow up on this and restructure it into an asynchronous
model, because this allows the system to be more reactive and more properly represents
the response time of behaviors. It is also more suitable for a large number of agents
with several ongoing conversations. To facilitate this we have each virtual character
controlled by an agent running in a separate thread, communicating with other agents
using broadcasting of messages. When an agent receives a new message it can react
to it immediately or just update its internal state and make a decision during normally
scheduled processing.
Each agent then represents a specific virtual human in the simulation. It is respon-
sible for generating appropriate behaviors based on the situation. Agents have internal
memory they use to store information about the world relevant to them. Input to the
agents is in the form of messages received from other agents, events from the rendering
environment about changes in the virtual world and they can also query the environment
about the state of the world. Similarly output from the agents are the messages they
generate and commands to the environment to perform basic actions.
Figure 2.1 shows the types of messages the agents can generate. Most of them
are associated with visualization and cause some basic actions to be executed although
some have no visual representation and only serve for agent coordination (for example,
selection of addressee and pre-TRP signal). Their use and meaning will be explained
later in this chapter.
16
speech
begin speaking
end speaking
pre-TRP signal
selection of addressee
positive or negative feedback
non-verbal
nodding
gestures
posture shifts
gaze
Figure 2.1: Conversation Agent Message types
While in the algorithms in Padilha and Carletta (2002) and Patel et al. (2004) every
virtual human was in conversation all the time, our simulation allows situations when
a virtual human is not involved in a conversation at all. From this arises the need to
have some higher order planning involved that decides when they should join an exist-
ing conversation, when they should start a new conversation and also when to leave
a conversation because of reasons external to the dialog simulation itself. The dialog
simulation does not prescribe the behavior of agents while they are not engaged in con-
versation. While it only controls the behavior of virtual humans that are involved in
conversation, it provides a mechanism for transfer of control to a higher level that has
access to additional information needed to make those decisions. This could be for
example needing to be at a specific location at a certain time or maybe just to indicate
that the agent got bored or exhausted all topics of conversation and some other activity
is now higher in importance.
17
The behavior of virtual agents is controlled by a set of attributes in a probabilistic
manner as in the previous algorithms (Padilha and Carletta 2002; Patel et al. 2004). Each
of these attributes has a value from 0 to 1. Whenever one of these attributes is tested, a
random number is selected and compared to the attribute value (possibly scaled based
on contingent factors of the conversation). The attributes used are shown in Figure 2.2.
talkativeness likelihood of wanting to talk
transparency likelihood of producing explicit positive and negative feed-
back, and turn-claiming signals
confidence likelihood of interrupting and continuing to speak during
simultaneous talk
interactivity the mean length of turn segments between TRPs
verbosity likelihood of continuing the turn after a TRP at which no one
is self selected
Figure 2.2: Conversational Agent Attributes
Each agent also keeps track of information about other virtual humans. Each agent
tracks the gaze of others, whether they are speaking and how long since they have last
interacted in the conversation group of the tracker. They also track the composition
of their conversation group — conversation groups are not defined externally but inter-
preted on the basis of perceived actions. Agents can also mis-interpret the actions of
others, and can have different ideas about the composition of a conversation group.
2.5 Conversational Participation Algorithm
The algorithm can be conceptually separated in two parts. One part deals with proactive
behaviors while the other deals with reactive behaviors. Both can be interpreted in
terms of events, the difference being that the proactive behaviors are triggered from
periodic events triggered by the scheduler whereas the reactive events are triggered as a
result of external input to the agent. The proactive part consists of a high-level planning
18
and a conversation processing event that model the behavior explained in the previous
sections. The reactive part on the other hand deals with claiming of turns, responding
to others and updating of internal state. Each agent runs a separate instance of the
algorithm in its own thread, with its own settings for the attributes, and its own internal
representation of the behaviors of others and group composition. What follows is a
description of the major events and behaviors.
2.5.1 High-level planning
This part of the algorithm is external to the main conversation algorithm and represents
the high order planning of the agents. In the first iteration this only included a simple
version of tests for engaging and disengaging from conversation. The test for leaving
conversation is usually based on external information, but in its basic form can be a
simple chance based decision with a condition that a certain amount of time passed
since the agent joined the conversation. In our case we used a 2% chance after the
conversation lasted at least 30 seconds. For engaging in conversation we use a test
against talkativeness (comparing random number between 0 and 1 against talkativeness
attribute). In case the test is successful we either join an existing conversation or start
a new one. Starting a new conversation always has some chance even if there is a
conversation already going on (in our case we used 20%).
every planning cycle (approx. every 5 sec)
if in conversation
test to leave conversation
else if talkativeness test successful
decide to join existing conversation
or start a new conversation
19
When embedding this dialog simulation in a larger simulation framework this part is
extended with planning of actions external to the dialog algorithm. In later iterations this
was also expanded to provide more elaborate way for joining and leaving conversation
that include movement as described in chapter 3.
2.5.2 Conversation processing
This is the main procedure responsible for proactive turn-taking behavior of agents
involved in conversation. While in reality we are making these kinds of decisions con-
tinually we restrict them to every 0.5 sec which is slow enough to not be too taxing on
the resources, but still provides a fluid conversation flow.
As was mentioned in section 2.4 each agent keeps track of the participants in the
conversation and when they were last active. Monitoring of activity is reactive based on
input received from other agents. But inactivity is a lack of activity and by its nature
not reactive. We notice it after the fact. If an agent is inactive for a long period of time
we remove them from the conversation group. One exception is if the agent is actively
gazing at the current speaker, but not otherwise producing any actions. Agents will treat
such an agent as active and not remove them.
Next we distinguish several possible states in which an agent can be while in con-
versation. Either no one is speaking currently, we are listening to someone, we are
speaking simultaneously or speaking alone. Each situation requires different decisions
to be made.
If no one is speaking we are in situation when no one took the turn at a previous
TRP. We either have to decide to start speaking, keep waiting for someone else to start
speaking or leave the conversation. If we have previously decided to start speaking
then we do nothing (it is important that we remember this as there is some reaction
time between when we decide to start speaking and when we actually start speaking
20
and we do not want to override that decision). Based on the turn-taking systematics
it is expected for a new speaker to leave some gap before taking the turn to allow for
the previous speaker to continue. If enough time has passed and the talkativeness test
succeeds then the agent decides to start speaking. He selects a random short time interval
(for example up to 0.75 seconds) representing his preparation and picks someone in the
conversation group toward whom the speech will be addressed. Another test based
on the transparency attribute is made to select whether the agent will produce a visual
indication of planning to take turn such as shifting posture. If on the other hand we do
not decide to start speaking and no one was speaking for a predefined amount of time,
then we have to make a decision about whether we want to leave the conversation.
The next situation we examine is when the agent is listening to someone. If there
is only one speaker the agent only has to continue processing gaze and test whether
he wants to interrupt the current speaker. On the other hand, if there is more than one
speaker for a long time then the agent has to consider the possibility that there are two
conversations going on and this might be the case for some time still. He selects the
speaker he is currently listening to or a random one if he is not listening to a particular
one and estimates the group composition. The procedure is similar to when joining a
preexisting conversation. In essence it removes the other speakers and conversation par-
ticipants that are gazing at the other speakers. This is a starting estimate, but it relies on
subsequent analysis of behavior of other agents to figure out the correct group compo-
sition. For example if it is just a very long interruption then the agent will notice that
the other speaker is still in the same conversation (by interpreting gaze behavior or other
feedback) and add them back appropriately. Apart from disambiguating simultaneous
speech the agent also has to decide whether he wants to barge into the discussion. The
test is based on the talkativeness and confidence attributes, adjusted to produce a desired
rate of interrupts.
21
The agents also have to disambiguate interruption from split conversation when they
are the ones being interrupted. If there is more than one interrupter we just treat it as
interruption until either we stop or the number of simultaneous speakers drops to two.
As described in the turn-taking organization a main cue in interpretation is gaze. If the
interrupter is addressing someone that is actively listening to them (i.e. gazes at the
interrupter) for some time, then the agent interprets that as side talk. He removes them
from conversation and continues talking. In the case of misinterpretation the interrupter
will eventually fail a confidence test and stop speaking. By his further actions the agent
can interpret that he is still in the conversation group. The confidence test to stop speak-
ing is adjusted by how long the simultaneous speech was going on. This is to simulate
the observation that most simultaneous speech is short, but it allows for some longer
ones as well.
When the agent is speaking alone, which is the case most of the time, he has to
decide when to produce gestures and who to look at. In the initial model the speaker
periodically selected a new target to look at. The gaze model was later expanded with
culturally-specific parameters and is explained in more details in chapter 4. In addition
we need a fail-safe mechanism if everyone leaves the conversation and no one is listen-
ing to the agent anymore. In this case a confidence test is performed to stop the agent
from speaking.
every conversation cycle (approx. every 0.5 sec)
remove characters that were inactive for too long
or are too far away
if no one is speaking
test talkativeness to start to speak
if so, start with random interval
select addressee
test transparency to shift posture
22
if no one was speaking for some time
if talkativeness test fails leave conversation
if listening to someone
if there is more than one speaker for some time
group was split into two or more conversations
keep speaker that I am listening to
remove participants that are attending to others
test talkativeness and confidence to interrupt
if speaking simultaneously
if there is only one additional speaker
and their addressee attends to them
then treat this as a side talk
remove both from conversation
otherwise test confidence to continue speaking
if speaking alone in a turn
decide when to gesture and gaze away
if no one is paying attention to me
if confidence test fails stop speaking
2.5.3 Claiming a turn
This is the first of several procedures responsible for reactive behavior. Agents decide
(using the talkativeness attribute or if they were selected by the current speaker) whether
or not to take a turn when they receive pre-TRP signal. Since the agents don’t actually
exchange any content and there is no prosodic information, they explicitly send a pre-
TRP signal which indicates that TRP is coming shortly. This is to simulate the ability
of conversation participants to estimate when the TRP is coming and plan their actions
to coincide with it. If they decide they will speak, they also decide (using the trans-
parency attribute) whether to signal their intention to speak with turn claiming signals if
appropriate.
23
when receiving pre-TRP signal
test talkativeness to decide to speak
if so, test transparency to make turn claiming signal
if not then test transparency to produce a backchannel
2.5.4 Starting to speak
Whenever an agent starts to speak it determines the timing of its turn, including when to
send a pre-TRP signal. In addition we simulate the behavior that second starters tend to
stop speaking if someone already started. We do this based on a confidence test. If the
test fails then we stop speaking after a short reaction time.
when starting to speak
if at TRP and someone already started speaking
test confidence to continue speaking
select segment length based on interactivity
2.5.5 Continuing speaking
Sometimes when one finishes a segment, no one else takes over. In this case the agent
has the option to continue his own speech beyond what was initially planned.
when you end segment and no one takes turn
test verbosity to continue speaking
2.5.6 Tracking participation
Whenever an agent speaks or gives feedback to someone in a conversation group, they
will be an active participant as well. This section maintains the conversational group
and activity of its members. As mentioned previously, the removal of participants from
the group based on inactivity is handled in the main conversation processing.
24
when receiving input from other characters
if they are signaling to someone in my group
then add them to group (if not already there)
if they are in my group and addressing someone in my group
update last time they were active
2.5.7 Responding to others
This section calculates how an agent should respond to the initiation of speech by
another. The agent’s reaction will depend on whether the agent is also speaking and
who started first, whether the agent is part of the same conversation as the speaker, and
attributes of confidence (whether to continue speaking or not), talkativeness (whether to
join a conversation), and transparency (whether to show feedback behavior).
When someone starts speaking after the agent has already started at a TRP the con-
fidence test is hugely biased in favor of the agent, but allows for potential scenario that
he leaves the turn to second starter.
when someone starts to speak
if in conversation with me
if at TRP and I already started speaking
test confidence to continue speaking
if not speaking
test transparency to gaze at speaker
if I am not in conversation and they are speaking to me
test talkativeness to join conversation
test transparency to give signals of joining
25
2.6 Evaluation of Initial Dialog Model
There are many possible ways to evaluate the simulation. One can try to fit the model
to observed conversations, as suggested by Padilha and Carletta (2002). One could also
test the differences in simulation that would result from different sets of characters with
different sets of attribute values, e.g., whether it leads to domination of the conversation
by a single character or small set of characters. As suggested in Patel et al. (2004), we
decided to test if the simulation “looks like a conversation” to the viewer.
In the first phase the simulation consisted of a turn-taking simulation where all vir-
tual humans were placed in fixed positions. The scenario we used involved 6 Afghani
civilians which we repurposed from the Leaders project (Gordon et al. 2004). The char-
acters were designed as background virtual humans which made them a good fit for
our purposes. In the initial condition the characters were not involved in conversation.
We recorded several simulations with different character attributes and stored videos
and internal logs of each agent to later analyze and compare their internal states with
responses from the viewers. A screenshot from a conversation simulation is shown in
Figure 2.3. In the videos we distributed the appearance of characters and character
attributes in a balanced manner in order to minimize any effects that surface charac-
teristics of the bodies might have on the results. We also made one simulation where
characters decided randomly when to start speaking and who to gaze to in order to have
a baseline for comparison with our algorithm.
The test was composed of 3 different parts. In the first part we asked the participants
to view several 30 second clips of simulations and decide how believable they think
each simulation was on a 7-point Likert scale. We also asked them to provide any
information about what factors they thought made the conversation less believable. In
the instructions we also made it clear to the viewers that when judging believability
26
Figure 2.3: Afghani civilians engaged in conversation.
of the simulation they were to pay most attention to the appropriateness of behavior,
particularly gaze and dialog, rather than animation quality of the characters.
In the second part we asked viewers to view multiple 2 minute clips of simulations.
We instructed them to pay attention to only one of the characters (different characters for
different clips) and analyze their behavior. Since the attributes used in the algorithm are
not all very visible in such a short dialog we decided to ask viewers about the perceived
properties of the characters rather than about underlying attributes. We asked viewers to
judge the following properties on scale from 1 to 7:
talkative how often is he talking:
1 – almost never talks
4 – talks about as much as everyone else
7 – talks almost all the time
27
predictive does he give any signals before speaking:
1 – never gives any hints that he is about to speak
7 – always indicates that he wants to speak
transparent is he giving any signals that he is attending to the speaker:
1 – seems oblivious to others
7 – always signals understanding of others
interruptive is he interrupting when others are speaking:
1 – always waits for others to finish
7 – jumps into conversation all the time
confident is he likely to keep talking if others speak at the same time:
1 – gives up his turn if someone else starts to speak,
7 – never shuts up when others speak
How talkative a character is is influenced by talkativeness attribute, predictive and
transparent are both influenced by transparency. Confident characters have high confi-
dence attribute and interruptiveness is determined by combination of both talkativeness
and confidence. We have not asked about verbosity or interactivity because that would
require observation of longer segments to get significant results.
In the last part we asked viewers to track who they think is speaking with whom,
again for clips of 2 minutes in length. We used this data to compare how the internal
state of each agent correlates to what is perceived by the viewer.
The average believability score for our algorithm was 5.3 compared to score of 3.3
for random behavior. The difference is statistically significant which indicates that most
28
viewers identified our algorithm as better than random. We found that the highest scores
were received by simulations where either all characters participate in the same conver-
sation or where the conversation groups correspond to positioning of the characters in
the setting. Since the algorithm did not take positioning of characters into effect when
deciding about creating new conversations and allowing conversations to split it was not
able to prevent this kind of undesirable behavior from happening.
Part 2 proved to be a lot more difficult than expected. Not only were there differences
between the values predicted by underlying attributes and results from viewers, but also
the values varied widely between viewers. This would suggest that it is hard for humans
to judge what the personality of a virtual character is, probably because of the lack of
expressiveness when we compare virtual characters to real humans. We guess that it
would be hard to grasp the personality of a background character anyway. However, we
still think that having a parameterized algorithm has its benefits since the structure of
dialog changes with different attribute settings.
Results from part 3 showed that what viewers perceived roughly agreed with the
internal state of the characters. When a certain group composition was held for a longer
time most of the characters and viewers agreed with what the current group composition
was. Most of them correctly differentiated between normal transitions, interruptions and
side conversations. However when the side conversations do not last long the results
vary between characters and also between viewers.
From the evaluation it can be seen that it is beneficial to dynamically create behav-
ior for background virtual humans as it both removes labor intensive work of creating
scripts and also improves believability of the simulations. However, the results also
showed that there is a lot of room for improvements. As already mentioned the main
element affecting believability that we identified was positioning of the characters. This
29
led us to examine what role the movement and positioning plays in conversations and is
described in the next chapter.
30
Chapter 3
Movement and Positioning Simulation
3.1 Introduction
To understand the movement and positioning involved in conversation we have to look
at human conversation in a casual, open setting, such as a party or marketplace. One of
the first things we notice is a tendency for people to cluster into sub-groups involved in
different conversations. These groupings are not fixed, however, people will often join
and leave groups and often move from one group to another. Groups themselves may
fragment into subgroups, and smaller groups sometimes merge into one larger group.
Participants in these groups adapt their positions and orientations to account for these
circumstances, often without missing a beat or otherwise disrupting their conversations.
To realize this in the simulation we have to understand the forces at play, the various
reasons people might have to reposition themselves and move around. And then we have
to represent this in a computational model. For inspiration on modeling of movement we
first turn our attention to virtual crowd simulations. Then we will examine the reasons
for movement as observed by anthropologists and social psychologists. Finally, we
will describe our simulation model in detail, how it integrates into the conversation
simulation and present the test case analysis.
31
3.2 Virtual Crowd Simulations
Simulations of large number of virtual humans were explored in many different fields
from physics and computer graphics to architecture and sociology. Most simulations
describe large-scale movement of agents and either focus on realism of behavioral
aspects or on high quality visualization, but recently work on both areas seems to be
converging.
Crowd simulations were pioneered by Reynolds (1987) who simulated the flocking
behavior of animals using ideas from particle systems. Bouvier and Guilloteau (1996)
used a combination of particle systems and transition networks to model human crowds
to support architectural design. Brogan and Hodgins (1997) simulated group behaviors
for systems with significant dynamics using a perception model to guide the movement
of individuals. Helbing and Moln´ ar (1995) presented a pedestrian model based on a
social force model and Still (2000) used mobile cellular automata to simulate pedestrians
in emergency evacuations. From a sociology perspective, McPhail et al. (1992) studied
individual and collective actions in temporary gatherings based on perception control
theory.
Of particular interest are simulations that focus on the modeling of individuals in
crowds that have non-trivial humanlike abilities. Terzopoulos and his colleagues devel-
oped a simulation that uses motor, perceptual, behavioral and cognitive components to
model pedestrians in urban environments such as train stations (Shao and Terzopoulos
2005). The virtual humans that populate the environment are fully autonomous and are
implemented as a hierarchical artificial life model. Complex actions are composed of
primitive reactive behaviors in a bottom-up approach. Besides navigational behavior
routines they have included also some others such as sitting down on a seat, watching
a performance, queuing at ticketing areas and even a simple chat behavior. Recently
32
they have explored the use of decision networks to guide the interactions between vir-
tual humans in selection of behavior routines (Yu and Terzopoulos 2007). They have
employed a social force model for navigation control, but they used it for pedestrian
behavior rather than in conversational setting.
Ulicny and Thalmann (2001, 2002) also explore simulation of crowds using multi-
agent systems with an emphasis on individuals. The agents contain a set of internal
parameters representing psychological and physiological state. They receive events
from environment and from other agents and together with internal state these drive the
selection of higher-level complex behaviors. Behaviors are represented as a combina-
tion of rules and finite state machines which use low-level actions provided by the virtual
humans. Strong points of their framework are a modular architecture with clear sepa-
ration of model part and visualization part, and ability to mix autonomous and scripted
behavior in order to control the scenario. Their work however is very general and does
not focus on conversational behavior.
From these examples we have identified the social force model (Helbing and Moln´ ar
1995) to be the most appropriate framework for modeling of movement in conversa-
tional setting. While the basic model only applied to pedestrian motion, we show how
this can be expanded by introducing new social forces and how to use it for local repo-
sitioning.
3.3 Reasons for Movement
There are several fields that are interested in navigation and positioning from which we
draw ideas for our simulation. First we have research from anthropologists and social
psychologists such as the classic work on proxemics by Hall (1968) and positioning
33
by Kendon (1990). Both provide examples of how people position themselves in dif-
ferent situations. Another reason relevant in conversational setting is audibility of the
speech. While people are able to selectively listen to several sources at a time, they
will prefer to move away from sound sources that they do not wish to attend to if they
have an option to do so. It is also important to know that people expect similar behavior
in virtual environments as in real life as shown by Bailenson et al. (2003); Jeffrey and
Mark (2003); Nakanishi (2004). This gives us some basic principles on which to base
the simulation and provides some qualitative expectations that we can use to test the
model.
From these sources we have identified the following reasons why someone engaged
in a conversation would want to shift position:
one is listening to a speaker who is too far and or not loud enough to hear,
there is too much noise from other nearby sound sources,
the background noise is louder than the speaker,
one is too close to others to feel comfortable,
one has an occluded view or is occluding the view of others.
Any of these factors (or a combination of several) could motivate a participant to
move to a more comfortable location. During the simulation, the speakers can change,
other noise sources can start and stop, and other agents can move around as well. These
factors can cause a variety of motion throughout the course of interactions with others.
In the rest of this section we describe these factors in more detail. Then in the next
section we will develop a formal model of reactions to these factors.
The first reason we consider for repositioning of conversation participants is audibil-
ity of the speaker. The deciding factor can be either the absolute volume of the speaker,
34
or the relative volume compared to other “noise”. Noise here describes all audio input
that is not speech by someone in the current conversation group. This includes the
speech of agents engaged in other conversations as well as non-speech sounds. When
we are comparing the loudness of different sources we take into account that intensity of
the perceived signal decreases with the square of the distance and also that the loudness
of several sources is additive.
Even when the speaker can be heard over a noise source, if outside disruptions are
loud enough, the group might want to move to a more remote area where they can
interact without interruptions. Each of the participants may decide to shift away from a
noise source, even without an explicit group decision. Of course this may not always be
possible if the area is very crowded.
Another reason for movement is proxemics. Individuals generally divide their per-
sonal space into four distinct zones (Hall 1968). The intimate zone is used for embrac-
ing or whispering, the personal zone is used for conversation among good friends, the
social zone is used for conversation among acquaintances and the public zone for pub-
lic speaking. The actual distances the zones span are different for each culture and its
interpretation may vary based on an individual’s personality, which is addressed in chap-
ter 4. If the speaker is outside the participant’s preferred zone, the participant will move
toward the speaker. Similarly if someone invades the personal zone of a participant, the
participant will move away.
The final reason for movement is specific to multiparty conversations. When there
are several people in conversation they will tend to form a circular formation. This
gives the sense of inclusion to participants and gives them a better view of one another
(Kendon 1990).
In terms of qualitative expectations for the model we identified three specific cases.
First, when someone joins a conversation we expect the previous group to make space
35
for the new participants. Second, when a split happens in conversation we expect the two
groups to cluster and segregate from each other. Finally, when participants have incom-
patible proxemic zones we expect them to constantly readjust their positioning (Scheflen
1975). We will use these in the evaluation to test whether our model produces expected
results.
3.4 Social Force Model
We present our movement simulation in the context of a social force model. Simi-
lar to movement in crowds, the movement of people engaged in conversation is to a
large extent reactionary. The reaction is usually automatic and determined by a person’s
experience, rather than planned for. It is possible to assign a vectorial quantity for each
person in conversation, that describes the desired movement direction. This quantity
can be interpreted as a social force. This force represents the influence of the environ-
ment on the behavior of conversation participant. It is important to note however that
this force does not directly cause the body to move, but rather provides a motivation
to move. We illustrate these forces with figures such as Figure 3.1, where each circle
represents an agent, the different shadings represent members of different conversation
groups, thicker circles represent speakers in that group, and arrows represent forces on
an agent of interest.
We associate a force with each reason for movement:
~
F
speaker
: attractive force toward a speaker you are listening to
~
F
noise
: repelling force from outside noise
~
F
proximity
: repelling force from agents that are too close
~
F
circle
: force toward circular formation of all conversation participants
36
Figure 3.1: A sample group positioning. Each circle represents an agent. A thick border
represents that the agent is talking, filled or empty shading indicates conversation group
membership.
~
F
speaker
is a force that is activated when the speaker is too far from the listener.
This can happen for one of two reasons. Either the speaker is not loud enough and
the listener has to move closer in order to understand him, or he is outside the desired
zone for communication. When the agent decides to join a conversation this is the main
influence that guides the agent to his conversation group as shown in Figure 3.2.
~
F
speaker
is computed according to equation (3.1), where~ r
speaker
is location of the speaker,~ r is
location of the agent andk is a scaling factor (we are currently usingk = 1).
~
F
speaker
=k(~ r
speaker
~ r) (3.1)
Figure 3.2: Attractive force toward speaker
~
F
speaker
.
37
~
F
noise
is a sum of forces away from each source of noise. Each component force is
directed away from that particular source and its size is inversely proportional to square
of the distance. This means that only sources relatively close to the agent will have
a significant influence. Not all noise is a large enough motivation for the agent to act
upon. The force is only active when the noise level exceeds a threshold or when its
relative value compared to speaker level in the group exceeds a threshold. Figure 3.3
shows an example of the latter. Equation (3.2) is used to compute
~
F
noise
, where~ r is
location of the agent,i sums over all noise sources,V
i
is volume of the noise sourcei
and~ r
i
is location of the noise sourcei. In our implementation we have used a uniform
volume for all speakers.
~
F
noise
=
X
i
V
i
~ r
i
~ r
k~ r
i
~ rk
3
(3.2)
~
F
proximity
is also a cumulative force. It is a sum of forces away from each agent that
is too close. The force gets stronger the closer the invading agent is. This takes effect
for both agents in the conversation group and other agents. This is the second force that
is modeling proxemics. While
~
F
speaker
is activated when the agent is farther than the
desired social zone,
~
F
proximity
is activated when the agent moves to a closer zone. Based
on how well the agents know each other this can be either when the agent enters the
intimate zone or the personal zone. Figure 3.4 shows an example when two agents get
too close to each other. Equation (3.3) is used to compute values for
~
F
proximity
.
~
F
proximity
=
X
k~ r
i
~ rk
~ r
i
~ r
k~ r
i
~ rk
2
(3.3)
~
F
circle
is responsible for forming the conversational group into a convex, roughly
circular formation. Each agent has a belief about who is currently participating in the
conversation. An agent will compute the center of mass of all these assumed partici-
pants and the average distance from the center. If an agent’s position deviates too much
38
Figure 3.3: Repelling force away from other speakers
~
F
noise
.
Figure 3.4: Repelling force away from agents that are too close
~
F
proximity
.
from the average, the
~
F
circle
gets activated either toward or away from center of mass.
Equation (3.4) is used to compute
~
F
circle
, whereN is number of all agents in the conver-
sation and~ r
m
represents center of mass. Notice that
~
F
proximity
takes care of spreading
out around the circle.
~ r
m
=
1
N
X
i
~ r
i
~
F
circle
=
~ r
m
+
1
N
X
i
k~ r
i
~ r
m
k
!
~ r~ r
m
k~ r~ r
m
k
~ r
!
(3.4)
The situation in Figure 3.5 is an example where an agent decides that he has to adapt
his positioning. Notice that if this agent was not aware of the agent to his left, the force
39
would not get triggered. This can be a cause for many interesting situations when agents
have different beliefs about who is part of the conversation.
Figure 3.5: Agent’s deviation from circular formation exceeds threshold and triggers
force
~
F
circle
.
As described above, each force has some conditions that determine whether the force
plays an active role in motivating movement. Since the forces are not actually physically
acting on agent’s bodies, it is not unreasonable for agents to suppress a certain force. All
the possible causes for movement are always present, but the agents selectively decide
which ones they will act upon in a given situation. This is unlike a kinematics calculation
with physical forces where all forces are always active. Combining all the conditions
we can define which forces are active according to a simple decision procedure. We can
view this as priorities that the agent uses to decide which conditions are more important
to react to.
In our implementation we use the following priorities:
if speaker is too low
~
F =
~
F
speaker
+
~
F
proximity
else if noise is louder than speaker
~
F =
~
F
speaker
+
~
F
noise
+
~
F
proximity
else if noise is too loud
~
F =
~
F
noise
+
~
F
proximity
else if too close to someone
~
F =
~
F
proximity
40
otherwise
~
F =
~
F
circle
Using the above priorities we have a force defined at each point in space where an
agent could be located. We do not use this for the continuous computation of movement,
but rather use it to compute destination points. In each planning cycle the agents will
consider whether they should move. To do this an agent considers his position in the
force field and computes a destination in the direction of the force field. This process is
performed iteratively a constantbound times (unless there is no movement in an earlier
iteration). This is described in the following equations, where~ r is the initial position,
is a scaling factor, and
~
P
bound
is the destination for the movement of this planning cycle:
~
P
0
=~ r
~
P
i+1
=
~
P
i
+
~
F(
~
P
i
)
~
Destination =
~
P
bound
Once we have computed the destination, we use it as a destination point for the char-
acter movement actions. Character animation and collision avoidance is then handled at
a lower level separate from the movement planning.
Figure 3.6 shows an example with two separate conversation groups, where one
agent decides to leave the shaded group and join the unshaded conversation. The fig-
ure shows the iterations he is performing in his planning cycle and the resulting final
destination.
41
Figure 3.6: Example of motion computation: The lower right agent decided to join the
unshaded conversation. He iteratively applies movement in the direction of local forces.
In each iteration the effects of different component forces may take effect. The thick
line indicates the final destination and path the agent chooses for this planning cycle.
3.5 Integration into Conversation Simulation
With the addition of movement into our simulation we also expanded the mechanisms
for joining and leaving of conversation to make use of positional information. This
replaces the simple algorithm described in section 2.5.1.
every planning cycle (approx. every 5 sec)
plan repositioning
if in conversation
test to leave conversation
else
test to engage in conversation
When embedding this dialog simulation in a larger simulation framework this part is
extended with planning of actions external to the dialog algorithm. Beside other things
it also controls when the agents should join or leave conversation. To facilitate this
a number of subroutines are available that implement the default decision making as
shown above. The repositioning refers to the social force model calculations described
in the previous section. The rest of subroutines are described next.
42
Engaging in conversation
This subroutine is very simple and its purpose is to have an agent engage in conversa-
tion. The decision is made based on a test against the talkativeness attribute. The agent
then queries the environment about the virtual humans that are in his hearing range or
visibility range (in the actual implementation this information is cached locally). If there
are some virtual humans close by it engages in conversation with them (either joining
an existing conversation or starting a new one), otherwise it selects a random agent and
moves toward them (using a movement message that triggers the movement basic action,
more details also given in chapter 6).
when testing to engage in conversation
if talkativeness test successful
if there are characters in hearing range
engage in local conversation
else
select a random agent in visibility range
and move toward them
When actually engaging in conversation the agent first has to decide whether to join
an existing conversation or start a new one. There is always a small chance for agent to
decide to start a new conversation, but when joining an existing conversation it will give
preference to those that are closer to him.
We base this on opening sequences such as those described by Kendon and Fer-
ber (1973) and Schiffrin (1977). They describe many possible modifications such as
misidentification and passing greetings where people don’t engage in conversation. Here
however we just focus on the base sequence, which starts with recognition of the other,
a simple distant salutation and approach to the conversation group. From here on the
agent adjusts its position based on the movement algorithm and starts the conversation
processing.
43
when engaging in local conversation
randomly select a speaker in hearing range
with bias toward ones that are closer
if there is no one speaking or with small chance otherwise
select a random agent within hearing range
gaze at the agent
perform distant salutation
move into comfortable range
begin conversation processing
else
estimate conversation group composition of selected speaker
gaze at the speaker
perform distant salutation
move into comfortable range
begin conversation processing
Disengaging from conversation
The test for leaving conversation is usually based on external information, but in its
basic form can be a simple chance based decision with a condition that a certain amount
of time passed since the agent joined the conversation. Since the agents themselves do
not generate any content, a specific closing greeting message is generated that informs
other conversation participants that the agent is about to leave. Similar to the opening, a
more elaborate ritual with its associated visual display could be simulated instead. We
also stop the conversation processing responsible for the attentiveness to the ongoing
conversation.
when testing to leave conversation
if test to leave conversation successful
send closing greeting message
44
cancel conversation processing
3.6 Evaluation
We evaluated the movement algorithms against a series of test cases where we know
what behavior to expect from known forces. In this section we present three such cases,
showing that the algorithm does have the power to represent several aspects of conver-
sational positioning.
All the agents use identical default individual parameters, but unlike a normal simu-
lation we constrained the grouping dynamics. In a normal situation the agents would
randomly form conversation groups, based on their stochastic decisions. Here we
wanted to examine particular scenarios and how the movement algorithm would react
to specific changes in conversation group structure. For this reason we disabled conver-
sational grouping decisions in the algorithm and triggered the group structure changes
manually from the user interface.
The only variable input to the movement algorithms for different agents is the pref-
erences for proxemics. Each agent has defined values for all zones, but we set all agents
to use the social zone for communicating. The other parameters such as thresholds for
hearing a speaker and noise and circular formations were fixed for these experiments.
3.6.1 Joining conversation
In this test case we have 4 agents. In the initial condition three agents are engaged in
conversation while the fourth one is away from the scene. We let the simulation run
and at some point we give a command to the fourth agent to join the group of three. At
first the agent will move toward the group until he is in a comfortable range as shown in
Figure 3.7.
45
Figure 3.7: The agent on the left is approaching a conversation. Arrows indicate where
the agents will move from now until the simulation stabilizes.
Figure 3.8: Stable point after the fourth agent joins the conversation.
At the point in which the fourth agent decides to join the other three, he is the only
one who knows he wants to join the conversation. The other agents know of the presence
of the fourth agent, but they have no idea that he would like to join them. The fourth
agent is listening for a while and when he gives a feedback signal the other agents
interpret that as a signal that he wants to join the conversation. As a result the agents
reevaluate their positioning and one agent decides it would be appropriate to move a step
back to give more space to the new agent. Given more space the new agent is able to
46
move in circular formation with the rest of the group without intruding on the personal
zones of other agents. The stable point of simulation is shown in Figure 3.8.
3.6.2 Conversation splitting into two separate conversations
In this test case, we have 6 agents. After initial placement of the agents we issue a
command for all the agents to form one conversation group. As a result they form a
circular formation as can be seen in Figure 3.9.
Figure 3.9: Agents form in a circle to engage in a single conversation.
We let the agents talk for a while and then give a command to the two agents on
the right side of the group to start a side conversation. After this a complex sequence
of events takes place. Initially the remaining agents still think that those two agents are
part of their conversation group. They have to disambiguate the speech of those two
agents and decide whether this is just an interruption or a split in the conversation. After
a while they realize that those agents are having a separate conversation.
Deciding that the agents on the right have left the conversation leads to a change in
the force field. The agents that were closest to the split are bothered by the noise and
start adjusting by moving away. By doing this they change the shape of formation which
causes the farther agents to also adapt back into circular formation. At the same time
47
the agents who split also move away from the others until they get to a point where all
are satisfied. The point where the simulation stabilized is shown in Figure 3.10.
Figure 3.10: After two agents leave the conversation the agents adapt to it by reposition-
ing.
3.6.3 Effect of proxemics
In this test case, we examine the effects when the social zones of the agents are not
compatible. This frequently happens when we have people from different cultures with
a large difference in distances for social zones. An example would be North Americans
compared to Arabs. Americans prefer a much greater inter-personal distance than Arabs.
Empirical data shows that in many such situations there is a sort of dance with one
person moving in while another moves away (Scheflen 1975).
Figure 3.11 shows an example of agents with incompatible social zones. The mark-
ings on the ground indicate the minimum and maximum acceptable distance for social
zone for each agent. We can see that the agent on the left has a much smaller comfort-
able distance than the one on the right. In the current position the left agent feels that
the other one is too far, while the right agent thinks everything is fine. This causes the
left agent to make a step forward. Consequently by doing so he steps into personal zone
48
Figure 3.11: Incompatible social zones.
of the right agent. Now the left agent is satisfied with the situation but the right agent
feels uncomfortable and decides to take a step back to keep the other agent out of his
personal zone. If nothing else intervenes, this process can continue, as the agent on the
left “chases” the one on the right out of the marketplace.
3.7 Conclusion
In this chapter we have shown how a movement and positioning component based on
social forces can be used to enhance our conversation simulation. In addition to reposi-
tioning based on structural changes in the groups we have enhanced the algorithms for
joining and leaving of groups so that positioning can be taken into account. This allows
us to form more believable groups and addresses the main deficiency we identified in
the first iteration.
As we have seen in the last test case however, we see that culture plays an important
point in contextual believability. Even when all the participants are from the same cul-
ture, we will expect differences between groups of different cultures. It is thus important
that our simulation represents cultural variation appropriately when used for simulations
49
that target specific cultural environments. Not only from believability point of view, but
also from a pedagogic perspective when we are using the simulation for cultural training
purposes. In the next chapter we then look at the effect of culture on proxemics, gaze,
turn-taking and overlap and expand our simulation with a set of parameters that allows
for cultural differentiation.
50
Chapter 4
Cultural Modeling
4.1 Introduction
When virtual agents are used for training simulations that are targeting a specific culture,
it is important to take into account the cultural aspect of contextual believability. We
expect the virtual agents to behave and interact with each other in a manner that fits in
the environment. If the user is on a virtual mission in a small town in the Middle East
he may expect to see Arab people on the streets. The experience will be much different
than a similar setting in the suburbs of a western city. There will be many differences in
how people behave in each environment. For instance when people interact with each
other there will be differences in how close to each other they stand and how they orient
themselves. Their gaze behavior will be different and there could even be differences
in turn taking and overlap. It is important then for the virtual agents to behave in a
culturally appropriate manner depending on where the virtual experience is situated.
The cultural model we developed for this purpose is focused on cultural variation of
non-verbal behaviors in conversation. We were interested in factors that play an impor-
tant role in face to face conversation and are mainly controlled and learned on a rather
subconscious level. As a first step we examined differences in proxemics, gaze and
pauses between turns. While other factors such as gestures also have significant cultural
variation and are salient for the outward appearance of conversation, we restricted our
investigation to some of the factors where we did not have to alter the art used in the
simulation.
51
Our aim was to create cultural models based on relevant literature that is available
and test the effects on believability of generated simulations using the models. Since
cultural appropriateness evaluation will in large part depend on the culture of the sub-
ject performing the evaluation, we have restricted our investigation to Anglo American,
Spanish-speaking Mexican and Arab culture based on the subject pool available to us.
We believe that the cultural model framework should still be applicable to other cultures,
but we have not attempted to provide parameter values for cultures other than the ones
mentioned here.
First we will look at other cultural models in the realm of virtual agents. Then we
will review the literature that we used as the basis for our cultural model. Finally, we
will present our computational model and its evaluation.
4.2 Cultural Models for Virtual Agents
There has been an increasing need for culturally adaptive agents as observed by O’Neill-
Brown (1997). In most cases the agents are built to target a particular culture. When
applying this kind of agent to a new virtual setting its behavior has to be modified. To
minimize the amount of work required for this modification, the agent architecture can
be designed in a modular fashion so that only the mappings of functional elements to
their culture-specific surface behaviors have to be changed as is the case for REA (Cas-
sell et al. 1999). On the other hand, the agents can be made to adapt to different cultural
norms. At one extreme they could be able to observe the environment and decide how to
behave. A more feasible alternative however is for the agent’s designer to provide this
information as is proposed for GRETA (de Rosis et al. 2003). The goal is for the agent
architecture to provide a set of parameters that can be modified in order to generate the
behavior that is culturally appropriate for the agent in a given scenario.
52
Maldonado and Hayes-Roth (2004) describe the process of localizing a virtual char-
acter to a specific culture. They express this in terms of ten qualities: identity, backstory,
appearance, content of speech, manner of speaking, manner of gesturing, emotional
dynamics, social interaction patterns, role and role dynamics. They give an example of
a web-based tutor that has been localized for U.S., Venezuelan and Brazilian culture.
While the variability is very rich it is specific to a particular individual agent and it is
not clear how this could be reused in another agent.
Taylor et al. (2007) describe an implementation of knowledge-based cultural model
in Soar based on cultural schema theory and appraisal theory. The agents have several
schema that describe normative behavior. A schema is a template for a single event or
a sequence of events including roles, and related objects. Schema instances are acti-
vated by relevant events in the environment. For example entering a restaurant triggers
a restaurant schema which is a collection of schema describing how to enter, order food,
eat and exit. The agents then use these schema to guide their behavior when they evalu-
ate various options they have to accomplish their goals.
Akhter Lipi et al. (2009) describe a data-driven model based on Hofstede cultural
theory to predict posture expressiveness. They use a Bayesian network to link the Hof-
stede cultural dimensions with posture parameters such as spatial extent, rigidness, mir-
roring, frequency and duration. They trained the network on German and Japanese data
and then used the bayesian network as a decision component in selection of postures
from a given data bank to fit a specific culture.
53
4.3 Human Culture-Specific Behavior
We will now review existing work on analysis of culture-specific behavior in human
interactions. The behaviors we look at are proxemics, gaze and turn taking and overlap.
We cover general findings and where possible elaborate on cultural differences.
4.3.1 Proxemics
Proxemics relates to the spatial distance between persons interacting with each other,
and their orientation toward each other. Hall writes that individuals generally divide
their personal space into four distinct zones (Hall 1968). The intimate zone is used
for embracing or whispering, the personal zone is used for conversation among good
friends, the social zone is used for conversation among acquaintances and the public
zone for public speaking. In addition to cultural variations in proxemics, there are also
variations based on sex, social status, environmental constraints and type of interaction.
Baxter (1970) observed interpersonal spacing of Anglo-, Black-, and Mexican-
Americans in several natural settings. He classified subjects by ethnic group, age, sex
and indoor/outdoor setting. Results for Anglo- and Mexican-American adults are listed
in table 4.1. While there are differences based on environment and sex combination it is
clear that those variations are dominated by cultural differences.
Ethnic Group Sex Combination Indoor Adults Outdoor Adults
Anglo M-M 2.72 2.72
Anglo M-F 2.33 2.59
Anglo F-F 2.45 2.46
Mexican M-M 2.14 1.97
Mexican M-F 1.65 1.83
Mexican F-F 2.00 1.67
Table 4.1: Mean Interpersonal Distance in Feet
54
Watson and Graves (1966) observed 32 male Arab and American college students in
pairs (6 possible combinations) for 5 minutes after 2 minute warm-ups. They found
that Arabs and Americans differed significantly in proxemics, the Arabs interacting
with each other closer and more directly than Americans. They also report that dif-
ferences between subjects from different Arab regions were smaller than for different
American regions. In a similar experiment Watson (1970) studied 110 male foreign stu-
dents between spring of ’66 and ’67 at the University of Colorado. He found that Latin
Americans exhibit less closeness than Arabs, but still interact much closer than Anglo
Americans.
Shuter investigated proxemic behavior in Latin America. He was particularly inter-
ested in changes between different geographical regions. In his study he compared
proxemics of pairs involved in conversation in a natural setting (Shuter 1976). He con-
cluded that interactants stand farther apart and the frequency of contact diminishes as
one goes from Central to South America. Table 4.2 lists the distances recorded in this
study.
Sex Combination Costa Rica Panama Colombia
M-M 1.32 1.59 1.56
M-F 1.34 1.49 1.53
F-F 1.22 1.29 1.40
Table 4.2: Mean Interpersonal Distance in Feet
McCroskey et al. (1977) performed an interesting study investigating whether real
life proxemic behavior translates into expected interpersonal distances when using a
projection technique. Their main goal was to get more data on proxemics as it relates to
differences in subjects that are hard to measure in naturalistic observations. They asked
subjects to place a dot on a diagram of a room where they would prefer to talk with a
person of interest. The results for projection technique were in agreement with findings
of real observations. Similar finding of translation of proxemic behavior to a virtual
55
setting is reported by Nakanishi (2004) in analysis of proxemics in virtual conferencing
system.
4.3.2 Gaze
Most available data on gaze concerns dyadic conversations. Kendon (1967) writes that
gaze in dyadic conversation serves to provide visual feedback, to regulate the flow of
conversation, to communicate emotions and relationships and to improve concentration
by restriction of visual input.
Argyle and Cook (1976) provide a number of useful data on gaze measurements in
different situations. While most of it is dyadic, there is some data available for triads.
They compare gaze behavior between triads and dyads as reported in studies by Exline
(1960) and Argyle and Ingham (1972). Table 4.3 shows how the amount of gaze differs
between the two situations (although the tasks and physical conditions in the two studies
were different so group size may not be the only variable). In dyadic conversation people
look nearly twice as much when listening as while speaking.
Triads Dyads
Sex Combination MMM FFF MM FF
Average amount of gaze by individuals 23.2 37.3 56.1 65.7
Looking while listening 29.8 42.4 73.8 77.9
Looking while talking 25.6 36.9 31.1 47.9
Mutual Gaze 3.0 7.5 23.4 37.9
Table 4.3: Amount of gaze (%) in triads and dyads
A study by Weisbrod (1965) looked at gaze behavior in a 7-member discussion
group. He found that people looked at each other 70% of the time while speaking
and 47% while listening. Among other things, he concluded that to look at someone
while he is speaking serves as a signal to be included in the discussion, and to receive a
look back from the speaker signals the inclusion of the other. Kendon (1967) attributes
56
this reversal of the pattern as compared to dyadic situation to the fact that in multiparty
situation the speaker must make it clear to whom he is speaking.
There is some data available on cultural differences in gaze behavior. In a review
by Matsumoto (2006) he reports that people from Arab cultures gaze much longer and
more directly than do Americans. In general contact cultures engage in more gazing and
have more direct orientation when interacting with others. Rossano et al. (2009) looks at
gaze behavior in Italian, Tzeltal and Yl Dnye and finds that in dyadic question-answer
pairs the patterns differ from previous predictions and exhibit a larger amount of gaze
from speakers and less from addressees.
4.3.3 Turn Taking and Overlap
As was explained in 2.2, most of the time only one person speaks in a conversation and
the duration of simultaneous speech is not long. However, in actual conversations this is
not always the case. Berry (1994) makes a comparison between Spanish and American
turn-taking styles and finds that the amount of overlap in Spanish conversation is much
higher than predicted by Sacks et al. One of the reasons for this behavior is presence
of collaborative sequences. These are genuinely collaborative in nature and include
completing another speaker’s sentence, repeating or rewording what a previous speaker
has just said, and contributing to a topic as if one has the turn even though they don’t.
Also when simultaneous speech does occur, continued speaking during overlap is much
more common in Spanish conversation.
Stivers et al. (2009) examine overlap in 10 languages drawn from major world lan-
guages and indigenous languages, focusing on question answer pairs and establishing
that the effect should translate to overlap in general. They find that there is a universal
tendency to avoid overlapping talk and minimization of silence between turns. There
is however difference in the average gap between turns. The difference from culture
57
mean to the overall mean is in the range of 250 ms, with Danish having on average +469
ms delay and Japanese only +7 ms. They also note that the differences within culture
have similar effect for all of the cultures examined. For example confirming answers
tend to have shorter delay than nonconfirming. Similarly answer responses tend to be
faster than nonanswers. In 9 out of 10 cultures the responses were delivered faster if the
speaker was gazing at the addressee while the question was asked.
4.4 Computational Model
The literature described in section 4.3 provides enough information to create a general
framework for a simple computational model. However, in the process of deciding
specific values for the cultural parameters we found that a lot of the needed information
is missing.
Most of the data on proxemics only has information on mean distance between
observed subjects. Information about values for different interaction zones is rare and
for North American most agree with values reported by Hall (1968). Data for Mexican
and Arab culture is much more scarce. While we did find some information in Spanish
literature on interaction distances for different zones, it was not clear whether they were
reporting values specific to Spanish cultures or just in general.
Literature on cultural differences of gaze and overlap in turn taking is rare and gen-
erally lacks quantitative data (with the exception of Stivers et al. (2009), which unfor-
tunately was not available yet at the time of our initial investigation). The only culture-
specific information we found on gaze indicated that gaze is more direct and longer in
contact cultures. While data on overlap in turn taking suggested that Spanish cultures
allow for more overlap than English, we did not find any comparative analysis for Arab
culture.
58
To apply the cultural variation to the dialog simulation we made some changes to the
algorithm in order to allow for the needed variations. The ability to provide proxemic
parameters to the simulation is provided by the movement and positioning algorithm
(see chapter 3 for more details). It takes 3 inputs; maximum distance for intimate zone,
maximum distance for personal zone and maximum distance for social zone. Each
agent maintains a belief about its relationship with other agents. They will choose an
appropriate interactional distance based on this information. If at some point during
the conversation an agent gets out of his preferred zone for interaction he will adapt by
repositioning himself. He will do this while balancing requirements for proxemics with
other factors that influence positioning, such as audibility of the speaker, background
noise and occlusion of other participants in the conversation.
To express the differences in gaze behavior and turn taking overlap we had to make
some changes to the simulation. Previously the conversation simulation employed a uni-
form gazing behavior. In order to differentiate gazing behavior of agents that are speak-
ing and listening we designed a probabilistic scheme where agents transition between
different gaze states. We identified 5 different states: 1) agent is speaking and the agent
they are gazing at is looking at the speaker, 2) agent is speaking and the agent they are
gazing at is not looking at the speaker, 3) agent is speaking and is averting gaze or look-
ing away, 4) agent is listening and speaker is gazing at him, 5) agent is listening and
speaker is not gazing at him. In each of the states the agent has a number of possible
choices. For example in state 1 he can choose to keep gazing at the current agent, he can
choose to gaze at another agent or gaze away. Each outcome has a weight associated
with it. If the weights for these 3 outcomes are 6, 2 and 2 respectively, then the speaker
will choose to keep gazing at their current target 60% of the time, in 20% he will pick
a new agent to gaze at and in the remaining 20% he will look away. The decision on
59
whether to transition between states is performed about every 0.5 seconds of the simula-
tion. In addition to these weights we introduced another modifier based on our informal
observations that makes it more likely for agents to look at agents that are currently
gazing at them.
Last, the overlap between turns was redesigned to follow a gaussian distribution (ten
Bosch et al. 2004), with mean and variation as parameters that can be culturally defined.
Whenever the agent decides to take a turn at a pre-TRP signal (a cue by which the agents
are able to predict when the next transition relevance place will occur), it picks a random
value based on this distribution and uses this value to queue when he’s going to start his
turn speaking.
The parameters defining the cultural variation are represented in XML format with
sections for proxemics, gaze and silence and overlap. The following is an example XML
culture description for the Anglo American model. The distances for proxemics are
expressed in meters. The gaze section starts with a GazingAtMeFactor which specifies
the modifier making it more likely to gaze at someone that is currently looking at the
agent. Following are the distributions for the gaze behavior for each of the 5 gaze states
(Speaker/Attending, Speaker/NonAttending, Speaker/Away, Addressee, Listener). The
choices for which the weights can be specified are: Speaker - agent that is speaking,
Addressee - agent that speaker is gazing at, Random - random conversation participant,
Away - averting gaze or looking away. The last section, Silence, includes the before
mentioned parameters influencing the gaussian distribution for overlap between turns.
0.45
1.2
2.7
60
1.5
6.0
2.0
2.0
1.0
8.0
1.0
9.0
1.0
8.0
1.0
1.0
6.0
2.0
1.0
1.0
61
0.0
0.5
Next Target
State Speaker Addressee Random Away
Speaker Attending 6 2 2
Speaker Non-Attending 1 8 1
Speaker Away 9 1
Addressee 8 1 1
Listener 6 2 1 1
Gazing at me factor = 1.5
Table 4.4: American gaze parameters in table form.
We tried to back most of the values for cultural parameters with data from the avail-
able literature, but in many cases we had to resort to approximations based on available
qualitative descriptions. For proxemics of North American culture we used the values
reported by Hall and are used as shown in the above example XML. To overcome the
lack of data for zone distances of Arab and Mexican culture we used differences in
mean distances from reported studies and used them to modify distances for all zones.
For Mexican Spanish we used the values of 0:45m for intimate, 1:0m for personal and
2:0m for social zone and 0:45m, 0:7m, 1:5m for respective zones in Arab model.
To model the more direct gaze of contact cultures we increased the weight corre-
sponding to gaze at the speaker in the Mexican and Arab model. We have decided not to
make any differences in overlap of turn taking because we did not get any data for Arab
culture.
62
4.5 Evaluation of Literature-Based Cultural Model
To test whether the cultural model described in this chapter can saliently represent the
cultural differences we conducted a cross-cultural study of the perceptions of non-verbal
behaviors in a virtual world. Native speakers of American English, Mexican Spanish,
and Arabic observed six two-minute silent videos representing multi-party conversation
created by recording the simulation run with different parameters. We also identified
age and sex of the participants and where they attended high school. In the study we
had 20 native English speakers, 22 to 70 years old, who all attended high school in the
US. 20 Arab subjects were in the range from 21 to 48 years old and most attended high
school in the Middle East (Lebanon, Qatar, Syria, Kuwait, Palestine, Morocco, Egypt).
All except one out of 20 Mexican subjects attended high school in Mexico and ranged
from 19 to 38 years old.
While all of the videos had Afghani characters in a Central Asian setting (the same
as for evaluation in the first iteration described in 2.6), the parameters of the characters’
non-verbal behaviors were set to values based on the literature for Americans, Mexicans,
and Levantine Arabs. The resulting videos differed mainly with respect to proxemics.
While the Mexican and Arab model had more direct gaze at the speaker, that aspect
was not always easily observable given the location of the camera. Two different videos
for each culture were presented to each observer, and the order of presentations was
balanced across observer groups. The observers were asked to rate the realism with
respect to their culture of the overall video, the characters’ proxemics, the characters’
gaze behaviors, and the characters’ pauses in turn-taking.
The results contained both expected and unexpected elements. Arab subjects judged
the Arab proxemics to be more realistic than both American and Mexican proxemics
(p < 0:01). Arab subjects also judged the Arab videos more realistic overall than the
American videos (p< 0:01). Arab subjects did not judge American proxemics to differ
63
Figure 4.1: These are two examples taken from the videos used in the evaluation. Left
picture is from the North American model and right picture from the Arab model.
from Mexican proxemics. And judgments of Arab subjects about gaze and pause did not
show significant differences across cultures, which was expected because these param-
eters did not significantly differ across the videos. The judgments of the Mexican and
American subjects did not show differences between any of the cultures with respect to
proxemics or overall realism. When performing a paired t-test on evaluations of indi-
vidual videos, the subjects saw significant differences between some of the individual
videos, even if the differences were not significant when the data for videos represent-
ing the same culture was aggregated. For example, the subjects judged the proxemics of
video “Arab 1” to differ significantly from those of both “American 1” and “Mexican 2”
(p < 0:001). There was suggestive evidence (p < 0:05) that American subjects distin-
guished the proxemics of “Arab 1” from “American 1”, but Mexican subjects apparently
did not perceive these differences (p> 0:59). There is suggestive evidence that Mexican
subjects distinguished “Arab 1” from “Mexican 2” (p< 0:13), but Mexican subjects did
not distinguish the pairs of Arab and Mexican animations.
It is not clear why the differences between cultures were not significant when there
were significant differences in perceptions of the individual videos. One possibility is
that the videos differed from each other along dimensions other than proxemics, gaze
and inter-turn pause length. Possible factors include gesture, coincidental coordination
64
of gesture among the characters, and limitations of the virtual world, which may have
affected the representations of the different cultural models in different ways. Another
possibility is that the range of possible simulations generated by the stochastic model is
too varied and goes outside the bounds of acceptable behavior. Using a larger number
of shorter movie clips would give an insight into this question.
It appears that while some aspects of cultural differences are salient, there are still
some factors influencing the perceptions that are unexplained. It is important to note
that while the models might not be completely accurate, they still yield perceptible dif-
ferences. People can make different judgments of culture-appropriateness of conversa-
tional behavior of virtual characters and these differences are related with their cultural
background.
At the conclusion of this iteration we were faced with several questions. Due to
the lack of quantitative data on cultural differences we considered ways for obtaining it.
One decision was to create a cross-cultural video corpus which we could use to refine our
cultural model. Another question that seemed to persist was the question of difficulty of
evaluating cultural appropriateness. It is possible that the lack of perceived differences
was due to inherent difficulty of the task and not due to the simulation specifically.
Finally, the responses to open-ended questions regarding believability and differences
observed in the videos implied that context plays an important role when making such
judgments. We used this in the design of the corpus task, as will be described in the next
chapter about the final iteration.
65
Chapter 5
Model Refinement and Conversation
Structure Model
5.1 Introduction
While the cultural model described in chapter 4 provided a good starting point, it was
not clear how accurately it represented the actual cultural variations. As a result we have
decided to create a cross-cultural dialog corpus representing the three different cultures
we decided to investigate in our previous study, Arabic, American English and Mexican
Spanish. This corpus served as the central point for this iteration cycle.
There have been two main lines of work we performed based on this corpus. First
focused on collection of quantitative data for model refinement and better understand-
ing of evaluation results from the previous iteration. As part of this we performed a
perception study where subjects evaluated cultural appropriateness of the videos from
the corpus to compare against the evaluation of cultural appropriateness of videos result-
ing from simulation. The second line of work was inspired by comments received in the
evaluation in previous iteration where some subjects said that the conversation would
seem more believable if a specific context was assumed, such as for example a quarrel.
Since the previous simulation framework assumed a free-form conversation without any
specific context in mind, we have decided to investigate how we could model differ-
ent conversation styles to better address contextual believability. The result of this line
66
of work was extension of the simulation with a conversation structure model based on
interaction process analysis (Bales 1976).
Both lines of work were very interconnected and were going on at the same time,
so at times it may be difficult to separate them completely. For the purpose of easier
understanding however the rest of this chapter is organized as follows. First we report
on the design and implementation of the cross-cultural dialog corpus. Then we will
cover the background work related to our work on conversation structure modeling and
how we modified the simulation to incorporate the conversation structure model. Next is
a report on model refinement based on data analysis, then the results of the cross-cultural
perception study of corpus video data and finally an examination into the difficulty of
determining conversational task from observation.
5.2 Cross-Cultural Dialog Corpus
One of the main purposes in creation of the corpus was better understanding of cultural
influences on components of conversational behavior. Specifically we were interested in
proxemics, turn-taking and gaze behavior. This imposes some constraints on the design
of the corpus and usability of other existing corpora. For example the AMI corpus (Car-
letta et al. 2006) contains multiparty conversations with audio-visual information, but
the participants are seated at a table so the study of proxemics is not possible. The corpus
also does not control for culture, so cross-cultural comparison is not possible. CUBE-G
corpus (Rehm et al. 2007a) on the other hand was specifically designed for examination
of cross-cultural differences in non-verbal conversational behavior, but only examined
dyadic conversation.
With these considerations in mind we created the UTEP-ICT Cross-Cultural Multi-
party Multimodal Dialog Corpus (Herrera 2010; Herrera et al. 2010, 2011). It consists
67
of video recordings of multiparty interactions representing three different cultures. The
cultural groups represented native speakers of Arabic, American English and Mexican
Spanish. Each session consisted of a group of four participants from the same cultural
background that performed 5 conversational tasks that lasted about 10 minutes each.
The participants were standing and free to move as can be seen in Figure 5.1.
Figure 5.1: Arab group performing task 4.
The tasks were designed in a way that allowed us to examine contextual differences.
We decided to focus on two conversational contexts. In a storytelling context the par-
ticipants are telling stories or personal experiences, while in decision making context
they must reach a consensus on a given topic of discussion. It was our expectation
that the conversations in the different contexts will exhibit different turn-taking patterns.
In addition, two tasks involved a toy as an additional possible point of gaze focus for
the participants which allowed us to see how gaze behavior changes in the presence of
additional point of interest beyond participants themselves.
We wanted the conversations to be as natural as possible. The participants were only
equipped with a wireless microphone and they were free to move around the room as
they wished. The experimenter explained the tasks one at a time and was not present
in the room while the task was being performed by the group. In the first task the
participants were asked to explain their pet peeves. In the second task they were asked
68
to come to a decision about which movies they have all seen and what were the best and
worst parts. In task 3 the toy was introduced and the participants were asked to come
up with a name for the toy. In task 4 they were asked to tell a story about the toy. And
finally in task 5 they were asked to describe an inter-cultural experience. Of these the pet
peeves, the toy story and inter-cultural experience tasks were examples of storytelling
context while the movies and toy naming were examples of decision making context.
The interactions were recorded from six points of view and partially annotated. In
total there were 12 conversation groups. Due to the size of the annotation task we
restricted the annotations to two 30 second clips for the first four of the tasks. This
allowed us to have 2 examples of each storytelling and decision making context, with
toy present equally split between them. The annotations included proxemics, turn-taking
and gaze. For purposes of proxemics the position of each participant relative to the
room was recorded. Gaze was annotated as either towards one of the other participants,
towards the toy or away. Turn-taking annotations consisted of time periods when the par-
ticipants were speaking with a distinction between talking and laughing. The annotation
was performed by 3 annotators using ANVIL (Kipp 2001) with interrater reliability of
Kappa at least 0.80. In case of outliers the data was reviewed and corrected if needed.
More details about the corpus and annotations are available in Herrera et al. (2010).
Figure 5.2: Gaze annotation in Anvil.
69
5.3 Conversation Structure Modeling
The main question we are trying to address here is how to modify the conversation
simulation in order to improve contextual believability for cases where the conversa-
tion participants are involved in a particular conversational task. As we mentioned in
the considerations when designing the corpus tasks, one of the ways we expected the
distinction to show up was in changes in turn taking. With this in mind we examined
Interaction Process Analysis, which is one of the main methodologies for analyzing
small group interactions that provides information on turn taking behaviors in relation
to content function categories. We will first look at a brief overview of the methodology
and then review the available literature on distribution of participation in conversations.
5.3.1 Interaction Process Analysis
Interaction Process Analysis (IPA) developed by Bales (1976) is a very prominent
methodology for observation and analysis of small groups. It defines a standard set
of general-purpose categories used to code interactions. It allows aggregation of atomic
interaction units into interaction profiles that can describe the interaction behavior of
groups as a whole, of individual participants as well as development over time. There
is a significant amount of standardized empirical data available that describes standard
interaction patterns.
The coding of interactions first breaks down all acts into basic units which are coded
by one of the 12 standard categories. Each interaction participant is assigned a posi-
tive number, while the group as a whole is represented by a zero. By coding a whole
conversation this way we obtain a sequence of timed codings in the form of category:
initiator – recipient (whole group can only appear as recipient). The interaction profiles
are then aggregated based on these codings. The interaction profiles consist of rates for
70
Positive Reactions
1) Shows solidarity, raises other’s status,
gives help, reward
2) Shows tension release, jokes, laughs,
shows satisfaction
3) Agrees, shows passive acceptance,
understands, concurs, complies
Attempted Answers
4) Gives suggestion, direction, implying
autonomy for other
5) Gives opinion, evaluation, analysis,
expresses feelings, wish
6) Gives orientation, information, repeats,
clarifies, confirms
Questions
7) Asks for orientation, information, rep-
etition, confirmation
8) Asks for opinion, evaluation, analysis,
expression of feeling
9) Asks for suggestion, direction, possible
ways of action
Negative Reactions
10) Disagrees, shows passive rejection,
formality, withholds help
11) Shows tension, asks for help, with-
draws out of field
12) Shows antagonism, deflates other’s
status, defends or asserts self
Table 5.1: IPA Categories
each of the IPA categories, giving an indication of how often a particular category was
used. These can be analyzed and compared with other profiles to investigate different
trends and how various factors influence the distribution of categories. In his work Bales
examined several indexes based on interaction profiles to explore the leadership role in
groups and how different factors influence it.
Table 5.1 shows the definitions of categories as described by Bales. The cate-
gories are grouped in four groups: positive reactions, negative reactions, questions and
attempted answers. The positive/negative reactions deal with social-emotional problems
while the other two deal with the task problems.
71
Bales observed a number of regularities in interactions of small groups. The findings
are considered to be general and are found to be true under many different conditions. In
particular he observed a number of chains of actions that occur. For example questions
(7, 8, 9) are usually followed by answers (4, 5, 6), answers are followed by agreement
or disagreement (3, 10). He observed several such regularities in terms of repetitions
and alternations.
5.3.2 Distribution of Participation
It has been observed that the level of participation in small groups forms a general
pattern that is dependent on group size, structure and conversational task. The first
results from Bales et al. (1951) suggested that the distribution of participation closely
resembles a harmonic distribution. Stephan (1952) and Stephan and Mishler (1952)
used the data from Bales and similar data from student discussion groups and concluded
that a geometric distribution is a much better fit. He noted though that special roles
such as chairman or other person with special function in the conversation may follow a
different distribution distinct from the remaining participants. The pattern remains even
when groups are selected to be homogeneous. For example when they formed a group
only from high or low participators from previous groups, the new groups again showed
the same distribution.
There have been several attempts to model the participation distribution (Leik 1967;
Goetsch and McFarland 1980; Kadane et al. 1969). Stasser and Taylor (1991) observed
that the speaking turns are clustered and proposed that the turn allocation depends on dif-
ferences in speaking rates, temporary increases in speaking likelihood close to a recent
turn and the competition for turn by the participants. More recent models focusing on
task-oriented conversations have attempted to combine the propensity to speak with the
distribution of task-relevant information among speakers (Stasser and Vaughan 1996;
72
Bonito 2001). Others attempt to model distribution of turns depending on the relative
status of participants based on expectations states theory (Fis ¸ek et al. 1991). The predic-
tions based on these models however tend to be inconclusive (Robinson and Balkwell
1995).
Few attempt to directly compare the turn distribution as a result of the conversa-
tional task performed by the group. One such example is the Interactive Systems Lab-
oratories meeting data collection (Burger et al. 2002; Burger and Sloane 2004). They
look at recordings of different kinds of meetings, including project planning, military
block parties, games, chatting and topic discussion. In comparing the data they find
the highest number of turns per minute in discussions and lowest in project planning.
Project planning also has the highest percentage of turns longer than 10 seconds, with
the lowest percentage in chatting. Their analysis of participation distribution does not
look at speaking time or turn distribution directly. Instead they segment the sessions
into sections that consist of a set of dialog moves starting with an initiation followed
by responses until its purpose is fulfilled or abandoned. Next they look for each par-
ticipant in how many sections they participated, but ignore their relative contribution
in each section. In project meetings they found two dominant speakers participating in
over 75% of the sections, with the rest participating in less than 30%. In role playing
meeting they had a very balanced participation of all speakers, all between 60 to 70%.
In chat and discussion meetings the participation gradually dropped across speakers.
There is also some reported effect of culture on the distribution of turns. Du-Babcock
(2003) shows that in collectivistic cultures distribution of speaking time is more even
than in individualistic cultures. Also when looking at heterogeneous groups it is found
that participants from collectivistic cultures tend to participate less than those from indi-
vidualistic cultures.
73
5.4 Modifications to Dialog Simulation
We have two main options to modify the conversational structure in the simulation relat-
ing to turn taking. First we have turn length duration. The initial model used a uniform
turn length calculation based on an interactivity parameter. We can replace this with a
turn length based on the conversation style we are trying to simulate. The other option
is modification of the overall turn distribution of the participants. In the initial model
this was indirectly determined by personality parameters of the agents such as talkativ-
ity. Using IPA interaction profiles as input to the conversation simulation it is possible
to augment the tests for when agents start to speak with the distribution of categories
in the individual profiles. Additionally, this provides a way to represent local struc-
tural differences between different interactions based on patterns of common category
sequences.
To apply this to the generation of turn-taking behavior we assign an interaction pro-
file to each conversation participant. This is used as a basis for the agents to decide
whether to take an action at a particular point in time and what kind of message to pro-
duce. The goal is to generate over the long-term a sequence of actions that matches the
prescribed interaction profiles. In addition to global behavior we want the local char-
acteristics to also follow the desired pattern. For this purpose we transform the local
patterns (for example answers following questions and similar as mentioned in sec-
tion 5.3.1) into rules that temporarily modify the probabilities of particular categories
being generated based on interactions that actually occurred. For example if agent A
addresses agent B with a question this will temporarily increase the probability of agent
B producing a response.
The selection of dialog structure is currently determined externally from the simu-
lation. That is, if the context in which the conversation takes place requires a specific
74
structure, that information should be provided to the simulation by altering the interac-
tion profiles of the agents. We do not attempt to provide a theory of how such changes
would occur or to make changes automatically and this will remain an area for possible
future work.
An interaction profile prescribes over a long time frame the rate of generation for
each of the IPA categories. To achieve this as conversation goes along the agents keep
track of the desired rate compared to the actual rate of actions they have performed.
To facilitate this the rates in the interaction profile are expressed in percentages of total
conversation time. The main idea is that based on the interaction profile we can compute
expected time in each IPA category for the agent. Additionally we also know what the
agent actually did. If we look at the difference of the two, that is the desired rate and the
actual rate of the IPA categories, we would desire this to be as close to zero as possible.
For this purpose the agents store this difference in their information state. They do not
compute this by tracking each of the factors separately and recording the whole history
of interaction, but by only storing the final difference and modifying it iteratively. Thus
in each conversation cycle of the simulation the value is increased by cycle length times
the percentage prescribed in the interaction profile and if the agent is speaking in that
turn the length of the cycle is subtracted from the category that is being performed.
The sum of these values can then be used to modify the tests when deciding when
someone should start speaking and the individual values to decide which IPA category to
produce if the agent decides to speak. Since the values represent the difference between
desired and actual rate, if the sum is positive that means that so far the agent generated
less than desired amount of actions and we should thus increase the probability of trying
to take the turn. The opposite holds if the sum is negative. We can control how strongly
the interaction profiles influence the basic model by modifying a conversion parameter
used to transform pure time based difference to adjustment of probability (the total result
75
capped between 0 and 1) as well as a decay factor applied in each cycle to the current
value that can be used to make deviations made far in the past less influential. Care
should be taken when assigning interaction profiles to make sure that overall percentage
rates for all participants do not exceed 1, because based on the model most of the time
only one participant will be speaking. If the rates of all participants would exceed that
value then inevitably some agents will be below the desired participation rate and we
will see a very high rate of simultaneous speech and interruptions caused by agents
trying to fulfill the unnatural distribution prescribed to them.
In addition to this the agents also store a reactive value for each IPA category that
represents local modification to the overall rate designed to influence local behavior.
Since this is used as a reactive tool we make the modifications at the point of receiving
the pre-TRP signal. Based on the IPA category being generated by the current speaker
the various local behavior rules set a new modification value to one or several IPA cate-
gories. This is achieved by adjusting the values in a way that the overall sum of reactive
values is always 0 (when the rule is not balanced, such as when the rule only has an
increase without a decrease, all the remaining categories are adjusted in the opposite
direction to maintain zero sum). We keep the sum at zero so that the local changes only
affect choice of category selected and not the probability of selecting when to speak.
Additionally, since we want this to have local effect we use a high decay value so that
the modifications have meaningful values from the time of pre-TRP to the TRP and
selection of the next speaker.
The rules based on Bales’ observations are as follows:
1. If the category being generated is social-emotional positive (1, 2, 3) then increase
reactive value for social-emotional positive (1, 2, 3)
2. If the category being generated is social-emotional negative (10, 11, 12) then
increase reactive value for social-emotional negative (10, 11, 12)
76
3. If the category being generated is task oriented (4 - 9) then increase reactive value
for task oriented (4 - 9)
4. If the category being generated is shows solidarity (1) then increase reactive value
for shows solidarity (1) and reduce reactive value for shows antagonism (12)
5. If the category being generated is shows tension release (2) then increase reactive
value for shows tension release (2) and reduce reactive value for shows tension
(11)
6. If the category being generated is agrees (3) then increase reactive value for agrees
(3) and reduce reactive value for disagrees (10)
7. If the category being generated is gives suggestion (4) then increase reactive value
for agrees (3) and increase reactive value for disagrees (10)
8. If the category being generated is gives opinion (5) then increase reactive value
for agrees (3) and increase reactive value for disagrees (10)
9. If the category being generated is gives orientation (6) then increase reactive value
for agrees (3) and increase reactive value for disagrees (10)
10. If the category being generated is ask for orientation (7) then increase reactive
value for gives orientation (6)
11. If the category being generated is ask for opinion (8) then increase reactive value
for gives opinion (5)
12. If the category being generated is ask for suggestion (9) then increase reactive
value for gives suggestion (4)
13. If the category being generated is disagrees (10) then increase reactive value for
disagrees (10) and reduce reactive value for agrees (3)
77
Positive Reactions
1) 0.007
2) 0.051
3) 0.085
Attempted Answers
4) 0.036
5) 0.210
6) 0.148
Questions
7) 0.037
8) 0.024
9) 0.005
Negative Reactions
10) 0.046
11) 0.030
12) 0.016
Table 5.2: Example interaction profile.
14. If the category being generated is shows tension (11) then increase reactive value
for shows tension (11) and increase reactive value for shows tension release (2)
15. If the category being generated is shows antagonism (12) then increase reactive
value for shows antagonism (12) and reduce reactive value for shows solidarity
(1)
The local rules will only be effective if we have distinct animations available to
convey the different IPA categories. If these are not available we can either collapse the
categories into the 4 broad categories or even remove them completely in which case
we are left with the single value representing the overall difference in desired and actual
participation rate.
We will now show an example of how the difference between desired and actual
interaction rate would progress for an agent with an interaction profile shown in Table
5.2. We will show changes in one second increments with the agent initially not speak-
ing, receiving a pre-TRP of category 8 (ask for opinion) at 10 seconds at which point
he decides to take turn when the current speaker ends. At 12 seconds the agent starts
speaking with category 5 (gives opinion). The initial condition will be with zero values
78
Cat 1 2 3 4 5 6 7 8 9 10
1 0.007 0.014 0.021 0.028 0.035 0.042 0.049 0.056 0.063 0.07
2 0.051 0.102 0.153 0.204 0.255 0.306 0.357 0.408 0.459 0.51
3 0.085 0.17 0.255 0.34 0.425 0.51 0.595 0.68 0.765 0.85
4 0.036 0.072 0.108 0.144 0.18 0.216 0.252 0.288 0.324 0.36
5 0.21 0.42 0.63 0.84 1.05 1.26 1.47 1.68 1.89 2.1
6 0.148 0.296 0.444 0.592 0.74 0.888 1.036 1.184 1.332 1.48
7 0.037 0.074 0.111 0.148 0.185 0.222 0.259 0.296 0.333 0.37
8 0.024 0.048 0.072 0.096 0.12 0.144 0.168 0.192 0.216 0.24
9 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
10 0.046 0.092 0.138 0.184 0.23 0.276 0.322 0.368 0.414 0.46
11 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0.24 0.27 0.3
12 0.016 0.032 0.048 0.064 0.08 0.096 0.112 0.128 0.144 0.16
Cat 11 12 13 14 15 16 17 18 19 20
1 0.077 0.084 0.091 0.098 0.105 0.112 0.119 0.126 0.133 0.14
2 0.561 0.612 0.663 0.714 0.765 0.816 0.867 0.918 0.969 1.02
3 0.935 1.02 1.105 1.19 1.275 1.36 1.445 1.53 1.615 1.7
4 0.396 0.432 0.468 0.504 0.54 0.576 0.612 0.648 0.684 0.72
5 2.31 2.52 1.73 0.94 0.15 -0.64 -1.43 -2.22 -3.01 -3.8
6 1.628 1.776 1.924 2.072 2.22 2.368 2.516 2.664 2.812 2.96
7 0.407 0.444 0.481 0.518 0.555 0.592 0.629 0.666 0.703 0.74
8 0.264 0.288 0.312 0.336 0.36 0.384 0.408 0.432 0.456 0.48
9 0.055 0.06 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1
10 0.506 0.552 0.598 0.644 0.69 0.736 0.782 0.828 0.874 0.92
11 0.33 0.36 0.39 0.42 0.45 0.48 0.51 0.54 0.57 0.6
12 0.176 0.192 0.208 0.224 0.24 0.256 0.272 0.288 0.304 0.32
Table 5.3: Example progression of difference between desired and actual interaction
rate.
for all rate differences. The progression of rate differences is shown in Table 5.3. Let’s
look in more detail at what happens at 10 second mark when the agent is deciding to
speak. Table 5.4 shows the rate difference and reactive values at that point. When we
receive the pre-TRP signal the rules 4 9! 4 9 and 8! 5 get triggered based on
which we adjust the reactive values. When the agent is deciding to speak we look at the
sum total difference of all categories to determine if we have to adjust normal speaking
rate in positive or negative direction. In our case at 10 seconds the total is 6.95 which
79
Cat Rate Difference Reactive Value Total
1 0.07 -0.26 -0.19
2 0.51 -0.26 0.25
3 0.85 -0.26 0.59
4 0.36 0.08 0.44
5 2.1 1.17 3.27
6 1.48 0.08 1.56
7 0.37 0.08 0.45
8 0.24 0.08 0.32
9 0.05 0.08 0.13
10 0.46 -0.26 0.20
11 0.3 -0.26 0.04
12 0.16 -0.26 -0.10
Table 5.4: Rate difference and reactive values when deciding to speak.
indicates that the agent so far was speaking about seven seconds less than desired. Our
default talkativeness attribute is 0.5 and we do not want to overwhelm the default algo-
rithm to the point where we are too rigid and lose all randomness. With this in mind
our chosen conversion rate is 0.01 for every second of rate difference (meaning almost a
minute of deviation would be required to force the agent to always try to speak). In our
case this means that the speaking test against talkativeness would be adjusted from 0.5
to 0.5695. Since the test succeeded the next part is selecting the category to generate.
We only consider categories with positive values and compute the share belonging to
each of them. In our case we see that the chance of selecting category 5 was 3.27 / 7.24
= 0.45.
5.5 Refinement of Cultural Model
Now we turn our attention to the analysis of cross-cultural corpus data and how it was
used to refine the model parameters. First we briefly report on the cross-cultural com-
parison of the data performed by Herrera (2010). For proxemics he found that Mexicans
80
stood closest, followed by Arabs and Americans, with American distances significantly
different from those of Mexicans and Arabs. For turn-taking there do not seem to be
any significant differences in amount of talking and pause. Similarly for gaze there have
also not been any significant differences comparing gaze toward a person and away. It
is possible though that more complex measures could show significant differences.
Here instead we focus on extracting the data for the purpose of computational mod-
eling. For proxemics the data collected does not allow us to compare the boundaries
of different interactional zones. Instead similar to before we have to rely on mean dis-
tances and infer the zones from there. The value is computed by taking the minimum
spanning tree on the positions of the participants and taking the average link length. For
Americans we get the mean value of 1.22 meters, 1.04 meters for Arabs and 0.95 meters
for Mexicans. The distances are on the shorter side of the social zone as is expected for
multiparty conversation. For example a similar study by Herrera for dyadic data shows
a 1.66 meters average distance for Americans which puts it more in the middle of the
social zone. Since the data for Americans agrees with the data from Hall we leave that as
is and use it as the basis for modifying the zone distances for the other two cultures. For
this purpose we perform a simple linear compression of zone distances. For Americans
we thus have 0.45 meters for intimate zone, 1.2 meters for personal zone and 2.7 meters
for social zone. For Arabs we have 0.38 meters for intimate zone, 1.02 meters for per-
sonal zone and 2.30 meters for social zone. Finally, for Mexicans we have 0.35 meters
for intimate zone, 0.93 meters for personal zone and 2.10 meters for social zone. The
main distinction from the initial model is in the Arab parameters which initially were
expected to stand much closer based on literature review. This is a bit unexpected as we
found in the evaluation that Arab subjects preferred the closer proxemics parameters in
the old model.
81
Next we look at turn taking and overlap. First we look at the offset between speakers
at transitions. We get a mean offset of 0.268 seconds for Americans, 0.250 seconds for
Arabs and 0.260 seconds for Mexicans. The differences are very marginal and suggest
that not having any cultural differences in the initial model of pause between turns was
acceptable. The main difference is that the average is slightly higher in favor of pause
than in our initial model. We also look at the amount of total overlap, turn length and
turn distribution between speakers. This was not in the initial cultural model, but will
allow us to make distinctions between storytelling and decision making tasks and their
effect on conversation structure. The amounts for overlap in seconds are in total duration
out of a 30 second clip and are shown in Table 5.5. For American and Mexican data
there does not seem to be any pattern here, the variation between individual clips is too
high for the differences to be significant (p > 0:75). For Arab data we have suggestive
evidence (p< 0:09) that there is less total overlap in storytelling tasks.
Culture/Task Storytelling Decision Making
American 4.75 5.10
Arab 2.92 4.89
Mexican 5.35 4.87
Table 5.5: Total overlap in seconds
For turn duration we segment turn lengths into bins for turns lasting less than 2
seconds, turns between 2 and 5 seconds, turns between 5 and 10 seconds, turns between
10 and 20 seconds and turns between 20 and 30 seconds. The data in Table 5.6 shows the
average number of turns of specific length observed on average in a 30 second segment.
In general we see that the number of shorter segments is higher in decision making tasks
(p < 0:004). Storytelling tasks on the other hand have a higher number of segments
above 10 seconds (if we combine last two bins,p < 0:06). In general Americans tend
to have more long turns than Arabs and Mexicans (p< 0:12).
82
Culture/Task 0 - 2 2 - 5 5 - 10 10 - 20 20 - 30
American storytelling 3.62 2.31 0.68 0.62 0.06
American decision making 6.06 2.87 0.87 0.25 0
Arab storytelling 5.43 2.37 0.62 0.31 0.12
Arab decision making 7.68 2.81 0.5 0.18 0.1
Mexican storytelling 5.43 1.93 0.93 0.25 0.06
Mexican decision making 7.87 2.31 0.43 0.18 0.06
Table 5.6: Amount of turns based on length
The final part of turn taking looks at turn distribution. Here we look at how the
turns are distributed between participants in terms of overall time one is holding the
turn, ordered from highest to lowest participant. The data is shown in Table 5.7. Again
we see the highest participant holding the floor significantly longer in storytelling tasks
(p< 0:006). The distribution pattern in general follows the one observed in literature we
described in section 5.3.2, but it is clear that the task has an effect on the distribution. It
is not certain whether this is a local effect due to restricting the annotations to 30 second
segments or if this would translate to longer interaction periods.
Culture/Task 1st 2nd 3rd 4th
American storytelling 17.44 5.64 1.52 0.21
American decision making 11.72 6.93 3.38 0.72
Arab storytelling 17.45 4.92 1.86 0.34
Arab decision making 13.51 5.74 3.07 0.71
Mexican storytelling 14.71 5.00 2.65 0.59
Mexican decision making 12.31 5.12 2.70 0.96
Table 5.7: Distribution of total turn holding time
Finally, we look at gaze data. We analyzed both conditional distributions and aggre-
gate amount of gazing at different targets and used this to fit the parameters of the
probabilistic computational model. For more details on the parameters of the computa-
tional model of gaze refer to section 4.4. Here we only used the two tasks that did not
involve the toy since our model does not account for external targets other than looking
away. Similar to the computational model we look at changes in half second intervals.
83
First, the tables 5.8-5.10 show aggregate nonconditional distributions and compare the
values across cultures. We see that speakers look most of the time at participants that
are looking back at them, with Arabs having the highest amount of mutual gaze. Mexi-
cans in general tend to have higher chance of looking away. Addressees in general look
at the speaker a lot more than at the other participants. Gaze distribution for listeners
is based only on samples where there was a speaker present. This is required because
when the speaker is not present the computational model will fall back to looking at a
random participant. The data shows that addressees are slightly more likely to look at
the speaker than other listeners. This can most likely be explained by other listeners
having two distinguished targets of interest as opposed to only one for addressees.
Culture Attending Non-Attending Away
American 0.55 0.23 0.22
Arab 0.67 0.20 0.13
Mexican 0.47 0.30 0.23
Table 5.8: Gaze distribution for speakers
Culture Speaker Other Away
American 0.76 0.10 0.14
Arab 0.81 0.09 0.10
Mexican 0.64 0.19 0.17
Table 5.9: Gaze distribution for addressees
Culture Speaker Addressee Other Away
American 0.56 0.17 0.09 0.18
Arab 0.64 0.14 0.07 0.15
Mexican 0.53 0.16 0.09 0.22
Table 5.10: Gaze distribution for listeners
The final computational parameters as used in the cultural model are shown in
tables 5.11-5.13, following the same table format as shown in section 4.4. Addressee
and listener data are the same as above. For speaker gaze on the other hand the model is
84
based on conditional probabilities, the condition being the current state of the speaker’s
gaze target. The data in Table 5.8 showed the overall distribution of the distinct states
while here we have transition probabilities between the states (computed by collecting
only samples for a particular previous state).
Next Target
State Speaker Addressee Random Away
Speaker Attending 0.75 0.17 0.08
Speaker Non-Attending 0.79 0.13 0.08
Speaker Away 0.27 0.73
Addressee 0.76 0.10 0.14
Listener 0.56 0.17 0.09 0.18
Gazing at me factor = 1.65
Table 5.11: American gaze parameters
Next Target
State Speaker Addressee Random Away
Speaker Attending 0.82 0.15 0.03
Speaker Non-Attending 0.73 0.19 0.08
Speaker Away 0.48 0.52
Addressee 0.81 0.09 0.10
Listener 0.64 0.14 0.07 0.15
Gazing at me factor = 4.56
Table 5.12: Arab gaze parameters
When comparing to the initial model parameters we notice that addressee and lis-
tener parameters were in general a relatively good fit. Speaker gaze distribution however
is quite different from what was initially proposed. First of all the speaker is much more
likely to keep looking at existing target rather than switching. Specifically when looking
at someone that is not reciprocating gaze we initially assumed that the speaker would
very likely switch gaze, but this was not found to be the case. It appears that using a
conditional distribution for the speaker gaze does not provide much benefit. A simpler
model where the distinction between the state of current gaze target was not taken into
85
Next Target
State Speaker Addressee Random Away
Speaker Attending 0.74 0.16 0.10
Speaker Non-Attending 0.73 0.18 0.10
Speaker Away 0.36 0.64
Addressee 0.64 0.19 0.17
Listener 0.53 0.16 0.09 0.22
Gazing at me factor = 1.51
Table 5.13: Mexican gaze parameters
account would probably do equally well. The cultural modifications seem to have been
justified to some extent. The higher chance of gazing at speaker is present for Arab data,
the Mexican data however does not follow what we predicted from literature. In general
Mexican gaze patterns seem to be much closer to those of Americans than Arabs as
initially predicted.
5.6 Cross-Cultural Perception Study of Real Interac-
tions
In order to better understand the results in section 4.5 we have performed a similar eval-
uation where we showed the participants clips from conversations collected in 5.1 and
asked them to rate the cultural appropriateness of conversational behavior. The survey
showed 24 videos, using the two toy tasks for Arab and American speakers, selecting
both 30 second clips for 3 groups from each combination. The subjects were asked to
rate how appropriate for their culture were the different features of conversation, includ-
ing overall conversational behavior, physical closeness, body movements, the way peo-
ple looked at each other and pauses and overlap. Finally, they were asked to select
whether the conversation more closely resembled storytelling or decision making. Ini-
tially we planned on using masked video and audio to avoid cultural biases, but based
86
on preliminary testing we found that test subjects reported it to be very hard to make
decisions under those conditions. With that in mind we decided to show unobstructed
videos with sound.
The evaluation performed was limited, having only 20 total subjects, 10 from Amer-
ican culture, 4 from Arab, 3 from Mexican and 4 other. The results were unexpected
and confirmed this to be a hard task even when sound was present. American subjects
on average rated both Arab and American videos almost the same on all categories.
The only meaningful difference was for Arab subjects who rated Arab videos overall
on average 4.9 compared to 4.1 for American videos. Similar differences were found
in Arab subject judgments of the other categories, but this could be just coincidental
since we only had 4 subjects. What is interesting is that in both evaluation of simula-
tions and here in evaluation of real interactions, Arab subjects were more perceptive to
differences while the judgments of the American and Mexican subjects did not show
differences between any of the cultures. One explanation for this could be in the way
our Arab subjects were recruited. While they have spent their childhood in their native
country they were all in the US for some time prior to the evaluations. It is possible that
the exposure to different cultures made them more aware of cultural differences which
resulted in more distinctive evaluations than for the other subjects.
5.7 Conversational Task Classification
While it was suggested that subjects viewing videos of conversation simulation consid-
ered context as part of determining believability, it was not clear if they are really able to
pick up on the actual task being performed. We have already seen that cultural context
presents a very difficult task in observation of conversational behavior, so it would not
87
be unlikely that viewers would have similar problems with determining conversational
task context.
First we look at the annotated data from the cross-cultural corpus and see whether it
is possible to predict the conversational task from given data. For this purpose we used
the Weka data mining software (Hall et al. 2009) to construct a classifier that would
predict the conversational task based on the collected data. We had available all data on
proxemics, gaze and turn taking that were described in section 5.5. In addition we also
added turn distribution as percentage of total speaking time and speaking time relative to
highest speaker as features (the difference coming from overlap). The task of classify-
ing the conversational task based on this data seems to be a relatively hard problem. We
have attempted several classification approaches, including decision trees, SVM, neural
network, bayes network with a number of feature selection and boosting techniques.
Quality of the classifiers was assessed using leave one out cross-validation. The best
results were obtained when reducing the feature set to just turn data and using decision
tree classifier (J48 algorithm in Weka) which gave accuracy of 71.875%. Other combi-
nations in general resulted in accuracy in the range of 60-70%. Of the available features
the most predictive ones are the number of long turns and share of the turn holding time
of the top and third speaker. This shows that it will probably be hard for humans to
tell the tasks apart. At the same time it confirms our decisions in conversation struc-
ture modeling that turn length and turn distribution should be modeled if we want to
represent differences present in various conversational tasks.
We now look at the actual perceptions from the cross-cultural perception study in
previous section. Since the videos included sound, we expected that determining con-
versational task would be easier in cases where subjects were familiar with the spoken
language. We assume that most subjects were familiar with English, but only Arabs to
be familiar with Arabic, which should give us an idea of how much language familiarity
88
helps in making the decision and how much can be decided just based on para-linguistic
features. Table 5.14 shows the average accuracy based on subject culture and the conver-
sational task and culture of the video. We see that American subjects are slightly better
at determining conversational task on American videos while Arabs are better at Arab
videos as expected. It is slightly unexpected that Arab subject accuracy on American
videos was just as bad as American accuracy on Arab videos. Another surprising result
is that the accuracy in general is not higher compared to the results from the classifier
approach, especially when the subjects were familiar with the spoken language. Look-
ing at scores on individual video clips reveals that there were both cases of very high
agreement as well as mixed scores. There were also some cases where subjects had a
high agreement that the conversational task was different from the one given to the par-
ticipants in the video. This means that the overall decision making task can have features
of storytelling on local scale and the other way around. With this in mind we checked
the cases where classifier made wrong predictions compared to human subjects, but we
found no discernible pattern. We found both cases where classifier predicted correctly
where human subjects did not and the other way around.
Subject Culture Arab DM Arab ST American DM American ST
American 0.50 0.58 0.70 0.59
Mexican 0.58 0.50 0.68 0.80
Arab 0.57 0.84 0.52 0.56
Other 0.71 0.57 0.90 0.52
Table 5.14: Accuracy of conversational task judgments (DM = Decision Making, ST =
Storytelling)
5.8 Conclusion
This concludes the last iteration of the conversational simulation. We have success-
fully improved the models based on concrete empirical data and added an additional
89
component that allows for finer control to target specific conversation contexts. The
main limitation is that it relies on the context to be externally provided with explicitly
specified interaction profiles for the participants. Potential improvements in this area
would include constructing a database of contextual specifications with a more auto-
mated assignment of profiles to the participants.
It is also important to try to understand the findings from the perception study of
real conversational behaviors. While the evidence is not conclusive at this point, it is
likely that targeting contextual believability may not be the best avenue if the objective
is purely to improve believability. We have seen that in many cases the subjects were
not able to distinguish between cultural contexts to the extent where it would make a
significant impact on believability. It may be that if context is not specified, the viewers
will fill in the blanks to maintain suspension of disbelief in presence of deviation from
what they expected. On the other hand, if the differences between contexts are relatively
small, the impact on believability may not be significant. At the same time we should
say that if objective is not purely to improve believability, but has for example also an
educational component, then the increased realism might still be important even if the
effect on believability is not significant.
At this point we have finished the exploration of the conversational simulation itself
and will turn our focus at the larger picture. We’ll look at the applications where the
simulation was used and the various frameworks where the simulation was embedded.
One of the main open topics that we will look at is simulation level of detail. While our
simulation targets background virtual humans it becomes inevitable that at some point
we will want to cross that line. The work we will explore looks at various ways of how
that could be achieved and how certain subunits of the conversation simulation can be
employed at other levels of detail.
90
Chapter 6
Implementation and Applications
The core dialog simulation described in the previous chapters currently has a Java and
C# implementation and is designed to be independent of the rendering environment.
When the simulation is embedded in an actual application the base classes are extended
to provide the scenario specific logic. This includes integration of the dialog simulation
into the larger simulation including transfer of control of the virtual humans when not
in conversation and implementation of rendering environment specific commands that
control creation of virtual humans and the representation of their behavior.
The simulation and its parts have been embedded in several applications. Some used
the whole simulation either as part of a regular setup or as an optional component. In
other cases only a specific subset of the simulation was used, such as for example the
dynamic movement simulation. In the following we briefly describe the nature of these
applications and what part the simulation played in them.
In Mission Rehearsal Exercise (MRE) (Hill et al. 2003) and SASO (Traum et al.
2005, 2008) the simulation was used in its primary purpose of simulating background
characters (using Unreal Tournament as a rendering environment as described in sec-
tion 6.4). Staff Duty Officer Moleno (SDO) (Jan et al. 2009) and the Checkpoint Exer-
cise (Jan et al. 2011) however can be seen as attempts of bridging the level of detail
divide from the side of full conversational characters towards background characters.
They are both based on the Vigor framework described in section 6.6. Here we do not
use the full background simulation, since the characters have to interact with the users.
Instead, the characters use language processing to be able to converse with the users
91
and just use the gaze model and proxemics for positioning while interacting with them.
The Checkpoint Exercise can also use the full background simulation, but this is only an
experimental feature not usually used in user experiments, since we cannot control who
the users will interact with.
We begin this chapter with a review of work on the role of level of detail in sim-
ulations in general (section 6.1). We continue with a more detailed description of the
architectural design of the simulation (section 6.2), the user interface (section 6.3) and
some examples of integration into the ICT virtual human framework (sections 6.4, 6.5
and 6.6). Finally, we examine the MRE, SASO, SDO Moleno and Checkpoint Exercise
applications to provide a better understanding of the setup in which the simulation and
its parts were used. (sections 6.7, 6.8 and 6.9).
6.1 Simulation Level of Detail
We can employ several approaches to cope with the problem of limited resources for
simulations. We can optimize graphical and simulation algorithms, but we cannot opti-
mize beyond some limit. After that point we have to resort to simplification. The goal is
to create approximation algorithms, which degrade the quality of simulation, but main-
tain the illusion of believability. Such algorithms are referred to as level of detail (LOD)
algorithms and can apply to both graphical and simulation algorithms (Brom et al. 2007).
Graphical LOD deals with simplification of the rendering of the world. It consists
of simplification of geometry, animation and gestures. For example facial expressions
are not very important for background characters if the user won’t be able to see them.
Simulation LOD on the other hand concerns the AI of virtual humans and virtual world
dynamics. In the context of large scale world simulations this can refer to simulation
of whole cities, simplifying the behavior of virtual humans that the user does not see.
92
On another scale it is applied to virtual humans that are visible on the scene in the
background, but cannot interact with the user.
The ALOHA framework is one of the first attempts at applying LOD techniques to
simulation algorithms. Initially it provided a unified framework for geometric and ani-
mation LOD algorithms (Giang et al. 2000). It used a knowledge base that controls the
level of detail for each object. Based on these rules the LOD resolver provided the nec-
essary information to a motion controller and a geometric controller to render the objects
at the appropriate level of detail. O’Sullivan et al. (2002) describe how this framework
was expanded with a behavior controller. They examine the concept of conversational
level of detail. They used a nonverbal behavior generation toolkit, BEAT (Cassell et al.
2001), to produce appropriate nonverbal behaviors closely synchronized with speech. It
uses a number of behavior generation rules which can be activated based on LOD crite-
ria. Their work is unique in that it provides a nonverbal generation model that could be
used at different levels of detail, but it appears that it is more suitable for main embodied
conversational agents as it focuses on speech and animation coordination.
Gordon et al. (2004) use a similar approach to the one presented in this disserta-
tion. They used two classes of virtual humans. For the main characters they used
scripted agents that communicate with the user and play major character roles in the
story of a virtual training simulation. Background virtual humans on the other hand are
guided by autonomous agents. While they are not essential to the story, they provide a
more immersive environment necessary to engage the user in the situation. While their
autonomous agents are very basic they acknowledge the need for employing different
simulation levels of details in virtual training simulations.
Brom (2005, 2007) explores simulation level of detail in more general terms, focus-
ing on LOD algorithms as they relate to simulating large worlds consisting of simulated
93
virtual agents. The GAL framework he proposes extends several techniques such as
smart objects, role-passing and level of detail at the behavioral level.
6.2 Software Architecture
The simulation is built on top of an asynchronous multi-agent framework. Each agent
representing a virtual human is a MinorCharacter object that is assigned a sep-
arate thread to be used for performing computations related to the simulation. The
Environment object serves as shared data for the agents and is responsible for
caching information from the rendering environment and transmission of all commu-
nication between the agents and the rendering environment as shown in figure 6.1.
Environment
Agent1
Agent2
Agent3
Agent4
Perform
action
Perceive
action
Rendering
Visualize
action
Figure 6.1: Example message exchange when an agent performs an action.
TheEnvironment object is responsible for determining what information is avail-
able to the agents. The agents will query the Environment object for what other agents
are in their visibility range or hearing range and where they are relative to them when
they make decisions about joining conversations and adjusting their positioning. The
Environment object has access to all the agents and stores their positioning informa-
tion in order to answer those queries. When the simulation is connected to a rendering
94
environment these queries can be passed through to take advantage of additional infor-
mation about the world.
The Environment object is responsible for keeping track of all agents involved
in the simulation and so is also responsible for adding and removing agents to the sim-
ulation. It also provides the wiring necessary for communication with the rendering
environment including instantiation of characters in the rendering environment and their
appearance.
At the root of the messaging system is the scheduling of actions by the agents. Each
agent has a list of messages that they have to process. The messages are organized in a
heap data structure and ordered by message time stamp. There are several types of mes-
sages employed by the agents including action messages, input messages, conversation
messages and planning messages.
Action messages represent actions that agents plan to execute (in the future). These
include greetings, gestures, posture shifts, gaze, movement, any action that the agent can
execute that is perceived in the world. Whenever an agent decides to perform an action
it generates an action message with a timestamp of when it should be executed (either
current time or some time in the future when we simulate reaction times) and passes it to
its scheduler. When it is time to execute the action message the agent sends it through the
Environment object where it gets relayed to the rendering environment to perform
the actual visual realization and to other agents in the form of an input message.
The input messages represent the sensory input of agents. Each action is associated
with either visual or auditory input. When an agents sends an action message to the
Environment object to be broadcast, the Environment object determines which
agents can perceive the action and passes them the input message which is then added
to the message heap of those agents. When the message is processed the agent then
95
updates its internal state and perform other actions as required by the conversational
algorithm.
Besides its use for action scheduling, the messages are also used for other mental
processing of agents that require scheduling. A group of conversation messages is used
for dialog simulation such as scheduling of conversation cycle monitoring, signaling for
pre-TRP, speech end and continuation tests. Planning messages are used to schedule
repositioning updates and decisions about joining and leaving conversations. In addi-
tion, the derived classes can extend the message class to define new types of messages
that might be needed to facilitate scheduling of other behaviors that are external to the
dialog simulation.
The simulation allows for both real-time execution as well as pausing and resuming
of the simulation in order to examine the internal state of the simulation at any point in
time. Each agent executes the following in its thread.
repeat
while there are more messages scheduled to process at the moment
process next scheduled message
estimate time till next message requires processing
sleep for that amount of time
All the functions related with message passing are synchronized so only one thread
can modify the data at a time. Most of the time the agent idles and only operates when
there are messages present that need processing. When a new message is received from
theEnvironment the thread is awakened to check if new messages need processing.
This allows for reactive behavior of agents that respond to the events as they happen.
Since most calculations are instantaneous the responses generated by the agents usually
incorporate some delay time to simulate response time.
Another important feature of the agent architecture is the information state. Each
agent has to keep information about the other agents. To facilitate this, each agent
96
can associate a number of key value pairs with any object. For example with each
agent they associate what they are gazing at, whether they are speaking, when they last
interacted with the group, whether they are moving or not. When a message is processed
it is passed to the appropriate handler based on the message type. Input messages are
passed to the input handler and the first thing it does is to update the internal state. For
example when the input message corresponds to an action of beginning to speak then it
sets speaking to true associated with the source of that input message. Then it further
processes the message as required for the simulation.
Processing of action messages is special since it may require visual representation.
The handler for action messages first processes the message according to the simulation,
then it sends the corresponding input message to theEnvironment class and finally
calls a virtual function corresponding to performing the actual physical action. In the
base implementation this function is empty and is expected to be implemented in the
derived classes to trigger the required commands necessary to produce the desired visual
representation.
6.3 User Interface
Similar to the design of the rest of the framework the user interface is also extensible to
incorporate information present in the specific implementations. Its main purpose is to
expose and allow manipulation of attributes defining the behavior of each agent. It dis-
plays their internal information state and allows control of execution of the simulation.
The control panel for this is composed of several sections (see Figure 6.2). At the
top is the menu that allows loading and saving of agent configurations and importing
of cultural settings. There are two main tabs available. The Characters tab controls the
information about the agents and Execute tab contains controls for playing, pausing and
97
resuming the simulation. The Execute tab also contains functionality for manipulating
dialog structure by allowing the sending of manual overrides ordering agents who to
join conversation with.
Figure 6.2: Display of conversation attributes in control panel.
The Characters tab consists of three parts. At the very top are controls for selection
of agents and adding and removing agents. In the middle are several panels that allow
manipulation of agent attributes. At the bottom is a textual representation of agent’s
information state. It can display either current state or a log of inputs and action deci-
sions made by the agent.
98
Figure 6.2 shows the controls for setting the attributes that affect turn-taking behav-
ior. Cultural parameters are controlled in three separate tabs corresponding to prox-
emics, gaze and overlap in turn-taking as shown in figures 6.3, 6.4 and 6.5.
Each agent can also be assigned a type or relationship with other agents as shown
in Figure 6.6. This is used by the positioning algorithm to determine which zone to use
when evaluating proxemics.
A scene specification is scenario dependent, but at the minimum includes a list of
virtual humans with their conversational attributes, their cultural settings and normally
information about visual representation. The cultural parameters can be saved in sepa-
rate culture definition files and imported into the simulation as shown in Figure 6.7.
Finally, to enable debugging of the agents each agent can be set to generate a log
of all input it received, all actions it performed and optionally any additional debugging
information. These log files are generated for each agent so that it is possible to examine
what happened from the perspective of each of the agents. Figure 6.8 shows an example
log from an agent.
6.4 Unreal Tournament
To demonstrate its use the simulation was embedded with the ICT Virtual Human
framework. The framework consists of a large number of distributed components that
communicate using the Virtual Human Messaging Library (VHMsg) (see http://
vhtoolkit.ict.usc.edu/index.php/VHMSG), which serves as an API wrap-
per for the underlying message broker (initially this was Elvin server (Segall and Arnold
1997), currently Apache ActiveMQ (Snyder et al. 2011) is used). One component of the
framework is the rendering environment, which was the Unreal Tournament 2003 game
engine in the initial implementation. There are several other components such as speech
99
Figure 6.3: Cultural Parameters – Gaze
Figure 6.4: Cultural Parameters – Proxemics
100
Figure 6.5: Cultural Parameters – Overlap in Turn-Taking
Figure 6.6: Agent Relationship Selection
101
Figure 6.7: Importing Cultural Settings
Figure 6.8: Agent Log
102
recognition, speech generation and agents controlling the main virtual humans in the
virtual training simulations.
Figure 6.9: Virtual humans engaged in conversation.
The dialog simulation connects with the other components of the framework in order
to be able to control its virtual humans and respond to potential queries about the virtual
humans it controls. The main communication is with the Unreal Tournament game
engine. It sends commands to instantiate the virtual humans and control their behavior
and listens for messages from the game engine about positioning and movement updates
for its virtual humans.
The commands it sends are on the high level. For example a gaze command will
only specify the type of command (i.e. gaze) and name of the virtual human for which
103
it is intended and identification of the target object. The actual behavior is implemented
inside Unreal Tournament in the form of UnrealScript functions. In the game engine
each virtual human (called pawn) is derived from a base class xPawn provided by the
game engine, which supports the basic atomic actions needed by the simulation. It can
be directed to walk to a specific location, direct its gaze and play predefined animation
sequences. In this project we did not create any new art assets and we used existing art
used in other projects at ICT. This means that some of the animations were not ideally
suited for our needs, but it provided a good starting point to test our implementation.
Each pawn is controlled by a controller class. The pawn is the physical representa-
tion while the controller determines its actions. The controller runs a central loop that
evaluates actions that were queued for execution. For example if the action it is execut-
ing is moving to a goal then it acquires a target and moves toward it until it reaches the
destination. When a message is sent from the dialog simulation, it is received by the
game engine and passed to the appropriate function based on the message type. This
then is responsible for generating the needed action and queuing it with the pawn con-
troller for execution. For example if the agent wants the virtual human to generate a
gesture it will send a command to play a predefined animation sequence. On the game
engine side the command is received and an action for playing the animation is submit-
ted to the pawn controller.
6.5 BML and Smartbody
Instead of creating a new specific implementation for each possible rendering environ-
ment, another possibility is to use an intermediary representation for virtual human
behaviors that can be realized by already existing solutions. One such intermediary
representation is the behavior markup language (BML), which is part of the SAIBA
104
framework that is encouraging sharing and collaboration for generation of natural mul-
timodal output for embodied conversational agents (Kopp et al. 2006). BML is based
on XML and is used to represent high level virtual human behaviors such as gaze and
gestures. Smartbody is one of the BML realizers that is responsible for executing the
behaviors specified in the BML (Thiebaux et al. 2008). While Smartbody was available
at the time of initial implementation of the background conversation simulation, it did
not support all the required behaviors such as locomotion. Another drawback is that
Smartbody is mainly aimed at controlling high fidelity virtual humans, which from the
level of detail perspective might be too resource intensive. Smartbody currently sup-
ports three rendering environments, Gamebryo, Unity and Ogre (Unreal Tournament is
no longer supported). It is responsible for scheduling and synchronization of the behav-
iors, which are processed by individual controllers. There are controllers that control
posture, locomotion, gaze and others and generally correspond to individual BML com-
mands.
The main benefit of the BML representation is that it provides a clean mapping of the
simulation actions. First is the movement start action and is mapped to the locomotion
BML behavior. The locomotion behavior accepts two parameters, target and manner.
Target is the desired final location of the character either as a global frame vector or
a character identifier when used for approaching other characters. Manner is WALK
when used for longer distance movement or STRAFE for local repositioning during
conversation. The simulation keeps track of the identifiers assigned to the behaviors and
uses this in combination with progress feedback reported by the realizer. When a block
end feedback is received for the locomotion behavior, this is converted to a movement
end message for the character in conversation simulation. The gaze action is mapped
either to gaze or headOrientation BML behavior. Gaze behavior is used when gazing
at a specific target and can include an optional parameter to limit gaze behavior to only
105
eyes, head, shoulder or waist, but by default the whole range of movement is allowed to
be used by the realizer. The headOrientation is used when the character is averting gaze
or looking away, this is achieved by specifying the orientation parameter with the value
of down, up, left or right. The final action required is for indicating when a character is
speaking. First when character starts speaking we can generate lip movements if desired
by using random generation of viseme commands. In addition we generate gesture
actions using the gesture BML behavior referring to the animations available for the
specific virtual human for various hand movements.
6.6 Vigor Framework and Second Life
The Vigor Framework is a multi-agent framework designed to develop virtual characters
that are embodied in virtual worlds such as Second Life. It is an example of a frame-
work that can cover avatars across different levels of detail, from characters that interact
with users to those that mainly serve as a backdrop where the background conversation
simulation described in this dissertation would fit.
The autonomous avatars connect and interact with Second Life through the C#
library LibOpenMetaverse. This library implements the Second Life client-server pro-
tocol and allows for a computer agent to connect to the Second Life virtual world in the
same way a human user would connect to it using the Second Life client. In addition
to this, the Vigor framework also has an implementation for the Active Worlds which
was used for the Staff Ride Guide project (more details about this work can be found
in Roque et al. (2011)).
The two main components of a Vigor agent are a behavior control system and the
conversation management component. The main goal in the creation of the behavior
control system was to allow for specification of behaviors that was flexible and modular
106
with many reusable components. The system is described in more detail in subsec-
tion 6.6.1. The behaviors can be autonomous or they can be assigned by a director
component as explained in subsection 6.6.2. The conversation management compo-
nent keeps track of the active conversations and handles the processing of user input.
The understanding of user input works at the level of surface text, using cross-language
information retrieval techniques to learn the best output for any input from a training set
of linked questions and answers (Leuski and Traum 2010). While this component is not
used directly by the background conversational simulation, it is explained in more detail
in subsection 6.6.3 for completeness.
Vigor agents also provide basic actions and events that can be used to create behav-
iors. Provided events include information about avatars moving in the world and the
actions they are performing. These can be used as triggers and conditions for behavior
control. The basic action set includes looking at specified targets and any other action a
Second Life user could perform such as interacting with objects. It also includes basic
navigation skills that allow an agent to move to specified locations while avoiding col-
lision with nearby avatars as part of the basic agent actions. The higher level movement
actions like path planning and guiding are implemented as reusable behaviors. This
allows the movement behaviors to be used as part of more complex behaviors without
reimplementing all the details.
6.6.1 Behavior Control
The behavior control system has many similarities to the ABL system used in
Fac ¸ade (Mateas and Stern 2002). Each agent has a root active behavior that it pro-
cesses as part of its main computational cycle along with movement computation. The
behaviors are written in C# with the help of basic building blocks. In general a behavior
defines a process that operates over time. This can be either synchronously with the
107
main computation cycle or asynchronously. In both cases the behavior is responsible
for reporting its state in each computational cycle. A behavior can either be in progress
or it can be complete with an indication of success, failure or error. In case of success
the behavior completed its objective. Failure represents a situation where the behavior
did not complete the desired task, for example if some of its requirements were not met.
Error on the other hand represents an unexpected exception in processing that cannot be
recovered from. The reporting of this behavior state is the primary way of coordinating
different behaviors.
The hierarchical structure is achieved through three main behavior building blocks,
sequential behavior, selector behavior and parallel behavior. Sequential behavior defines
an AND branch in a behavior tree. It executes all provided behaviors sequentially. It
results in failure if any of the child behaviors fails and completes with success if all
behaviors succeed. Selector behavior represents an OR branch in the behavior tree. It
will sequentially try to execute behaviors in the provided sequence until one of them suc-
ceeds. Whenever one of them succeeds this also results in the overall selector behavior
completing with success. On the other hand if all child behaviors result in failure then
the selector behavior also results in failure. Parallel behavior defines a parallel branch in
the behavior tree. Here all the behaviors in provided sequence are executed in parallel.
Based on parameters it completes when one of the child behaviors completes or when
all of them complete (with the same success status). In all cases if any behavior in the
child sequence returns an error the parent behavior terminates with an error as well.
The behavior control system also provides a number of decorator behaviors that are
used to modify other behaviors. Counter behavior is used to repeat a behavior several
times. It can be specified whether to complete when the child behavior completes with
success or failure. When it reaches its maximum counter it completes with the success
or failure state of the last execution. A similar decorator is a timer behavior. Here instead
108
of executing the child behavior a specified maximum number of times we execute the
behavior up to a specified time limit. We can similarly specify completion properties.
Several basic behavior building blocks are used to control behavior tree branching
based on various triggers. A simple condition behavior can be used as either a precon-
dition in selector behaviors or as a blocking condition in sequential behaviors. The con-
ditions can either use simple boolean lambda expressions or they can be based on event
triggers or user input classification. A condition can for example represent whether there
are any avatars in vicinity or if someone provided specific information in conversation
or any other arbitrary condition that is computable based on available data.
Besides these generic building blocks we also have a number of navigation behav-
iors. The navigation behaviors access the navigation map of the agent that is represented
as a graph of walkways that the agent has access to. The basic walk to behavior uses this
navigation map to construct a path to a specified location using the A* algorithm (Hart
et al. 1968). For the final leg to the target the agent just moves directly to target loca-
tion under assumption that navigation map is designed in a way that the obstacles are
accounted for in the graph. If the target is too far away from the navigation map it is
treated as inaccessible in which case the behavior results in failure.
On top of the basic move behavior we built the guiding behavior which takes as
input a group of avatars that are to be guided to a destination. The behavior can be
customized with parameters describing when and how long to wait for the avatars that
fall behind and with custom procedures that allow to provide further explanations to the
guided avatars that are having problems. Two simpler behaviors allow for following
other avatars and waiting in lines.
The modular nature of the behaviors provides a lot of opportunities for their reuse.
This way the scenario specific behaviors can focus on the specific tasks that they have
to perform.
109
The background conversation simulation is implemented as a behavior for each of
the characters involved. It behaves in an asynchronous manner and contains a link to
the main computational thread where the conversational process computations are per-
formed. The environment required for the simulation is directly tied to the Vigor frame-
work and provides information such as positioning of the characters as well as events
required for notification of movement completion.
It is the hierarchical nature of behaviors that allows for implementation of multiple
levels of detail at simulation level. In an example of inspecting a checkpoint the behavior
of inspection can have several implementations at various levels of detail depending on
whether the user is close by to see the results. At a low level of detail all the agents
have to do is communicate at the agent level without any visual representation. At the
high level however we want all the details of actually exchanging communication in the
virtual world including all animations and other behaviors that accompany it. In case of
more complex behaviors we can decide whether we can skip the subbehaviors and just
apply the end result or whether we want to go through all the steps involved.
6.6.2 Director
The director component is mainly in use when the roles and behaviors of the avatars
have to change as a result of changes in the environment or to progress a story as is
the case in the checkpoint exercise (see section 6.9). While agents usually perform
actions from their local perspective based on their internal goals, the director represent
an agency that has a bird’s-eye view of the complete situation. It acts with the goals
of the overall application purpose in mind and can modify internal goals of the agents
and their behaviors to achieve an overall effect. It is especially essential in interactive
storytelling where it is responsible for maintaining the story arc.
110
In the context of background conversation simulation the director could serve as a
means for providing transitions between levels of detail. Based on the actions of the user
it could for example determine that the group of characters currently being simulated at
the level of background conversation should change to some other more detailed simu-
lation. This would involve disabling the background conversational behavior, replacing
it with the new behavior and transferring any needed information state to make the tran-
sition as seamless as possible.
In the context of checkpoint exercise the director component is responsible for exe-
cution of story vignettes. The main goal is to provide a stream of interesting experiences
for the trainees without overloading them and keeping the stories consistent. There are a
number of avatars that come through the checkpoint just to provide traffic and baseline
experience for the trainees. In addition to these the director can start vignettes that pro-
vide more complex stories involving several avatars where the trainees have to question
them more thoroughly and possibly investigate the situation in the village. The direc-
tor periodically checks how many vignettes are active and starts new ones to keep the
trainees busy. Since some avatars can be used in several vignettes it is important that we
keep track of which vignettes are currently active and only start those where all needed
resources are available.
After selecting an available vignette there are several steps the director has to per-
form. First it has to decide which version of the vignette to create. This is either com-
pletely random or preselected for controlled experiments. Depending on the story it then
assigns inventory to all the involved avatars and how they respond to physical search
and the metal detector. It then assigns the behaviors for the actors. In most cases this
involves coming to the checkpoint, waiting in line and answering questions until they are
let through. In a case where coordination between several avatars is involved the director
uses behavior conditions that depend on shared synchronization elements. Finally, the
111
director has to assign a knowledge base to the avatars. The knowledge the avatars have
depends on the vignette involved and the version of the story. This is accomplished by
assigning a classifier for the avatars and is explained in more details in the next section.
6.6.3 Conversation Management
Most conversations in virtual worlds such as Second Life take place in chat with some
exchanges using instant messages. All the avatars use a typing animation and simulate
the time needed to type a message. Since any avatar could potentially be played by a
real user, all interactions between avatars are treated as if human users were controlling
them. Whenever an avatar receives a message over chat he tries to interpret whether he
was the intended recipient. We currently rely mostly on proxemics and body orientation
to determine the appropriate recipient and do not use any contextual cues other than
in some very specific situations. The agent queries all the avatars in proximity of the
speaker and compares them. It first compares the angle between speaker gaze direction
and position of the avatar. An avatar that is closer to the gaze will be a more likely
recipient. When several avatars are within a certain angle of each other the decision is
based on distance with the closest avatar selected as the recipient. The users view the
environment from a third person perspective and we found the most success with using
an angle about the size of field of view created by the size of the user body. Based on this
comparison the avatar decides whether he was the intended recipient or he just overheard
something intended for someone else. In most cases the agent will only respond when
he is the intended recipient, but he can use a separate classifier to interpret the overheard
messages and can react to those based on who the speaker and recipient are and the
classification of the message. This can, for example, be used to have civilians that are
waiting in line at the checkpoint react if someone is being inspected in an inappropriate
way.
112
The agents used in the checkpoint exercise use the NPCEditor for text classifica-
tion (Leuski and Traum 2010). This allows the author to specify a list of example user
inputs and outputs that should be produced by the avatars and link them. This informa-
tion is used to train the classifier so it can select an appropriate response for previously
unseen user input. Each avatar can have more than one classifier and the classifiers
can be organized in an inheritance hierarchy. For example each character can have a
classifier for a general knowledge base that is independent of the situation and two clas-
sifiers for each vignette, one for each version, that inherit from the general classifier and
provide additional knowledge required in each case.
We have used a mixed classification approach where some of the user inputs are
classified to dialog acts and some directly to text. This authoring choice mainly depends
on how dynamic the response has to be and whether it is different based on information
state of the agents. If the agent should always respond in a particular way then a simple
text to text classification is sufficient. If the agent has a few options of what to say then
all the responses are listed and annotated with conditions. Such responses are then inter-
preted by the agent and the appropriate answer is selected based on current information
state. This can be used for example to only reveal certain information after some con-
dition is met. While in theory we could create a separate classifier for each information
state combination, this would quickly become infeasible. The final option is mapping
from text to dialog act. This is used either for generic categories such as greetings,
thanks or confirmations, or in situations where the generated response depends on more
complex conditions.
In addition to the generated text the classification output also contains annotations
for actions and simple text replacement templates. Everything that requires some pro-
cessing by the agent is enclosed in square brackets. These square bracket commands
113
can be used to represent preprocessing commands, text template commands or action
commands.
When agent receives a response with a single command such as in the case of text to
dialog act mapping this is interpreted as a preprocessing command. Based on the prepro-
cessing instruction in the brackets the agent expands the dialog act and generates output
text based on current information state. An example of this would be “[GREETING]”.
When this preprocessing command is received the agent can inspect its information state
to determine when was the last time he greeted the avatar he is interacting with and their
relationship to decide on a proper response. This can include further text templates and
action commands that are used in further post-processing. After this possible prepro-
cessing the text template expansion takes place. This can be simple text replacement
such as name of the agent or current time. It can also contain conditional evaluations, in
which case text is generated only if some conditions are true. For example “[FEMALE
I am from this village.][MALE Here.]” will generate a different response depending on
the sex of the avatar that asked the question.
At this point the agent has a final generated response which it outputs either in chat
or over instant message depending on the context of which conversation the input was
received from. Finally, the agent performs another pass of the instruction text to parse
the action commands. These can either include actions to be performed in the world such
as walking to a specific location or instantiation of a specific behavior, or can include
commands that instruct the agent to modify its information state. In the case of “Of
course.[GUIDE AhmedHouseOutside]” for example the agent confirms the instruction
and then performs the behavior of guiding the avatar to Ahmed’s house.
Finally, the classification responses can be annotated for context resolution. The
author can mark the salient context words with curly brackets or use an action command
that modifies the active context of the agent in case the context is apparent but does not
114
appear explicitly in text. This context can then be used with context text template in
combination with context resolution classification. A response in the classifier can be
marked as requiring context resolution. This can be the case for example if ambiguous
references such as pronouns are used. If a response is marked for context resolution
then the agent will send the question to the classifier again after it performs its post-
processing which allows the classifier to respond as if the user used direct references.
In the case of “No, but myfsong and his friends know more about the local boys than I
do.” son is marked as the active context. An alternative when the context is not explicitly
mentioned is to use an information state modifying command such as “He is right over
there.[SETCONTEXT Ahmed]”. When a pronoun resolution is required this is then
resolved with rephrasing the question such as “*Where does [CONTEXT] live?”
Another form of context resolution appears when we have several vignettes going
on at the same time. A particular classifier will only contain information about a spe-
cific vignette. If we wanted a single classifier to handle all cases then we would need
a classifier for each combination of vignette versions since each vignette comes in dan-
gerous and non-dangerous variant. This would be unfeasible so we use classifier toss-
ing instead. Here we provide responses only for the user input relevant to the specific
vignette and for all other user input we just classify which vignette they are related to.
This way if the user asks a question that cannot be handled by the active classifier the
agent can query the director to find out which version of the vignette is currently running
and can then pass the question to the appropriate classifier.
115
6.7 MRE and SASO
The Mission Rehearsal Exercise (MRE) (Hill et al. 2003) was the first application to use
the background conversation simulation. The goal of MRE is to teach critical decision-
making skills to small-unit leaders in the U.S. Army. When the user arrives at the scene
in a small Balkan town, he finds out that one of his soldiers was involved in an accident
with a civilian vehicle. The goal of the user is then to figure out how to deal with this new
situation in light of his previous instructions. While he is dealing with the situation at
the crossroad, a crowd has started to gather in the background. Those furthest away are
controlled by simple scripted behaviors while the ones closest to the scene are controlled
by the background conversation simulation.
The main benefit of the background conversation simulation in this case compared to
using scripted behaviors is that it provides a lot of variability while scripts would quickly
become repetitive unless a significant amount of effort was put into creating long and
elaborate scripts. Additionally, the agents are also able to react to events taking place in
the foreground.
In SASO-ST (Traum et al. 2005), the user plays the role of a local military com-
mander that must negotiate with a doctor in order to persuade him to move the clinic to
a safer location. The later SASO-EN scenario (Traum et al. 2008) extends this with a
three-party negotiation between the captain, doctor and the local elder. The negotiation
takes place at a cafe in the market of the town where the clinic is located. The back-
ground conversation simulation is used as an optional component of the scenario and is
used to control the characters moving around in the market. Unlike the MRE scenario
where the group is static, here the background characters freely move around town and
spontaneously form conversation groups as they meet in the market.
116
6.8 SDO Moleno
Lt Moleno was a junior officer who watched over two islands in Second Life where users
could find information about the US Army and participate in activities such as a quiz, a
helicopter ride, and a parachute jump. As a real staff duty officer would, he patrolled the
area to make sure everything was ok. Since this was primarily a tourist destination site,
he was also equipped to interact with visitors and give them information about the island
as well as giving a guided tour as he went through his rounds. The knowledge domain
of the SDO primarily covered the information about the two islands. He also knew of
knowledge sources for other information such as facts about the army — if people asked
about these, he could refer the users to investigate those sources. If someone asked a
question that is out of his domain of expertise he would promise to try to find the answer,
and relay the question to a remote human monitor.
One of the main features of the SDO was to assist the users in a proactive way.
Unlike question-answering agents such as Leuski et al. (2006) and Artstein et al. (2008)
where the agent just answers questions when asked, SDO actively seeked out the users
that might need help, similar to Weitnauer et al. (2008).
The SDO kept a user model for everyone he met. This model persisted between
sessions. He classified users in several categories that were used in deciding how to
interact with them. In addition to this the SDO stored additional information about the
users to help with decisions, such as last time the user was greeted and when was the
last interaction with the user. He also tracked some realtime parameters such as online
status, away from keyboard status (AFK), whether they are typing a message and their
location.
The SDO could be in one of several states which determine his behavior. He could
be idle, following a path, approaching a user avatar, engaged in conversation, guiding
users or waiting for users.
117
Figure 6.10: SDO interacting with a returning user.
When the SDO was idle he would perform routine rounds around the islands check-
ing if everything is in order. When he detected an avatar that he was not aware of he
would approach their location to investigate them. For identified avatars the SDO would
evaluate how important it is to approach them. In general he would check on users not
more often than about every 5 minutes to see if they needed any help. This applied only
to new users, once the users were classified as advanced users he would no longer check
on them. In addition he would not approach an avatar if he identified them as AFK.
When approaching an avatar the SDO would first move to an appropriate distance
from the avatar before engaging. The chat has a limit on how far one can see it. In addi-
tion the SDO tried to follow the proxemics norms as they translate from real world into
virtual worlds (Jan and Traum 2007; Friedman et al. 2007). Once he reached the desired
distance he addressed the user based on aspects of his information state, including the
user model. If this was a first time he was interacting with the user he would introduce
118
himself, give a calling card that enabled users to send instant messages to him and offer
assistance. For returning users he would greet them if he hasn’t seen them for some
time (more than 3 hours) or just ask if everything is ok. A similar greeting behavior was
performed when the user started the interaction as opposed to SDO seeking them out.
When guiding users around the island he would make sure everyone in the group
stays with him. If someone would fall too far behind he would first wait a bit and if they
would not come he would send them an instant message and prompting to offer them
a teleport if they got lost. During a guided tour, if anyone in the group started to type
the SDO would stop and wait to see what they have to say and enter conversation mode.
During conversation if no one would take a turn for a while he would try to resume the
guided tour if one was pending.
The SDO was operational for about a year and a half during which he interacted with
over 5000 unique visitors and recorded over 50000 lines of dialog. More information
on analysis of the collected data can be found in Robinson et al. (2010).
6.9 Checkpoint Exercise
The Checkpoint Exercise involves a team of two soldier avatars, stationed at a check-
point outside of a Middle Eastern desert village. In the current setup one of the avatars is
controlled by the human trainee, the other – by a virtual human (Grunt Rumble). Poten-
tially the second avatar can also be controlled by a human player. Multiple indigenous
villagers (avatar agents) approach the checkpoint to enter the village. The task of the
team is to make sure that no illegal or dangerous goods make it into the village. The
teammates take turns inspecting the village visitors, examining their possessions and
identifications. They can question the visitors about their business in the village. If a
visitor’s story raises suspicions, the team may decide to seek confirmation by sending
119
one of the members into the village to investigate. The team can either allow the visitor
into the village or detain him based on the outcome of the interview and investigation.
Figure 6.11: Scene at the school during Federal Virtual Worlds Challenge evaluation.
Besides normal villagers, the exercise consists of a number of different story
vignettes, each with two possible outcomes (chosen at random) that could be revealed
after investigation. The stories were designed to be nonlinear (sandbox style), with
multiple characters reused across multiple unfolding narratives. This means that the
ordering of specific tasks in the story was not enforced. Instead the characters are only
given motivations that provide for an interesting setup, but how the narrative resolves is
entirely dependent on how the user interacts with the characters and the environment.
The pregnant woman scenario consists of a female who comes to the checkpoint
in suspiciously bulging clothes. Examination of her belongings reveals a fake ID and
a suspicious story. Thorough investigation inside the village reveals that either she is
lying to cover up the fact that she is secretly pregnant out of wedlock, and is seeking
to flee the country, or she is not pregnant and is smuggling plastic explosive under her
clothing.
120
The diversion scenario involves a group of people who walk up to the checkpoint
at the same time. As soon as a soldier begins inspecting them, a woman accuses the
soldier of stealing her ring and the group becomes irate. While the villagers are furiously
hurling accusations at the checkpoint soldiers, a small boy runs through the checkpoint
carrying a backpack. After investigation in the village, it is revealed that the group
was intentionally creating a distraction for the boy so he could sneak through. The
alternate outcomes are that the boy was either sneaking medicine through for his friend’s
sick sister, afraid that it would be seized by the foreigners, or the boy was sneaking
contraband through the checkpoint.
6.10 Conclusion
In this chapter we examined how the conversational simulation fits into the larger frame-
work. We reviewed other work on simulation levels of detail of which this is a particular
example. Vigor framework that we described is an example of a simulation framework
that supports the concept of simulation levels of detail. What enables mixing of different
simulation fidelity is the hierarchical nature of the behaviors. We’ve also seen how the
director component could be used for transitioning between different levels of detail.
While we have not worked specifically on the transitioning process, we have examined
several virtual humans that operate at different levels of detail. It is our expectation that
by using a compatible simulation framework at different levels of detail will enable a
more seamless transition.
In addition to exploration of the concept of levels of detail, this chapter serves as a
demonstration of how the simulation and its parts were used in actual applications. We
have described some of the implementation details and functionality of user interface
as well as how the integration with different rendering environments was performed.
121
Finally, we described the MRE, SASO, SDO Moleno and Checkpoint Exercise projects
that were used as the test bed for work in this dissertation.
122
Chapter 7
Conclusion
7.1 Summary
This dissertation examined the effects of various components of a multiparty dialog
simulation on the believability of the generated simulation. This is an important question
because if we know what will have a bigger impact we can focus the computational
resources and development time on those specific features for more overall effect.
In addition, when developing the dialog simulation we were trying to follow the
requirements for background virtual human simulations. In order to achieve real-time
performance we exploited the fact that the user is not interacting with the background
virtual humans. When designing the algorithms employed in the simulation this was
a constant driving force to simplify whenever the resulting change did not have a big
impact on believability or variability. Chapter 2 describes how the requirements influ-
enced the design of the base simulation model, focusing on appearance of conversation
and the patterns of interaction, rather than actual information exchange or communica-
tion of internal state. It shows how variability is achieved through conversation attributes
of the agents and how it attempts to achieve believability by modeling the conversational
behavior based on theories from sociolinguistics and discourse analysis. The main con-
tribution compared to previous work was the design of an asynchronous model of turn-
taking behavior in small groups that allowed dynamic changes in group structure such
as joining and leaving of conversation groups as well as splits in conversation.
123
In terms of effects on believability the initial evaluation of the simulation identi-
fied that the believability was hindered when the agents did not appropriately take into
account their positioning. This resulted in the work presented in chapter 3 on the move-
ment and positioning algorithm. The algorithm is based on social force model and
represents the behavior observed by social psychologists. By allowing the agents to
move and taking the positioning into account when forming conversation groups this
resulted in more believable conversational groupings where agents do not talk to each
other across large distances without regard for other agents around.
Another aspect of believability we inspected was how humans judge cultural appro-
priateness of various behaviors. Chapter 4 described the computational model of
culture-specific behavior in multiparty conversation that was developed based on lit-
erature review to model conversational groups representing Anglo American, Spanish-
speaking Mexican and Arab culture. When evaluating the believability of generated
simulations we found that proxemics has the biggest effect in terms of cultural appro-
priateness while we did not find a significant effect for gaze and overlap in turn-taking.
Finally, in chapter 5 we refined the model based on a cross-cultural dialog corpus we
developed and extended the simulation with a conversation structure model. We have
also examined the perception of cultural appropriateness on videos from the corpus and
found that this task appears just as hard on real human recordings as for judgments of
cultural appropriateness on videos of simulated conversations. The impact of cultural
modeling on overall believability is thus not completely clear and remains an open ques-
tion for the future.
The rest of the contributions in chapter 6 are related to implementation and explo-
ration of levels of detail. We describe the inner workings of the simulation software
and how it was integrated with rendering environments such as Unreal Tournament and
Second Life. We showed how the base simulation can be used in its primary role of
124
background conversational simulation in the case of MRE and SASO. SDO Moleno and
Checkpoint Exercise on the other hand show how components of the simulation such as
gaze behavior and movement can be used at other levels of detail where agents interact
with the user.
The movement algorithm described here is also applicable in other situations. For
example Pedica and Vilhj´ almsson (2008) adjust the model to be applicable for auto-
mated movement of human avatars involved in conversations in online virtual worlds.
Of interest is also Lakshika et al. (2012) that attempt to provide automated believability
evaluation trained on human judgments and use that to drive genetic algorithms to tune
the force parameters in their model.
Section 6.6 provided a detailed description of the behavior simulation framework for
agents in online virtual worlds. The reader that is interested in more results in this area
should complement this with Jan et al. (2009, 2011), Roque et al. (2011) and Robinson
et al. (2010), which include a lot more details regarding the application and evaluation
side of that work.
Section 5.5 described the cultural variations from the cross-cultural dialog corpus in
the context of the cultural model in conversational simulation. Readers that are inter-
ested in further information about the corpus and further cross-cultural analysis of the
data should check Herrera et al. (2010, 2011) and Herrera (2010).
7.2 Limitations and Future Work
The dialog simulation described in this dissertation is ideally suited for simulating vir-
tual humans that don’t interact with the user of the application. As described in sec-
tion 6.1, it fits a particular simulation level of detail. In a fully interactive environment
where the user is free to walk around and interact with everyone we cannot guarantee
125
that a particular virtual human will remain at the same simulation level of detail. This
was for example very evident in the Checkpoint Exercise that we examined in chapter 6.
With this in mind it would be productive to examine possibilities for elevating the LOD
dynamically.
First as the user gets closer to the agents he will get into their hearing range. The
current simulation is silent and the user will not perceive it as believable when he is close
enough that he would expect to hear the conversation. It would make sense for agents
that are close enough to actually generate speech or use pre-recorded voice clips. One
potential problem would be the coherence as the underlying simulation does not take
into account content. One possibility would be to assign a topic id and IPA category to
each voice clip. The simulation would have to be augmented so that agents generate the
topic id in addition to the IPA category and then a corresponding voice clip would be
randomly selected and played.
As the user gets even closer the agents would have to take him into account when
reasoning about their positioning. For this purpose the user could be represented in the
simulation in the same way as all the other agents, except that he would not be controlled
by simulation. If the user actually engaged in conversation one possibility would be for
his speech to be categorized into the topic ids and IPA categories and just send a message
to the agents in the simulation as if he was one of them. Another way would be to elevate
the virtual human to be controlled by a more complex agent such as the ones used for
tactical question answering (Traum et al. 2007) and similar agents based on the Vigor
framework. The applications described in chapter 6 serve as a good starting point, but
further work is required in expanding this area of research.
Another limitation comes from the scope of the simulation. The simulation focuses
on the behavioral control mainly in terms of timing and high level behavior selection,
but does not explicitly attempt to define the behavior primitives, instead we just assume
126
they are provided by the rendering environment. It is clear that how the gesture and
other animations look and how the characters visually appear has an important effect on
their believability. This area falls under the scope of computer graphics and would be
an interesting area to explore to see how much the visual appearance itself contributes
to believability compared to high level dynamics.
In the area of cultural modeling we have only focused on three specific cultures.
While the general framework should be applicable for most cultures, the data is not
available for all of them. Based on our findings it would be recommended that in case of
limited resources one would focus on collecting data for proxemics as it had the biggest
effect on believability.
Finally, work on conversation structure modeling is just in its infancy. It currently
requires explicit assignments of interaction profiles which have to be adjusted when
conversation group structure changes. There are possibilities to automate the creation
of profiles and their assignments to some degree, but it is currently not clear what form
of input would be best to drive the modifications and assignment of profiles. It is most
likely that this will become clearer after this subsystem gets used in a variety of scenarios
which will allow us to see what parts of the assignment process are general and what is
scenario specific.
127
Bibliography
A. Akhter Lipi, Y . Nakano, and M. Rehm. A parameter-based model for generat-
ing culturally adaptive nonverbal behaviors in embodied conversational agents. In
C. Stephanidis, editor, Universal Access in Human-Computer Interaction. Intelligent
and Ubiquitous Interaction Environments, volume 5615 of Lecture Notes in Computer
Science, pages 631–640. Springer Berlin / Heidelberg, 2009.
M. Argyle and M. Cook. Gaze and Mutual Gaze. Cambridge University Press, 1976.
M. Argyle and R. Ingham. Gaze, mutual gaze, and proximity. Semiotica, 6(1):32–50,
1972.
R. Artstein, J. Cannon, S. Gandhe, J. Gerten, J. Henderer, A. Leuski, and D. Traum.
Coherence of Off-Topic Responses for a Virtual Character. In 26th Army Science
Conference, Orlando, Florida, 2008.
J. N. Bailenson, J. Blascovich, A. C. Beall, and J. M. Loomis. Interpersonal distance
in immersive virtual environments. Personality and Social Psychology Bulletin, 29:
819–833, 2003.
R. Bales. Interaction Process Analysis: A Method for the Study of Small Groups. Uni-
versity of Chicago Press, 1976.
R. Bales, F. Strodtbeck, T. Mills, and M. Roseborough. Channels of communication in
small groups. American Sociological Review, 16(4):461–468, 1951.
J. C. Baxter. Interpersonal spacing in natural settings. Sociometry, 33(4):444–456, dec
1970.
A. Berry. Spanish and American turn-taking styles: A comparative study. Pragmatics
and Language Learning, monograph series, 5:180–190, 1994.
J. Blascovich. A theoretical model of social influence for increasing the utility of col-
laborative virtual environments. In Proceedings of the 4th international conference
on Collaborative virtual environments, pages 25–30. ACM, 2002.
128
J. Bonito. An information-processing approach to participation in small groups. Com-
munication Research, 28(3):275, 2001.
E. Bouvier and P. Guilloteau. Crowd simulation in immersive space management. In
M. G¨ obel, J. David, P. Slavik, and J. J. van Wijk, editors, Virtual Environments and
Scientific Visualization ’96, pages 104–110. Springer-Verlag Wien, April 1996.
D. Brogan and J. Hodgins. Group Behaviors for Systems with Significant Dynamics.
Autonomous Robots, 4(1):137–153, 1997.
C. Brom. GAL: A framework for large artificial environments inhabited by human-like
intelligent virtual agents. Technical report, Charles University, 2005.
C. Brom, O. Sery, and T. Poch. Simulation level of detail for virtual humans. In
Pelachaud et al. (2007), pages 1–14.
S. Burger and Z. Sloane. The isl meeting corpus: Categorical features of communicative
group interactions. In Proc. ICASSP-2004 Meeting Recognition Workshop, 2004.
S. Burger, V . MacLaren, and H. Yu. The isl meeting corpus: The impact of meeting type
on speech style. In Seventh International Conference on Spoken Language Process-
ing, 2002.
J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec,
V . Karaiskos, W. Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska,
I. McCowan, W. Post, D. Reidsma, and P. Wellner. The ami meeting corpus: A
pre-announcement. In S. Renals and S. Bengio, editors, Machine Learning for Multi-
modal Interaction, volume 3869 of Lecture Notes in Computer Science, pages 28–39.
Springer Berlin / Heidelberg, 2006.
J. Cassell, T. Bickmore, M. Billinghurst, L. Campbell, K. Chang, H. Vilhj´ almsson, and
H. Yan. Embodiment in conversational interfaces: Rea. Proceedings of the SIGCHI
conference on Human factors in computing systems: the CHI is the limit, pages 520–
527, 1999.
J. Cassell, H. H. Vilhj´ almsson, and T. Bickmore. Beat: the behavior expression ani-
mation toolkit. In SIGGRAPH ’01: Proceedings of the 28th annual conference on
Computer graphics and interactive techniques, pages 477–486, New York, NY , USA,
2001. ACM.
F. de Rosis, C. Pelachaud, and I. Poggi. Transcultural believability in embodied agents:
a matter of consistent adaptation. Agent Culture: Designing Human-Agent Interaction
in a Multicultural World. ” Laurence Erlbaum Associates, 2003.
129
B. Du-Babcock. A comparative analysis of individual communication processes in small
group behavior between homogeneous and heterogeneous groups. In Proceedings
of the 68 th Association of Business Communication Convention, Albuquerque, New
Mexico, USA, pages 1–16, 2003.
R. V . Exline. Explorations in the process of person perception: Visual interaction in
relation to competition, sex, and N affiliation. meeting of American Psychological
Association, 1960.
M. Fis ¸ek, J. Berger, and R. Norman. Participation in heterogeneous and homogeneous
groups: A theoretical integration. American Journal of Sociology, pages 114–142,
1991.
D. Friedman, A. Steed, and M. Slater. Spatial social behavior in second life. In
Pelachaud et al. (2007), pages 252–263.
T. Giang, R. Mooney, C. Peters, and C. OSullivan. ALOHA: Adaptive Level Of Detail
for Human Animation: Towards a new framework. Eurographics 2000 short paper
proceedings, pages 71–77, 2000.
G. Goetsch and D. McFarland. Models of the distribution of acts in small discussion
groups. Social Psychology Quarterly, pages 173–183, 1980.
A. Gordon, M. van Lent, M. V . Velsen, P. Carpenter, and A. Jhala. Branching storylines
in virtual reality environments for leadership development. In D. L. McGuinness and
G. Ferguson, editors, AAAI, pages 844–851. AAAI Press / The MIT Press, 2004.
A. Guye-Vuill` eme. Simulation of nonverbal social interaction and small groups dynam-
ics in virtual environments. PhD thesis, Infoscience — Ecole Polytechnique Federale
de Lausanne (Switzerland), 2004.
E. T. Hall. Proxemics. Current Anthropology, 9(2/3):83–108, apr 1968.
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. The weka
data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1):
10–18, 2009.
P. Hart, N. Nilsson, and B. Raphael. A formal basis for the heuristic determination of
minimum cost paths. Systems Science and Cybernetics, IEEE Transactions on, 4(2):
100 –107, july 1968.
D. Helbing and P. Moln´ ar. Social force model for pedestrian dynamics. Phys. Rev. E,
51(5):4282–4286, May 1995.
D. Herrera. Gaze, turn-taking and proxemics in multiparty versus dyadic conversation
across cultures. PhD thesis, The University of Texas at El Paso, 2010.
130
D. Herrera, D. Novick, D. Jan, and D. R. Traum. The UTEP-ICT cross-cultural mul-
tiparty multimodal dialog corpus. In Multimodal Corpora Workshop: Advances in
Capturing, Coding and Analyzing Multimodality (MMC 2010), Valletta, Malta, May
2010.
D. Herrera, D. Novick, D. Jan, and D. R. Traum. Dialog behaviors across culture
and group size. In Proceedings of Human-Computer Interaction International 2011
(HCII’11), Orlando, Florida, USA, July 2011.
R. Hill, J. Gratch, S. Marsella, J. Rickel, W. Swartout, and D. Traum. Virtual humans in
the mission rehearsal exercise system. K¨ unstliche Intelligenz, 4(03):5–10, 2003.
K. Ijaz, A. Bogdanovych, and S. Simoff. Enhancing the believability of embodied con-
versational agents through environment-, self-and interaction-awareness. In Proceed-
ings of ACSC. Citeseer, 2011.
D. Jan and D. Traum. Dynamic movement and positioning of embodied agents in multi-
party conversations. In Proceedings of the Workshop on Embodied Language Process-
ing, pages 59–66, Prague, Czech Republic, June 2007. Association for Computational
Linguistics.
D. Jan and D. R. Traum. Dialog simulation for background characters. In
T. Panayiotopoulos, J. Gratch, R. Aylett, D. Ballin, P. Olivier, and T. Rist, editors,
IVA, volume 3661 of Lecture Notes in Computer Science, pages 65–74. Springer,
2005.
D. Jan, D. Herrera, B. Martinovski, D. Novick, and D. R. Traum. A computational
model of culture-specific conversational behavior. In Pelachaud et al. (2007), pages
45–56.
D. Jan, A. Roque, A. Leuski, J. Morie, and D. R. Traum. A virtual tour guide for virtual
worlds. In Z. Ruttkay, M. Kipp, A. Nijholt, and H. H. Vilhj´ almsson, editors, IVA,
volume 5773 of Lecture Notes in Computer Science, pages 372–378. Springer, 2009.
D. Jan, E. Chance, D. Rajpurohit, D. DeVault, A. Leuski, J. Morie, and D. R. Traum.
Checkpoint exercise: Training with virtual actors in virtual worlds. In H. H.
Vilhj´ almsson, S. Kopp, S. Marsella, and K. R. Th´ orisson, editors, IVA, volume 6895
of Lecture Notes in Computer Science, pages 453–454. Springer, 2011.
P. Jeffrey and G. Mark. Navigating the virtual landscape: coordinating the shared use
of space. Designing information spaces: the social navigation approach, pages 105–
124, 2003.
J. Kadane, G. Lewis, and J. Ramage. Horvath’s theory of participation in group discus-
sions. Sociometry, pages 348–361, 1969.
131
A. Kendon. Some functions of gaze-direction in social interaction. Acta Psychol (Amst),
26(1):22–63, 1967.
A. Kendon. Spatial Organization in Social Encounters: the F-formation System, pages
209–237. Cambridge University Press, 1990.
A. Kendon and A. Ferber. A description of some human greetings. Comparative ecology
and behaviour of primates, pages 591–668, 1973.
M. Kipp. Anvil-a generic annotation tool for multimodal dialogue. In Seventh European
Conference on Speech Communication and Technology, 2001.
S. Kopp, B. Krenn, S. Marsella, A. Marshall, C. Pelachaud, H. Pirker, K. Th´ orisson,
and H. Vilhj´ almsson. Towards a common framework for multimodal generation: The
behavior markup language. In Intelligent Virtual Agents, pages 205–217. Springer,
2006.
E. Lakshika, M. Barlow, and A. Easton. Fidelity and complexity of standing group con-
versation simulations: A framework for the evolution of multi agent systems through
bootstrapping human aesthetic judgments. In Evolutionary Computation (CEC), 2012
IEEE Congress on, pages 1 –8, june 2012.
R. Leik. The distribution of acts in small groups. Sociometry, pages 280–299, 1967.
A. Leuski and D. R. Traum. Npceditor: A tool for building question-answering charac-
ters. In Proceedings of The Seventh International Conference on Language Resources
and Evaluation (LREC), 2010.
A. Leuski, R. Patel, D. Traum, and B. Kennedy. Building effective question answering
characters. In Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue,
pages 18–27, 2006.
H. Maldonado and B. Hayes-Roth. Toward Cross-Cultural Believability in Character
Design. Agent Culture: Human-Agent Interaction in a Multicultural World, 2004.
M. Mateas and A. Stern. A behavior language for story-based believable agents. IEEE
Intelligent Systems,, pages 39–47, 2002.
D. Matsumoto. Culture and Nonverbal Behavior. The Sage Handbook of Nonverbal
Communication, pages 219–235, 2006.
J. C. McCroskey, T. J. Young, and V . P. Richmond. A simulation methodology for
proxemic research. Sign Language Studies, 17:357–368, 1977.
C. McPhail, W. Powers, and C. Tucker. Simulating Individual and Collective Action in
Temporary Gatherings. Social Science Computer Review, 10(1):1, 1992.
132
L. Meltzer, W. Morris, and D. Hayes. Interruption outcomes and vocal amplitude:
Explorations in social psychophysics. Journal of Personality and Social Psychology,
18(3):392–402, 1971.
H. Nakanishi. Freewalk: a social interaction platform for group behaviour in a virtual
space. Int. J. Hum.-Comput. Stud., 60(4):421–454, 2004.
P. O’Neill-Brown. Setting the stage for the culturally adaptive agent. Proceedings of
the 1997 AAAI Fall Symposium on Socially Intelligent Agents, Menlo Park, CA: AAAI
Press. AAAI Technical Report FS-97-02, pages 93–97, 1997.
B. Orestr¨ om. Turn-taking in English conversation. CWK Gleerup, 1983.
C. O’Sullivan, J. Cassell, H. Vilhj´ almsson, J. Dingliana, S. Dobbyn, B. McNamee,
C. Peters, and T. Giang. Levels of detail for crowds and groups. Computer Graphics
Forum, 21(4):733–741, 2002.
E. Padilha and J. Carletta. A simulation of small group discussion. Proceedings of
EDILOG 2002: Sixth Workshop on the Semantics and Pragmatics of Dialogue, pages
117–124, 2002.
E. Padilha and J. Carletta. Nonverbal behaviours improving a simulation of small group
discussion. Proc. 1st Nordic Symp. on Multimodal Comm, pages 93–105, 2003.
E. G. Padilha. Modelling Turn-taking in a Simulation of Small Group Discussion. PhD
thesis, University of Edinburgh, 2006.
J. Patel, R. Parker, and D. R. Traum. Simulation of small group discussions for middle
level of detail crowds. Army Science Conference, 2004.
C. Pedica and H. Vilhj´ almsson. Social perception and steering for online avatars. In
Intelligent Virtual Agents, pages 104–116. Springer, 2008.
C. Pelachaud, J.-C. Martin, E. Andr´ e, G. Chollet, K. Karpouzis, and D. Pel´ e, editors.
Intelligent Virtual Agents, 7th International Conference, IVA 2007, Paris, France,
September 17-19, 2007, Proceedings, volume 4722 of Lecture Notes in Computer
Science, 2007. Springer.
M. Rehm, E. Andre, and M. Nischt. Let’s Come TogetherSocial Navigation Behaviors
of Virtual and Real Humans. Intelligent Technologies for Interactive Entertainment:
First International Conference, Intetain 2005, Madonna Di Campaglio, Italy, Novem-
ber 30-December 2, 2005, Proceedings, 2005.
M. Rehm, E. Andr´ e, N. Bee, B. Endrass, M. Wissner, Y . Nakano, T. Nishida, and
H. Huang. The cube-g approach–coaching culture-specific nonverbal behavior by
virtual agents. Organizing and learning through gaming and simulation: proceed-
ings of Isaga, page 313, 2007a.
133
M. Rehm, B. Endrass, and M. Wissner. Integrating the user in the social group dynamics
of agents. Proceedings of Social Intelligence Design (SID), 2007b.
C. W. Reynolds. Flocks, herds and schools: A distributed behavioral model. SIGGRAPH
Comput. Graph., 21(4):25–34, 1987.
D. Richards, N. Szilas, M. Kavakli, and M. Dras. Impacts of visualisation, interaction
and immersion on learning using an agent-based training simulation. research and
development, 28:45, 2008.
J. Rickel, S. Marsella, J. Gratch, R. Hill, D. Traum, and W. Swartout. Toward a new
generation of virtual humans for interactive experiences. Intelligent Systems, IEEE,
17(4):32–38, 2002.
D. Robinson and J. Balkwell. Density, transitivity, and diffuse status in task-oriented
groups. Social Psychology Quarterly, pages 241–254, 1995.
S. Robinson, A. Roque, and D. R. Traum. Dialogues in context: An objective user-
oriented evaluation approach for virtual human dialogue. In 7th International Con-
ference on Language Resources and Evaluation (LREC), Valletta, Malta, May 19-21
2010.
A. Roque, D. Jan, M. G. Core, and D. R. Traum. Using virtual tour behavior to build
dialogue models for training review. In IVA, pages 100–105, 2011.
F. Rossano, P. Brown, and S. Levinson. Gaze, questioning and culture. Conversation
Analysis: Comparative Perspectives. Cambridge University Press, Cambridge, pages
187–249, 2009.
H. Sacks, E. Schegloff, and G. Jefferson. A Simplest Systematics for the Organization
of Turn-Taking for Conversation. Language, 50(4):696–735, 1974.
A. E. Scheflen. Micro-territories in human interaction. In A. Kendon, R. M. Harris, and
M. R. Key, editors, World Anthropology: Organization of Behavior in Face-to-Face
Interaction, pages 159–173. Mouton, Paris, 1975.
E. Schegloff. Preliminaries to preliminaries:Can I ask you a question?. Sociological
Inquiry, 50(3-4):104–52, 1980.
E. Schegloff. Discourse as an interactional achievement: Some use of ”uh-huh” and
other things that come between sentences. Georgetown University Round Table on
Languages and Linguistics, Analyzing discourse: Text and talk, pages 71–93, 1982.
D. Schiffrin. Opening encounters. American Sociological Review, 42(5):679–691, Octo-
ber 1977.
134
B. Segall and D. Arnold. Elvin has left the building: A publish/subscribe notification
service with quenching. In Proceedings of AUUG97, pages 3–5. Brisbane, Australia,
1997.
W. Shao and D. Terzopoulos. Autonomous pedestrians. In SCA ’05: Proceedings of
the 2005 ACM SIGGRAPH/Eurographics symposium on Computer animation, pages
19–28, New York, NY , USA, 2005. ACM Press.
R. Shuter. Proxemics and tactility in latin america. Journal of Communication, 26(3):
46–52, 1976.
B. Snyder, D. Bosnanac, and R. Davies. ActiveMQ in action. Manning, 2011.
G. Stasser and L. Taylor. Speaking turns in face-to-face discussions. Journal of Person-
ality and Social Psychology, 60(5):675, 1991.
G. Stasser and S. Vaughan. Models of participation during face-to-face unstructured
discussion. Understanding group behavior: Consensual action by small groups, 1:
165–192, 1996.
F. Stephan. The relative rate of communication between members of small groups.
American Sociological Review, 17(4):482–486, 1952.
F. Stephan and E. Mishler. The distribution of participation in small groups: An expo-
nential approximation. American Sociological Review, 17(5):598–608, 1952.
G. K. Still. Crowd Dynamics. PhD thesis, Warwick University, 2000.
T. Stivers, N. Enfield, P. Brown, C. Englert, M. Hayashi, T. Heinemann, G. Hoymann,
F. Rossano, J. De Ruiter, K. Yoon, et al. Universals and cultural variation in turn-
taking in conversation. Proceedings of the National Academy of Sciences, 106(26):
10587, 2009.
G. Taylor, M. Quist, S. Furtwangler, and K. Knudsen. Toward a hybrid cultural cognitive
architecture. In CogSci Workshop on Culture and Cognition, Nashville, TN, Cognitive
Science Society, 2007.
L. ten Bosch, N. Oostdijk, and J. de Ruiter. Durational Aspects of Turn-Taking in Spon-
taneous Face-to-Face and Telephone Dialogues. Sojka et al.[SKP04], pages 563–570,
2004.
M. Thiebaux, A. Marshall, S. Marsella, and M. Kallmann. SmartBody: Behavior Real-
ization for Embodied Conversational Agents. In Proceedings of Autonomous Agents
and Multi-Agent Systems (AAMAS), 2008.
135
D. Traum, A. Roque, A. Leuski, P. Georgiou, J. Gerten, B. Martinovski, S. Narayanan,
S. Robinson, and A. Vaswani. Hassan: A Virtual Human for Tactical Questioning.
Proceedings of the 8th SIGdial workshop on Discourse and Dialogue, S. Keizer, B.
Bunt & T. Paek (eds), Antwerp, Belgium, pages 75–78, 2007.
D. Traum, S. Marsella, J. Gratch, J. Lee, and A. Hartholt. Multi-party, multi-issue, multi-
strategy negotiation for multi-modal virtual agents. In Intelligent Virtual Agents,
pages 117–130. Springer, 2008.
D. R. Traum, W. Swartout, S. Marsella, and J. Gratch. Fight, flight, or negotiate: Believ-
able strategies for conversing under crisis. In proceedings of the Intelligent Virtual
Agents Conference (IVA), page 52–64. Springer-Verlag Lecture Notes in Computer
Science, Springer-Verlag Lecture Notes in Computer Science, September 2005.
B. Ulicny and D. Thalmann. Crowd simulation for interactive virtual environments and
VR training systems. Computer Animation and Simulation 2001: Proceedings of the
Eurographics Workshop in Manchester, 2001.
B. Ulicny and D. Thalmann. Towards interactive real-time crowd behavior simulation.
Computer Graphics Forum, 21(4):767–775, 2002.
O. M. Watson. Proxemic Behavior: A Cross-cultural Study. Mouton, 1970.
O. M. Watson and T. D. Graves. Quantitative research in proxemic behavior. American
Anthropologist, 68(4):971–985, August 1966.
R. M. Weisbrod. Looking behavior in a discussion group. Unpublished paper, Depart-
ment of Psychology, Cornell University, 1965.
E. Weitnauer, N. M. Thomas, F. Rabe, and S. Kopp. Intelligent agents living in social
virtual environments - bringing max into second life. In IVA, volume 5208 of Lecture
Notes in Computer Science, pages 552–553. Springer, 2008.
Q. Yu and D. Terzopoulos. A decision network framework for the behavioral animation
of virtual humans. Proceedings of the 2007 ACM SIGGRAPH/Eurographics sympo-
sium on Computer animation, pages 119–128, 2007.
136
Abstract (if available)
Abstract
When we simulate a large number of virtual humans in virtual worlds we encounter a point where it is no longer feasible to simulate all of them in full detail. In the case where some of them never interact with the user it is useful to distinguish between main virtual humans and background virtual humans. The main purpose of background virtual humans is to engage the user in the immersive environment. Their main objective is to be believable to the user and allow him to suspend disbelief. To achieve this, the background virtual humans must be able to perform various behaviors, with conversational behavior being one of the most commonly required. ❧ In this dissertation we present a framework for believable simulation of conversational behavior for background virtual humans based only on a small number of parameters. It was developed over the course of four major iterations based on computational models derived from literature review and analysis of video corpus data. It takes into account the role of proxemics, gaze, turn-taking and pause on believability of conversational behavior and how context such as culture and conversational task affects believability. ❧ We report on a number of evaluations performed during the development of the simulation and show that simulations appear believable to the subjects. We describe the applications where the simulation has been employed and show what role it can play in a larger framework covering multiple levels of detail.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Generating gestures from speech for virtual humans using machine learning approaches
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Rapid prototyping and evaluation of dialogue systems for virtual humans
PDF
An intelligent tutoring system’s approach for negotiation training
PDF
A framework for research in human-agent negotiation
PDF
Procedural animation of emotionally expressive gaze shifts in virtual embodied characters
PDF
Automated negotiation with humans
PDF
The human element: addressing human adversaries in security domains
PDF
Enabling human-building communication to promote pro-environmental behavior in office buildings
PDF
The interpersonal effect of emotion in decision-making and social dilemmas
PDF
Context dependent utility: an appraisal-based approach to modeling context, framing, and decisions
PDF
Closing the reality gap via simulation-based inference and control
PDF
Human adversaries in security games: integrating models of bounded rationality and fast algorithms
PDF
Understanding and generating multimodal feedback in human-machine story-telling
PDF
Ubiquitous computing for human activity analysis with applications in personalized healthcare
PDF
Predicting and modeling human behavioral changes using digital traces
PDF
Edge indexing in a grid for highly dynamic virtual environments
PDF
Plant substructuring and real-time simulation using model reduction
Asset Metadata
Creator
Jan, Dusan
(author)
Core Title
Virtual extras: conversational behavior simulation for background virtual humans
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/23/2012
Defense Date
08/06/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
behavior simulation,believability,conversational behavior,cultural differences,levels of detail,OAI-PMH Harvest,virtual humans,virtual worlds
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Traum, David (
committee chair
), Gratch, Jonathan (
committee member
), Koenig, Sven (
committee member
), Moore, Granville Alexander (
committee member
)
Creator Email
djan@usc.edu,dusan_jan@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-90997
Unique identifier
UC11290301
Identifier
usctheses-c3-90997 (legacy record id)
Legacy Identifier
etd-JanDusan-1163.pdf
Dmrecord
90997
Document Type
Dissertation
Rights
Jan, Dusan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
behavior simulation
believability
conversational behavior
cultural differences
levels of detail
virtual humans
virtual worlds