Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Body pose estimation and gesture recognition for human-computer interaction system
(USC Thesis Other)
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
BODY POSE ESTIMATION
AND GESTURE RECOGNITION
FOR HUMAN-COMPUTER INTERACTION SYSTEM
by
Chi-Wei Chu
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2008
Copyright 2009 Chi-Wei Chu
Table of Contents
List Of Tables iv
List Of Figures v
Abstract vii
Chapter 1: Introduction 1
1.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.0.1 Environmental Limitations . . . . . . . . . . . . 3
1.2.0.2 Feature extraction in IR images . . . . . . . . . . 4
1.2.0.3 3D Feature Reconstruction from Multiple Views . 5
1.2.0.4 Human Pose Estimation and Tracking . . . . . . 7
1.2.0.5 HCI integration . . . . . . . . . . . . . . . . . . . 8
1.3 Overview of the Approaches . . . . . . . . . . . . . . . . . . . . . 8
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Natural Gesture Analysis . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 2: Related Works 14
2.1 Human Posture Tracking in 2D and 3D Space . . . . . . . . . . . 14
2.2 Tracking using Hidden Markov Model . . . . . . . . . . . . . . . . 15
2.3 Tracking using Particle Filtering . . . . . . . . . . . . . . . . . . . 17
2.4 Other Pose Estimation/Tracking methods . . . . . . . . . . . . . 19
Chapter 3: Feature Extraction from IR Images 21
3.1 Silhouette Segmentation . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 2D Axis Points Extraction . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Body Parts Segmentation . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 4: 3D Feature Reconstruction 27
4.1 Visual Hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Volumetric Approximation . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Shape Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 3D Axis Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
ii
4.4.1 Reconstruction from Volumetric Data . . . . . . . . . . . . 37
4.4.2 Reconstruction from 2D Body Axis . . . . . . . . . . . . . 45
Chapter 5: Pose and Gesture Recognition 46
5.1 Posture Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.1 Decomposing Postures . . . . . . . . . . . . . . . . . . . . 50
5.1.2 Selection of Atoms and Learning . . . . . . . . . . . . . . 53
5.2 Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.1 Parallel Hidden Markov Model. . . . . . . . . . . . . . . . 57
5.2.2 Factorial Hidden Markov Model . . . . . . . . . . . . . . . 59
5.3 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Chapter 6: Pose and Gesture Estimation 70
6.1 Articulated Rigid Body Model . . . . . . . . . . . . . . . . . . . 70
6.1.1 3D Axis Based Model Configuration . . . . . . . . . . . . 73
6.1.2 Model Scaling by 3D Height . . . . . . . . . . . . . . . . . 76
6.2 2D-3D Iterative Closest Point Tracking . . . . . . . . . . . . . . . 79
6.2.1 Iterative Closest Point (ICP) Method . . . . . . . . . . . . 79
6.2.2 2D-3D ICP . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2.3 3D Model Projection . . . . . . . . . . . . . . . . . . . . . 83
6.2.4 2D Iterative Closest Points . . . . . . . . . . . . . . . . . . 86
6.2.5 Convert 2D Transformation to 3D: . . . . . . . . . . . . . 89
6.3 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3.1 Observation Probability . . . . . . . . . . . . . . . . . . . 95
6.3.2 Proposal Functions . . . . . . . . . . . . . . . . . . . . . . 96
6.3.3 Particle Refinement . . . . . . . . . . . . . . . . . . . . . . 98
6.4 Rest State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.6 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Chapter 7: Natural Gesture Analysis 111
7.1 Motion Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.1.1 Conditional Random Field . . . . . . . . . . . . . . . . . . 113
7.1.2 Latent Dynamic Conditional Random Fields . . . . . . . . 114
7.1.3 Pre-computed Feature Vector . . . . . . . . . . . . . . . . 116
7.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Chapter 8: Conclusion 121
8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
References 123
[CONTENTS]
iii
List Of Tables
5.1 Example of Matching Pursuit Estimation . . . . . . . . . . . . . . 49
5.2 Identification rate for the 6 gestures considered using the proposed
FHMM formulation. We show results on 5 people, where only per-
son 1 was used for training. . . . . . . . . . . . . . . . . . . . . . 67
5.3 Identification rate of 6 gestures over 5 people, using traditional
HMM formulation. . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1 Mean and standard deviation of joint errors, measured in centimeter104
6.2 Average diameter of 3D marker clouds . . . . . . . . . . . . . . . 108
6.3 Mean and standard deviation of pointing errors, with 2D marker
offset, measured in degrees . . . . . . . . . . . . . . . . . . . . . . 108
7.1 Confusion matrix of CRF classification . . . . . . . . . . . . . . . 119
7.2 Confusion matrix of LDCRF classification . . . . . . . . . . . . . 120
[LIST OF TABLES]
iv
List Of Figures
1.1 Virtual Reality Training Systems . . . . . . . . . . . . . . . . . . 2
1.2 Floor plane of virtual reality theater. Equipped with two frontal
cameras below the screen. One overhead camera on the ceiling and
one side camera.. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 AnexampleIRimage. Theimageisblurry,andmostofthetextures
are washed out . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Example IR images and Background Segmentation . . . . . . . . 6
1.5 Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 The overall pipeline of the approach . . . . . . . . . . . . . . . . . 11
1.7 Example Tracking Result . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 Edge-Based Shadow Removal . . . . . . . . . . . . . . . . . . . . 23
3.2 Scan Lines and Body Axis Points . . . . . . . . . . . . . . . . . . 24
3.3 Body Axis Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Silhouette and Polygonal Approximation . . . . . . . . . . . . . . 28
4.2 Polygon Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 3D Visual Hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 3D Voxel volumetric approximation . . . . . . . . . . . . . . . . . 31
4.5 Cylindrical control points. . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Shapes of the postures, 3D points are sampled points of the visual
hull. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7 A captured human volume in Euclidean space (top) and its pose
invariant intrinsic space representation (bottom). . . . . . . . . . 38
v
4.8 Partitioning of the pose invariant volume (top), its tree structured
principal curves (middle), and project back into Euclidean space
(bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.9 3D Feature Construction . . . . . . . . . . . . . . . . . . . . . . . 44
5.1 The resulting weight matrix M of matching pursuit decomposition
of each pairs of 30 postures. . . . . . . . . . . . . . . . . . . . . . 54
5.2 Singular values of M. . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Rank 1 approximation of M . . . . . . . . . . . . . . . . . . . . . 55
5.4 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 Factorial Hidden Markov Model . . . . . . . . . . . . . . . . . . . 62
6.1 Articulated Body Model . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 (Left) Kinematic models for all frames. (Right) Joints for a nor-
malized kinematic model . . . . . . . . . . . . . . . . . . . . . . . 74
6.3 Example NSS Body Model Configuration . . . . . . . . . . . . . . 77
6.4 ICP Process, white lines in 2D models indicate closest point pairs 82
6.5 Back-project 2D rotation to 3D . . . . . . . . . . . . . . . . . . . 90
6.6 Example Tracking Sequences, One frame per Row. The tracking
can automatically recover from an error . . . . . . . . . . . . . . . 103
6.7 Manually mark joint locations. Red points represent 2D markers.
Yellow points represent 3D marker . . . . . . . . . . . . . . . . . 105
6.8 Body Joint Errors. Without pixel offset . . . . . . . . . . . . . . . 106
6.9 Pointing angle error, in degrees . . . . . . . . . . . . . . . . . . . 108
6.10 Poses that will likely result in tracking error . . . . . . . . . . . . 110
7.1 Sequential Conditional Random Field . . . . . . . . . . . . . . . . 113
7.2 Latent-Dynamic Conditional Random Field . . . . . . . . . . . . 115
[LIST OF FIGURES]
vi
Abstract
In this thesis we present an approached for a visual communication application
for a dark, theater-like interactive virtual simulation training environment. Our
system visually estimates and tracks the body position, orientation and the limb
configuration of the user. This system uses a near-IR camera array to capture
images of the trainee from different angles in the dim-lighted theater. Image
features like silhouettes and intermediate silhouette body axis points are then
segmented and extracted from image backgrounds. 3D body shape information
such as 3D body skeleton points and visual hulls can be reconstructed from these
2D features in multiple calibrated images.
For body pose estimation, we propose a particle-filtering based method that
fitsanarticulatedbodymodeltotheobservedimagefeatures. Currentlywefocus
on estimating the pose of the upper body. From the fitted articulated model we
canderiveinformationusedbyourHCIsystem,suchasthepositiononthescreen
where the user is pointing to. We use current graphic hardware to accelerate the
processing speed so the system is able to work in real-time. The system serves as
part of multi-modal user-input device in the interactive simulation
vii
Chapter 1
Introduction
1.1 Overview
In this thesis, we present a visual human body pose tracking system for a virtual
training environment. Training humans for demanding task in a simulated en-
vironment is becoming of increasing importance, not only to save costs but also
to reduce training risk for hazardous tasks. Traditionally, the simulated training
has been to perform physical tasks such as target shooting or piloting an aircraft.
However, advances in the state of synthetic agent design and story telling allows
us to construct environments to simulate scenarios that can provide training for
cognitive, decision making tasks. A key issue then becomes the modalities by
which the human trainee needs to communicate with the characters in the syn-
thetic environments. Speech is one natural modality but visual communication is
alsoimportantforaseamlesshuman-computerinteraction(HCI)interface. Visual
communication can consist of explicit gestures, such as pointing, but also of more
1
(a) The virtual reality surrounding theater.
Showing ”Doctor without borders” scenario
(b) Flatworld environment. A simulated war-
torn room with a window view
Figure 1.1: Virtual Reality Training Systems
subtle ”body language” communications based on the body postures and facial
expressions. Our body tracking system provides such communication interface,
coupled with other modalities such as speeches or facial expressions.
Here we consider a synthetic training environment must be immersive to be
effective, a situation where the trainee does not wear head mounted displays;
instead, a user is positioned on a stage facing a large screen that display a 3D
rendered virtual environment, such as a war-zone city street or a field hospital.
This is a bit like being in a movie hall but where the characters respond to our
actions. Two examples of our environments are shown in Fig.1.1. The floor plane
of one of the environments, the virtual reality theater, is shown in Fig.1.2.
2
Figure 1.2: Floor plane of virtual reality theater. Equipped with two frontal cameras
below the screen. One overhead camera on the ceiling and one side camera.
1.2 Issues
Indesigningthevisual-sensing,gesture-basedhuman-computerinteractionsystem
in our environment, we encountered several issues that must be resolved. We will
explain each issue and possible solutions as following:
1.2.0.1 Environmental Limitations
This limitations makes visual sensing very challenging:
• Theenvironmentisverydark;evenworse,theilluminationfluctuatesrapidly
as the scenes on the screen change.
• The trainees need not be stationary but can walk around in a limited area
for natural responses.
3
• The sensing system must be passive and not interfere with other processes
such as aural communications or the displays.
• Afurtherrestrictionisthatthetraineesnothavetobepreparedextensively,
suchaswearingspecialclothingwithmarkersorothersensors(e.g. magnetic
sensors).
• Responsiveness is also an important factor in designing the system; it must
be able to operate in real-time to be useful as a user input channel.
1.2.0.2 Feature extraction in IR images
We solve the illumination difficulty of the environment by illuminating the scene
withinfrared(IR)lightsplacedstrategicallyaroundthestageandimagingitwith
normal cameras with near-IR filters placed in front of their lenses. This allows
imagestobeacquiredinthedark, flickeringenvironmentwithoutinterferingwith
other functionalities. However, several issues arise with the use of IR images:
• The images captured in near-IR spectrum are bare of most of the common
imagefeaturesseeninavisiblespectrumimage, suchascolorsandtextures,
and there is little interior detail visible. The only usable feature is the
pixel blob silhouette segmented from the background by motion, as shown
in Fig.1.3.
• Even this pixel blob segmentation is noisy. The intensity of the image is
washedoutandthegrayscalerangeislimited,thustheobserveduserwould
4
Figure 1.3: An example IR image. The image is blurry, and most of the textures are
washed out
beveryeasilyundistinguishablefromthebackground. Andtheimagequality
is grainy and noisy at pixel level, as shown in and Fig.1.4.
Some useful information can still be deducted from the silhouettes, if we apply a
priori knowledge of the shape geometry of human body.
1.2.0.3 3D Feature Reconstruction from Multiple Views
Wefocusedourmethodinusingmultiplecamerasintheenvironment. Thereasons
behind this decision are:
5
(a) Raw image from multiple cameras (b) Pixel blobs from silhouette Segmentation
Figure 1.4: Example IR images and Background Segmentation
• Usingmultiplecamerasallowsustogathermoreusableinformationtocom-
pensate for the poor quality and the lack of features of the images.
• Some information is lost during the camera perspective projection, such as
depth and occlusion. We use multiple cameras to partially compensate for
these deficiencies and to make acquisition of 3-D position and pose easier.
With multiple calibrated cameras, we can therefore reconstruct 3D information
from 2D images, such as 3D body shape or skeleton points. One issue here is
feature matching: given a feature in one image, find the corresponding feature in
another image, and reconstruct 3D information from the matching pair. Noise,
false positive detection and missed detection of the image features will all result
in inaccuracy of the 3D feature reconstruction.
6
1.2.0.4 Human Pose Estimation and Tracking
Gesture-based HCI can roughly be divided into two categories: pose/gesture
recognition and pose/gesture estimation. The recognition deducts the qualita-
tive information of the action of the user, such as ”He’s walking” or ”He’s waving
the arm”. However, since the trainee is interacting with virtual entities in the
virtual world, it is necessary to infer 3D geometry information from the body
motions, such as the virtual person the trainee is facing, or the object/place he is
pointing at. The estimation derives such quantitative information.
For pose estimation, currently we focus on estimating the upper body poses,
as actions of lower body are rarely relevant in virtual interactions. One of the
essential application is pointing direction estimation, as it directly specify the
virtual entity the trainee is interacting with. It’s necessary to fit a 3D body
model into detected or derived 2D or 3D features. However, the human body has
large number of degrees of freedom (DOF) and pose several problems:
• The tracking must be generic enough to work on different users, without
re-training the system to work with them.
• Even with the use of multiple cameras, some ambiguities of the joint config-
urations are still unavoidable.
• The estimation of high-dimensional variables is very computational expen-
sive. It’s difficult to achieve real-time performance needed for HCI.
7
1.2.0.5 HCI integration
Finally, the visual sensing we try to achieve is not a stand alone component but
an integral module of a HCI system.
• The gesture system can incorporate both lower level observations (images)
and information from parent training system to recognize gestures. For
example, the user is more likely to interact with salient virtual entities than
unimportant ones.
1.3 Overview of the Approaches
To resolve the issues above and achieve visual sensing for HCI in virtual training
environment, our approaches are organized as following :
• 2D Image Feature Extraction: We segment the foreground silhouette
from the image by using Gaussian pixel distributions. We then extract the
shape information, such as the axis point, from of the silhouettes. In combi-
nation of multiple views, these information is sufficient for pose estimation
tasks.
• 3D Feature Reconstruction: Given the calibrated camera array and the
image features in each camera, we can reconstruct the 3D features. We can
construct3Dvisualhullandvoxelvolumedatafromthesilhouettes. Wecan
8
(a) IR Images (b) Silhouettes and 2D Arm Axis Points
(c) 3D Visual Hull and Arm Axis Points
Figure 1.5: Image Features
alsoconstruct3Dbodyaxispointsfromthebodyvolumedataandfromthe
2D body axis. This allows us to estimate the pose in 3D world (Fig.1.5).
• Posture and Gesture Recognition: We decompose the 3D shape de-
scriptor to be represented by linear combination of atom postures. We use
9
discrete Hidden Markov Model and its variations to find the most likely ges-
turegiventheposturesequences. Sincetheshapedescriptorisscaledtosize,
its invariant to different users.
• Pose Estimation & Tracking: We define a 3D body articulated model
anduseaniterativeconvergencemethodcalled2D-3DICP(IterativeClosest
Point) method to fit the model to 2D and 3D feature data. To improve
the robustness of the 2D-3D ICP, we integrate it with Particle Filtering
method.Specialproposalfunctionsandsimplerandomwalkareusedtoguide
the fitting in the Particle Filter. By using multiple hypothesis method, the
Particle Filter can explore different solutions simultaneously. This results in
morerobustestimationsgiventhenoisyimagesandtheambiguitiesbrought
by camera projections.
The flow chart is shown in Fig.1.6.
1.4 Contributions
We have successfully developed a real-time body pose estimation and tracking
system. This system can track the body configurations of the user in a virtual
training environment with reasonable accuracy and performance. The system is
deployed to the ICT VR theater and Flatworld combat simulation environment.
It allows the user of the training system to interact with simulated scenario with-
out using input devices. In an example scenario, the virtual terrorist will start
10
Camera N Camera 1
…
Camera 2
Human-Computer
Interaction Interface
Silhouette
2D Axis Points
2D Feature Extraction
Shape Descriptor
Visual Hull
3D Feature
Reconstruction
Pose Tracking
Particle Filter
2D-3D ICP
Tracking
Arm Rest State
Gesture
Recognition
Posture
Recognition
Posture & Gesture
Reconstruction
3D Axis Points
Figure 1.6: The overall pipeline of the approach
11
Figure 1.7: Example Tracking Result
shooting at the user if the user stands in the open ground, and will cease attack
if the user crouches. The user then can point at the position of the terrorist to
guide the friendly attack helicopter
1.5 Natural Gesture Analysis
An ongoing aspect of our research is analyzing the natural gesture of the user
when he/she is interacting with virtual people in the training environment. Un-
like semantical gestures, which convey meanings on its own, the natural gestures
are performed unintentionally. We segment and classify the hand motion of the
12
user, and transmit to the HCI module for further contextual analysis.
Therestofthethesisisorganizedasfollows. Chapter2givesashortsurveyof
variousotherworksdealingwiththeposeandgesturerecognitionandestimation.
Chapter3describestheextractionoftheimagefeaturefromIRimages. Chapter4
talks about the reconstruction of 3D features from 2D images. Chapter 5 focuses
on the posture and gesture recognition system. Chapter 6 describes the pose
estimationsystem. ThenaturalgestureanalysisispresentedinChapter7. Finally,
Chapter 8 concludes this thesis.
13
Chapter 2
Related Works
2.1 Human Posture Tracking in 2D and 3D Space
Various methods have been proposed for the estimation and tracking of of human
body postures. Many approaches recognize the postures directly from 2D images,
as single camera image sequences are easier to acquire. Some of them try to fit
body models (3D or 2D) into single-view images [38] [10] [24] [47], or classify
postures by image features [19,49]. There are two main difficulties those methods
must overcome. One is the loss of depth information. Thus it’s difficult to tell if
the body parts are orienting toward to or away from the camera. These system
must either maintain both hypotheses or regulate the estimate with human body
kineticconstraints. Theotherproblemcomesfromself-occlusion: bodypartsthat
arenotseenfromtheimagecannotbeestimated. Roughlyonethirdofthedegrees
of freedom of the human model are unobservable due to motion ambiguities and
self-occlusions.
14
Tocompensatefortheseambiguitiesdueto2Dacquisition,severalapproaches
rely on using multiple cameras. These approaches use an array of two or more
camerasor3Dbodyscannerstocapturethehumanshapeandmotionfromdiffer-
ent views. Some of these approaches extract 2D image features from each camera
and use these features to search for, or update the configuration of, a 3D body
model [28] [16] [26] [63]. Others introduce an intermediate step of reconstructing
the3Dshapeofhumanbody. Thecharacterizationofthehumanposeisthendone
by fitting a kinematics model into the 3D shape information [33] [15] [12] [43], or
by using a shape descriptor for classification [60]. Many of the methods above
make an implicit assumption that the image quality is stable and high enough to
extractusefulimagefeatures. Theyutilizefeaturessuchasmotionflow,colorand
textureofskinandclothingtoimprovetheclassificationormodel-fittingaccuracy.
Such features are not available in our environment.
2.2 Tracking using Hidden Markov Model
Identifying posture is only the first step toward recognizing gestures. The human
gesture can be modeled as a sequence of temporal transition of the postures. One
of the most commonly used tool to model such transition is the Hidden Markov
Model. If the the state space of the Hidden Markov Model is discrete and each
posture corresponds to one state, this allows the use of the Forward-Backward
15
Algorithm [48] to compute the likelihood of observation and the Viterbi Algo-
rithm [56] to derive the most likely hidden state sequence given the observation
sequence. Since Yamato [62] used discrete HMMs to recognize image sequence of
sixtennisstrokes, (discrete)HMMhavebeenwidelyusedbyresearcherstomodel
and recognize the temporal or spatial transition of gestures. [46] combined HMM
and neural network to recognize hand gestures. Wilson [61] expanded the tradi-
tional HMM to Parametric HMM to recognize parameterized gestures. Deng [31]
used HMM model to segment the starting end ending points of gestures within
continuous gesture sequence.
However, in the case of tracking articulated body model, the state is modeled
as a continuous space of Degree of Freedom values. The discrete HMM methods
are unsuitable for such problems. If the prior distribution of these states is of
Gaussiandistribution,andstatetransitionandobservationfunctionsoftheHMM
are linear functions with Gaussian noises, Kalman Filter (KF) [32] can be used
for tracking and gesture. If the transition and observation functions are non-
linearfunctionswithGaussiannoises,ExtendedKalmanFilter(EKF)canbeused
instead. However, the Extended Kalman Filter method estimates the non-linear
functions with their first-order partial derivatives (Jacobian). Thus it usually
performs poorly if the function is highly non-linear. Articulated human body
model usually consists of 10 to 30 degree of freedoms, depending on the problem
16
being solved. This makes the probability distribution of very high dimension and
highly non-linear.
2.3 Tracking using Particle Filtering
One method that gained importance in recent years is the Particle Filter (PF)
method, also called Sequential Monte Carlo method. Section 6.3 gives an intro-
duction of Particle Filter methods. Particle Filter uses a set of data samples ,
called particles, to represent the posterior distribution of hidden states. Particle
Filter has been widely used in visual object tracking and robotics. Particle Filter
works well in modeling uncertainties as it can model multi-modal distributions
or distribution without analytical forms. Particle Filter maintains multiple hy-
pothesis of current state configurations via the use of particles. Thus it is more
robust than single hypothesis method such as Kalman Filter or gradient descent.
Doucet et al. [17,18] give the mathematical derivations of Particle Filter, proof of
convergence, error bound, optimal choice of functions and parameters.
Generic Particle Filter method is very robust and used widely, [8] uses CON-
DENSATIONalgorithmtotrack2Dkinematicmodelinmonocularvideoand[23]
also used CONDENSATION method to track hand gesture in 2D images. How-
ever, original Particle Filter method has drawbacks in real-time problems. The
original Particle Filter method such as CONDENSATION [27] requires the num-
ber of particles to be exponential to the number of degree of freedoms to robustly
17
track human motion. This makes the method computationally expensive and in-
efficient. Various human model tracking methods introduces modifications to the
original Particle Filter method to reduce the computational complexity, while at-
tempt to increase the tracking accuracy. Bernieret al. [4] combines Particle Filter
andMarkovNetworkbeliefpropagationtotrack3Darticulatedposeinreal-time.
The Markov Network propagates believes by set of weighted samples. The result-
ing system corresponding to a set of Particle Filters, one for each limb. In [53],
Schmidt et al. use Kernel Particle Filter to avoid large number particles. Their
methodalsomean-shifttoshiftparticletohighweightareas. Furtherreducingthe
particle number. [34] combines Particle Filter method and Markov Chain Monte
Carlo (MCMC) method to track a 28 DOF human model. Each particle in their
method represents as a Markov Chain, consisting of a set of sample itself, and
explores the state space using MCMC method. Thus the method requires less
particles. It also requires less iteration than pure MCMC as multiple chains are
exploring simultaneously. In [58], Wang et al. give a modular analysis of different
alternative Particle Filter methods. The paper surveys each method and experi-
mented their performance in tracking 2D articulated human model in monocular
image sequence.
18
2.4 Other Pose Estimation/Tracking methods
In addition to Particle Filter method, several other approach have been used
in toe human pose tracking task. An alternative way to the Particle Filter for
pose estimation is the Markov Chain Monte Carlo (MCMC) method. Instead
of propagating the particles as a time sequence, the MCMC method explores
the solution space by Markov Chain. Mun-Wai [37] uses data-driven MCMC to
find the best match between images and human articulated model using both
kinematic and appearance proposal functions. [7] combines the modeling of both
image segmentation and human pose estimation problem as a Marlov Random
Field (MRF), and uses dynamic graph cut to solve the MRF.
The methods discussed above use probabilistic approach. They treat the esti-
mation of human pose from images as a probabilistic problem. Some other meth-
ods uses deterministic approach. That is, the estimation from image to posture
is a fixed function. In [33], Kehl et al. use stochastic meta descent, a variant of
gradient descent method. The articulated model is iteratively fit to a volumetric
data reconstructed from multiple views. Demirdjian [15] uses ICP (iterative clos-
est point) method to fit a 3D articulated model to 3D shape data reconstructed
from stereo images. At each iteration, the closest match between model points
anddatapointsarefoundAndthemodelisreconfiguredtomatchthedatapoint.
The deterministic methods, however, have disadvantage in robustness. They
arepronetofittingtoalocalmaxima, thusfailtocreateglobaloptimalsolutions.
19
And once the tracking is lost, the subsequent image frames are likely to lose track
as well, unless an error recovery scheme is devised.
20
Chapter 3
Feature Extraction from IR Images
As we have shown in Fig.1.4, the only feature we can extract from the grayscale
infrared images is the silhouette pixel blob. However, by analyzing the shape
informationinconjunctionwiththeaprioriknowledgeofthehumanbodyshape,
we can still deduce geometry information from the pixel blobs.
3.1 Silhouette Segmentation
In order to extract the pixel blob from the image, we model the intensity of each
pixel in the images as a Gaussian distribution.
P
c
≈N(μ
c
,σ
c
)
where c∈{r,g,b} for color images or c={intensity} for greyscale images.
The meanμ and the variance σ of the Gaussian distribution are learnt during
the background training phase prior to each HCI session. In the training phase, a
21
sequenceofimageframes,withoutanyforegroundobjects(users),arecollectedby
the system. The mean μ and variance (or standard deviation) σ of the intensity
of each pixel is computed from the training set.
After the training phase, each pixel of new incoming images is compared
against its corresponding distribution. The ratio of the difference to the vari-
ance is computed as:
d
c
=(P
c
−μ
c
)σ
c
If the difference ratio d
c
is larger than certain variance threshold, the pixel is
classified as a foreground object pixel. To smooth the boundary of pixel blobs, a
series of dilations and erosions are applied to the silhouette image. Pixel blobs of
sizes too small are considered as noise and are removed.
This method, however, segments the shadows of the user as well as his/her
actual body. To effectively distinguish the shadows from user, we diffuse the IR
lights when setting up the lighting condition. This can be achieved by reflecting
IR lights by diffuse materials such as the screen, instead of directly illuminate the
user with IR spot-lights. Thus the shadows cast by the IR lighting have blurry
boundaries and edge properties can be incorporated to eliminate shadow regions.
To remove the shadow:
1. TheCannyEdgeDetectionisappliedontheinputimages. Thedistributions
oftheedgepixelsaretrainedthesamewayaswetrainthebackgroundpixels.
22
(a) Background Edges (b) Edges and Silhouettes
(c) Segmented Edges and Silhouettes (d) Shadow Removed
Figure 3.1: Edge-Based Shadow Removal
2. The edge pixels of new incoming images are segmented against the trained
background edges. The segmented edge pixels are dilated to close potential
gaps.
3. The silhouette image is scanned by horizontal, vertical and diagonal scan
lines. If the silhouette pixel is enclosed by two edge pixels on the scan line,
it is classified as an actual body pixel. Otherwise it is classified as shadow
pixel and is discarded.
An example can be shown in Fig.3.1.
23
(a) Vertical Scan (b) Horizontal Scan
(c) Diagonal Scan (d) Diagonal Scan
Figure 3.2: Scan Lines and Body Axis Points
3.2 2D Axis Points Extraction
The 2D geometry information we use is the symmetry axes of the silhouettes.
This enables us to segmentation the blob into body parts, such as the torso and
thearms. The2Daxispointsandtheir3Dreconstructionsarethenusedtoguide
the fitting of an articulated body model. Several techniques exist to find the
skeleton axis points, such as medial axis points [39] or generalized cylinders [64].
Here we scan the silhouettes with arrays of scan lines of four different directions
(horizontal, verticalandtwodiagonaldirections)andintersectthescanlineswith
24
the silhouettes. For each intersection line segment, the middle point is extracted
as the axis point. The line segment length information is also stored as the
”diameter” of the corresponding axis point. Each scan direction will result in a
set of scanned axis points. An example is shown in Fig.3.2. For each pair of
the axis points, if their scan-lines are adjacent and the intersection line segments
overlaps, the two points are considered ”neighbors”. We can infer a body axis
graph from the neighborhood information.
3.3 Body Parts Segmentation
Givenourtaskofestimatingarmpointingdirections, wewanttofindthepossible
arm segments in the body axis graph. Since the area in which the user can move
is limited, the projected images of the arms of the user are restricted to certain
regions of the images, depending on the camera positions. Also the images of
the arms have smaller diameter (thinner) than other body parts. Thus, we select
segments of graph that lie in predefined image regions and have diameter below
certain threshold.
The torso of the user should remain relatively up-right in normal simulation
activities; thus we select axis points scanned by scan-lines that are horizontal to
the floor plane. Since the image of the body will be the central bulk of the pixel
blobs, graph segments that have scan diameter above the threshold are further
25
(a) Arm 2D Axis
(b) Torso 2D Axis
Figure 3.3: Body Axis Points
selected. Highest body axis points are also selected as head. Fig.3.3 illustrates an
example of arm and torso segmentation.
26
Chapter 4
3D Feature Reconstruction
Since we capture the images from multiple calibrated cameras placed at different
angles around the stage, by incorporating multiple view geometry, we can recon-
struct 3D geometry information from 2D features. They include 3D visual hulls,
shape descriptors and 3D axis points. They can be used to guide the recognition
and estimation of 3D human poses.
4.1 Visual Hull
Given a set of 2D silhouette images of human body from different angles, we can
approximatethe3Dshapeoftheoriginalobjectbyreconstructingthevisualhull.
We compute the polygonal approximation of the silhouettes and then construct
the 3D polyhedral approximation of the visual hull [36,42]. This method is fast
and allows us to achieve real-time reconstruction. The 3D blob shape is used as
27
Figure 4.1: Silhouette and Polygonal Approximation
a proposal function to estimate the floor position of the person in pose tracking
described later.
The visual Hull Reconstruction procedure works as following:
1. First, the the segmented body silhouettes {S
m
,m = 1...M} of each pro-
jected images are approximated by 2D polygons. (Shown in Fig.4.1)
2. Assume we have a polygonal approximation of image S
m
and on of the 2D
polygon edge E
i
m
of polygon S
m
has two end points P
b
and P
e
:
(a) Back-projectedP
b
andP
e
as two 3D linesL
b
,L
e
, which intersect at the
camera focal point C
m
. L
b
, L
e
and C
m
forms a 3D plane Z.
28
Figure 4.2: Polygon Clipping
(b) IteratethroughallotherimageS
n
,n6=m. Back-projectits2Dpolygon
vertices of and intersect the back-project line with Z. This will form a
3D planar polygon G
n
on Z, for each image S
n
. Intersect and clip all
of the 3D planar polygons G
n
and the 3D lines L
b
, L
e
. This will result
in one or more disjointed 3D planar polygon patches. (Fig.4.2)
3. Repeat Step 2 and iterate though all polygon edges in all images. The set
of 3D polygon patches forms the polyhedral approximation of the 3D body
shape.
An example of the visual hull reconstruction is shown in Fig.4.3. The visual
hull may visually look like single surface. But in fact it consists of set of indepen-
dent disjoint polygon patches. No neighboring information between polygons are
presented.
29
Figure 4.3: 3D Visual Hull
4.2 Volumetric Approximation
WhileVisualHullrepresentsthegeometryinformationofbodysurface,thevoxel-
based volume data captures the volume of the body. The implementation is
derived from the work of [52] for real-time volume capture; however, several other
approaches are readily available. The capture approach is a basic brute-force
methodthatcheckseachelementofavoxelgridforinclusioninthepointvolume.
In our approach, we divide the capturing space into 3D voxel grid.
Thecameracalibrationparametersallowustopre-computealook-uptablefor
mapping a voxel to pixel locations in each camera. First, we iterate through all
30
Figure 4.4: 3D Voxel volumetric approximation
foregroundsilhouettespixelsofallcameras. Foreachpixel,thereferencecountsof
corresponding voxels are incremented. If the count of a voxel exceeds a threshold
(usually equal to the number of cameras), the voxel grid is marked as part of the
body volume. One set of volume data is collected for each frame.
The voxel-based volumetric reconstruction provides both volume and surface
information. Buttheresolutionislimitedbythesizeofvoxelgrids. Smallervoxel
grid size results in higher resolution but increases the construction time.
31
4.3 Shape Descriptor
The number of vertices of the polyhedral visual hull depends highly on the polyg-
onal approximation of the silhouette. And the vertices are often not uniformly
distributed on the surface. So we uniformly sample points within polygon trian-
gles of visual hull. We define the shape of a object as the set of sampled surface
3D points P = {P
i
}
i=1...N
. We compute a bounding reference shape C
R
that
bounds the visual hull and centered at centroid of the point cloud. In our ap-
proach, we use a cylindrical shape, as depicted in Fig.4.5, however other shapes
such as spheres could be also used. A set of uniformly sampled reference points
Q = {Q
j
}
j=1...M
on the reference cylinder are defined. A coordinate system is
defined for each reference point: it is centered on the point and tangent to the
reference cylinder. For each point P
i
on the visual hull and reference point Q
j
,
We compute the relative coordinate P
i
Q
j
. This relative coordinate is encoded in
spherical coordinate system. That is, P
i
Q
j
= (r,θ,ϕ ). The radius r is normal-
ized to [0, 1] with respect to the size of the cylinder. For each reference point
Q
j
, we construct a K×L×P binned spherical distribution. Each bin (r
k
,θ
l
,ϕ
p
)
stores the number of silhouette points N
j
(k,l,p) projected onto that bin. The
histograms over all reference points are then summed:
N
0
(k,l,p)=
X
j
N
j
(k,l,p) (4.1)
32
The bin values are then normalized with respect to the largest value:
N(k,l,p)=
N
0
(k,l,p)
max
k,l,p
(N
0
(k,l,p))
(4.2)
Figure 4.5: Cylindrical control points
The descriptor of a shape P, Desc(P), is represented as a vector of K ×
L×P dimensions, recording the normalized value of the bins. The derived shape
descriptorisinvarianttothescaleofthevisualhullasthedescriptorisnormalized
33
by the size of the reference cylinder. It is also translation invariant since the
reference cylinder is placed on the centroid. Rotating the posture (and thus the
visualhull)aroundspinalaxisisequivalentasacyclicpermutationofthereference
pointsaroundthecylinder,resultinginanunchangedglobaldescriptor N(k,l,p).
Thus the rotation invariance of the shape descriptor is also guaranteed. We also
leverageonanotherkeypropertyofthisshapedescriptor: additivity. Thatis, the
descriptorofacompositepostureisapproximatelytheadditiveoperationresultof
the composed sub-postures. Assume given a set of fixed reference points, Q, and
two set of surface points S
1
and S
2
. Since the unnormalized descriptor records
the number of points lying in each bin, then the unnormalized descriptor of the
union of the two point sets satisfies:
Desc(S
1
∪S
2
)=Desc(S
1
)+Desc(S
2
) (4.3)
Thissummationpropertycannotapplytotheposturedescriptordirectlybecause
the sample points set of the composite posture is not the union of the two sets of
elementary postures, as there are overlapping parts such as the torso. Assume we
havetwoposturesshapeP
1
andP
2
, andthepostureP
12
whichisthecombination
of P
1
and P
2
, as shown in Fig.4.3.
34
(a) P
1
(b) P
2
(c) P12
Figure 4.6: Shapes of the postures, 3D points are sampled points of the visual hull.
35
The shape descriptor of these three postures satisfy the following relationship:
Desc(P
1
)=Desc(arm(P
1
))+Desc(torso(P
1
)) (4.4)
Desc(P
2
)=Desc(arm(P
2
))+Desc(torso(P
2
)) (4.5)
Desc(P
12
)=
Desc(left arm(P
12
))+Desc(right arm(P
12
))+Desc(torso(P
12
)) (4.6)
Since the descriptor is scale and rotation invariant, each body part will have
similar descriptor values:
Desc(torso(P
12
))
∼
=Desc(torso(P
1
))
∼
=Desc(torso(P
2
)) (4.7)
Desc(P
12
)
∼
=Desc(arm(P
1
))+Desc(arm(P
2
))+Desc(torso(P
12
))
∼
=Desc(P
1
)+Desc(P
2
)−Desc(torso(P
12
)) (4.8)
However,thedescriptorisaglobalshapedescriptoranddoesnotseparatebetween
different body parts. We use a ”resting posture” that representing ”stand still”
posture. And use this to compensate overlapping torso part
Desc(P
12
)
∼
=Desc(P
1
)+Desc(P
2
)−Desc(torso(P
12
)) (4.9)
36
Each posture descriptor is implicitly normalized with respect to the number of
samplepoints, sothedescriptorvalueswillnotbiasedbythesamplingrate. Thus
therelationshipoftheelementaryandcompositedescriptorismoreaccuratelyrep-
resentedbya”bin-occupancy”operator,ratherthanasummation. Suchproperty
will be exploited to decompose complex postures into set of simple basic posture
4.4 3D Axis Points
Weexperimentedtwomethodof3Dbodyaxisreconstruction. Thefirstonuses3D
body volumetric data. The second one uses 2D body axis from multiple images.
4.4.1 Reconstruction from Volumetric Data
We used Nonlinear Spherical Shells (NSS) approach for extracting 3D body axis
point from a Euclidean-space volume of points. For NSS, we assume that nonlin-
earity of rigid-body kinematic motion is introduced by rotations about the joint
axes. By removing these joint nonlinearities, we can trivially extract skeleton
curves.
Severalworksonmanifoldlearningtechniqueshaveproducedmethodscapable
of uncovering nonlinear structure from spatial data. These techniques include
Isomap [29], Kernel PCA [3], and Locally Linear Embedding [51]. Isomap works
by building geodesic distances between data point pairs on an underlying spatial
manifold. ThesedistancesareusedtoperformanonlinearPCA-likeembeddingto
37
Figure 4.7: A captured human volume in Euclidean space (top) and its pose invariant
intrinsic space representation (bottom).
38
anintrinsicspace,asubspaceoftheoriginaldatacontainingtheunderlyingspatial
manifold. Isomap, in particular, has been demonstrated to extract meaningful
nonlinearrepresentationsforhighdimensionaldatasuchasimagesofhandwritten
digits, natural hand movements, and a pose-varying human head.
The procedure for (NSS) works in three main steps:
1. Removal of pose-dependent nonlinearities from the volume by transforming
the volume into an intrinsic space using Isomap;
2. dividing and clustering the pose-independent volume such that principal
curves are found in intrinsic space;
3. project points defining the intrinsic space principal curve into the original
Euclidean space to produce a skeleton curve for the volume.
Isomap is applied in the first step of the NSS procedure to remove pose non-
linearities from a set of points compromising the captured human in Euclidean
space. WeusetheimplementationprovidedbytheauthorsofIsomap(availableat
http://isomap.stanford.edu/). This implementation is applied directly to the vol-
ume data. Isomap requires the user to specify only the number of dimensions for
the intrinsic space and how to construct local neighborhoods for each data point.
Because dimension reduction is not our aim, the intrinsic space is set to have 3
dimensions. Each point determines other points within its local neighborhood
using k-nearest neighbors or an epsilon sphere with a chosen radius.
39
TheapplicationofIsomaptransformsthevolumepointsintoapose-independent
arrangement in the intrinsic space. The pose-independent arrangement is similar
to a Da Vinci pose in 3 dimensions (Fig.4.7). Isomap can produce the Da Vinci
point arrangement for any point volume with distinguishable limbs.
The next step in the NSS procedure is processing intrinsic space volume for
principal curves. The definition of principal curves can be found in [22] or [2] as
selfconsistentsmoothcurvesthatpassthroughthemiddleofad-dimensionaldata
cloud, or nonlinear principal components. While smoothness is not our primary
concern, we are interested in placing a curve through the middle of our Euclidean
spacevolume. Dependingonthepostureofthehuman,thistaskcanbedifficultin
Euclidean space. However, the pose-invariant volume provided by Isomap makes
the extraction of principal curves simple, due to properties of the intrinsic space
volume. Isomap provides an intrinsic space volume that is mean-centered at the
origin and has limb points that extend away from the origin.
Points on the principle curves in intrinsic space be found by the following
sub-procedure (Fig.4.8)
1. partitioningtheintrinsicspacevolumepointsintoconcentricsphericalshells;
2. clustering the points in each partition;
3. averaging the points of each cluster to produce a principal curve point;
40
Figure4.8: Partitioningoftheposeinvariantvolume(top), itstreestructuredprincipal
curves (middle), and project back into Euclidean space (bottom).
41
4. linkingprincipalcurvepointswithoverlappingclustersinadjacentspherical
shells.
Clustering used for each partition was developed from the one-dimensional
sweep-and-prune technique, described by [30], for finding clusters bounded by
axis-aligned boxes. This clustering method requires specification of a separating
distance threshold for each axis rather than the expected number of clusters.
The result from the principal curves procedure is a set of points defining the
principal curves linked in a hierarchical tree-structure. These include three types
of indicator nodes: a root node located at the mean of the volume, branching
nodes that separate into articulations, and leaf nodes at terminal points of the
body.
ThefinalstepintheNSSprocedureprojectstheintrinsicspaceprincipalcurve
points onto a skeleton curve in the original Euclidean space. We use Shepards
interpolation [54] to map principal curve points onto the Euclidean space volume,
producing skeleton curve points. The skeleton curve is formed by reapplying the
tree-structuredlinkagesoftheintrinsicspaceprincipalcurvestotheskeletoncurve
points.
The skeleton curve found by the NSS procedure will be indicative of the un-
derlying spatial structure of the Euclidean space volume, but may contain a few
undesirable artifacts. We handle these artifacts using a skeleton curve refinement
42
procedure. The refinement procedure first eliminates noise branches in the skele-
toncurvethattypicallyoccurinareasofsmallarticulation,suchasthehandsand
feet. Noise branches are detected as branches with depth under some threshold.
A noise branch is eliminated through merging its skeleton curve points with a
non-noise branch. The refinement procedure then eliminates noise for the root
of the skeleton curve. Shell partitions around the mean of the body volume will
be encompassed by the volume (i.e., contain a single cluster spread across the
shell). The skeleton curve points for such partitions will be roughly located near
the volume mean. These skeleton curve points are merged to yield a new root
to the skeleton curve. The result is a skeleton curve having a root and two or
more immediate descendants. The minor variations in the topology of the skele-
ton curve are then eliminated by merging adjacent branching nodes. These are
two skeleton points on adjacent spherical shells with adjacent clusters that both
introduce a branching of the skeleton curve. The branches at these nodes are
assumedtorepresentthesamebranchingnode. Thus, thetwoskeletonpointsare
merged into a single branching node.
Thismethod,however,reliesonaccuratereconstructionof3Dvolumetricdata.
In the presence of noisy volume data, the 2D axis reconstruction is more suitable.
43
(a) 3D Visual Hull
(b) Arm Axis Points
Figure 4.9: 3D Feature Construction
44
4.4.2 Reconstruction from 2D Body Axis
Another way to reconstruct 3D body axis is using the 2D body axis from Section
3.2. After we extract the 2D body axis point of arms, torso and head, their 3D
counterparts can also be constructed. Assume we have two 2D axis segments, S
i
and S
j
, of two different images, I
i
, I
j
. For each point P
ik
in S
i
, we compute the
epipolarlineL
ik
j
thatmapfromP
ik
totheI
j
. LetpointP
ik
j
bethetheintersection
point ofL
ik
j
and the edges ofS
j.
. We can reconstruct 3D points fromP
ik
andP
ik
j
,
given the calibration information of the two cameras. Special care must be taken
when any pair of two cameras and a body segment axis are co-planar. In this
case the epipolar line of the points from first image will align with the body axis
segment in the second image. The resulting intersection points will be incorrect.
Both2Dsilhouetteinformationand3Dshapeinformationcanbeusedtoestimate
3D body pose configurations.
45
Chapter 5
Pose and Gesture Recognition
InmanyHCIsystem,sometimessufficienttocharacterizethecurrentpostureand
gesture from certain pre-defined posture dictionary. But recognizing arbitrary
human body posture is a challenging task as it has to take into account the
variability across people in executing the same posture.
5.1 Posture Recognition
One way to recognize the posture is using the shape descriptor directly with data
classifiers such as support vector machines (SVM) to recognize different postures.
We collected training data for all postures to be recognized, and trained a SVM
classifier for each pair of them. New input posture descriptors are input to those
SVMs and the best fitting posture was selected. The advantage of using SVM
classifiersisthehightoleranceofnoiseandtherecognitionaccuracy; howeverthis
method requires training the SVM on every posture to be accounted for. Using a
46
SVMforeachpairofposturesresultsinalargenumberofSVMs,althoughahier-
archical classification can be used to reduce the number of SVMs. The additivity
property of the shape descriptor, suggests that the recognition of complex body
postures can be achieved by recognizing a subset of elementary postures called
atoms. Thus, we propose a method for decomposing arbitrary input posture as
the weighted sum of atoms. Such decomposition can be used to represent and
recognize a large dictionary of complex postures from a small set of atoms.
Acommonrepresentationofsignalsreliesonthe adaptive approximation tech-
nique. Suchapproachesseektofindtherepresentationoffunction f asaweighted
sum of elements from an overcomplete dictionary. Given a signal f and a redun-
dant dictionary D as a collection of signals D ={g
γ
}
γ∈Γ
, these techniques seek
to decompose the original signal f as a linear combination:
f
∼
=
X
γ∈Γ
α
γ
g
γ
,α
γ
∈ R (5.1)
The optimal approximation of [α
γ
] is the one that results in the weighted sum
that most closely resemble the original function. Various methods have been pro-
posed to find the optimal decomposition, such as: method of frames [14], best
orthogonal basis [13] and basis pursuit [11]. Each method places different con-
strains on the [α
γ
] vector, such as minimizing its L1 or L2-norm, and solves the
weights accordingly. The original function can thus be represented by series of
47
weight parameters. However, we want to use such approximation for feature se-
lectioninsteadofdatacompression. Weintenteachposturef tobeapproximated
by only as few elements in the dictionary as possible, and each element closely
matches the local properties of f. Finding the optimal approximation over a re-
dundant dictionary was proved to be a NP-complete problem [45]. However, the
matching pursuit (MP) algorithm proposed in [40] avoids such complexity. The
MPalgorithmassumestheinputsignalf andthebasicelements, called atoms, in
the dictionary D ={A
i
}
i=1...M
are all in the Hilbert space. It also assumes that
all atoms are normalized to unit length. The MP algorithm is iterative, starting
with R
0
=f, at n-th iteration:
1. Choose the atoms that maximize the absolute value of the scalar product
with previous residue:
A
n
=argmax
A
i
∈D
|hR
n−1
,A
i
i| (5.2)
which is equivalent to minimize the magnitude of the current residue R
n
.
h∙i is the scalar product operator.
2. Compute the residue R
n
of current iteration:
R
n
=R
n−1
−hR
n−1
,A
n
iA
n
(5.3)
48
Function f Atoms
Table 5.1: Example of Matching Pursuit Estimation
49
The algorithm continues until either the magnitude of the residue is small
enough, or a certain number of iterationsN have been executed. The approxima-
tion parameter is
α
n
=hR
n−1
,A
n
i. (5.4)
And the function f can be approximated as
f
∼
=
N
X
n=1
hR
n−1
,A
n
iA
n
+R
N
(5.5)
The main advantage of MP algorithm compared to other methods is its efficient
computation: instead of solving for a global optimization, MP uses a non-optimal
greedy method in each step, and chooses the element that reduces the most the
residue function.
5.1.1 Decomposing Postures
Posturedata,however,hasonemajordifferencecomparedtoatomsusedinsignal
or image decomposition: all the posture atoms have a large overlapping part at
the torso, legs and head section. The densities of bins around torso are usually
much higher than densities corresponding to arms and hands. This makes the
original matching pursuit unstable for use in the posture decomposition process.
Indeed, after the first iterations, bins at torso part of the residue will have large
negative values due to repeated subtraction, and subsequent iterations will then
50
focus on compensating the negative bins instead of trying to fit the actual arms
and hands features as depicted in [].
We propose a modified matching pursuit algorithm suitable for decomposing
a posture f in to a set of atoms. Assuming we have a dictionary of postures
D ={A
i
}
i=1...M
, where the first atom A
0
is the resting posture, i.e. the common
denominator of all postures. The input posture f, and the atoms are normalized
to unit length. Then the modified MP decomposition estimates the residues R
n
:
1. Starting with:
R
0
=f−hf,A
0
iA
0
(5.6)
2. At n-th iteration, choose the atom A
n
that
A
n
=argmax
A
i
∈D
(overlap(R
n−1
,A
i
)) (5.7)
3. Compute the residue R
n
:
R
n
=overlap diff(R
n−1
,A
n
) (5.8)
The overlap(f,g) function returns the number of non-empty bins where f and
g overlap. In each iteration, the algorithm will select the atom that has the
mostoverlappingbinswithpreviousresidue. Andtheoverlap diff(f,g)function
removes from f the overlap bins among f and g.
51
By removing the resting posture A
0
before applying matching pursuit, we
prevent other atoms from absorbing the torso values, which are considered as
non-features they do not characterize the variations among atoms. This will keep
the weights of the atoms balanced. We also operate on only the overlapping
property instead of the bins’ density. This ensures that any noise on the torso
part that was not subtracted in the first step will not interfere with the atom
selection process in the subsequent iterations as in the original MP algorithm.
After N iterations, we compute the weight of each atom as follows:
α
0
=hf,A
0
i (5.9)
α
n
=hR
n−1
,A
n
i,n=1...N (5.10)
Thiswillresultinaweightvector{α
i
}
i=0...N
, whichisfarsmallerthanthedimen-
sion of the shape descriptor.
After estimating the weights of the decomposition of the posture f into the
set of selected atoms A
i
, it is straightforward to recognize the composition of
the postures. First, a posture threshold is applied to the weights to eliminate
atoms decomposed from noise. Then the two atom postures, different of A
0
, with
largestweightsareselectedandconstitutesprimary andsecondary atompostures,
ordered by their lexical precedence. Note that the resting posture A
0
is always
present and has largest weight, as it absorbs the bins at the torso part. It is
52
possible that after thresholding, none or only one atom is left. Such posture will
be classified as resting posture or the corresponding atom A
0
.
Frequently, the posturef corresponds to multiple instances of the same atom.
For example, the posture representing two arms up in symmetric configuration,
corresponds to twice the contribution of the atom representing one arm. The
proposed matching pursuit algorithm identifies automatically such situations by
analyzing the estimated weights. By the additive property, the densities of the
arms part of symmetric postures are about twice than the densities of corre-
sponding one-arm posture. And these higher densities will reflect on the value
of weights that are computed from the scalar product of densities vectors. To
solve the multiple-instances problem, we collect training data for each one-atom
elementary posture and decompose them by the corresponding atom, and record
the average weight α
0
i
. After decomposing an arbitrary posture, we compare the
weight of each atom α
i
, to the corresponding α
0
i
. If the ratio of the two weights
exceeds certain instance threshold, that atom is considered to be of multiple in-
stances and the primary and secondary atoms are marked to be the same atom.
5.1.2 Selection of Atoms and Learning
An essential element in the posture decomposition is the selection of atom pos-
tures. The atoms in the posture dictionary must be discriminative; otherwise
53
Figure 5.1: The resulting weight matrix M of matching pursuit decomposition of each
pairs of 30 postures.
similar atoms will compete with each other in the matching pursuit process, re-
sulting in low weight value distributed over multiple atoms. The choice of the
histogram bin resolution also affects the discriminative power of the atoms.
To select the most distinctive atoms, we collected 30 different arbitrary pos-
tures. ForeachpairofposturesP
i
andP
j
,weranthematchingpursuitalgorithm,
usingP
j
astheatomtodecomposeP
i
,andrecordtheresultingweightsina30×30
symmetric matrix M shown in Fig.5.1. This matrix is then decomposed by Sin-
gularValueDecomposition(SVD)anditslowerrankapproximationiscomputed.
From the lower ranked matrix we selected five atoms corresponding to the largest
eigenvalues, these allow to extract the corresponding elementary postures. These
atoms, depicted in Fig.5.3 , will serve as atoms in the posture decomposition pro-
cess. ThisselectedsetofposturesgeneratesadictionaryofatotalC(5,2)+5=15
recognizable composite and elementary postures.
54
Figure 5.2: Singular values of M.
Figure 5.3: Rank 1 approximation of M
55
5.2 Gesture Recognition
The posture recognition method presented in this paper provides an efficient de-
scriptionforgesturesanalysis. Itparsescontinuousvariationsoftheposturesinto
occurrences of the elementary postures or atoms available in the dictionary. In
this section we will present a formulation of a Hidden Markov Model (HMM) re-
lying on the primary/secondary decomposition of arbitrary postures for gestures
recognition.
The Hidden Markov Model assumes that the state of the system q
t
at time
t consists one of the set of states S = {S
1
,S
2
...S
N
}. It also defines a set of
observationsymbols: O ={O
1
,O
2
,...O
M
},astatetransitionprobabilitiesmatrix
A = {a
ij
}, a
ij
= P (q
t+1
=S
j
|q
t
=S
i
) 1 ≤ i,j ≤ N, and a state-observation
probabilities matrix: B = {b
jk
} , b
jk
= P (O
k
|S
j
), 1 ≤ j ≤ N,1 ≤ k ≤ M and
finally, the probabilities of initial states: π ={π
i
},π
i
= P (q
1
=S
i
),1≤ i≤ N.
An example is shown in Fig.5.4.
Given the elements above, a HMM is often represented as (A,B,π). HMM
have been widely used by researchers to model and recognize the temporal or
spatial transition of gestures. In applications methods for gesture recognition,
each gesture is defined as a HMM with different state transition matrix A. And
eachstateisdefinedasapostureorassociatedtoamotionprofile. Thosemethods
seek to find the most probable model among available gestures that describes the
observedposturesequence. Theseapproaches,however,requireanextremelylarge
56
Figure 5.4: Hidden Markov Model
number of states space as the number of gestures to be modeled increases. To
address this limitation, we propose formulations of HMM based on atoms instead
of complete postures.
5.2.1 Parallel Hidden Markov Model
ThefirstalternativeHMMweusedistheParallelHiddenMarkovModel. Thatis,
we have state space is defined by the atoms: S ={A
0
,A
1
,...A
N
} and the set of
observationsiscorrespondtothedecompositionofeachpostureintoaprimaryand
secondary atom: O =
A
p
i
,A
s
j
,0≤i,j≤N. The set of observations represents
allpossibledecomposedpairsofprimary/secondaryatoms. Thetransitionmatrix
A and observation matrix B are also defined based on atoms, respectively. A =
{a
ij
}, a
ij
=P (q
t+1
=A
j
|q
t
=A
i
), 0≤i,j≤N, and B ={b
jk
}, b
jk
=P (O
k
|A
j
),
57
1≤ i≤ N,0≤ k ≤ M. In our framework we assume the initial state is always
the resting posture: π ={π
i
}, π
0
=1, π
i
=0, 0≤i≤N. However, this practical
assumption does not reduce the scope of the proposed approach, since we can
easily detect the resting posture.
We assume that the transition and observation of the Primary/Secondary
atoms in a gesture are independent. AssumeG is the dictionary of all pre-defined
gestures. TheneachgestureinGconsistsoftwoHMMsinsteadofone,oneforthe
primary atomandtheotherforthesecondary atom,eachwithdifferenttransition
matrices. The considered HMM is then defined by:
G={m
i
},m
i
=
m
p
g
,m
s
g
,g =1...kGk
m
p
g
=
A
p
g
,B,π
m
s
g
=
A
s
g
,B,π
TheinputsequenceofT posturesisdecomposedintoasequenceofprimary/secondary
atom compositions:
d=(v
1
,v
2
,...v
T
),where v
t
=(A
p
t
,A
s
t
)
d
p
=(A
p
1
,A
p
2
,...A
p
T
),d
s
=(A
s
1
,A
s
2
,...A
s
T
),
58
For every input vectord, we want to find the most probable gesture in the dictio-
nary by solving:
m=argmax
m
i
∈G
p(m
i
|d) (5.11)
Assuming that the prior probabilities of all gestures are equal, the problem be-
comes finding the gesture model that has the highest likelihood for a given obser-
vation.
m=argmax
m
i
∈G
p(d|m
i
) (5.12)
And the likelihood is defined as:
p(d|m
i
)=p(d
p
|m
p
i
)×p(d
s
|m
s
i
) (5.13)
The probability can be computed by forward algorithm. Parsing gestures into
respective atoms, we reduce the original HMM model with possibleO(N
2
) size of
state space into two HMM models, each with O(N) size state spaces.
5.2.2 Factorial Hidden Markov Model
Another alternative HMM, Factorial Hidden Markov Model (FHMM) also pro-
vides a nice solution to the dictionary complexity problem, since every posture
can be decomposed into a linear combination of atoms. The states of such
dynamic system at time t can be represented by M multiple state variables:
S
t
=
S
1
t
,S
2
t
,...S
M
t
instead of only one variable. Each variable S
m
t
can take
59
on K
m
discrete values. For simplicity we assume K
m
= K for all m. That is
S
m
t
∈{a
m
1
,a
m
2
,...a
m
K
}. Each variable has its own prior probabilities for different
values:
π
m
={π
m
k
},π
m
k
=P (S
m
1
=a
m
k
),1≤k≤K,1≤m≤M
The system is called factorial as the state space is the cross-product of the M
variables. However, without constraining the state transitions of the multiple
variables, the state of the system can take arbitrary M
k
values. Such system
is equivalent to ordinary HMM with M
k
states. By placing constrains on the
conditionalinterrelationsbetweenvariables, thepossiblestatesofthissystemcan
be regularized and the system can be more easily analyzed.
A particular model we consider here is the model in which each the dynamics
of one variable is independent and decoupled from other variables.
p(S
t
|S
t−1
)=
M
Y
m=1
p
S
m
t
|S
m
t−1
In other words, each variable has its own state transition dynamics:
Matrix P
m
=
p
m
ij
,p
m
ij
=p
a
m
j
|a
m
i
,1≤i,j≤K
We assume that each state variable S
m
t
is a K 1 vector, where the k-th element
corresponding to the current state takes the value of 1 and all other elements
take value 0. The observation is a sequence of vectors Y = (Y
1
,Y
2
,...Y
T
). The
60
observation Y
t
at time step t is a D×1 vector of continuous values. And the
probabilitydensityofY
t
giventheunderlyingM statevariablesisaDdimensional
Gaussian distribution:
p(Y
t
|S
t
)=
1
q
|Σ|(2π)
D
exp
−
1
2
(Y
t
−μ
t
)
T
Σ
−1
(Y
t
−μ
t
)
where
μ
t
=
M
X
m−1
W
m
S
m
t
Each weight matrix W
m
is a D × K matrix. And Σ is the D × D covari-
ance matrix. The parameters of such FHMM system can be represented as
Θ={π
m
,P
m
,W
m
,Σ}
m=1...M
. AnexampleofsuchsystemisillustratedinFig.5.5.
The parameters of the FHMM can be trained by the Baum-Welch algorithm
from a set of observation sequence. A benefit of FHMM over traditional HMM
is the lower complexity in training the system. Assuming the length of the train-
ing sequence is T, the training of the traditional HMM with M
K
states has a
timecomplexityofO
TK
2M
foreachtrainingsequence. Theexactinferenceus-
ing Baum-Welch algorithm for FHMM has complexity of O
TMK
M+1
instead.
In [20] the authors proposed several approximation techniques that can speed up
the parameter estimation process, such as Monte Carlo approximation or struc-
tured variational approximation. In our experiment, we used exact inference for
61
Figure 5.5: Factorial Hidden Markov Model
62
training the FHMM. In [50] proposed a variation of the FHMM described pre-
viously to Triangulated FHMM for annotating the activity of an actor in a 2D
video sequence. Another HMM extension that formulates multiple state variables
is the Coupled HMM (CHMM) [41]. The difference between FHMM and CHMM
is that CHMM models the observation as multiple streams, each conditioned on
one state variable. While FHMM models a single observation stream conditioned
on combinations of variables. Each state variable in CHMM also conditioned on
the values of all variables at previous time step, instead of independent transi-
tion as in FHMM. Thus FHMM is more suitable for our problem, as discussed
below. In [57] the authors use Parallel HMM (PaHMM) to recognize American
Sign Language. Like CHHM, PaHMM assumes the observation can be separated
into different channels and uses independent HMMs to model the state behind
each channel.
The multiple chains formalism of FHMM is particularly attractive for our ges-
turerecognitionsystem. First,itformulatesthesystemstateasmultiplevariables.
Thus the decomposition of primary and secondary atoms can be taken into ac-
count, as we can model each atom as a variable. Second, it assumes that the
transitions of the different variables are independent. This may reduce the sys-
tem capability in tracking specific combinations of atoms postures; but it greatly
reduces the system complexity. Third, the observation model formulates the ob-
servedvectorasthecombinationofstatevariables. ThemeanoftheGaussian, μ
t
,
63
is a linear function of the underlying atom states. This fits the additive property
of the shape descriptor:
Desc(P
12
)
∼
=Desc(P
1
)+Desc(P
2
)−Desc(Resting)
and implies that the value of each histogram bins in the descriptor can also be
expressed as a linear function of the composing atom postures. Thus, the FHMM
observation fits well with the shape descriptor formalism. The matching pursuit
algorithm linearly projects the descriptor onto the basis defined by the atoms.
Thus the weight vector can be viewed as a linear function of the descriptor, and
furthermore the atoms, as well. We adapt in this section the FHMM formalism
for recognizing gestures of the upper body. The shape descriptors derived from
the visual hull provide a good geometric representation of the posture, however
using the shape descriptors as observation vectors directly in a FHMM system
is unsuitable. The descriptors are of high dimensionality, usually in thousands.
The vectors are also very sparse; resulting in singular or near-singular covariance
matrices, if used as is in the FHMM framework. By applying matching pursuit
algorithm on the descriptor, we model the shape explicitly as the linear sum of
each atom, while reducing the representation to a more compact form, suitable
for the FHMM framework.
We define one FHMM for each gesture, two underlying state chains for each
FHMM, corresponding to the transition of the states (postures) of the left and
64
right arms. Since the shape descriptor is rotation invariant and does not distin-
guish between postures of left or right arms, we will refer the two state variables
as primary/secondary states. The state of each of the two variables can be one of
theK atoms available in the dictionary. We use the K×1 weight vector resulted
from the matching pursuit algorithm as the observation vector. In this case we
have D =K.
By modeling two separate chains, we make the assumption that the pri-
mary/secondary state transitions are independent. Thus gestures that only differ
at the synchronicity of the two atoms cannot be distinguished. For example, two
arms pointing forward at the same time or one arm pointing after the other are
considered as the same gesture.
Assume we have a gesture dictionary G containing a set of trained FHMMs,
each corresponds to a defined gesture. For every observed weight vector sequence
Y, we want to find the most probable gesture in the dictionary by solving:
m=argmax
m
i
∈G
p(m
i
|Y)=argmax
m
i
∈G
p(Y|m
i
)p(m
i
)
Assuming that the prior probabilities of all gestures are equal, the problem be-
comes finding the gesture model that has the highest likelihood for a given obser-
vation:
m=argmax
m
i
∈G
p(Y|m
i
)
65
Thelikelihoodofanobservationsequence,givenaparticularFHMM,canbecom-
putedbyaforwardalgorithm. Theforwardalgorithmusesdynamicprogramming
method to efficiently compute the probability of the observation sequence given
each of these possible hidden state sequences. It returns the likelihood p(Y|m
i
)
by summing all these probabilities.
5.3 Experiment Results
We have used the proposed 3D shape descriptor and shape decomposition tech-
nique to decompose users’ postures while performing specific gestures. The ex-
perimental environment consists of four synchronized cameras, allowing real-time
image extraction, silhouette segmentation, and 3D human body visual hull re-
construction at 12 frames per second. In the Experiment we used a cylindrical
reference shape of 3 vertical by 16 horizontal reference points to infer the shape
descriptor. Usinglesscontrolpointsalongtheangulardirectionwillimpedethero-
tation invariance of the shape descriptor, while incorporating more control points
do not significantly improve the performances but requires more computing. We
chose the bin resolution to be 24(r)×24(θ)×24(ϕ ), as shape histogram with
less resolution cannot capture the variation in shapes necessary to distinguish
postures and gestures. We defined 6 different gestures: (1) one arm raising up
and pointing to a direction horizontally; (2) two arms flapping like wings; (3)
directing traffic, one arm pointing while the other waving at a direction; (4) one
66
arm raising up overhead; (5) flapping only one arm; (6) two arms pointing left
and right diagonally. Each of these gestures consists of transitions of the 5 atom
postures. The proposed approach allows the recognition of gestures across people
without a user-specific training data set. To evaluate the recognition rate across
persons, we collected video sequences of each of the six gestures from five people;
10 to 14 video sequences were collected for each person each gesture. Each se-
quence is about 400 frames long. The FHMMs for the corresponding 6 gestures
wereformulatedusingtwochainsandthestatevariableofeachchainhad5states
corresponding to 5 atoms. The 5×1 weight vector returned by matching pursuit
was used as observation at each frame. The FHMMs were only trained on the se-
quence of first person. Therefore, we expect the recognition rates to be improved
as we broaden the set of people in the training data. The average recognition
rates of different gestures are shown in Table 5.2.
Person 1 2 3 4 5
Gesture
1 45 50 30.77 23.08 50.0
2 87.5 87.5 83.3 88.9 90.0
3 100 100 25.0 87.5 100
4 100 92.3 75.0 80.0 100
5 76.6 66.6 83.3 100 85.7
6 100 100 55.5 100 100
Table 5.2: Identification rate for the 6 gestures considered using the proposed FHMM
formulation. We show results on 5 people, where only person 1 was used for training.
67
Forcomparison,wealsotestedthegesturerecognitionwithtraditionalHMMs.
TheHMMwereformulatedasasinglestatevariablehaving5
2
=25states,count-
ing all possible combinations of primary/secondary atoms. The results are shown
in Table 5.3. The resulting identification rate of FHMM is comparable to HMM.
The differences resides in the nature of the two models. The benefit of HMM
is the ability to model the transition of combinational value of states. On the
other hand, it requires exact synchronization of the two atoms as in the training
data to correctly classify the gesture. Any mismatch in the synchronization of
the analyzed gesture and trained gesture, such as one arm raising slower than the
other arm, will drastically affect the classification results, as is the reason of low
recognition rates for some gestures in the experiments. In the FHMM formula-
tioneachvariabletransitsindependentlyandthereforeallowsrecognizinggestures
where the primary and secondary transitions are not synchronized to the training
data. ThispreventstheFHMMformulationtodistinguishtwogesturesthatdiffer
only in synchronization of atoms. Finally, the computation complexity of FHMM
also scales better with the increase of atom’s dictionary size. We encourage the
reader to review the submitted video illustrating the gestures considered in this
experiment and highlighting the performance of the gesture recognition system.
68
Gesture
Person
1 2 3 4 5
1 37.5 50 76.9 44 80.0
2 87.5 87.5 100 55.5 100
3 41.6 41.6 0 37.5 66.6
4 80.0 87.5 75 10 91.6
5 88.8 88.8 100 100 100
6 22.2 22.2 11.1 7.8 0
Table 5.3: Identification rate of 6 gestures over 5 people, using traditional HMM for-
mulation.
69
Chapter 6
Pose and Gesture Estimation
In previous chapter, we investigated the inference of the qualitative property of
body posture and gestures. However, in the Human-Computer Interaction sys-
tem, one of the basic operations is interacting with virtual entities. For example,
oftentimes the user need to specify which entity (virtual person, object or a loca-
tion in virtual world) he/she is interacting with, via pointing or facing. Or the
userneedtodescribetheshapeoftheobject,orthetrajectoryofmotion,byhand
movements. In those cases it is necessary to estimate the geometry information
of the pose and gesture of the user. We achieve this by defining an articulated
model of human body, and fit such model to our observed images.
6.1 Articulated Rigid Body Model
Currently, we focus on the task of estimating the upper body configuration of the
user, especially the pointing poses. The motion of the lower body, such as the
70
Figure 6.1: Articulated Body Model
stanceofthelegs,isoflittleinterestinourHCIapplication. Withestimatedbody
configuration, wecandeduce thevirtualentity(virtualpersonorobject)thatthe
trainee is interacting with.
For our body tracking task, we define a body model as show in Fig. 6.1. We
model seven segments and six joints of the articulated model
• Lower Body: This segment includes the legs and the hip. Legs are repre-
sented as a single segments, since the user will be mostly standing in one
place during the VR interaction.
• UpperBody: includesthechestandtheabdomen. Theupperbodylinkwith
lower body with waist joint. The stance of the upper body, such as leaning
71
or bowing, may of particular interest in VR training. Since those stances
may play a specific role in interaction (virtual) people of certain culture,
such as middle east.
• Head: The head joints with upper body with neck joint. The head segment
is importance in calculating the pointing direction. In our application, the
pointing direction is derived from the center of the head (presumably the
roughpositionoftheeyes)tothetipofthepointingarm. Itisalsoimportant
in integrating with facial expression modality. Since the facial detection
module may need the head position in 2D image to initialize facial tracking.
• Upper and Lower Arms: tracking the arms may be the most important
part of the body estimation system, as they are most frequently used for
interaction with virtual objects. The arms are linked by shoulder and elbow
joints.
The degrees of freedom (DOF) of the model include:
• 3Dpositionofthebody(x,y,z)intheworldcoordinatesystem. Weassume
in the world coordinate system, the XY plane is the floor and the Z axis
points up.
• 3D rotation of the body, modeled as ZXY Euler angles.
• The joint rotation of the waist, neck, shoulders and elbows. Since we as-
sume the axial symmetry of arm and head segments, the neck and arm joint
72
angles can be represented as azimuth angle θ and zenith angle ϕ. But for
generalization we model all of them as ZXY Euler angles.
This gives a total of 21 degrees of freedom (DOF) for the model. The shapes of
body segments are modeled as 3D ellipsoids.
One problem has raised, the body tracking system should be generalized to
all kind of users. A single model is poor to adapt to user of different heights,
sizes,orbodyproportions. Applyingabodymodeloffixedsizeandproportionto
differentpeoplewillresultinarbitrary,andoftenundesirable,bodyconfiguration.
We need to devise a way to dynamically adjust the geometry of the model, such
as link length and segment size, to the people currently using the system. In the
next two sections we introduce methods to configure the geometry of the rigid
body model.
Currently we focus on track only a single user. Segmentation of multiple users
is made more difficult in IR images due to lack of color or texture.
6.1.1 3D Axis Based Model Configuration
In this section, we describe the application of Nonlinear Spherical Shells (NSS) to
configure the articulated body model. this method can adapt the proportion of
each individual body segment. The model and motion capture (MMC) procedure
automaticallydeterminesacommonkinematicmodelandjointanglemotionfrom
a volume sequence in a two pass process.
73
Figure 6.2: (Left) Kinematic models for all frames. (Right) Joints for a normalized
kinematic model
Inthefirstpass,theprocedureappliesNSSindependentlytoeachframeinthe
volume sequence. From the skeleton curve and volume of each frame, a kinematic
model and posture is produced that is specific to the frame. A second pass across
thespecifickinematicmodelsofeachframeisusedtoproduceasinglenormalized
kinematic model with respect to the frames in the volume sequence.
The described NSS procedure is capable of producing skeleton curve features
in a model-free fashion. The skeleton curve is used to derive a kinematic model
for the volume in each frame. First, we consider each branch (occurring between
two indicator nodes) as a kinematic link. The root node and all branching nodes
are classified as joints. Each branch is then segmented into smaller kinematic
74
links based on the curvature of the skeleton curve. This division is performed
by starting at the parent indicator node and iteratively including skeleton points
untilthecorrespondingvolumepointsbecomenonlinear. Nonlinearityistestedby
applyingathresholdtotheskewnessofthevolumepointswithrespecttotheline
between the first and last included skeleton point. When the nonlinearity occurs,
a segment, representing a joint placement, is set at the last included skeleton
point. The segment then becomes the first node in the determination of the next
link and the process iterates until the next indicator node is reached. The length
of these segments, relative to the length of the whole branch, is recorded in the
branch. The specific kinematic models derived from the volume sequence may
have different branch lengths and each branch may be divided into a different
number of links
In the second pass, a normalization procedure is used across all frame-specific
models to produce a common model for the sequence. For normalization, we aim
to align all specific models in the sequence and look for groupings of joints. The
alignment method we used iteratively collapsed two models in subsequent frames
usingamatchingproceduretofindcorrespondences. Thematchingprocedureuses
summed error values of minimum squared distance between branch parents, the
difference between angles of branches, and the difference between branch lengths.
The normalization procedure finds the mapping that minimizes the total error
value. We have also begun to experiment with a simpler alternative alignment
75
procedure. This procedure uses Isomap to align by constructing neighborhoods
foreachskeletonpointthatconsidersitsintra-frameskeletoncurveneighborsand
corresponding points on the skeleton curve in adjacent fames.
Once the specific kinematic models are aligned, clustering on each branch is
performed to identify joint positions. Each branch is normalized by averaging
the length of the branch and number of links in the branch. The location of the
aligned joint locations along the branch forms a 1D data sequence. An example
is shown in Fig.6.2 for a branch with an average number of joints rounded to
three. To identify the joint clusters, we used a clustering method that estimates
density of all joint locations and places a joint cluster where peaks in the density
are found.
From the normalized body model we can infer the proportion of each body
segment in Fig.6.1.
6.1.2 Model Scaling by 3D Height
A simpler method is to scale the articulated body model proportionally to the es-
timated height of the user. The height can be inferred from the maximum height
of the 3D body shape data (Visual Hull or Volumetric). It can also be inferred
from the height of the 3D torso axis points. The assumption is that human body
proportion tends to have little variations among general sized people.
76
(a) Original Body Pose (b) Volumetric Data
(c) Pose Invariant Volume (d) Derived Body Model
Figure 6.3: Example NSS Body Model Configuration
77
The Nonlinear Spherical Shells method has the advantage of model flexibility.
It can estimate the length of each individual body segment, thus is able to config-
ure the body model proportion more precisely than height based scaling method.
ThedisadvantageofNSSprocessingisthecomputationcomplexityastheIsomap
processing is very slow.
The height-based scaling method may be too simplistic to adapt to users who
aretoothinorobese. Butitsadvantageisthecomputationalefficiencyandallows
real-time processing. In our experiment we found that the height scaling method
is generally sufficient to adapt the body to average users.
It is impractical to re-configure or scale the body model at every image frame.
The process introduces extra computation overhead. And in the case of height-
based body scaling, the inferred height would be incorrect if the user crouch or
raise the arms. It is more practical to only configure the model at the start of
body tracking.
The tracking system detects the presence of the user in the scene from the
sum of observation weights of the particles. If the weight sum is below certain
threshold,mostlikelyduetolackofsegmentedsilhouettesintheimage,thesystem
classified the room as empty, otherwise the space is classified as occupied. When
theuserfirstentersthescene,thetrackingsystemconfigurethebodymodelbased
on inferred 3D geometry information. If an user enter the scene in stances other
78
than normal walking, such as crouching or with arm raised, the system may scale
the model incorrectly.
6.2 2D-3D Iterative Closest Point Tracking
Since we designed our system by the assumption that the 2D silhouette data, and
subsequently the 3D shape information, is noisy and lossy, we can not perform
body pose estimation on 3D data. Here we propose an iterative convergence
method that use 2D silhouette to update 3D body model.
6.2.1 Iterative Closest Point (ICP) Method
In [15], Demirdjian et al. used Iterated Closest Points (ICP) method to match
body model to 3D shape data created from stereo vision. The original (3D)
ICP [5,25] method works as following.
Assume we have a set of model points Q = {q
i
,i = 1...N} and data point
P ={p
i
,i = 1...N}. Each point p
i
corresponds to its closest data point q
i
. We
want to find a rotation R and translation T that minimize the error term
E =
N
X
i=1
ke
i
k
2
,e
i
=p
i
−R(q
i
)−T
79
Tofindtheoptimal T,firstweconvertthe P andQtoberelativetotheircentroid.
Let
p=
1
N
X
p
i
, q =
1
N
X
q
i
and
p
0
i
=p
i
−p, q
0
i
=q
i
−q, i=1...N
The optimal rotation matrix can be derived by maximizing the term
E
M
=
N
X
i=1
p
0
i
∙ [R(q
0
i
)] (6.1)
FindingFindingtheoptimalrotationRisnon-trivial,astherotationmatrixmust
satisfy the orthonormal constraint. And the optimal translation T can be derived
by
T =p−R(q)
6.2.2 2D-3D ICP
The original 3D ICP works well at reasonably accurate 3D data. However, since
the background in the images can be cluttered and the IR image is featureless ,
the silhouettes, and subsequently the 3D visual hull constructed from them, can
be very noisy. It is not uncommon that an arm is missing from the visual hull for
someframes,duetosegmentationerror. AnICPtrackingmethodusingthefaulty
3D data may converge to an arbitrary configuration and has no way to correct
80
itself. We propose a new tracking method which applies ICP to the 2D data
instead, and then integrates the 2D tracking results to 3D model. The method is
outlined below.
Assume the model is properly initialized. At each time frame t:
1. the3Dbodymodelisprojectedintoeachimagetoforma2Drigidarticulated
model. Fig.6.4(a), 6.4(b) is an example of this projection.
2. The 2D model in each image is then fit to the corresponding silhouette
contour using 2D ICP. See Fig.6.4(c).
3. The 2D rigid transformation that fits the 2D model to the contour is then
back-projected to 3D space. A 3D rotation and translation of each body
segment is inferred from the back-projection.
4. 3D rigid transformations inferred from multiple camera images are then av-
eraged. The 3D body model is updated by the average transformation. See
Fig.6.4(d).
5. Repeat to step 1. Terminate if the error computed by 2D ICP is below a
threshold , or a fixed number of iterations have been performed.
Each step in the method is described in more detail below.
81
(a) 3D Model Before ICP (b) 2D Model Before ICP
(c) 2D Model After ICP (d) 3D Model After ICP
Figure 6.4: ICP Process, white lines in 2D models indicate closest point pairs
82
6.2.3 3D Model Projection
At the 3D-to-2D projection step, the 3D body model is projected into the image
to form a 2D articulated rigid body model. The segments of these 2D models are
2D ellipses
An advantage of using ellipsoids for body segments is that the perspective
projection from an ellipsoid forms an ellipse and this mapping can be represented
inananalyticalform. Assumethatwehavea3DellipsoidbodysegmentS
k
. A3D
point is represented by the homogeneous coordinate system X = [x,y,z,1]
T
. An
ellipsoid centering at origin and aligning with coordinate axis can be represented
as
x
2
a
2
+
y
2
b
2
+
z
2
c
2
=1≡X
T
1
a
2
0 0 0
0
1
b
2
0 0
0 0
1
b
2
0
0 0 0 0
X =1
After arbitrary rotation R and translation T, the ellipsoid can be represented as
X
T
(M
−1
)
T
1
a
2
0 0 0
0
1
b
2
0 0
0 0
1
b
2
0
0 0 0 0
M
−1
X =1
83
where
M =
R T
0 1
and M
−1
=
R
T
−R
T
T
0 1
Let
A
0
=(M
−1
)
T
1
a
2
0 0 0
0
1
b
2
0 0
0 0
1
b
2
0
0 0 0 0
M
−1
The ellipsoid can be represented as
X
T
A
0
X =1 (6.2)
Assume we want to project ellipsoid S
k
into camera c. Let
•
− →
e =[x
e
,y
e
,z
e
,1]
T
is the focal point of the camera c.
• P is the 3×4 projection matrix of camera c.
• P
+
is the 4×3 pseudo inverse matrix of P.
• The 4 rows of P
+
are 3-d row vector P
+
1
,P
+
2
,P
+
3
,P
+
4
, respectively
84
ForanypixelU =[u,v,1]
T
ontheimage,wewouldwanttofinditsback-projection
line L
U
. First, we can compute the pseudo back projection point as
X
0
=P
+
U =
P
+
1
U
P
+
2
U
P
+
3
U
P
+
4
U
≡
P
+
1
U/P
+
4
U
P
+
2
U/P
+
4
U
P
+
3
U/P
+
4
U
1
X
0
isanarbitrary3Dpointlyingontheback-projectionlineandmightbebehind
the camera. The the (non-unit) direction vector of the back-projection line is
− →
v =X
0
−
− →
e ≡
P
+
1
U−x
e
P
+
4
U
P
+
2
U−y
e
P
+
4
U
P
+
3
U−z
e
P
+
4
U
0
≡P
∗
U
where
P
∗
=
P
+
1
P
+
2
P
+
3
0
−
x
e
y
e
z
e
0
P
+
4
P
+
4
P
+
4
0
The 3D back projection line can be defined as
L
U
≡X
p
=
− →
e +τ
− →
v (6.3)
85
If the back-projection line L
U
is tangent to the surface of the ellipsoid, then the
pixelU isontheprojectedellipse. Tofindoutifthe L
U
istangenttotheellipsoid,
we substitute the X
p
in (6.3) into (6.2). This will result in a quadratic equation
of variable τ:
− →
v
T
A
0− →
vτ
2
+2
− →
e
T
A
0− →
vτ +
− →
e
T
A
0− →
e −1=0 (6.4)
BeingtangenttoS
k
meansthatL
U
andS
k
intersectatonlyonepoint. Andthere
is only one solution in (6.4), which means
(2
− →
e
T
A
0− →
v)
2
−4(
− →
v
T
A
0− →
v)(
− →
e
T
A
0− →
e −1)=0 (6.5)
Rewriting (6.5), we can get the close-form representation of the projected ellipse
s
k
:
U
T
h
P
∗T
A
0
− →
e
− →
e
T
A
0
P
∗
−
− →
e
T
A
0
− →
e −1
P
∗T
A
0
P
∗
i
U =0 (6.6)
Using Eq.6.6, we can easily deduce the 2D ellipse segment of the 2D body model.
Also, given the ellipse equation, we can also quickly compute the point set uni-
formly distributed on the contour of the ellipse, and their corresponding tangent
normal. This is useful in the next step.
6.2.4 2D Iterative Closest Points
Givenaprojected2Dbodymodeland2Dsilhouetteineachcamera,wewouldlike
to find the set of 2D rigid transformations that best align the model to the image
86
data. Iterative Closest Points (ICP) is a standard 3D registration algorithm [6].
Given a set of 3D data and a 3D model, ICP estimates the rigid motion between
the data and model. A 2D variation of ICP is proposed here.
Assume we have a 2D rigid model consisting of K ellipses. First we uniformly
select N points Q = {q
i
|i=1...N} on the curve of ellipse s
k
,k = 1...K. Let
q =avg
i
(Q) is the average point of Q.
Second, for each point q
i
, we find the nearest point p
i
on the silhouette con-
tour. Let P ={p
i
|i=1...N} and p =avg(P) to be the mean coordinate of all
p
i
. However, finding the nearest point requires computing the Euclidean distance
between every pair of model and contour pixel points. This brute force procedure
is computationally expensive. To simplify computation, we approximate the sil-
houette contour by 2D polygons (Fig.4.1). Instead of computing and comparing
thedistancebetweenq
i
andallcontourpixels,thenearestpointcannowbefound
by computing the shortest distance between q
i
and each polygon edge segment.
Inourcamerasetup, inanimagewith320×240resolution, ahumanbodysilhou-
ettecontourusuallycontains800-1000pixels. Thesilhouettecanbeapproximated
by 20-30 polygon edges with reasonable accuracy. This can greatly improve the
performance, as real-time tracking is one of the foremost requirements. When
searching for nearest points, the normals of the contour of the ellipse and the
polygon are also taken into consideration. For each pair of polygonal contour
edgeandellipsepointq
i
, iftheanglebetweentheircontournormalsislargerthan
87
acertainthreshold,thispairofpointsisignoredinsearchingforthenearestpoint
of q
i
. We set the angle threshold to be π/2 in our implementation.
After finding the nearest point pairs, we want to estimate the 2D rigid trans-
formation M that best match the model to the data. We use the sum of square
error as the error function:
E =
X
i
p
i
−Mq
i
2
The transformation M consists of a 2D rotation angle θ and a translation T.
M =
R
θ
T
0 1
whereR
θ
isthe2Drotationmatrixofangleθ. Theoptimizationcanbeperformed
the same way as we applied on 3D ICP. The only difference is that when max-
imizing the term E
M
in eq.6.1, we can simply make partial derivative of E
M
to
the rotation angle θ and make
dE
M
dθ
=0
The rotation angle θ that minimize the error E is estimated by
θ =arctan
P
i
w
i
p
i
y
q
i
x
−p
i
x
q
i
y
P
i
w
i
p
i
x
q
i
x
+p
i
y
q
i
y
!
(6.7)
88
Where w
i
is the exponential weight.
w
i
∼
=e
−|p
i
−q
i
|
2
(6.8)
Thisweightdecreasesasthedistancebetweennearestpointpairsincreases. With
theweighting,outlierpointwilllessinterferenceincomputingthetransformation.
The 2D translation T is estimated by
T =p−R
θ
(q) (6.9)
6.2.5 Convert 2D Transformation to 3D:
Now we have a 2D model in each image, and a derived 2D transformation that
best matches the 2D model to the 2D silhouettes. Here we propose a method to
integrate those 2D transformations in multiple images into a single 3D transfor-
mation, that would best matches the 3D model to the observed data.
Assume for each camerac, and for each body segmentS
k
, a 2D rotationθ and
translationT arefoundinpreviousstep. The2Dtransformationisnowconverted
to 3D as follows:
Let the 3D point J
k
be the joint of S
k
, let L
k
the line connecting J
k
and
the focal point
− →
e. L
k
is also the projection ray of J
k
. The 2D rotation can
be converted to 3D as a rotation of S
k
around L
k
by angle θ. An example is
shown in Fig.6.5. T is converted to a 3D displacement on a plane parallel to the
89
θ
θ
L
k
s
k
S
k
J
k
Figure 6.5: Back-project 2D rotation to 3D
image plane. For each body segment, the 2-D ICP tracking applied to multiple
camera images may produce different 3D transformations. Body model updates
are inferred from a combination of these transformations.
We integrate the 3D rigid transformations together by averaging them. Aver-
aging the translations is straightforward:
T =avg
c
(T
c
)
where c is camera index. For rotations, there are several ways to compute the
average [21]. Since our 3D model segments rotate around axis, a 3D rotation can
be conveniently represented by an unit quaternion. Assume the rotation axis line
L
k
has directional unit vector
v
k
=(v
x
,v
y
,v
z
),kv
k
k=1
90
the rotation around L
k
by angle θ can be represented by quaternion q
c
:
q
c
=cosθ/2+sinθ/2((v
x
i+v
y
j+v
z
k)
The average of a rotation is then given by:
q =
P
c
q
c
|
P
c
q
c
|
(6.10)
Quaternions in eq.(6.10) must also satisfy hq
i
,q
j
i ≥ 0 for any two quaternions.
This can be achieved by flipping the signs of the quaternions, since −q
c
and q
c
represent the same rotation.
The averaged rotation is applied to each 3D body joint. The translation is
only applied to full body and not individual segment. This averaged rotation is
relative to world coordinates system. To re-configure the body model, the rota-
tion must first be represented in local joint coordinate system and then convert
to ZXY Euler angles.
Sincetheshapesof2Dmodelschangewitheveryiterationduetore-projection,
the 2D-3D ICP method is not guaranteed to converge. However, empirically we
find that 2-3 iterations are sufficient to provide good seeds to the subsequent
particle filter process. The 2D-3D ICP is less prone to segmentation errors than
pure 3D ICP using the 3-D visual hull. For example, a lost limb in one silhouette
91
will remove the limb in the visual hull which is formed by intersection of the
silhouette data. A tracking method using the 3D shape will likely fail in such a
case. Onthecontrary, the2D-3DICPusestheunionofallsilhouettes. Theeffect
of one faulty segmentation is reduced by the averaging step.
6.3 Particle Filtering
The 2D-3D ICP method, like other iterative convergence methods, relies heavily
ontheinitialconfigurationofthemodel. Aninitialbodymodelconfigurationthat
differs too much from the correct one may converge to an arbitrary pose. Also, if
the tracking fails at one time instance, it is unlikely to correct itself in the ICP
process. To automate the track initialization process and improve robustness, we
integrate a Particle Filtering method with the 2D-3D ICP.
Assume x
t
is the joint angles of the body model and y
t
is the captured image
information at time t. We want to find the best x
t
:
x
t
=argmax
xt
p(x
t
|y
1:t
) (6.11)
given the sequence of images y
1:t
. However, it’s difficult to derive the posterior
probability function directly. One reason is at the dimensionality of x
t
is high (21
DOFs plus their derivatives). And another reason is that the camera projection
fromx
t
toy
t
isnotalinearprocess. Onemethodthatisoftenusedtoestimatethe
92
posteriorprobabilityistheSequentialMonteCarlomethod,alsoknownasParticle
Filtering [1,18]. Instead of deriving the actual analytical form of the posterior,
ParticleFilterestimatesthedistributionwithsetofdatasamplescalledparticles.
Each particle represents one possible user state (position and joint angles).
Justlikeposturerecognition,wemakeanassumptionthatoursystemisHidden
Markovian Model. i.e. thestatex
t
dependsononlythestatex
t−1
ofprevioustime
instance. And that the captured images y
t
only result from the current user state
x
t
. Only y
t
is observable.
p(x
t
|x
0:t−1
)=p(x
t
|x
t−1
) (6.12)
p(y
t
|x
0:t
)=p(y
t
|x
t
) (6.13)
The original generic Particle Filter includes three steps: Initialization, Weighted
Sampling and Resampling. Many different alternate methods have been proposed
to improve the generic Particle Filter. Some hybrid approaches add an additional
Refine-Prediction step after the importance sampling [59]. These methods adjust
the sample set either to increase the sample diversity or increase sample impor-
tance. We include the Refine-Prediction step in out method, too. We may also
incorporate other improved method in the future. Our Particle Filter works as
follows:
93
• Initialization: at beginning, N data samples α
i
0
,i =1...N are drawn from
a prior distribution p(x
0
)
• Weighted sampling: at each time step t, for each particles α
i
t−1
at time
t−1, a new particle is sampled according to the density distribution of the
proposal function:
e α
i
t
∼q
α
i
t
|y
1:t
,α
i
0:t−1
=q
α
i
t
|y
t
,α
i
t−1
(6.14)
Evaluate the weight w
i
t
of each particle α
i
t
by
w
i
t
=
p(y
t
|e α
i
t
)p
e α
i
t
|α
i
t−1
q
e α
i
t
|y
t
,α
i
t−1
(6.15)
and all w
i
t
, i=1...N are normalized to the sum of one.
• Refine-Prediction: A subset of the particles from previous step are ran-
domly selected. The 2D-3D ICP is utilized as the Refine-Prediction step,
which improves sample importance by moving samples toward nearby local
optima.
• Resampling: N newparticlesα
t
aresampledfromtheset
e α
1
t
...e α
N
t
with
replacements, with the probability proportional to particle weight:
p
α
i
t
=e α
i
t
=w
i
t
(6.16)
94
Thus at the end of each iteration (time frame), the system would consists of N
samples representing N different possible body pose configurations. The details
of each step of the Particle Filtering are explained below.
6.3.1 Observation Probability
Theobservationprobabilityp(y
t
|α
i
t
)isevaluatedbyprojectingtheshapemodel
of particle α
i
t
into each of the 2D images. Assume P
i
c
is the projected shape of
α
i
t
into image c,c = 1...C. And S
c
is the silhouettes in image c. Then the
probability is estimated as following:
p
y
t
|α
i
t
=
C
Y
c=1
kP
i
c
∩ S
c
k
kP
i
c
∪ S
c
k
(6.17)
Thek∙k operator returns the area of the image shape. This observation function
maximizes the overlapping area between the projected image of the particle and
thesilhouette,whilepenalizesthenon-overlappingregion. Theobservationproba-
bilitiesofdifferentcamerasaremultipliedasthecameracapturingisindependent
to each other.
95
6.3.2 Proposal Functions
Theproposalfunctionq
α
i
t
|y
1:t
,α
i
t−1
isanimportantpartoftheParticleFilter
method. Theoretically, if the proposal function is defined as the posterior:
q
α
i
t
|y
t
,α
i
t−1
=p
α
i
t
|y
t
,α
i
t−1
(6.18)
it will minimize variance of the particle weights w
i
t
and the estimation error. But
it’s usually not possible to sample directly from the posterior distribution. Often
q() is defined as the transition function: q
α
i
t
|y
t
,α
i
t−1
= p
α
i
t
|α
i
t−1
. But
it has the drawback of ignoring the current observation for drawing new samples.
In our approach, we define a set of different proposal functions, Q={q
j
}. When
drawing new samples in the weighted sampling step, proposal functions are ran-
domlyselectedfromthissetandsamplesaredrawnaccordingtotheirprobability
distribution. These functions may sample new values of different set of degree
of freedoms and thus may not be mutually exclusive. The probability of each
functionbeingselectedaresetempirically(manually). Theproposalfunctionswe
uses are:
Random Walk function: The proposal function is simply defined as the tran-
sition probability:
q
α
i
t
|y
t
,α
i
t−1
=p
α
i
t
|α
i
t−1
(6.19)
96
New observation is not taken into account when spawning new samples. The sys-
tem models not only the current value of each degree of freedom, but also their
first and second order time derivatives (velocity and acceleration). When sam-
pling α
i
t
from α
i
t−1
, new position or rotation angles and their velocity are derived
from old value and the elapsed time difference. And then Gaussian noises are
added to the new value and its derivatives. Although a value cap of velocity and
acceleration needs to applied, to prevent the function going too wild.
Body Position Function: This function estimates the new body position. It
randomly samples points from 3D torso axis points and uses its X and Y coordi-
nates as new body position. Another way is to uniformly sample points on the
surface of the visual hull and compute the average position as the new body posi-
tion. Though the position can be easily biased by the arms and shadow volumes
that are not removed in shadow removal process.
Arm Direction Function: The 3D arm axis points are used to estimate the
shoulder and elbow joint angles in the new samples. For each arm, one of the 3D
axislinestripatthesamesideofthebodyisrandomlyselected. Andtheshoulder
and elbow joint angles are altered to make body model aligns with the axis lines.
However, there may be no 3D axis or shape information available either because
the user is not performing pointing gesture at all, the extended arm is occluded
by body due to viewing angle limit, or there is severe error in pixel segmentation.
Thenthebodypositionorarmdirectionmethodswouldhavenoeffectsandmake
97
no modifications to the corresponding DOFs.
Default Pose Function: A set of pre-defined poses, such as scarecrow and
standing straight, are added to the particle set. The Body Position Function is
applied to those default poses to adjust body locations.
Shape Descriptor & Posture Recognition Function: The posture recog-
nition system described in Section 5.1 can also serve as proposal function. New
particles are spawned based on the recognized posture.
In the future, other different proposal functions that take advantages of differ-
entimagefeaturescanbeintroducedintothisframeworktoexpandthecapability
of the system.
6.3.3 Particle Refinement
After the importance sampling step, a set of particles is randomly selected for
further Prediction-Refinement. 2D-3D ICP method is applied to converge those
particletomatchthebodysilhouettes. Theadvantageoftheadditionalrefinement
step, comparedthegenericParticleFilter, isthereductionofparticles. Insteadof
relying on a large swarm of particles trying to hit the best match, a small number
of particles can be guided towards the local optima.
98
Body Joint Constraints: For each arm joint, the joint angle can be con-
verted from ZXY Euler angle to azimuth angle θ and zenith angle ϕ. The con-
straint is represented as a hard value cap of the two angles. During both im-
portance sampling and particle refinement, if the resulting model configuration
violates the joint constraints, the new model configuration is discarded and the
old one is kept.
The Particle Filter not only provides an advantage of estimating non-linear
posterior probability, it also increases the robustness of the tracking system. Our
silhouette observation is noisy and unstable given the environment constraints.
The Particle Filter maintains multiple hypothesises of the current user state and
thus is able to keep tracking user pose in the absence of usable image features,
and quickly recover from erroneous state when correct silhouettes are available.
In our experiments, the segmentation of the arm may occasionally fail and causes
the model to swing wildly, but it always recovers back after only 2 to 3 frames.
The system also does not need manual initialization. If the system finds that
the observation densities of the particles are all near zero, it assumes that the
stage is empty and does not output model configurations. But the Particle Filter
still keeps the particles moving around by the random walk function; when a user
enters the stage from any location, the proposal functions quickly converges the
particles around the user and automatically start tracking.
99
Asoursystemisintendedtobeusedasanon-linehuman-computerinteraction
interface, it must be able to operate in real-time. To improve computational
efficiency, we use an approximation for the body part shapes and also use the
Graphical Processing Unit (GPU) available in a normal desktop computer.
The most computationally intensive part of the system is the inference of the
observation probability p(y
t
|α
i
t
). It needs to project (render) the body shape
modelintoeachcameraimage,foreachparticleandcomparewiththesilhouettes.
Insteadofactuallyperformingtherenderingandrasterizationprocess,wedirectly
derivetheanalyticalformofthemappingbetweenthe3Dellipsoidbodypartand
the projected 2D ellipses, as shown in Eq.6.6. Silhouette pixels are compared
against those ellipse equations instead of rendered body images.
Since such computation is highly parallel in nature, we utilize the GPU hard-
waretocomputetheellipsoidmappingandthepixelevaluation. Thisgainsseveral
magnitude of order of performance increase (See results section).
6.4 Rest State
Thetrackingsystemusesthesilhouettesasthebasisofvisualinformation. When
the arms of user are wrapped near the body, the silhouettes are visually nearly
undistinguishable. The 2D-3D ICP might converge the arms arbitrarily around
or even into the torso. The tracking result is meaningless in such situation. To
address this problem, we use a Support Vector Machine (SVM) [9] to classify the
100
pose into ”rest” and ”non-rest” states for each arm. The 6-d feature vector used
by the SVM consists of 3D positions of elbow and wrist joints, in the upper body
coordinate system. One SVM is trained for the right arm; the coordinates of the
joints of left arm are mirrored before applying the SVM classification. An arm
classified as being in rest state is forced to a resting down state regardless of the
tracking result.
6.5 Integration
In each time frame, the particle with the highest weight in the the sampling step
is selected and the articulated body configuration represented by this particle is
used to estimate the pointing direction. The actual direction the user is pointing
to, however, is not the direction from the shoulder to the tip of the arm. Instead,
the vector that originates from the estimated 3D coordinate of the eyes to the
tip of the arms are used as pointing directions. Currently the eye coordinate is
defined as the center of 3D head shape. In future, face tracking module may
be integrated to provide more accurate eye position estimation. Intersecting this
vector with screen results the estimated screen coordination the user is pointing
at.
The visual body tracking system interconnects with the virtual environment
training system via a network. The current pointing direction in 3D space is
constantly updated. The VR system then transforms the real-world 3D space
101
coordinateintovirtual-worldcoordinateandrecognizesthevirtualentitytheuser
is pointing at.
6.6 Experiment Results
We have set up the system in three different environments (see Fig.1.1). One is
used in a virtual-reality theater with a curved surrounding screen (Fig.1.2), the
other is in a set simulating a war-torn building with projection screens as walls
and doors. The last one is the experimental space in our laboratory.
We use four to five synchronized IR firewire cameras connected to a 2.8 GHZ
Dual-Core, Dual-CPU Xeon workstation equipped with GForce 8800GTX graph-
ics card. The system is runs at 12 frame per second on average with 50 particles;
this speed is acceptable for simple interactions with the environment. Without
the use of GPU, the processing speed is 3-4 seconds per frame; thus, the GPU
gives an an improvement of 15 to 20 times. More particles provide better estima-
tion robustness and accuracy but slow down the performance. Some frames from
a tracking sequence are shown in Fig.6.6. An example tracking error occurred in
the second frame, but the method recovered quickly.
The rest-state SVM uses a radial basis function. The SVM parameters are
trained using 5-fold cross-validation method on 2000 manually labeled frames.
The classification has 98.95% accuracy.
102
Figure 6.6: Example Tracking Sequences, One frame per Row. The tracking can auto-
matically recover from an error
103
Joint Mean Std Outliers Total
Root 6.88 2.55 0 2992
Waist 3.43 1.42 1 2991
Neck 3.48 1.36 7 2993
Left Shoulder 5.57 2.68 26 3002
Left Elbow 6.97 2.62 11 2996
Left Wrist 5.05 2.41 41 3002
Right Shoulder 5.42 1.94 13 2992
Right Elbow 7.06 2.60 0 2972
Right Wrist 5.93 2.80 40 2973
Table 6.1: Mean and standard deviation of joint errors, measured in centimeter
Totestquantitativeaccuracy,wecollectedvideosequencesofusersperforming
various gestures to interact with the virtual system . We collected and annotated
a total about 3000 frames. To annotate the ground truth of body configuration,
we manually marked the pixel coordinates of each joint in each of the images.
Then the 3D coordinates of the body joints can be estimated from those 2D
markers by triangulation using camera calibration information, as shown in in
Fig.6.7. Nine joints are annotated: root (feet location), waist, neck, left and right
shoulders, elbows and wrists. This approach is time-consuming and hence the
amount of data on which we could evaluate is limited, but still meaningful. We
compute tracking errors as the Euclidean distance between joints in the tracked
body model and the annotated ground truth. Outlier points, identified by an
abnormally large error of a joint, are removed by Chauvenet’s criterion method.
The histograms of joint errors is shown in Fig.6.8. The statistics of joint errors is
shown in Table.6.1.
104
Figure 6.7: Manually mark joint locations. Red points represent 2D markers. Yellow
points represent 3D marker
As we can see from the error histogram, there is a constant bias in the error
distribution. Several factors may contribute to their error:
1. Tracking error: This is the most obvious. Our body estimation is only an
estimate of the posterior probability.
2. 2D marker error: The error might lie in the ground ”truth” itself. First, the
exact joint location might be difficult to see in the image, especially when
joint point is occluded by other body parts. Second, it difficult for human
to pick points at pixel resolution level. Thus the manually marked 2D joint
105
0 2 4 6 8 10 12 14 16 18
0
50
100
150
200
250
300
350
400
(a) Root Joint
0 1 2 3 4 5 6 7 8 9 10
0
100
200
300
400
500
600
700
(b) Waist Joint
0 1 2 3 4 5 6 7 8 9 10
0
50
100
150
200
250
300
350
400
450
(c) Neck Joint
0 2 4 6 8 10 12 14 16
0
50
100
150
200
250
(d) Left Shoulder
0 2 4 6 8 10 12 14
0
50
100
150
200
250
300
350
(e) Right Shoulder
0 2 4 6 8 10 12 14 16 18
0
50
100
150
200
250
(f) Left Elbow
0 2 4 6 8 10 12 14 16 18
0
50
100
150
200
250
(g) Right Elbow
0 5 10 15
0
50
100
150
200
250
300
(h) Left Wrist
0 2 4 6 8 10 12 14 16 18
0
50
100
150
200
250
(i) Right Wrist
Figure 6.8: Body Joint Errors. Without pixel offset
coordinatesareonlyroughvisualestimatesandmaynotrepresenttheexact
correct coordinates.
106
3. 3D marker error: Since the 3D marker is derived from 2D counterparts, the
errors in annotated 2D marker will cascade into 3D markers. Even if coor-
dinates of 2D markers are correct, there might by numeric errors in camera
calibration data that introduce inaccuracies in triangulation. Finally, the
body model used in tracking may not correspond to actual body propor-
tions exactly, which could introduce a constant bias in the error terms.
To estimate the effect of 2D marker error on the 3D marker error, we offset
the pixel coordinate of 2D markers. A 2D marker is then not represented by just
one point, but a cluster of points which consist of the manually picked point and
its neighboring pixels. The 3D marker is then also represented by a cluster of
3D points, which are estimated from every pair of cameras and every pair of 2D
marker points. We then compute the diameter of the enclosing sphere of this 3D
marker cluster, as shown in Table 6.2. As we can see, a large portion of the joint
errors may be attributed to the error of marker annotation. The errors of Waist
and Neck are also high. Because it’s difficult to manually pick exactly matching
points on different images for those body parts.
An important application of our body pose estimation system is estimation
of pointing. The trainees in the virtual environment use pointing gestures to
identify the virtual entities the they intended to interact. We derive the direction
of pointing as the line extending from the center of the head (roughly position of
107
Joint Cloud diameter
Waist 13.7297
Neck 10.5212
Left Shoulder 4.330
Left Elbow 4.61136
Left Wrist 4.33646
Right Shoulder 4.25
Right Elbow 5.32013
Right Wrist 4.13857
Table 6.2: Average diameter of 3D marker clouds
0 2 4 6 8 10 12
0
50
100
150
200
250
300
350
400
(a) Left hand
0 2 4 6 8 10 12 14
0
50
100
150
200
250
300
350
(b) Right hand
Figure 6.9: Pointing angle error, in degrees
Hand Mean Std Outliers Total
Right hand 3.20 2.32 75 2974
Left hand 2.58 2.03 56 3001
Table 6.3: Mean and standard deviation of pointing errors, with 2D marker offset,
measured in degrees
eyes)totheendofthehand. Theerrorsofpointingdirection, estimatedasangles
between pointing vectors, are shown in Fig.6.9 and Table 6.3.
The error does not include body pose in rest state. If an arm is labeled as
being in the rest state in target data, the corresponding joints and segments are
not included to compute the average error. We think that these error numbers
are acceptable in an interactive environment where feedback is available to the
108
user from the reactions of the agents in the environment; however, we have not
performed the integrated experiments.
The occurrence of large tracking errors (outliers) is not a random process.
Instead, it most often occurs at certain body poses that result in self-occlusion,
as shown in Fig.6.10. In the first example, the right lower arm is aligned with the
upper arm in the side view camera. The silhouette of the arm is consumed by
the torso in frontal view cameras. In the second view, the right arm occludes the
left arm. Any movement of one arm may change the body model configuration of
the other arm due to the 2D-3D ICP process. Increasing the number of cameras
and the variety of viewing angles can reduce such errors, but will also increase
computation requirements.
Jitters,representingrapidfluctuationsoftheestimatedpose,canoccurbecause
our proposal functions do not take the motion smoothness into account when
spawning new particles. We could gain some smoothness by backward filtering
butthiswouldintroduceadditionallatencyintothesystem; wehavenotexplored
the trade-off between the two for a user.
109
Figure 6.10: Poses that will likely result in tracking error
110
Chapter 7
Natural Gesture Analysis
We expanded the posture/gesture recognition system and pose tracking to allow
for more complex Human-Computer Interaction in the virtual training environ-
ment. In the earlier parts, we assume that the users intentionally communicate
withcomputerandconveyinformationtothesystemviabodyposesandgestures.
In a virtual training system, it’s often desirable to also evaluate the physical and
mental conditions of the trainees by observing their unintentional body stances
and/or movements. Such unintended body motions are usually very subtle and
less drastic. It’s usually more difficult to detect and recognize them.
However, the body language (or sometimes called Natural Body Language)
cannot be analyzed without context. The context of the communication can only
provided by the content provider of the virtual training system.
The short term goal of the project is to provide an encoding of body motion
that the VR interface system can use to further deduce human gestures. The
encoding should:
111
• Segment salient motion from the tracking sequences.
• Classify the motion from a set of basic primitive of motions.
• Provide the quantitative parameters of the classified motion primitives.
We’ve collected sequences of video of people performing natural gestures and
annotated the video frames. The algorithm we chose to classify the motion labels
are Conditional Random Field (CRF) and Latent-Dynamic Conditional Random
Field (LDCRF) [44].
7.1 Motion Encoding
In our preliminary motion encoding system, we defined 9 different labels to de-
scribe the motion of human arms. Those labels include:
• REST: This indicates that the arm is in a lowered down position.
• PAUSE: The arm is temporarily motionless in mid-air
• LEFT: The hand moves right-to-left in front of the subject.
• RIGHT: The hand moves left-to-right in front of the subject
• UP: The hand rises up in front of the subject.
• DOWN: The hand drops down in front of the subject.
• FORWARD: The hand moves forward away from the subject’s body
112
• BACKWARD: The hand moves toward the subject’s body.
• OTHER: All other motions that do not fall into above categories.
There are several methods that are capable of classifying motion encoding
from the body tracking results. We chose to use Conditional Random Field and
Latent-Dynamic Conditional Random Field methods for this task. CRF and
LDCRF model more complex sequential data interactions than does HMM. And
theoretically provides better accuracy in segmenting and classifying sequential
data [44].
7.1.1 Conditional Random Field
The conditional random field (CRF) [35] is a graphical model used to represent
random variables and their dependencies. Although the graph can have an arbi-
trary layout, the most common form is a sequential chain, as shown in Fig.7.1. In
y
0
y
n-1
y
n
y
n+1
y
N
… …
X
y
1
y
N-1
Figure 7.1: Sequential Conditional Random Field
thisgraphicalform,verticesy
0
...y
N
representrandomlabelvariablesateachtime
113
stepn,andthevertexX ={x
0
,x
1
,...,x
N
}istheentiresequenceofobservations.
Given a particular observation sequence X and label sequence Y ={y
0
,...y
N
},
the probability of the graph can be written in the form:
p(X,Y)=
1
Z
Y
n
exp(
~
λ∙ ~
t(y
n−1
,y
n
,X,n)+~ μ∙ ~ s(y
n
,X,n)) (7.1)
where Z is the normalization term. Let
~
t(y
n−1
,y
n
,X,n) be the transition feature
function and let ~ s(y
n
,X,n) be the state feature function; each returns a feature
vector. The parameters
~
λ and ~ μ are weight vectors. Those parameters can be
estimatedfromasetoftrainingdatawhichmaximizethejointprobabilityp(X,Y).
Givenestimated(trained)parametervectors
~
λand~ μandanobservationsequence
X,wecanalsoinferthemaximumlikelihoodlabelsequenceY ={y
0
,...y
N
}using
the Viterbi algorithm [55].
The CRF has several advantage over Hidden Markov Models. HMMs often
assumes the independence of observation, that the observation at a particular
time instance depends only on the system state at that time. By relaxing this
restriction, CRF can model more complex interaction between features and long-
range dependencies.
7.1.2 Latent Dynamic Conditional Random Fields
A Latent-Dynamic Conditional Random Field was proposed in [44]. In addition
to the graphical structure of CRF, it introduces a layer of ”hidden variables”,
114
h={h1...h
N
}asshowninFig.7.2. Eachhiddenvariableh
n
thatassociateswith
label variable y
n
has a value is from a discrete set of values H
yn
, h
n
∈H
yn
.
.
X
h
0
h
n-1
h
n
h
n+1
h
N
… … h
1
h
N-1
y
0
y
n-1
y
n
y
n+1
y
N
… … y
1
y
N-1
Figure 7.2: Latent-Dynamic Conditional Random Field
The posterior probability of the sequence is defined as
p(Y|X,θ)=
X
H
p(Y|H,X)p(H|X,θ)=
X
H:∀h
n
∈H
yn
p(H|X,θ) (7.2)
And the probability p(Y|X,θ) is defined as
p(X,Y)=
1
Z
Y
n
exp(
~
λ∙ ~
t(h
n−1
,h
n
,X,n)+~ μ∙ ~ s(h
n
,X,n)) (7.3)
115
The parameters θ = (
~
λ,~ μ) can be estimated from training data, as in CRF. The
inference of such model is to find label sequence Y, given observation sequenceX
and trained parameters θ
∗
, that maximize the probability:
Y
∗
=argmax
Y
p(Y|X,θ
∗
)=argmax
Y
X
H:∀h
n
∈H
yn
p(H|X,θ
∗
) (7.4)
This can be computed by either belief propagation or Viterbi algorithm [44].
Compared to the regular CRF model, the LDCRF in addition can also capture
internal sub-structures by incorporating hidden states.
7.1.3 Pre-computed Feature Vector
ThemostimportanttaskinusingCRFandLDCRF,istodefinefeaturefunctions
~
t and ~ s that sufficiently capture the motion property we wish to encode. The
features we used include:
1. Positions: The positions of elbow and hand joints in shoulder coordinate
system. These positions are represented in Euclidean and Cylindrical coor-
dinate systems, respectively.
2. Velocity: The velocity of hand and elbow joints, again in Euclidean and
Cylindrical representations. The velocity features include unit-length direc-
tion vector and a scalar value velocity.
116
3. RestStateClassification : Themarginvaluereturnedbythetrainedsup-
port vector machine depicted in Section 6.4.
7.2 Experiment Results
To test the motion encoding, we edited short cartoon clips and showed them to
test subjects. We asked them to re-tell the story depicted in the cartoon using
speech and hand gestures. Those gesture motions are recorded and the body
configuration data resulting from our body pose estimation and tracking system
is used as input for the motion encoding system. The data includes several short
sequences of motions. The total length is about 8000 frames at 30 frames per
second recoding rate.
Werandomlyselecthalfofthemotionsequencestobeusedastrainingdataset.
The rest are used as testing data. We used the CRF package proposed by [44]
1
.
ThewindowsizeoftheCRFmodelissetto10,i.e. thefeaturevectorofeachframe
includes the features of current frame and the features of 10 neighboring frames
before and 10 frames after. Another parameter is the regularization term, which
penalizesthetotallikelihoodbytheEuclideannormoftheparametervector. This
prevents the model from overfitting to the training data. The regularization term
is set to 0.5 empirically. The confusion matrix of the testing data is shown in
1
Dr. Louis-Philippe Morency provided advise on the use of CRF, LDCRF, and the methodology of
collecting human nature gesture data.
117
Table 7.1. Each row is the target label of each frame and each column is the
recognition result.
For comparison, we also experimented the use LDCRF, also using the package
provided by [44]. The window size and regularization terms are the same as in
the CRF mode. The hidden states per label are set to 2. The results are shown
in Table 7.2. As we can see, there is a significant accuracy improvement of using
LDCRF over CRF.
However, theaccuracyshownaboveismeasuredbythelabelingofeachframe,
which does not necessary reflect the actual motion encoding accuracy. In actual
application, it should be the labeling of the sequences that matter, not individual
frames. In our experiments, each motion sequence (LEFT, UP, etc) is relatively
short, about 10-15 frames. The recognition output would often result in shifting
of label sequence boundaries. Those biases of boundaries contribute to the error
above.
118
REST PAUSE UP DOWN LEFT RIGHT FORWARD BACKWARD OTHER
REST 0.962 0.0040 0.016 0.004 0.002 0.006 0.006
PAUSE 0.858 0.013 0.013 0.013 0.004 0.007 0.009 0.083
UP 0.005 0.055 0.745 0.044 0.012 0.012 0.002 0.126
DOWN 0.017 0.142 0.708 0.020 0.002 0.020 0.092
LEFT 0.168 0.003 0.633 0.029 0.168
RIGHT 0.108 0.006 0.003 0.738 0.026 0.120
FORWARD 0.047 0.018 0.006 0.018 0.778 0.13
BACKWARD 0.167 0.0395 0.500 0.295
OTHER 0.003 0.065 0.038 0.013 0.037 0.024 0.042 0.011 0.767
Table 7.1: Confusion matrix of CRF classification
119
REST PAUSE UP DOWN LEFT RIGHT FORWARD BACKWARD OTHER
REST 0.9940 0.0020 0.0020 0.0020
PAUSE 0.0003 0.8678 0.0183 0.0085 0.0272 0.0164 0.0041 0.0152 0.0421
UP 0.0061 0.1216 0.7766 0.0410 0.0015 0.0046 0.0486
DOWN 0.0515 0.8566 0.0037 0.0037 0.0092 0.0754
LEFT 0.1098 0.0029 0.8613 0.0029 0.0231
RIGHT 0.0456 0.0028 0.8689 0.0085 0.0741
FORWARD 0.1871 0.0117 0.0292 0.6667 0.1053
BACKWARD 0.0022 0.9978
OTHER 0.0018 0.0643 0.0563 0.0055 0.0365 0.0121 0.0146 0.0055 0.8034
Table 7.2: Confusion matrix of LDCRF classification
120
Chapter 8
Conclusion
In this thesis, we presented research on posture and gesture recognition, real-
time human body pose estimation and natural gesture analysis using multiple
cameras. Wesetupacalibratedcameraarraytocapturethehumanbodymotion
from multiple angles. 2D image features such as silhouettes and body axis points
are extracted from each camera image. 3D features such as visual hull, 3D body
axis and shape descriptor are then reconstructed using calibration information.
Our posture and gesture recognition system incorporates a Matching Pursuit
methodtodecomposeshapedescriptorsgeneratedfromthevisualhull. Bydecom-
posingshapeintoacombinationofposturesfromtheatomdictionary,themethod
can greatly reduce the dimensionality of the posture data. This decomposition
alsoallowsustouserecognitionmethodssuchasFactorialHiddenMarkovModel
(FHMM) for gesture recognition task, which offers performance improvements
over traditional HMM approach.
121
Our body pose estimation method combines Particle Filter and proposed 2D-
3D Iterative Closest Points method to fit the articulated body model to the 2D
and 3D image features. This system is fast and can run in real-time. It is robust
and can automatically start tracking person without manual initialization. It can
also rapidly recover from tracking errors. This body pose estimation system is
integratedintoavirtualtrainingsystemandservesasvisual-based,gesture-based
human-computer interaction interface.
Our preliminary research in natural gesture analysis results in a motion en-
coding system. The system takes the body pose estimation result and tries to
segment it into pre-defined motion label sequences. These motion encodings can
be further feed into a virtual training system, which can provide more in-depth
analysis using contextual information.
8.1 Future Work
In our virtual training HCI project, the body pose and gesture system is only one
moduleoftheHCIsystem. Othermodalitiessuchasfacialgestures,facialexpres-
sions, handgesturesandaudioinputsareallpartofthemulti-modalHCIsystem.
The integration may not only be vertical, where each module work independently
and send the result to the parent system, but also horizontal, where each module
uses information return by other modules to improve its performance.
122
For example, the face gesture recognition system has to use a very high-
definition camera so that the image can cover the whole working area while pro-
vidingenoughimageresolutionofthefaceregionfordetection. Ourbodytracking
systemcanestimatethelocationoftheuser’sfacein3Dspaceand2Dimagespace
of calibrated camera. This information can be utilized by facial gesture system
to avoid scanning the whole high-resolution image to search for faces. On the
otherhand,ourfacedetectionsystemcandetectthedirectionofuser’sface. Such
information can be used by pose estimation system to more accurately estimate
the pointing directions.
123
References
[1] S.Arulampalam,S.Maskell,N.Gordon,andT.Clapp. Atutorialonparticle
filters for on-line non-linear/non-gaussian bayesian tracking. IEEE Transac-
tions on Signal Processing, 50(2):174–188, February 2002.
[2] T.LinderB.Kegl,A.KrzyzakandK.Zeger. Learninganddesignofprincipal
curves. InIEEE Transactions on Pattern Analysis and Machine Intelligence,
volume 22(3), page 281V297, 2000.
[3] A. J. Smola B. Scholkopf and K.-R. Muller. Nonlinear component analysis
as a kernel eigenvalue problem. In Neural Computation, volume 10(5), page
1299V1319, 1998.
[4] O.BernierandP.CheungMonChan. Real-time3darticulatedposetracking
using particle filtering and belief propagation on factor graphs. In BMVC06,
page I:7, 2006.
[5] Shahriar Negahdaripour Berthold K. P. Horn, Hugh M. Hilden. Closed-form
solution of absolute orientation using orthonormal matrices. Journal of the
Optical Society of America, 5:1127–1135, July 1988.
[6] Paul J. Besl and Neil D. McKay. A method for registration of 3-d shapes.
IEEE Trans. Pattern Anal. Mach. Intell., 14(2):239–256, 1992.
[7] M. Bray, P. Kohli, and P.H.S. Torr. Posecut: Simultaneous segmentation
and 3d pose estimation of humans using dynamic graph-cuts. In ECCV06,
pages II: 642–655, 2006.
[8] Tat-Jen Cham and J.M. Rehg. A multiple hypothesis approach to figure
tracking. In Computer Vision and Pattern Recognition, 1999. IEEE Com-
puter Society Conference on, volume 2, pages –244, 1999.
[9] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vec-
tor machines, 2001. Software available at http://www.csie.ntu.edu.tw/
∼
cjlin/libsvm.
[10] H.-J Chen, Z.and Lee. Knowledge-guided visual perception of 3-d human
gait from a single image sequence. In Systems, Man and Cybernetics, IEEE
Transactions on, volume 22, pages 336–342, Mar/Apr 1992.
124
[11] S.S.ChenandD.L.Donoho. Atomicdecompositionbybasispursuit. SIAM
Journal on Scientific Computing , 20(1):33–61, 1998.
[12] C.W.Chu,O.C.Jenkins,andM.J.Mataric. Markerlesskinematicmodeland
motioncapturefromvolumesequences. InCVPR03,pagesII:475–482,2003.
[13] R. R. Coifman and M. V. Wickerhauser. Entropy-based algorithms for best
basis selection. IEEE Trans. on Information Theory, 38(2):713–718, 1992.
[14] I Daubechies. Time-frequency localization operators: a geometric phase
space approach. In Information Theory, IEEE Transactions on, volume 34,
pages 605–612, July 1988.
[15] David Demirdjian and Trevor Darrell. 3-d articulated pose tracking for un-
tethereddiecticreference. InICMI’02: Proceedingsof,page267,Washington,
DC, USA, 2002. IEEE Computer Society.
[16] J. Deutscher, A. Blake, and I.D. Reid. Articulated body motion capture by
annealed particle filtering. In CVPR00, pages II: 126–133, 2000.
[17] A. Doucet, S. Godsill, and C. Andrieu. On Sequential Monte Carlo Sam-
pling Methods for Bayesian Filtering,volume10ofStatistics and Computing.
Springer, 2000.
[18] ArnaudDoucet, NandodeFreitas, andNeilGordon. Sequential Monte Carlo
Methods in Practice. Springer.
[19] AhmedElgammalandChan-SuLee. Inferring3dbodyposefromsilhouettes
using activity manifold learning, 2004.
[20] ZoubinGhahramaniandMichaelI.Jordan. Factorialhiddenmarkovmodels.
Mach. Learn., 29(2-3):245–273, 1997.
[21] Claus Gramkow. On averaging rotations. Int. J. Comput. Vision, 42(1-2):7–
16, 2001.
[22] T. Hastie and W. Stuetzle. Principal curves. In Journal of the American
Statistical Association, volume 84, page 502V516, 1989.
[23] T. Heap and D. Hogg. Wormholes in shape space: tracking through dis-
continuous changes in shape. In Computer Vision, 1998. Sixth International
Conference on, pages 344–349, January 1998.
[24] P. Horain and M. Bomb. 3d model based gesture acquisition using a single
camera. In WACV’02. Proceedings. Sixth IEEE Workshop on, pages 158–
162, 2002.
125
[25] Berthold K. P. Horn. Closed-form solution of absolute orientation using unit
quaternions. Journal of the Optical Society of America, 4:629–642, April
1987.
[26] C. Hu, Q. Yu, Y. Li, and S.D. Ma. Extraction of parametric human model
for posture recognition using genetic algorithm. In AFGR00, pages 518–523,
2000.
[27] M.IsardandA.Blake. C-conditionaldensitypropagationforvisualtracking.
IJCV, 29(1):5–28, August 1998.
[28] T. Izo and W.E.L. Grimson. Simultaneous pose estimation and camera cali-
bration from multiple views. In Non-Rigid04, page 14, 2004.
[29] V. de Silva J. B. Tenenbaum and J. C. Langford. A global geometric frame-
work for nonlinear dimensionality reduction. In Science, volume 290(5500),
page 2319V2323, 2000.
[30] D. Manocha J. D. Cohen, M. C. Lin and M. K. Ponamgi. I-icollide: An
interactive and exact collision detection system for large-scale environments.
In Symposium on Interactive 3D graphics, page 189V196, 1995.
[31] H.T. Tsui J.W. Deng. An hmm-based approach for gesture segmentation
and recognition. In ICPR ’00: Proceedings of the International Conference
on Pattern Recognition, volume 3, page 3683, Washington, DC, USA, 2000.
IEEE Computer Society.
[32] R. E Kalman. A new approach to linear filtering and prediction problems.
Transactions of the ASME - Journal of Basic Engineering, 82:35–45, 1960.
[33] R. Kehl, M. Bray, and L. Van Gool. Full body tracking from multiple views
using stochastic sampling. In CVPR05. IEEE Computer Society Conference
on, volume 2, pages 129–136, 2005.
[34] D.J.KiamChoo;Fleet. Peopletrackingusinghybridmontecarlofiltering. In
Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International
Conference on, volume 2, pages 321–328, 2001.
[35] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional ran-
dom fields: Probabilistic models for segmenting and labeling sequence data.
InProc. 18th International Conf. on Machine Learning,pages282–289.Mor-
gan Kaufmann, San Francisco, CA, 2001.
[36] A. Laurentini. The visual hull concept for silhouette-based image under-
standing. PAMI, 16(2):150–162, February 1994.
126
[37] M.W. Lee and I. Cohen. A model-based approach for estimating human 3d
poses in static images. PAMI, 28(6):905–916, June 2006.
[38] M.W. Lee and R. Nevatia. Dynamic human pose estimation using markov
chain monte carlo approach. In Motion05, pages II: 168–175, 2005.
[39] FredericF.LeymarieandBenjaminB.Kimia. to be published in Medial Rep-
resentations: Mathematics,AlgorithmsandApplications,chapter11. Kluwer,
2006.
[40] S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries.
IEEE Trans. on Signal Processing, 41(12):3397–3415, 1993.
[41] AlexPentlandMatthewBrand,NuriaOliver. Coupledhiddenmarkovmodels
for complex action recognition, 1996.
[42] Wojciech Matusik, Chris Buehler, and Leonard McMillan. Polyhedral visual
hulls for real-time rendering. In Proceedings of the 12th Eurographics Work-
shop on Rendering Techniques, pages 115–126, London, UK, 2001. Springer-
Verlag.
[43] I. Miki’c, M.M. Trivedi, E. Hunter, and P.C. Cosman. Articulated body
posture estimation from multi-camera voxel data. In CVPR01, pages I:455–
460, 2001.
[44] L.-P. Morency, A. Quattoni, and T. Darrell. Latent-dynamic discriminative
models for continuous gesture recognition. Computer Vision and Pattern
Recognition, 2007. CVPR ’07. IEEE Conference on, pages 1–8, June 2007.
[45] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM J.
Comput., 24(2):227–234, 1995.
[46] Chan Wah Ng. Gesture recognition via pose classification. In ICPR ’00:
Proceedings of the International Conference on Pattern Recognition, page
3703, Washington, DC, USA, 2000. IEEE Computer Society.
[47] V. Parameswaran and R. Chellappa. View independent human body pose
estimation from a single perspective image. In CVPR04, pages II: 16–22,
2004.
[48] Lawrence R. Rabiner. A tutorial on hidden markov models and selected
applications in speech recognition. pages 267–296, 1990.
[49] M.M. Rahman and S Ishikawa. Appearance-based representation and recog-
nition of human motions. In Robotics and Automation, 2003. Proceedings.
ICRA ’03. IEEE International Conference on, volume 1, pages 1410–1415,
2003.
127
[50] D.RamananandD.Forsyth. Automaticannotationofeverydaymovements.
In Neural Info. Proc. Systems (NIPS), Vancouver, Canada, Dec 2003.
[51] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally
linear embedding. In Science, volume 290(5500), page 2323V2326, 2000.
[52] J. Smith S. G. Penny and A. Bernhardt. Traces: Wireless full body track-
ing in the cave. In Ninth International Conference on Artificial Reality and
Telexistence (ICAT99), 1999.
[53] J. Schmidt, J. Fritsch, and B. Kwolek. Kernel particle filter for real-time 3d
body tracking in monocular color images. In FGR06, pages 567–572, 2006.
[54] D. Shepard. A two-dimensional interpolation function for irregularly-spaced
data. InInProceedingsoftheACMnationalconference,page517V524,1968.
[55] Charles Sutton and Andrew Mccallum. An introduction to conditional ran-
dom fields for relational learning. In Lise Getoor and Ben Taskar, editors,
Introduction to Statistical Relational Learning. MIT Press, 2006.
[56] Andrew J. Viterbi. Error bounds for convolutional codes and an asymptot-
ically optimum decoding algorithm. In IEEE Transactions on Information
Theory, volume 13, page 260V269, April 1967.
[57] Christian Vogler and Dimitris Metaxas. A framework for recognizing the si-
multaneousaspectsofamericansignlanguage. Comput.Vis.ImageUnderst.,
81(3):358–384, 2001.
[58] P. Wang and J.M. Rehg. A modular approach to the analysis and evaluation
of particle filters for figure tracking. In CVPR06, pages 790–797, 2006.
[59] Ping Wang and J.M. Rehg. A modular approach to the analysis and eval-
uation of particle filters for figure tracking. Computer Vision and Pattern
Recognition, 2006 IEEE Computer Society Conference on, 1:790–797, 17-22
June 2006.
[60] N. Werghi and Yijun Xiao. Posture recognition and segmentation from 3d
human body scans. In 3D Data Processing Visualization and Transmission,
2002. Proceedings. First International Symposium on, pages 636–639, 2002.
[61] Andrew D. Wilson and Aaron F. Bobick. Parametric hidden markov models
forgesturerecognition. IEEETransactionsonPatternAnalysisandMachine
Intelligence, 21(9):884–900, 1999.
[62] J. Yamato, J. Ohya, and K. Ishii. Recognizing human action in time-
sequential images using hidden markov model. In CVPR92, pages 379–385,
June 1992.
128
[63] H. Yoshimoto, N. Date, and S. Yonemoto. Vision-based real-time motion
capture system using multiple cameras. In MFI, pages 247–251, 2003.
[64] M Zerroug and R Nevatia. Segmentation and 3-d recovery of curved-axis
generalized cylinders from an intensity image. In Computer Vision and Im-
age Processing., Proceedings of the 12th IAPR International Conference on,
volume 1, pages 678–681, Oct 1994.
129
Abstract (if available)
Abstract
In this thesis we present an approached for a visual communication application for a dark, theater-like interactive virtual simulation training environment. Our system visually estimates and tracks the body position, orientation and the limb configuration of the user. This system uses a near-IR camera array to capture images of the trainee from different angles in the dim-lighted theater. Image features like silhouettes and intermediate silhouette body axis points are then segmented and extracted from image backgrounds. 3D body shape information such as 3D body skeleton points and visual hulls can be reconstructed from these 2D features in multiple calibrated images.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Facial gesture analysis in an interactive environment
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Model based view-invariant human action recognition and segmentation
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Hybrid methods for robust image matching and its application in augmented reality
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Landmark-free 3D face modeling for facial analysis and synthesis
PDF
Feature-preserving simplification and sketch-based creation of 3D models
Asset Metadata
Creator
Chu, Chi-Wei
(author)
Core Title
Body pose estimation and gesture recognition for human-computer interaction system
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
09/09/2008
Defense Date
05/19/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
body tracking,computer vision,hci,ICP,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nevatia, Ramakant (
committee chair
), Kuo, C.-C. Jay (
committee member
), Medioni, Gerard G. (
committee member
)
Creator Email
chiwei.chu@gmail.com,chuc@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1591
Unique identifier
UC1345450
Identifier
etd-Chu-2363 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-107307 (legacy record id),usctheses-m1591 (legacy record id)
Legacy Identifier
etd-Chu-2363.pdf
Dmrecord
107307
Document Type
Dissertation
Rights
Chu, Chi-Wei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
body tracking
computer vision
hci
ICP