Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Ubiquitous computing for human activity analysis with applications in personalized healthcare
(USC Thesis Other)
Ubiquitous computing for human activity analysis with applications in personalized healthcare
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
UBIQUITOUS COMPUTING FOR HUMAN ACTIVITY ANALYSIS WITH
APPLICATIONS IN PERSONALIZED HEALTHCARE
by
Mi Zhang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2013
Copyright 2013 Mi Zhang
Dedication
To my parents, my wife, and my son.
ii
Acknowledgments
Above all, I would like to express my sincere gratitude to my adviser Professor Alexan-
der A. Sawchuk. This thesis would not have been possible without his encouragement,
guidance and support. In addition, the good advice and friendship of Professor Sawchuk
has been invaluable on both an academic and a personal level, for which I am extremely
grateful.
I would like to thank my committee members Professor Bhaskar Krishnamachari and
Professor Yan Liu for their useful comments and suggestions. I would also like to thank
my collaborators Professor Albert “Skip” Rizzo, Chien-Yen Chang, Belinda Lange and
Sheryl Flynn Ashford at USC Institute for Creative Technologies, and Professor Shih-
Ching Yeh at National Central University, Taiwan who have contributed directly and
indirectly to this thesis.
I would also like to thank my wife Jingbo Meng for her constant support and great
patience at all times. My parents have also given me their unequivocal support to my
Ph.D. study.
Finally, I would like to express my gratitude to my friends Yi Gai, Ying Chen, Pankaj
Mishra, Zheng Yang, Mingyue Ji, Teng Wu, Abe Kazemzadeh, Safar Hatami and Fate-
meh Kashfi for making my graduate stay at USC enjoyable.
iii
Table of Contents
Dedication ii
Acknowledgments iii
List of Figures viii
Abstract xi
Chapter 1: Introduction 1
Chapter 2: USC-HAD: A Daily Activity Dataset for Wearable Sensor-
based Human Activity Analysis 4
2.1 Introduction .............................. 4
2.2 Existing Datasets ........................... 7
2.2.1 MIT PlaceLab Dataset .................... 8
2.2.2 UC Berkeley WARD Dataset ................. 8
2.2.3 CMU Multi-Modal Activity Database (CMU-MMAC) . . . 9
2.2.4 OPPORTUNITY Dataset ................... 9
2.2.5 Design Goals ......................... 10
2.3 Sensors and Hardware Platform.................... 11
2.3.1 The Choice of Sensors .................... 12
2.3.2 MotionNode.......................... 13
2.4 USC Human Activity Dataset (USC-HAD).............. 14
2.4.1 Human Subjects........................ 14
2.4.2 Activities ........................... 15
2.4.3 Data Collection Procedure .................. 16
2.4.4 Ground Truth Annotation ................... 17
2.4.5 Dataset Organization ..................... 18
2.4.6 Dataset Visualization ..................... 19
2.5 Discussion ............................... 20
iv
Chapter 3: Feature Design and Analysis for Human Activity Recogni-
tion 23
3.1 Introduction .............................. 23
3.2 Feature Design ............................ 25
3.2.1 Activity Model ........................ 25
3.2.2 Statistical Features ...................... 26
3.2.3 Physical Features ....................... 30
3.2.4 Feature Normalization .................... 34
3.3 Feature Selection ........................... 34
3.3.1 Feature Selection Methods .................. 35
3.3.2 Classifier ........................... 38
3.4 Single-Layer Feature Selection and Classification .......... 38
3.4.1 Evaluation on Feature Selection Methods .......... 39
3.4.2 Feature Profiling on the Selected Features .......... 40
3.5 Hierarchical Feature Selection and Classification .......... 43
Chapter 4: Sparse Representation for Activity Signals 46
4.1 Introduction .............................. 46
4.2 Sparse Representation-Based Framework............... 47
4.2.1 Feature Extraction ...................... 48
4.2.2 Feature Selection vs. Random Projection .......... 49
4.2.3 Overcomplete Dictionary Construction and Sparse Repre-
sentation............................ 50
4.2.4 Sparse Recovery via
1
Minimization ............ 51
4.2.5 Classification via Sparse Representation ........... 52
4.2.6 Classification Confidence Measure .............. 54
4.3 Experiments and Results ....................... 55
4.3.1 Effect of the Feature Dimension and Comparison to Base-
line Algorithms ........................ 56
4.3.2 Effect of the Choice of Features and Random Projection . . 58
4.3.3 SCI as a Measure of Confidence ............... 59
Chapter 5: Learning Motion Primitive 62
5.1 Introduction .............................. 62
5.2 The Bag-Of-Features Framework ................... 65
5.2.1 Size of Window Cells ..................... 66
5.2.2 Features ............................ 67
5.2.3 Primitive Construction .................... 67
5.2.4 V ocabulary Size ........................ 68
5.2.5 Primitive Weighting...................... 68
5.2.6 Classifier and Kernels..................... 71
5.3 Evaluation ............................... 72
v
5.3.1 Impact of Window Cell Sizes ................. 73
5.3.2 Impact of V ocabulary Sizes .................. 74
5.3.3 Comparison of Features.................... 75
5.3.4 Comparison of Primitive Construction Algorithms...... 77
5.3.5 Comparison of Weighting Schemes ............. 78
5.3.6 Comparison of Kernel Functions ............... 79
5.3.7 Confusion Table........................ 80
5.3.8 Comparison with String-Matching .............. 81
5.4 Extension Based on Sparse Representation .............. 82
5.4.1 Dictionary Learning...................... 83
5.4.2 Sparse Coding for Activity Modeling ............ 84
5.4.3 Classifier ........................... 86
5.4.4 Experimental Results and Discussions ............ 86
Chapter 6: Discovering Low Dimensional Activity Manifolds 91
6.1 Introduction .............................. 91
6.2 Manifold-Based Framework ..................... 93
6.2.1 Feature Extraction ...................... 94
6.2.2 Learning Activity Manifolds ................. 94
6.2.3 Learning Input-to-Manifold Mapping ............ 98
6.2.4 Recognizing Activity Manifolds ............... 99
6.3 Evaluation Results........................... 100
6.3.1 Estimating the Intrinsic Dimensionality ........... 101
6.3.2 Impact of the Number of Nearest Neighbors ......... 102
6.3.3 Confusion Table........................ 102
Chapter 7: RehabSPOT: A Customizable Networked Body Area Sens-
ing System for Computerized Rehabilitation 106
7.1 Introduction .............................. 106
7.2 Existing Networked Body Area Sensing Systems .......... 109
7.3 The Design of RehabSPOT ...................... 111
7.3.1 Sensing Hardware....................... 112
7.3.2 Software Architecture..................... 114
7.4 System Evaluation........................... 118
Chapter 8: Fine-Grained Motor Function Assessment for Computer-
ized Rehabilitation 122
8.1 Introduction .............................. 122
8.2 Motion Trajectory-Based Method................... 123
8.2.1 Feature Extraction ...................... 124
8.2.2 Trajectory Comparison .................... 125
8.2.3 Similarity Score........................ 127
vi
8.3 Evaluation ............................... 128
8.3.1 Experimental Setup ...................... 128
8.3.2 Evaluation Results ...................... 129
Chapter 9: Conclusion 135
9.1 Future Work .............................. 137
Bibliography 140
vii
List of Figures
2.1 An example of activity data from the x-axis of the 3-axis accelerom-
eter................................... 7
2.2 MotionNode sensing platform .................... 13
2.3 MotionNode, the mobile phone pouch, and the miniature laptop . . . 17
2.4 During data collection, a single MotionNode is packed firmly into a
mobile phone pouch and attached to the subject’s front right hip . . 18
2.5 The plot of the raw sensor data, the histogram, and the spectral anal-
ysis of each axis of the 3-axis accelerometer for activity Walking
Forward ................................ 20
2.6 The plot of the raw sensor data, the histogram, and the spectral anal-
ysis of each axis of the 3-axis accelerometer for activity Running
Forward ................................ 21
3.1 Data distributions of various activity classes ............. 27
3.2 Correlations between different features ................ 29
3.3 Testing classification error rates as a function of the number of fea-
tures selected by different feature selection methods ......... 39
3.4 Sanity check on feature selection methods .............. 41
3.5 Testing classification error rates of feature selection methods with-
out physical features.......................... 42
3.6 The structure and the performance of the hierarchical feature selec-
tion and classification framework ................... 44
4.1 The block diagram of the sparse representation-based human activ-
ity recognition framework....................... 48
viii
4.2 The sparse representation solutions via
1
minimization and the cor-
responding residuals for two test samples from walk forward and
running respectively. ......................... 54
4.3 Impact of Feature Dimension ..................... 57
4.4 Impact of Feature Choices ...................... 59
4.5 Impact of SCI Threshold Value on Classification Performance . . . 60
5.1 An example of activity representation (walking forward (top) and
running (bottom)) using five motion primitives (labeled A, B, C, D,
E in different colors). ......................... 64
5.2 Block diagram of Bag-of-Features (BoF)-based framework for human
activity representation and recognition ................ 65
5.3 Impact of Window Cell Sizes ..................... 73
5.4 Impact of V ocabulary Sizes ...................... 74
5.5 Comparison of Features ........................ 76
5.6 The difference of primitive mapping between physical features (top)
and statistical features (bottom) .................... 77
5.7 Comparison of Primitive Construction Algorithms .......... 78
5.8 Comparison of Weighting Schemes.................. 79
5.9 Comparison of Kernel Functions ................... 80
5.10 Performance Comparison with String-Matching-Based Approach 82
5.11 The block diagram of the sparse representation-based motion prim-
itive framework ............................ 82
5.12 Impact of Window Cell Sizes ..................... 87
5.13 Impact of Sparsity (T
0
) ........................ 88
5.14 Impact of Dictionary Sizes (K) .................... 89
6.1 The block diagram of the manifold-based human activity recogni-
tion framework ............................ 93
ix
6.2 Manifolds of four different types of activities visualized in 3D spaces 97
6.3 Mapping results of the non-parametric mapping function ...... 99
6.4 Impact of the number of nearest neighbors (K) on the classification
performance of the manifold-based framework............ 103
6.5 Intrinsic dimensionality estimation based on residual variance.... 104
7.1 Overview of RehabSPOT architecture ................ 111
7.2 Sun SPOT Sensing Platform ..................... 112
7.3 RehabSPOT Message Formats .................... 114
7.4 RehabSPOT client architecture .................... 115
7.5 RehabSPOT server architecture .................... 117
7.6 Snapshots of the demonstration of RehabSPOT in a rehabilitation
program ................................ 121
8.1 The placement of MotionNode on the upper limb of the subject . . . 129
8.2 The 3D scatter plots of the traditional automatic motor function
assessment method. Subject 1 is the female patient with upper limb
hemiparesis, Subject 2 is the healthy female, and Subject 3 is a male
patient with upper limb hemiparesis. The three features used for the
plots are AI, VI, and ARE. ...................... 130
8.3 The fine-grained trajectory representation and the warp path calcu-
lated from DTW of Pronation ..................... 131
8.4 The fine-grained trajectory representation and the warp path calcu-
lated from DTW of Flexor Synergy.................. 132
x
Abstract
Ubiquitous computing envisions a world in which people can access computing resources
anywhere and any time. Over the past decade, the emergence and availability of a variety
of miniature devices embedded with powerful sensing, communication, and computa-
tional capabilities are turning this vision into reality. Powered by these sensing and
computational devices, ubiquitous computing endeavors to provide new and better solu-
tions to problems in many application domains with significant societal impact. These
include security, healthcare, education, sustainability, energy, and social informatics.
My thesis investigates how ubiquitous computing technologies bring new solutions
to transform the existing healthcare system to enable personalized healthcare and improve
health and well-being for both healthy and clinical populations. The first half of this the-
sis focuses on wearable sensor-based human activity recognition technology which acts
as the fundamental technology to support a variety of personalized healthcare appli-
cations, including personal fitness monitoring, long-term preventive care, and intelli-
gent assistance for elderly citizens. Chapter 2 presents the human activity dataset we
have built based on wearable sensors. Chapter 3 to Chapter 6 presents four different
algorithms to model and recognize human daily activities based on the human activity
dataset introduced in Chapter 2. Specifically, Chapter 3 analyzes human activity sig-
nals based on feature selection algorithms and shows that the recognition performance
can be improved by carefully selecting features for each activity separately. Chapter 4
xi
and Chapter 5 discuss new computational models based on dictionary learning and non-
linear manifold learning respectively to solve the human activity recognition problem
from a totally different perspective. Chapter 6 presents the new activity model based
on the recently developed sparse representation and compressed sensing theories and
demonstrates that the task of looking for “optimal features” to achieve the best activity
recognition performance is less important within this framework.
The second half of this thesis focuses on the design of a novel on-body networked
sensing system called RehabSPOT for computerized rehabilitation for patients with
stroke. Chapter 7 presents the system design of RehabSPOT and its value in person-
alized rehabilitation delivery via real-time system reconfiguration. Chapter 8 presents
the computational model based on wearable sensing system to analyze patients’ motor
behavior to track precisely the progress patients have made during rehabilitation.
xii
Chapter 1
Introduction
The term ubiquitous computing was first coined by Mark Weiser in 1991. In his sem-
inal article [88], Weiser envisioned that: “The most profound technologies are those
that disappear. They weave themselves into the fabric of everyday life until they are
indistinguishable from it”. Ubiquitous computing is such a profound technology that
enables people to access computing resources anywhere and any time. Nowadays, with
advances in semiconductor technologies, the emergence and availability of a variety of
miniature devices embedded with powerful sensing, communication, and computational
capabilities are turning Weiser’s vision into reality. The smart phone is such a wonderful
example. Equipped with a powerful processor and a rich set of embedded sensors, it is
the host of millions of applications and is rapidly becoming the most ubiquitous sensing
and computing platform in people’s daily life. Wearable sensors are another set of ubiq-
uitous computing devices. As an example, a wearable pulse oximeter can continuously
monitor an individual’s physiological conditions by measuring the heart rate as well as
the concentration of oxygen in the blood. Ubiquitous computing can also be embedded
in our living and working environment. For example, ambient sensors deployed at vari-
ous places inside homes can sense people’s needs, provide contextual and personalized
services, and ultimately improve the quality of our daily lives.
Powered by these sensing and computational devices, ubiquitous computing endeav-
ors to provide new and better solutions to problems in many application domains with
significant societal impact. These include security, healthcare, education, sustainability,
energy, and social informatics. This thesis focuses on healthcare. As one of the Grand
1
Challenges for Engineering in the 21st century, the importance of healthcare to our soci-
ety today cannot be emphasized enough. Specifically, this thesis focuses on how ubiq-
uitous computing brings new solutions to transform the existing healthcare system to
enable personalized healthcare. The healthcare system today adopts a hospital-centered
model which is based on scheduled periodic evaluations at hospital visits. With the help
of the mobile, wearable, and ambient sensing devices, ubiquitous computing makes it
possible to build a patient-centered model to deliver personalized healthcare seamlessly
in our everyday lives, regardless of space and time [77]. Compared to the hospital-
centered model, the benefits of personalized healthcare powered by ubiquitous comput-
ing technologies are enormous. More importantly, it opens up new opportunities that
existing healthcare system can not provide. First, ubiquitous computing enables contin-
uously collecting people’s personal health information to deliver preventive health care
in our daily lives. It can also be used as a disease diagnosis tool to detect the early signs
of diseases to prevent serious disease from happening. Finally, ubiquitous computing
has significant potential to increase the coverage of chronic care to reduce the need for
frequent clinic visits and thus dramatically lower healthcare costs.
Driven by these fascinating opportunities brought by personalized healthcare, this
thesis focuses on developing ubiquitous computing technologies to support a variety of
personalized healthcare applications for preventive care, disease diagnosis, and chronic
care, with the ultimate goal to provide better healthcare at lower cost. Specifically, the
first half of this thesis focuses on wearable sensor-based human activity recognition
technology, which actually acts as the fundamental technology to enable personalized
healthcare. Many health problems are associated with human behavior. This is why
physicians always want to know patients’ daily activities so that they can get a much
better understanding about their health conditions. Existing healthcare practice usually
asks patients to make a diary of their daily activities for self-monitoring. However,
2
self-monitoring is tedious and may not be effective because people tend to forget to log
their activities. Human activity recognition technology can automatically identify and
keep a record of people’s daily activities. This provides people with important retro-
spective behavioral information so that they can better manage their health conditions.
Within this thesis, Chapter 2 presents the human activity dataset we have built based on
wearable sensors. Chapter 3 to Chapter 6 presents four different algorithms to model
and recognize human daily activities based on the human activity dataset introduced
in Chapter 2. Specifically, Chapter 3 analyzes human activity signals based on feature
selection algorithms and shows that the recognition performance can be improved by
carefully selecting features for each activity separately. Chapter 4 and Chapter 5 discuss
new computational models based on dictionary learning and nonlinear manifold learn-
ing respectively to solve the human activity recognition problem from a totally different
perspective. Chapter 6 presents the new activity model based on the recently developed
sparse representation and compressed sensing theories and demonstrates that the task
of looking for “optimal features” to achieve the best activity recognition performance
is less important within this framework. The second half of this thesis focuses on the
design of a novel on-body networked sensing system called RehabSPOT for comput-
erized rehabilitation for patients with stroke. Chapter 7 presents the system design of
RehabSPOT and its value in personalized rehabilitation delivery via real-time system
reconfiguration. Chapter 8 presents the computational model based on wearable sensing
system to analyze patients’ motor behavior to track precisely the progress patients have
made during rehabilitation.
3
Chapter 2
USC-HAD: A Daily Activity Dataset
for Wearable Sensor-based Human
Activity Analysis
2.1 Introduction
Human activity recognition is regarded as one of the most important problems in ubiqui-
tous computing since it has a wide range of applications including healthcare, security,
surveillance, human-machine interaction, sport science, etc. Camera-based computer
vision systems and inertial sensor-based systems are among several techniques used to
collect basic sensor data for human activity recognition. In computer vision, human
activities are captured by cameras and the task is to recognize automatically the activity
based on a sequence of images [80]. However, in some scenarios which require con-
tinuously monitoring a person’s activities, the camera-based method may not work due
to the lack of complete camera coverage. In addition, cameras are intrusive and many
people do not feel comfortable being watched by cameras continuously.
With the advancement of semiconductor and MEMS technologies, inertial sensors
such as accelerometers and gyroscopes are miniaturized such that they can be attached
or worn on the human body in an unobtrusive way. The data from these wearable
sensors can be used in systems that understand and recognize human activities using
4
machine learning and pattern recognition techniques. Compared to cameras, an advan-
tage of wearable sensors is that they generally monitor activity on a nearly-continuous
or continuous basis, and are not confined to a limited observation space. Furthermore,
wearable sensors are unobtrusive if they are integrated into items people wear or hold
in their normal lives. Examples of such items are watches, shoes, mobile phones, and
clothing [60] [8] [33].
Since wearable sensors are suitable for continuous monitoring, they open the door
to a world of novel healthcare applications. Specific applications include physical fit-
ness monitoring, elder care support, sleep quality monitoring, long-term preventive and
chronic care, and intelligent assistance to people with cognitive disorders [22] [97]. As
an example, a sleep quality monitoring application could use activity information (body
position and movement) to infer and calculate the amount of restorative sleep (deep
sleep) and disruptive sleep (time and duration spent awake) that one gets throughout the
night. This information helps users recognize sleeping disorders as early as possible for
diagnosis and prompt treatment of the condition [41].
The applications mentioned above promote the research of human activity recogni-
tion using wearable sensors. Over the past decade, researchers in embedded systems,
signal processing, biomedical engineering, and human-computer interaction have begun
to work on prototyping wearable sensor systems, building human activity datasets, and
developing machine learning techniques to model and recognize various types of human
activities. In this work, we focus on developing a dataset for human activity recognition
research. It has been widely accepted that datasets play a significant role in facilitating
research in any scientific domain. In application areas including human speech recogni-
tion, natural language processing, computer vision, and computational biology, there are
5
many publicly available datasets that act as standardized benchmarks for algorithm com-
parison (e.g. UC Irvine machine learning repository
1
, Caltech 101/256 dataset
2
, and
Wall Street Journal CSR corpus [65]). Although wearable sensor-based human activ-
ity recognition has been studied for a decade, most researchers develop and examine
the performance of their activity models and recognition algorithms based on their own
datasets. In general, these datasets are relatively small and limited by the constrained
settings within which they are constructed. Specifically, they either only contain a small
number of subjects (e.g. 2, 3, or even 1) or focus on some specific category of activities
(e.g. cooking activities). Furthermore, most of these datasets are not available for public
usage. This prohibits researchers in ubiquitous computing community to compare their
algorithms on a common basis.
The lack of large, general purpose, and publicly available human activity datasets
motivates us to build our own dataset. In this chapter, we describe how we constructed
a dataset useful for ubiquitous computing community for conducting human activity
recognition research and compare it to a selection of similar existing datasets. We term
our dataset the University of Southern California Human Activity Dataset (USC-HAD).
As a brief overview, USC-HAD is specifically designed to include the most basic and
common human activities in daily life from a large and diverse group of human subjects.
Our own focus is on healthcare related applications such as physical fitness monitoring
and elder care, but the activities in the dataset are applicable to many scenarios. The
activity data is captured by a high-performance inertial sensing device instead of low-
cost, low-precision sensors. Figure 2.1 shows an example of the activity data sampled by
the sensing device. As this time, we have included 12 activities and collected data from
14 subjects. The entire dataset and the basic code for visualizing the data is publicly
1
http://archive.ics.uci.edu/ml/
2
http://www.vision.caltech.edu/Image Datasets/Caltech101/
6
available on the web at: http://sipi.usc.edu/HAD/. We intend to expand the
number of activities and number of subjects in future, and we will provide updates on
this website.
8M QI W
%GGI PIV E X M SR K
<%\MWSJ%GG IPIVSQIXIV
7MX SR E GLEMV
;EPO
;EPO
0MI SR E FIH 6YR 7XERH ERH
HVMRO [EXIV
Figure 2.1: An example of activity data from the x-axis of the 3-axis accelerometer
2.2 Existing Datasets
The number of publicly available human activity datasets is limited. In this section,
we review some of them. Although each dataset has its own strengths, none of them
meets our goals, thus motivating us to build our own dataset. A full comparison of these
datasets and USC-HAD is in Table 2.4.
7
2.2.1 MIT PlaceLab Dataset
One of the first publicly available datasets is the MIT PlaceLab dataset [75]. A sin-
gle subject wearing five accelerometers (one on each limb and one on the hip) and a
wireless heart rate monitor was asked to perform a set of common household activities
during a four-hour period. The household activities include: preparing a recipe, doing
a load of dishes, cleaning the kitchen, doing laundry, making a bed, and light cleaning
around an apartment. In addition to the activities above, the subject also searches for
items, uses appliances, talks on the phone, answers email, and performs other every-
day tasks. The major issue with this dataset is that it only contains data from a single
subject. A potential problem with it is that the small number of subjects may poorly rep-
resent the activity characteristics of a large population. In addition, the definitions of the
considered activities are vague which makes the evaluation of recognition performance
difficult.
2.2.2 UC Berkeley WARD Dataset
The WARD (wearable action recognition database) dataset developed at the University
of California, Berkeley (UC Berkeley) consists of continuous sequences of human activ-
ities measured by a network of wearable sensors [92]. These wireless sensors are placed
at five body locations: two wrists, the waist, and two ankles. Each custom-built multi-
modal sensor carries a 3-axis accelerometer and a 2-axis gyroscope. WARD includes
20 human subjects (13 male and 7 female) and a rich set of 13 activities that covers
some of the most common activities in people’s daily lives such as standing, walking,
and jumping. Although the WARD dataset covers a large population and focuses on
the most common human activities, part of the sensed data is missing due to battery
failure and wireless network packet loss. In addition, the data sampled from the sensors
8
is raw digital data and not calibrated. This makes the data hard to interpret. Moreover,
the dataset does not include sensor locations where people typically carry their mobile
devices (e.g. mobile phone, iPod) such as pant pockets and front hips. We feel that this
makes this dataset less useful.
2.2.3 CMU Multi-Modal Activity Database (CMU-MMAC)
The Carnegie Mellon University Multi-Modal Activity Data-base (CMU-MMAC) is dif-
ferent from the datasets mentioned above in the sense that it contains many other modal-
ities besides accelerometers and gyroscopes to sense and measure human activities [78].
These modalities include video, audio, RFID tags, motion capture system based on on-
body markers, and physiological sensors such as galvanic skin response (GSR) and skin
temperature. These sensors are located all over the human body, including both forearms
and upper arms, left and right calves and thighs, abdomen, and both wrists. 43 subjects
were enrolled to perform food preparation and cook five different recipes: brownies,
pizza, sandwich, salad and scrambled eggs in a kitchen environment. Although this
dataset contains a much bigger population and richer modalities and locations than any
dataset mentioned above, it only focuses on a specific category of activities (cooking).
2.2.4 OPPORTUNITY Dataset
The OPPORTUNITY dataset is collected from an European research project called
OPPORTUNITY [68]. The OPPORTUNITY dataset focuses on daily home activi-
ties in a breakfast scenario. Specifically, 12 subjects are asked to perform a sequence
of daily morning activities including grooming room, preparing and drinking coffee,
preparing and eating a sandwich, and cleaning tables in a room simulating a studio
9
flat with kitchen, deckchair, and outdoor access. Like CMU-MMAC, the OPPORTU-
NITY dataset is recorded from many different sensing modalities including accelerom-
eters, gyroscopes, magnetometers, microphones, and video cameras integrated in the
environment, in objects, and on the human bodies. Similar to CMU-MMAC, although
OPPORTUNITY dataset contains a wide range of sensing modalities, it only covers
daily morning activities in a home environment.
2.2.5 Design Goals
The goal of our USC-HAD dataset is to overcome the limitations of the existing datasets
such that it can serve as a standard benchmark for researchers in ubiquitous computing
community to compare performance of their human activity recognition algorithms. In
order to achieve this, our dataset has been carefully constructed with the following goals:
• The dataset should enroll a large number of human subjects with divergence in
gender, age, height, and weight.
• The activities included should correspond to the most basic and common human
activities in people’s daily lives such that the dataset is useful for a wide range of
potential applications such as elder care, and personal fitness monitoring.
• The wearable sensors should be calibrated and capable of capturing human activ-
ity signals accurately and robustly.
• We envision that in near future the wearable sensors will become a part of the
mobile devices (e.g. mobile phone) people carry in their daily lives. Therefore,
the locations of the wearable sensors should be selected to be consistent with
where people carry their mobile devices.
10
In the following sections we will describe in more detail the wearable sensors we
use for data collection, the activities we have chosen, and finally the data format and
organization of our USC-HAD dataset.
2.3 Sensors and Hardware Platform
The majority of wearable systems for ubiquitous computing and activity recognition
concentrates on placing a single type of sensor, typically accelerometers, in multiple
locations on the human body (single-modality multi-location). However, the use of sin-
gle sensor type has been proved to restrict the range of activities it can recognize [22].
An alternative is to use multiple sensor types, that is, a multi-modal sensor to collect
data from a single body location (multi-modality single-location). The rationale behind
this idea is to select sensors that are complementary such that a wider range of activities
can be recognized. For example, using an accelerometer and a gyroscope together can
differentiate whether the person is walking forward or walking left/right while classifica-
tion fails if accelerometers are used alone. Furthermore, the reason to place the sensors
on a single location is to remove the obtrusiveness incurred by placing sensors on mul-
tiple body locations. In terms of practicality, this multi-modality single-location design
is a promising line of investigation since it is much more comfortable for users to wear
a single device at only one location. Moreover, this multi-modal sensor could be incor-
porated into existing mobile devices such as mobile phones. Integrating sensors into
devices people already carry is likely to be more appealing to users and achieve greater
user acceptance. In terms of performance, the study carried out in [54] has shown that
the information gained from multi-modal sensors can offset the information lost when
activity data is collected from a single location. Therefore, we adopt the multi-modality
single-location design to build our sensing platform.
11
2.3.1 The Choice of Sensors
There are many types of wearable sensors used in the literature for human gesture and
activity recognition. These sensors include audio sensor (microphone), motion sensors
such as accelerometer and gyroscope; geographical sensors such as magnetometer (digi-
tal compass) and GPS; physiological sensors such as galvanic skin response (GSR) sen-
sor, pulse oximeter, and Electrocardiogram sensor (ECG); and environmental sensors
such as barometric pressure sensor, ambient light sensor, humidity and temperature sen-
sor. Intuitively, it would be optimal to include all these sensors since each provides some
useful information. However, in the perspective of activity recognition performance, it
is not necessary or may be undesirable if we incorporate all the sensing modalities. For
example, heart rate information extracted from ECG sensor has a high correlation with
the accelerometer signals. Only modest gain is achieved when these two sensors are
combined together [55]. Light sensors can be misleading since its readings are more
dependent on how users carry devices than what activities they are performing. In the
perspective of system complexity and practicality, the total number of sensors should
be as small as possible such that the size of the wearable device is small. Therefore,
only the most important sensors which provide complementary information should be
incorporated. In [56], the rotation angle produced by a gyroscope is identified to be
the key performance booster for fall detection. In [53], accelerometer and microphone
are identified as the two most important sensors to recognize activities including sitting,
walking, walking up/down stairs, riding elevator up/down, and brushing teeth. How-
ever, for privacy considerations, we argue that sensors such as microphone should not
be selected.
12
2.3.2 MotionNode
Based on the above considerations, we use an off-the-shelf sensing platform called
MotionNode to capture human activity signals and build our dataset. MotionNode is
a 6-DOF inertial measurement unit (IMU) specifically designed for human motion sens-
ing applications (see Figure 2.2)
3
. Each MotionNode itself is a multi-modal sensor that
integrates a 3-axis accelerometer, 3-axis gyroscope, and a 3-axis magnetometer. The
measurement range is ±6g and ±500dps for each axis of accelerometer and gyroscope
respectively. Although body limbs and extremities can exhibit up to ±12g in accelera-
Figure 2.2: MotionNode sensing platform
tion, points near the torso and hip experience no more than±6g range in acceleration [9].
Therefore, MotionNode is capable of capturing all the details of normal human activ-
ities. In addition, MotionNode is a wired device and transmits sampled sensor data to
a laptop computer via a USB interface. In such case, no sensor data is missed and the
fidelity of the sensor data is well preserved. A possible concern is that the wire is cum-
bersome and may distort the sampled data. However, we have proved by experiments
that as long as the wire is soft and long, it has little impact on the quality of the collected
data
4
.
3
http://www.motionnode.com/
4
The experiments were performed by placing the MotionNode on a rotation table with a soft and
relatively long wire connected to a PC. The rotation table was preset to rotate at a constant rate (30dps,
60dps, 120dps). The readings from MotionNode were almost the same as the preset values.
13
Compared to other commercially available inertial sensing platforms, MotionNode
has several advantages:
• MotionNode is extremely small in size(35mm×35mm×15mm) and lightweight
enough (14g) to wear comfortably for long period of time. This feature makes
MotionNode unobtrusive and thus perfect as a wearable device.
• Compared to the accelerometer and gyroscope embedded in the smartphones (e.g.
iPhone 4G), the integrated sensors have higher resolution (0.001g ± 10% for
accelerometer, 0.5
◦
/second for gyroscope) and wider sensing ranges. In addi-
tion, MotionNode is gyro-stablized and well calibrated such that the readings are
accurate and reliable.
• The highest sampling rate can reach 100Hz. This sampling frequency is much
higher than the one used in some of the existing datasets [92] [78].
2.4 USC Human Activity Dataset (USC-HAD)
In this section, we describe the details of our human activity dataset USC-HAD. We first
explain our criteria for selecting the subjects and activities, then we describe the data
collection procedure and how we annotate the data. Finally we present the organization
of our dataset.
2.4.1 Human Subjects
Variation across users is an important practical issue for any pattern recognition prob-
lem. In order to build a powerful recognition system, the system needs to be trained on
a large diverse group of individuals. In the context of human activities, we assume that
the diversity of the subjects enrolled includes the following four factors: (1) Gender;
14
(2) Age; (3) Height; and (4) Weight. Based on these guidelines, we have selected 14
subjects (7 male, 7 female) to participate in the data collection. The statistics of age,
height, and weight are listed in Table 2.1. We hope the diversity in each of these four
factors can cover a wider range of population.
Age Height (cm) Weight (kg)
range 21-49 160 - 185 43-80
mean 30.1 170 64.6
std 7.2 6.8 12.1
Table 2.1: Statistics of the participating human subjects
2.4.2 Activities
There are many categorization methods to classify human activities. One method cat-
egorizes activities into activities that an individual does by themselves (e.g., cooking),
and activities that involve more than one person (e.g. shaking hands) [23]. Another
popular categorization is based on time-scale. It breaks activities into: (1) short-term
activities (low-level activities), where activities are characterized by a sequence of body
motions, posture or object use (e.g., walking, going upstairs). These activities typically
last between seconds and several minutes; and (2) long-term activities (high-level activ-
ities), which are complex and usually composed of a collection of low-level activities.
These activities typically last more than several minutes and can last as long as a few
hours (e.g., cleaning the house, going shopping) [43]. In this work, we focus on building
a dataset of low-level activities. We list two reasons here: (1) Low-level activities such
as walking and running have a clear definition and description. This makes modeling
at this granularity level much easier. In comparison, high-level activities are typically
complex. Up to now, there is still no consensus on how to define these activities in
the ubiquitous computing community. (2) Normally, high-level activities consist of a
15
sequence of low-level activities. For example, going shopping can be regarded as walk-
ing to the garage, driving a car to the shopping mall, and then shopping in the mall.
Therefore, it is reasonable to assume low-level activity recognition is the basis of the
high-level activity recognition. Once we reliably recognize low-level activities, we can
then construct a temporal and location model on top of these low-level activities to char-
acterize the corresponding high-level activities.
Based on the considerations mentioned above, we have selected 12 activities (see
Table 2.2). These activities are among the most basic and common human activities in
people’s daily lives. Note that the description for each activity in Table 2.2 is generic
such that each subject could perform these activities based on one’s own style. We hope
this diversity in performance style could cover a wider range of population.
Activity Description
1 walking forward The subject walks forward in a straight line
2 walking left The subject walks counter-clockwise in a full circle
3 walking right The subject walks clockwise in a full circle
4 walking upstairs The subject goes up multiple flights
5 walking downstairs The subject goes down multiple flights
6 running forward The subject runs forward in a straight line
7 jumping The subject stays at the same position and continuously jumps up and down
8 sitting The subject sits on a chair either working or resting. Fidgeting is also
considered to belong to this class
9 standing The subject stands and talks to someone
10 sleeping The subject sleeps or lies down on a bed
11 elevator up The subject rides in an ascending elevator
12 elevator down The subject rides in a descending elevator
Table 2.2: Activities and their brief descriptions
2.4.3 Data Collection Procedure
To collect data, we pack a single MotionNode firmly into a standard-sized mobile phone
pouch (see Figure 2.3). Since MotionNode is a wired device, the MotionNode is con-
nected to a miniature laptop via a long and soft cable to record sampled data. During
16
data collection, the subject wears the pouch at one’s front right hip (with the MotionN-
ode oriented so thex axis points to the ground and is perpendicular to the plane formed
by y and z axes), holds the miniature laptop in one hand, and is asked to perform a trial
of specific activity naturally based on one’s own style (see Figure 8.1). We choose the
front right hip as the location to wear the sensor because it is one of the top 5 locations
where people carry their mobile phones when they are out and about in public spaces
based on the survey carried out by [32]. In order to capture the day-to-day activity vari-
ations, each subject was asked to perform 5 trials for each activity on different days at
various indoor and outdoor locations. Although the duration of each trial varies across
different activities, it is long enough to capture all the information of each performed
activity. On average, it took 6 hours for each subject to complete the whole data collec-
tion procedure.
Figure 2.3: MotionNode, the mobile phone pouch, and the miniature laptop
2.4.4 Ground Truth Annotation
Ground truth was annotated while the experiments were being carried out. When the
subject was asked to perform a trial of one specific activity, an observer standing nearby
17
Figure 2.4: During data collection, a single MotionNode is packed firmly into a mobile
phone pouch and attached to the subject’s front right hip
marked the starting and ending points of the period of the activity performed. In addi-
tion, the observer was also responsible for recording the details of how subjects perform
activities. Examples include how many strides the subject made during one trial of
“walking forward”; how the subject climbed the stairs (one stair at a time, or two stairs
at a time) during one trial of “walking up stairs”, etc. This on-line ground truth anno-
tation strategy eliminates the need for the subjects to annotate their data by themselves
and helps to reduce annotation errors.
2.4.5 Dataset Organization
After the sessions were recorded, the activity data of each trial was manually segmented
based on the starting and ending points annotated by the observer. These segmented
data was then stored and organized using the MATLAB computing environment. Each
segmented activity trial of one subject is stored in a separate .mat file. The naming
convention of each .mat file is defined as a”m”t”n”.mat, where a stands for activity,
m stands for activity number (see Table 2.2 for the activity numbers), t stands for trial,
and n stands for trial number. For example, the first trial of activity “walking forward”
18
is stored in the .mat file with the name a1t1.mat (“walking forward” has an activity
number 1). The stored information of each .mat file is listed and described briefly in
Table 2.3.
Field Description
title USC Human Activity Database
version The version of the dataset
date A string indicating the date of the recording session with the format: yyyy-mm-dd
subject number An integer representing the unique ID number of the subject
age An integer representing the age of the subject
height An integer representing the height of the subject in units of centimeters
weight An integer representing the weight of the subject in units of kilograms
activity name A string indicating the name of the activity
activity number An integer representing the ID number of the activity
trial number An integer representing the number of the trial
sensor location The location of the sensor worn on the human body
sensor orientation The orientations of the embedded 3-axis accelerometer and 3-axis gyroscope
sensor readings The sampled data from the 3-axis accelerometer and 3-axis gyroscope
comments Details of how subjects perform activities
Table 2.3: Dataset fields and their brief descriptions
2.4.6 Dataset Visualization
In addition to the collected activity data, we also provide sample MATLAB scripts for
visualizing the data. An example of the plot is shown in Figure 2.5 and Figure 2.6. In this
example, we show the raw sensor data, the histogram, and the spectral analysis of each
axis of the 3-axis accelerometer for activity Walking Forward (Figure 2.5) and activity
Running Forward (Figure 2.6). As illustrated in the figures, although the raw sensor
data of the three axes in time domain look similar, the histograms and the spectral plots
show different patterns between the two types of activities. Based on these observations,
researchers can extract features and develop various pattern recognition algorithms to
characterize the activity data.
19
0 10 20 30
0
0.5
1
1.5
2
2.5
Time (s)
Acceleration (g)
X-Axis Data
0 1 2 3
0
100
200
300
400
500
Acceleration (g)
Count
X-Axis Distribution
0 20 40 60
0
0.5
1
1.5
2
X-Axis Spectrum
Frequency (Hz)
|X(f)|
0 10 20 30
-1.5
-1
-0.5
0
0.5
1
Time (s)
Acceleration (g)
Y-Axis Data
-2 -1 0 1
0
200
400
600
800
1000
Acceleration (g)
Count
Y-Axis Distribution
0 20 40 60
0
0.05
0.1
0.15
0.2
0.25
Y-Axis Spectrum
Frequency (Hz)
|Y(f)|
0 10 20 30
-2
-1.5
-1
-0.5
0
0.5
Time (s)
Acceleration (g)
Z-Axis Data
-2 -1 0 1
0
200
400
600
800
1000
Acceleration (g)
Count
Z-Axis Distribution
0 20 40 60
0
0.1
0.2
0.3
0.4
Z-Axis Spectrum
Frequency (Hz)
|Z(f)|
Figure 2.5: The plot of the raw sensor data, the histogram, and the spectral analysis of
each axis of the 3-axis accelerometer for activity Walking Forward
2.5 Discussion
The intention of the development of the USC-HAD dataset is not to replace other exist-
ing datasets. Instead, USC-HAD is carefully designed to satisfy the key design goals
presented in the beginning of this chapter. Compared to other existing datasets, USC-
HAD includes a representative number of human subjects, both male and female. The
activities considered are well-defined basic daily activities. Finally, the activity data is
collected from a high-precision well-calibrated sensing hardware such that the data is
accurate, reliable, and easy to interpret. All these features make the research work using
this dataset repeatable and extendible by other researchers.
20
0 5 10 15
-2
0
2
4
6
8
Time (s)
Acceleration (g)
X-Axis Data
-5 0 5 10
0
100
200
300
400
Acceleration (g)
Count
X-Axis Distribution
0 20 40 60
0
0.5
1
1.5
2
X-Axis Spectrum
Frequency (Hz)
|X(f)|
0 5 10 15
-6
-4
-2
0
2
4
Time (s)
Acceleration (g)
Y-Axis Data
-5 0 5
0
100
200
300
400
500
600
Acceleration (g)
Count
Y-Axis Distribution
0 20 40 60
0
0.1
0.2
0.3
0.4
0.5
Y-Axis Spectrum
Frequency (Hz)
|Y(f)|
0 5 10 15
-3
-2
-1
0
1
2
Time (s)
Acceleration (g)
Z-Axis Data
-4 -2 0 2
0
100
200
300
400
500
Acceleration (g)
Count
Z-Axis Distribution
0 20 40 60
0
0.1
0.2
0.3
0.4
0.5
Z-Axis Spectrum
Frequency (Hz)
|Z(f)|
Figure 2.6: The plot of the raw sensor data, the histogram, and the spectral analysis of
each axis of the 3-axis accelerometer for activity Running Forward
We have developed several activity models and activity recognition techniques based
on part of this dataset, with the goal of better understanding human activity signals and
developing state-of-the-art human activity recognition systems. We will describe each
of these activity recognition techniques in details in the following chapters.
21
Dataset Number of Activities Sensor Sensor
Subjects Recognized Locations Types
MIT PlaceLab 2 Preparing a recipe Left arm 3-axis accelerometer(±2g)
Doing a load of dishes Right arm Heart rate
Cleaning the kitchen Left leg
Doing laundry Right leg
Making the bed Hip
Light cleaning
Searches for items
Uses appliances
Talks on the phone
UC Berkeley WARD 20 Rest at standing Left wrist 3-axis accelerometer(±2g)
(13 male, 7 female) Rest at sitting Right wrist 2-axis gyroscope(±500dps)
Rest at lying Front center of the wrist
Walk forward Left ankle
Walk left Right ankle
Walk right
Turn left
Turn right
Go upstairs
Go downstairs
Jog
Jump
Push wheelchair
CMU MMAC 43 Food preparation Left forearm Camera
Cook five recipes Right forearm Microphone
Left upper arm RFID
Right upper arm 3-axis accelerometer(±6g)
Left thigh 3-axis gyroscope(±500dps)
Right thigh 3-axis magnetometer
Left calf Ambient light
Right calf Heat flux sensor
Abdomen Galvanic skin response
Left wrist Skin temperature
Right wrist Near-body temperature
Forehead Motion capture
USC HAD 20 Walk forward Front left trousers pocket 3-axis accelerometer(±6g)
(10 male, 10 female) Walk left Shoulder bag 3-axis gyroscope(±500dps)
Walk right Front right hip Barometer
Walk up stairs Backpack
Walk down stairs Front left shirt pocket
Run forward
Jump up
Sit on a chair
Stand
Sleep
Fall forward
Fall backward
Fall left
Fall right
Ride a bicycle
Drive a car
Elevator up
Elevator down
Table 2.4: A full comparison between existing datasets and USC Human Activity
Dataset
22
Chapter 3
Feature Design and Analysis for
Human Activity Recognition
3.1 Introduction
It is well understood that high quality features are essential to improve the classifica-
tion accuracy of any pattern recognition system. In human activity recognition, features
such as mean, variance, correlation, and FFT coefficients computed from mechanical
motion measurements are commonly used [9]. To perform classification, one naive idea
is to pool all available features into one vector used as input to the classifier. The dis-
advantage here is that some features may be irrelevant or redundant, and do not provide
new information to improve the classification accuracy. Some features might even con-
fuse the classifier rather than help discriminate various activities. What is worse, due to
the “curse of dimensionality”, the performance may degrade sharply as more features
are added when there is not enough training data to learn reliably all the parameters
of the activity models [16]. Therefore, to achieve the best classification performance,
the dimensionality of the feature vector should be as small as possible, keeping only
the most salient and complementary features. In addition, keeping the dimensionality
small could reduce the computational cost such that the recognition algorithms can be
implemented and run on lightweight wearable devices such as mobile phones.
23
The two main techniques that are used to identify important features and reduce
dimensionality in pattern recognition are: (1) feature transformation - creating new fea-
tures based on transformations or combinations of the original extracted feature set; and
(2) feature selection - selecting the best subset of the original extracted feature set [46].
Both of them have been used in the wearable sensor community for recognizing vari-
ous human activities. One common strategy is to apply either feature transformation or
feature selection to get a fixed set of features for the whole set of activities to be rec-
ognized. For example, in [61], a correlation-based feature selection method was used
to select a subset of features. A 87% classification accuracy was achieved when using
the top eight features selected to classify six basic human activities. In [67], researchers
identified energy as the least significant feature among all the five available features by
using a sequential backward elimination method. In [6], the authors applied three fea-
ture selection methods: Relief-F, Simba, and mRMR to assess the relevance of features
for discriminating 15 different activities. All these three methods achieved similar per-
formance. The other strategy assumes that different activities may be characterized by
different sets of features. In [45], by performing cluster analysis, Huynh et al. showed
that the classification performance could be improved by selecting features and window
lengths for each activity separately. Lester et al. in [54] also demonstrated that a fea-
ture’s usefulness depends on the specific activity to be inferred. They applied a modified
version of AdaBoost [84] to select the top50 features and then learn an ensemble of dis-
criminative static classifiers based on the selected features for each activity. They found
that the selected features were different for different activities.
In this chapter, we focus on feature design and evaluation based on feature selec-
tion techniques. The rationale to use feature selection is that the selected features retain
their original meanings that we believe are important for better understanding of human
activities. And our goal is to identify the most important features to recognize various
24
human activities. The contributions of this work are listed here: we first design a new set
of features (called physical features) based on the physical parameters of human motion.
We expect these features to represent motion more accurately and concisely than com-
monly used statistical features such as mean and variance. Then, we use both statistical
and physical features in a single-layer feature selection and classification framework to
systematically analyze and evaluate their impact on the performance of the recognition
system for the whole set of activities to be recognized. To further improve the recog-
nition performance, we follow the ideas from [45] and [54] to extend the single-layer
framework to a multi-layer framework that selects the most important features for dif-
ferent activities in a hierarchical manner.
The rest of the chapter is organized as follows. Section 3.2 is devoted to defining the
statistical and physical features we will analyze and evaluate. Section 3.3 overviews
the feature selection problem and gives a brief introduction to the feature selection
techniques we use in this work. Based on techniques methods, a single-layer feature
selection and classification framework is built and evaluated in Section 3.4. Section 3.5
presents the multi-layer hierarchical feature selection and classification framework.
3.2 Feature Design
3.2.1 Activity Model
The different ways in which continuous sensor data can be modeled result in differ-
ent recognition paradigms. In this chapter, we model each activity based on a sliding-
window strategy. Specifically, we divide the continuous sensor streams into fixed length
windows. By choosing a proper window length, all the information of each activity can
be extracted from each single window. The information contained in each window is
then transformed into a feature vector by computing various features over the sensor
25
data within the window. We denote this activity model as the “whole-motion” model
since each single window is assumed to contain all the information of the correspond-
ing activity. Here, a window of length 2 seconds with a 50% overlap is used. In the
following section, we introduce two sets of features we incorporated in our recognition
framework.
3.2.2 Statistical Features
The first set of features are statistical features computed from each axis (channel) of
both accelerometer and gyroscope. Some of them have been intensively investigated
in previous studies and proved to be useful for activity recognition [9] [67] [45]. For
example, variance has been proved to achieve consistently high accuracy to differentiate
activities such as walking, jogging, and hopping [45]. Correlation between each pair
of sensor axes helps differentiate activities that involve translation in single dimension
such as walking and running from the ones that involve translation in multi-dimensions
such as stair climbing [67].
However, we did not blindly pool all the statistical features used in previous stud-
ies. Instead, we carefully picked up features that show distinct patterns among different
activities and discarded redundant ones which are highly correlated to other features and
no additional information is gained by adding them [38]. Figure 3.1 shows the distribu-
tions of different activities sampled from randomly selected six subjects. It is clear that
the shapes of distributions differ among different activities, and some characteristics of
the shapes can be quantitatively described by statistics. For example, The differences in
shapes between “walk-downstairs” and “walk-upstairs” can be differentiated by a kur-
tosis measure. “walk-left” in (h) shows a slight positive skewness while “walk-right”
in (i) has a slight negative skewness compared to ‘walk-forward” in (g) that shows no
skewness.
26
(a) The distribution of x-axis
acceleration of walking forward
(b) The distribution of x-axis
acceleration of walking downstairs
(c) The distribution of x-axis
acceleration of walking upstairs
(d) The distribution of x-axis
acceleration of running
(e) The distribution of x-axis
acceleration of jumping up
(f) The distribution of x-axis
acceleration of sitting
(g) The distribution of x-axis rotation
velocity of walking forward
(h) The distribution of x-axis rotation
velocity of walking left
(i) The distribution of x-axis
rotation velocity of walking right
Figure 3.1: Data distributions of various activity classes
In Figure 3.2, examples of correlations between feature pairs are presented in a series
of scatter plots in 2D feature space. The horizontal and vertical axes represent two dif-
ferent features. Points with different colors in the 2D space represent different activities.
As expected, (b) and (c) illustrate that standard deviation is highly correlated to variance
and root mean square (RMS). The relationship between mean and median is a little bit
tricky. As shown in (a), mean and median are highly correlated for sedentary activities
(sitting and standing) while behave quite differently for moderate and vigorous activities
(walking, running, and jumping).
We also consider statistical features that have been successfully applied in similar
recognition problems. Examples are zero crossing rate, mean crossing rate, and first-
order derivative. These features have been heavily used in human speech recognition
27
and handwriting recognition problems. The final list of statistical features with brief
descriptions is shown in Table 6.1.
Feature Description
Mean The DC component (average value) of the signal over the window
Median The median signal value over the window
Standard Deviation Measure of the spreadness of the signal over the window
Variance The square of standard deviation
Root Mean Square The quadratic mean value of the signal over the window
Averaged derivatives The mean value of the first order derivatives of the signal over the window
Skewness The degree of asymmetry of the sensor signal distribution
Kurtosis The degree of peakedness of the sensor signal distribution
Interquartile Range Measure of the statistical dispersion, being equal to the difference between
the75th and the25th percentiles of the signal over the window
Zero Crossing Rate The total number of times the signal changes from positive to negative or back
or vice versa normalized by the window length
Mean Crossing Rate The total number of times the signal changes from below average to above average
or vice versa normalized by the window length
Pairwise Correlation Correlation between two axes (channels) of each sensor and different sensors
Spectral Entropy Measure of the distribution of frequency components
Table 3.1: Statistical features with symbols and brief descriptions
28
(a) Mean vs. Median (b) Standard Deviation vs. RMS (c) Standard Deviation vs. Variance (d) Averaged Movement Intensity vs. SMA
(e) Variance of Movement Intensity vs.
Averaged Acceleration Energy
(f) Standard Deviation vs. Energy (g) ARATG is good to differentiate
walk-forward, walk-left and walk-right
(h) Eigenvalue along heading direction
vs. Eigenvalue along gravity direction
Figure 3.2: Correlations between different features
29
3.2.3 Physical Features
The second set of features we incorporate are called “physical features”, which are
derived based on the physical interpretation of human motions. As mentioned in Chap-
ter 2, the sensors are located at the subject’s front right hip, with the orientation as “x
axis pointing to the ground and perpendicular to the plane formed byy andz axes”. We
assume that we know the sensor location and orientation as a priori. Thus, some of our
physical features are computed and optimized based on this prior knowledge. Although
this assumption limits the generalization capability of our physical features to be applied
to other locations and orientations to some extent, it simplifies the problem and allows
us to focus on developing features with strong physical interpretations so as to better
understand human motions.
It should be noted that the way to compute physical features is different from sta-
tistical features. For statistical features, each feature is extracted from each sensor axis
(channel) individually. In comparison, most of the physical features are extracted from
multiple sensor channels. In other words, sensor fusion is performed at feature level for
physical features. In the rest of this section, we explain the physical features in great
detail. A brief summary of physical features can be found in Table 3.2.
1. Movement Intensity (MI): MI is defined as
MI(t)=
a
x
(t)
2
+a
y
(t)
2
+a
z
(t)
2
, (3.1)
the Euclidean norm of the total acceleration vector after removing the static grav-
itational acceleration, wherea
x
(t),a
y
(t), anda
z
(t) represent thet
th
acceleration
sample of the x, y, and z axis in each window respectively. This feature is inde-
pendent of the orientation of the sensing device, and measures the instantaneous
intensity of human movements at index t. We do not use MI directly. Instead, we
30
Feature Symbol Description
Mean of Movement Intensity ai The mean value of movement intensity over the window
Variance of Movement Intensity vi The variance of movement intensity over the window
Normalized signal magnitude area sma The sum of acceleration magnitude summation over three axes of
each window normalized by the window length
Eigenvalues of dominant directions eva The eigenvalues of dominant directions along which
intensive human motion occurs
Correlation between acceleration cagh The correlation coefficient of acceleration between gravity
along gravity and heading direction direction and heading direction
Averaged velocity along heading direction avh The averaged velocity along heading direction
Averaged velocity along gravity direction avg The averaged velocity along gravity direction
Averaged rotation angles related to aratg The cumulative rotation angles around gravity direction
gravity direction normalized by the window length
Dominant Frequency df The dominant frequency band along each axis of
both accelerometer and gyroscope
Energy ene The energy along each axis of both accelerometer and gyroscope
Averaged acceleration energy aae The mean value of the energy over three acceleration axes
Averaged rotation energy are The mean value of the energy over three gyroscope axes
Table 3.2: Physical Features
compute the mean (AI) and variance (VI) of MI over the window and use them as
two features given by
AI =
1
T
T
t=1
MI(t)
(3.2)
VI =
1
T
T
t=1
(MI(t)−AI)
2
(3.3)
whereT is the window length.
2. Normalized Signal Magnitude Area (SMA): SMA is defined as
SMA =
1
T
T
t=1
|a
x
(t)|+
T
t=1
|a
y
(t)|+
T
t=1
|a
z
(t)|
, (3.4)
the acceleration magnitude summed over three axes within each window normal-
ized by the window length. This feature has been used in previous studies and is
regarded as an indirect estimation of energy expenditure [49] [37] [93].
31
3. Eigenvalues of Dominant Directions (EVA): When a subject jumps, a large
acceleration component along the vertical direction is expected. Likewise, when
a subject runs forward, there should be a large acceleration component along the
heading direction and a relatively large acceleration component along the verti-
cal direction. To capture these effects, we calculate the covariance matrix of the
acceleration data along x, y, and z axis in each window. The eigenvectors of
the covariance matrix correspond to the dominant directions along which inten-
sive human motion occurs. The eigenvalues measure the corresponding relative
motion magnitude along the directions. In this work, we use the top two eigen-
values as our features, corresponding to the relative motion magnitude along the
vertical direction and the heading direction respectively.
4. Correlation between Acceleration along Gravity and Heading Directions
(CAGH): Given the location and orientation of the sensing device described, the
gravity direction is approximately parallel to x axis, and the subject’s heading
direction when walking is a combination of y and z axes. We first derive the
Euclidean norm of the total acceleration vector along the heading direction, and
then calculate the correlation coefficient between the acceleration in gravity direc-
tion and the derived acceleration along heading direction as our feature.
5. Averaged Velocity along Heading Direction (AVH): A VH is approximated by
first calculating the averaged velocities along y and z axes over the window, and
then computing the Euclidean norm of those two velocities.
6. Averaged Velocity along Gravity Direction (AVG): A VG is computed by aver-
aging the instantaneous velocity along the gravity direction at each timet over the
window. The instantaneous velocity at each timet is calculated through numerical
integration of the acceleration along gravity direction.
32
7. Averaged Rotation Angles related to Gravity Direction (ARATG): ARATG
calculates the cumulative rotation angles around the gravity direction. The cumu-
lative sum is then divided by the window length for normalization. Since sensors
are at the subject’s front right hip, this feature captures the rotation movement of
the human torso around gravity direction.
8. Dominant Frequency (DF): The dominant frequency is defined as the frequency
corresponding to the maximum of the squared discrete FFT component magni-
tudes of the signal from each sensor axis.
9. Energy (ENERGY): Energy is calculated as the sum of the squared discrete FFT
component magnitudes of the signal from each sensor axis. The sum is then
divided by the window length for normalization. The DC component of the FFT
is excluded in this sum since it is already measured by the mean feature.
10. Averaged Acceleration Energy (AAE): AAE is defined as the mean value of the
energy over three acceleration axes.
11. Averaged Rotation Energy (ARE): ARE is defined as the mean value of the
energy over three gyroscope axes.
Figure 3.2 illustrates some interesting observations on some of our physical features
and the correlations between physical features and previously defined statistical features.
Although the high correlation between standard deviation and energy feature shown in
(f) is expected, its similarity to (c) is unexpected. This high correlation between standard
deviation and energy feature suggests us to use standard deviation instead of the energy
feature due to the high computational complexity of FFT operations. Another interesting
observation is shown in (d), where the straight line indicates that the SMA feature is
highly correlated to mean value of movement intensity. This means that the l1-norm
33
and l2-norm of the acceleration signals are quite equivalent to each other. From the
performance point of view, (g) illustrates that the ARATG feature successfully partitions
the data from “walk-forward”, “walk-left”, and “walk-right” into three isolated clusters,
with each cluster containing data from one single activity class. Finally, the scatter plot
in (h) demonstrates the discrimination power of the eigenvalues features to differentiate
walking, running, and jumping. As can be seen in the vertical axis, the results match our
intuition that along the gravity direction, the motion intensity of jumping and running are
larger compared to walking. The horizontal axis illustrates the motion intensity along
the moving (heading) direction. Since the speed of running is higher than walking, the
motion intensity of running along the moving direction is higher than walking. The
values of jumping are between those of walking and running. This observation can be
explained by the fact that normally people can not jump straight up and there is always
a forwarding momentum exerted by human body while people are jumping up.
3.2.4 Feature Normalization
Because the scale factors and units of the features described above are different, before
we proceed to the feature selection stage, we normalize all the features to zero mean and
unit variance using
f
normalized
=
f
raw
−μ
σ
(3.5)
where μ and σ are the empirical mean and standard deviation of a particular feature
across all activity classes.
3.3 Feature Selection
The feature computation in the last section yields a total of 110 features including 87
statistical features and 23 physical features. To systematically assess the usefulness
34
and identify the most important features for discriminating different activities, feature
selection techniques are used. In this section, we gives a brief introduction about the
feature selection methods involved in this work.
3.3.1 Feature Selection Methods
Feature selection techniques can be organized into three categories: filter methods,
wrapper methods and embedded methods [70]. This categorization is based on how they
combine the feature selection search with the construction of the classification model.
Filter methods assess the relevance of features by only looking at the intrinsic prop-
erties of the data without involving any classification models. Therefore, they are com-
putationally fast, and do not inherit any bias of a specific classification model. Fil-
ter methods can be further divided into two groups: univariate (feature weighting) and
multivariate (subset search) based on whether they evaluate the merits of features indi-
vidually or through feature subsets. In univariate filter methods, a feature relevance
score is calculated for each feature individually based on a predefined evaluation metric.
Features are then ranked based on their scores, and low-scoring features are discarded.
Afterward, the remaining features are presented as input to the classification model.
However, since each feature is considered separately, univariate filter methods ignore
the dependencies and correlations between features, which may lead to worse classifi-
cation performance. In order to overcome this problem, multivariate filter methods were
introduced. They search through candidate feature subsets guided by a predefined eval-
uation metric which captures both the relevance between features and class labels and
the redundancy between different features.
While filter methods treat the problem of finding a good feature subset independently
of the model selection step, wrapper methods embed the model hypothesis search within
35
the feature subset search. In this setup, a search procedure in the space of possible fea-
ture subsets is defined, and various subsets of features are generated and evaluated. The
evaluation of a specific subset of features is obtained by training and testing a specific
classification model, rendering this approach tailored to a specific classification algo-
rithm. To search the space of all feature subsets, a search algorithm is then “wrapped”
around the classification model. However, as the space of feature subsets grows expo-
nentially with the number of features, it is impractical to perform exhaustive search.
Instead, heuristic search methods such as hill-climbing and best-first search, are used
to guide the search for a suboptimal subset. Advantages of wrapper approaches include
the interaction between feature subset search and model selection, and the ability to
take into account feature dependencies. A common drawback of these techniques is that
they have a higher risk of overfitting than filter methods and are very computationally
intensive.
The last class of feature selection methods is the embedded method. In embedded
methods, the search for an optimal subset of features is built into the classifier con-
struction, and can be seen as a search in the combined space of feature subsets and
hypotheses. The AdaBoost based feature selection technique in [54] can be catego-
rized as an embedded technique. Just like wrapper approaches, embedded approaches
are thus specific to a given classification algorithm. They have the advantage that they
include the interaction with the classification model, while at the same time being far
less computationally intensive than wrapper methods.
In this work, one filter method and two wrapper methods were investigated and are
summarized here. These methods are selected due to their high popularity and useful-
ness in many pattern recognition and machine learning problems.
• Relief-F: Relief-F [51] is a popular filter method that estimates the relevance of
features according to how well their values distinguish between the data points of the
36
same and different classes that are near each other. Specifically, it computes a weight
for each feature to quantify its merit. This weight is updated for each of the data points
presented, according to the evaluation function:
w
i
=
N
j=1
x
j
i
−nearmiss
x
j
i
2
−
x
j
i
−nearhit
x
j
i
2
(3.6)
where w
i
represents the weight of the i
th
feature, x
j
i
represents the value of the i
th
feature for data point x
j
, N represents the total number of data points, nearhit(x
j
)
and nearmiss(x
j
) denote the nearest point to x
j
from the same and different class
respectively. The higher the weight is, the more important is the feature. The major
drawback of Relief-F is that it does not consider feature dependencies and therefore
does not help remove redundant features.
• SFC (Wrapper Method based on Single Feature Classification): In SFC [38],
features are ranked based on their individual classification performance. Features at the
top of the ranking list are selected as the final feature subset. SFC is similar to Relief-F
in the sense it can not capture redundant features either. However, unlike Relief-F, SFC
uses the classifier’s classification accuracy as the metric for feature evaluation.
• SFS (Wrapper Method based on Sequential Forward Selection): In SFS [89],
features are sequentially added one by one. Specifically, if one feature is needed, the
feature with the best classification performance is selected. If more than one feature is
needed, we add one feature at a time which in combination with the already selected
features achieve the best classification performance. Compared to the two methods
described earlier, the benefit of SFS is that it takes feature redundancy into consideration.
LetM represent the total number of features in the full feature set andN
f
represent
the number of features to be selected. Table 3.3 summarizes the computational costs of
the three feature selection methods, with numerical examples for our case of M = 110.
37
The computational cost is in the form of either the number of times of calling the feature
evaluation function for filter methods, or the number of times of calling the classification
algorithm for wrapper methods. In order to make this comparison more meaningful, the
wrapper method based on exhaustive search is also included.
N
f
=1 N
f
=2 N
f
=3 N
f
= n
Relief-F 110 111 111 M+1
SFC 110 111 111 M+1
SFS 110 219 327 Θ(M
2
)
Exhaustive Search 110 5995 215820 O(2
M
)
Table 3.3: Comparison of computational cost of feature selection methods with 110
features.
3.3.2 Classifier
The choice of classifier is critical to feature selection. Since the nature of feature selec-
tion problem is to select features from a high dimensional space, in this work, we choose
Support Vector Machines (SVMs) with a linear kernel to be our learning machine. They
have proved to be very effective in handling high dimensional data [27]. In addition, we
use classification accuracy as our evaluation metric to assess the effectiveness of feature
selection methods listed above.
3.4 Single-Layer Feature Selection and Classification
As our first step to approach the problem of activity recognition using feature selection,
we adopt a single-layer feature selection framework. That is, we take all activity classes
into consideration at one time. Our goal is to find the best discriminating set of features
for all activities.
38
0 20 40 60 80 100 120
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Number of Features
MCE (Mis −Classification Error)
Evaluation on Feature Selection Algorithms
Relief −F
SFC
SFS
Figure 3.3: Testing classification error rates as a function of the number of features
selected by different feature selection methods
3.4.1 Evaluation on Feature Selection Methods
To evaluate the effectiveness of the three feature selection methods, we adopt a leave-
one-subject-out cross validation strategy. Specifically, we use the data from five subjects
as training examples to select features and build activity models. Data from the left-out
subject is used for testing. This process iterates for every subject. The final result is the
averaged value across all the subjects.
Figure 3.3 shows the average testing classification error rates as a function of the
number of features selected, ranging from 5 to 110 (full set), with interval equal to
5. Each line represents one feature selection method. The results show that across
three feature selection methods, the classification errors taper off when 50 features are
included, with approximately a 10% misclassification rate achieved on average. If we
pick more features beyond the top50, the performance only varies slightly. This matches
the results in Lester’s previous work [54]. If we look at each method individually, SFS
39
performs the best in the sense that it achieves a 12% misclassification rate by using the
first five features. In comparison, Relief-F is the worst since it gets a 47% misclassi-
fication rate with five features. SFC ranks in the middle with a 17% misclassification
rate. Besides the classification performance, we record the averaged computational time
of each method and list them in Table 3.4. As expected, SFS has the highest computa-
tional cost. SFC is computationally more expensive than Relief-F since the classifier is
invloved.
Algorithm Averaged Computational Time (second)
Relief-F 87.5
SFC 112.4
SFS 2780
Table 3.4: Averaged computational time of different feature selection methods
One interesting question to ask is whether the features selected by feature selec-
tion methods are truly important for our activity recognition problem. To answer this
question, we first remove the top 50 features selected by feature selection methods in
Figure 3.3. Then the same feature selection procedure is re-performed on the remaining
feature set. Figure 3.4 shows the results. Across three methods, the classification errors
are13% to23% higher compared to Figure 3.3 when 50 features are selected. This indi-
cates that the top 50 features selected in Figure 3.3 contain more important information
than the remaining feature set.
3.4.2 Feature Profiling on the Selected Features
The top 50 features selected by each feature selection method have been proven useful.
To identify these features, we perform feature profiling. Since we use the leave-one-
subject-out strategy on 6 subjects, not all the features selected in each iteration are the
same. Thus we combine the top 50 features selected in each iteration, which results
in a total of 300 features with repetition for each method. The results are shown in
40
0 10 20 30 40 50 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Number of Features
MCE (Mis −Classification Error)
Sanity Check on Feature Selection Algorithms
Relief −F
SFC
SFS
Figure 3.4: Sanity check on feature selection methods
Table 3.5. The second column lists the selected physical features. The third column
lists the numbers and the percentages of physical features selected out of 23×6 = 138
physical features in 6 iterations. The corresponding results for statistical features are
listed in the fourth column. For all three methods, although the numbers of selected
Algorithm Physical Features No.(%) of Physical No.(%) of Statistical
Selected Features Selected Features Selected
Relief-F AI, VI, AAE, SMA, 67 (48.6%) 233 (44.6%)
EV A(2), CAGH, A VH,
A VG, ARATG, DF
SFC AI, VI, AAE, SMA, 103 (74.6%) 197 (37.7%)
EV A, A VH, ARATG,
ARE, ENERGY , DF
SFS AI, VI, AAE, EV A, 70 (50.7%) 230 (44.1%)
CAGH, A VH, A VG,
ARATG, ARE, DF
Table 3.5: Top 50 feature profiling of different feature selection methods
physical features are less than statistical features, the corresponding percentages are
higher. Among them, Relief-F selected the least number of physical features whereas
41
SFC selects the most. This observation indicates that SFC considers physical features
more critical in terms of classification performance. Compared to SFC, the number of
physical features selected by SFS drops to70. This is because some physical features are
highly correlated to other features, so that SFS does not select the redundant ones. For
example, SMA is not selected since it is highly correlated to AI. Likewise, the ENERGY
features from all sensor channels are not selected because they are correlated to standard
deviation.
Feature profiling gives the impression that the physical features play a major role
in differentiating different activities. To validate this point, we remove all the physical
features from the full feature set and perform feature selection on statistical features
only. The result is presented in Figure 3.5. For all three methods, the misclassification
0 10 20 30 40 50 60 70 80 90
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Number of Features
MCE (Mis −Classification Error)
Performance of Feature Selection Algorithms Without Physical Features
Relief −F
SFC
SFS
Figure 3.5: Testing classification error rates of feature selection methods without physi-
cal features
rates are approximately 18% when the top 50 features are selected. This is 8% higher
than the rates when physical features are included (see Figure 3.3). Based on all the
42
results shown in this section, we can conclude that our self-designed physical features
make a strong contribution and improve the classification performance to a great extent.
3.5 Hierarchical Feature Selection and Classification
The major limitation of the single-layer framework studied in the previous section is
that all activities are considered simultaneously. As a result, it may not scale well if the
number of activities to be recognized is large. To achieve high scalability, we propose
here a multi-layer feature selection and classification framework. Specifically, we first
group activities into subsets based on our understanding of the problem domain and then
perform feature selection and classification in a hierarchical manner. In this scenario,
the classifier in each layer considers a smaller number of activities. In addition, we
now have the flexibility to use different features for different classifiers/activity subsets,
instead of using the same feature set for all activity classes.
Figure 3.6 illustrates the structure of the hierarchical feature selection and classifica-
tion framework. Blue boxes represent meta-classes we create to group activity subsets.
Green boxes represent the nine types of activities to recognize. Now the problem of rec-
ognizing nine activity classes is broken down to seven distinct classification problems.
The classifier at the top layer distinguishes between two meta-classes: static activity
vs. dynamic activity. The static activity meta-class includes standing and sitting while
the dynamic activity meta-class includes the remaining seven activities. Then, on the
second layer, walking-related activities (walk forward, walk left, walk right, go upstairs,
and go downstairs) are differentiated from jumping and running. Finally, the classifiers
at the third and fourth layer focus on recognizing different activities related to walking.
To determine the best feature sets for classifiers at each layer, SFS is used due to its
good performance. We follow the same leave-one-subject-out cross validation strategy
43
Walking
Forward
...
Running
Jumping
Root
Standing
Sitting
Standing Sitting
Walking
...
Running
Jumping
Walking
...
Running
Jumping
Running Jumping
Walking Left
Walking Right
Walking Left Walking Right
Go Upstairs
Go Downstairs
Go Upstairs Go Downstairs
Walking
Forward
Classifier 1
Classifier 2
Classifier 3
Classifier 4 Classifier 5
Classifier 6 Classifier 7
99.4% (5)
99.6% (20)
93.3% (5)
98% (10) 93.7% (10)
95.2% (5) 99.9% (5)
Figure 3.6: The structure and the performance of the hierarchical feature selection and
classification framework
to perform feature selection and classification at each layer. For classifiers 1, 2, 3,
and 6, the maximum classification accuracy achieved and the corresponding number of
features selected (in parentheses) are shown in Figure 3.6 in red. For the remaining
classifiers, the number of features selected when achieving the maximum accuracy is
greater than50. To lower the computational cost, we use instead the number of features
at which the classification accuracy reaches the first local maximum. These parameters
are shown in red and in parentheses as before. Based on the structure and the features
selected at each layer, the averaged testing classification accuracy of the overall multi-
layer classifier is 93.1%. This result is 3.8% higher than the accuracy achieved by the
single-layer classifier when the top50 features are used.
Table 3.6 lists the selected physical features at each classifier that are the most impor-
tant to differentiate different activity subsets. For example, as expected, features AI and
VI are useful in differentiating static activity and dynamic activity at classifier 1 (C1). At
44
C2, intensity-related features such as VI, AAE, ARE are selected. A VH and CAGH are
also selected since walking, running, and jumping have different velocities and intensi-
ties along the heading direction. At C4, the eigenvalues along both gravity and heading
directions are selected, matching the result in Figure 3.2. At C6, ARATG is selected,
matching the result in Figure 3.2. Finally, at C7, both A VG and CAGH are selected since
going upstairs and going downstairs exhibit different velocities and intensities along the
vertical direction.
Classifier 1 2 3 4 5 6 7
Physical AI, VI, EV A(1), AI, VI, EV A, VI, A VG, ARATG, CAGH,
Features VI CAGH, AAE, AAE CAGH ARATG, CAGH AVG
Selected A VH, ARE CAGH
Table 3.6: The physical features selected at each layer
45
Chapter 4
Sparse Representation for Activity
Signals
In the previous chapter, we focus on designing and looking for features that can best
describe human activities. In this chapter, we explore the possibility of building human
activity models based on the recently developed sparse representation and compressed
sensing theories. We show in this chapter that the task of looking for “optimal features”
to achieve the best activity recognition performance (the issue we aim to solve in the
last chapter) is less important within this novel sparse representation-based framework.
4.1 Introduction
During the past few years, research on high-dimensional sparse signals has experienced
great breakthroughs. A sparse signal is “a signal that can be represented as a lin-
ear combination of relatively few base elements in a basis or an overcomplete dic-
tionary” [10]. In fact, many real world signals are high-dimensional and sparse in
nature. For instance, smooth images exhibit sparse structures if represented using a
Fourier basis. Similarly, piecewise smooth images can be regarded as sparse signals
under a wavelet basis [85]. Although the original goal of exploring a signal’s sparsity is
for signal compression and reconstruction, its discriminative nature has been exploited
and successfully applied to many machine learning/pattern recognition tasks. Examples
46
include human face recognition [90], iris identification [66], facial action unit recogni-
tion [57], human speech recognition [35], and object recognition [58]. Inspired by their
success based on their highly scalable ways of data modeling, in this chapter, we explore
the sparsity nature of human daily activity signals sampled from wearable sensors and
propose a sparse representation-based framework for human daily activity modeling and
recognition. An important step of our approach is the selection of a basis or the design of
the overcomplete dictionary for sparse representation. One option is to use the standard
sparsity-inducing bases such as Fourier basis, Wavelets, and Curvelets used in many
image processing techniques. In this work, we follow the method proposed in [90] to
use the training samples directly as the basis to construct the overcomplete dictionary.
The rationale behind this strategy is that if sufficient training samples are available, it is
assumed that a test sample can be well represented by a linear combination of training
examples from the same activity class, and therefore the representation of the test sam-
ple in terms of the training samples is naturally sparse. The main goal of this work is
to study the key characteristics of the sparse representation-based framework for human
activity modeling and recognition. To achieve this goal, we study the robustness of the
framework related to different feature dimensions and different selections of features.
We also compare the recognition performance obtained with sparse representation to
some conventional activity recognition approaches such that the advantages of the pro-
posed sparse representation-based approach can be clearly and better illustrated.
4.2 Sparse Representation-Based Framework
The block diagram of our sparse representation-based framework for human activity
modeling and recognition is illustrated in Figure 6.1. The proposed framework con-
sists of two stages: training stage and recognition stage. In the training stage, a sliding
47
Feature
Vectors
Training Stage
Motion
Sensors
Feature
Extraction
Overcomplete
Dictionary
Construction Raw
Data
Recognition Stage
Motion
Sensors
Feature
Extraction
Raw
Data
Sparse
Representation
Determined
Activity
Feature
Vector
Classifier
Overcomplete
Dictionary
Sparse
Coefficient
Vector
Figure 4.1: The block diagram of the sparse representation-based human activity recog-
nition framework
window is used to segment the streaming activity signals sampled from the wearable
sensors into a sequence of fixed-length windows. In this work, the window is set to
be 4 seconds long with 50% overlap. Here we assume that all the important informa-
tion of each activity is contained inside each window. This information is extracted by
computing various features over the sampled sensor data within each window to form
a feature vector. The feature vectors from training samples of all activity classes are
then concatenated together to construct the overcomplete dictionary. In the recognition
stage, the unknown stream of activity signal is first segmented into fixed-length wint-
dows and then transformed into a feature vector in the same manner as in the training
stage. Its sparse representation based on the overcomplete dictionary constructed in the
training stage is then extracted and imported into the classifier for classification. In the
remainder of this section, we explain every component of this framework in detail.
4.2.1 Feature Extraction
The features used in this work are the same as the ones introduced in Chapter 3. Both
statistical and physical features are extracted from accelerometer and gyroscope. In
48
total, the dimensionality of the input feature space is 110. Please refer to Chapter 3 for
the definitions of these features.
4.2.2 Feature Selection vs. Random Projection
As stated in the introduction section, one goal of this work is to study how the choices
of features and the dimensions of feature space affect the performance of our sparse
representation-based framework. We want to understand not only the effect of using the
features themselves but also the effect of using the linear combinations of these features
based on some linear transformations. There are many types of linear transformations
that can be used. Popular examples include principal component analysis (PCA) [81]
and linear discriminant analysis (LDA) [11]. In this work, we are particularly inter-
ested in using random projection as the linear transformation since it has been proved
very powerful to encode information effectively in many applications and can be imple-
mented much more efficiently than PCA and LDA [15]. In random projection, a linear
transformation is represented as a random matrix R whose entries are independent and
identically distributed (iid) random variables from a zero mean Gaussian distribution in
which each row is normalized to have unit length [19]. As a result, the newly generated
features are linear combinations of all the original features with randomly generated
coefficients. Compared to the features generated by PCA and LDA, these randomly
projected features are less structured but encode the information from all the original
features. We will compare the effects of using the original features selected from the
feature set and the transformed features generated by random projection on the recogni-
tion performance in the next section.
49
4.2.3 Overcomplete Dictionary Construction and Sparse Represen-
tation
Assume that there are k distinct activity classes to recognize and n
i
training samples
from class i, i ∈ [1,2,...k]. Recalling that each training sample is represented as an
m-dimensional feature vector, we arrange the given n
i
training samples from class i
as columns of a data matrix D
i
=[x
i,1
,x
i,2
,...,x
i,n
i
] ∈ R
m×n
i
. Here, we make a
key assumption that given sufficient training samples from class i (i.e., the number of
columns of the data matrix D
i
is large enough), any new test sample y ∈ R
m
that
belongs to the same activity class can be approximately represented as a linear combi-
nation of the training samples inD
i
:
y = α
i,1
x
i,1
+α
i,2
x
i,2
+...+α
i,n
i
x
i,n
i
(4.1)
with coefficientsα
i,j
∈R,j=1,2,...,n
i
.
Next, we define a new matrix A which concatenates the training samples from all
the activity classes as
A=[D
1
,D
2
,...,D
k
] ∈R
m×n
=[x
1,1
,...,x
1,n
1
,x
2,1
,...,x
2,n
2
,...,x
k,1
,...,x
k,n
k
]
(4.2)
wheren = n
1
+n
2
+...+n
k
. In such case, the matrixA can be seen as an overcomplete
dictionary of n prototype elements. Based on the assumption just mentioned, we can
express the test sampley from classi in terms of the overcomplete dictionaryA as
y = Aα (4.3)
50
where
α=[0,...,0,α
i,1
,α
i,2
,...,α
i,n
i
,0,...,0]
T
(4.4)
is a sparse coefficient vector whose entries are zero except those associated with classi.
Therefore,α can be regarded as a sparse representation ofy based on the overcomplete
dictionary A. More importantly, the entries of α encode the identity of y. In other
words, we can infer the class membership of the test sampley by finding the solution of
the linear system of equations of(4.3) which is expected to be sparse.
4.2.4 Sparse Recovery via
1
Minimization
Based on the theory of linear algebra, the solution of the linear system of equations of
(4.3) depends on the characteristic of the matrix A.Ifm>n, the system y = Aα is
overdetermined and the solution can be found uniquely. However, in most real world
applications, the number of prototypes in the overcomplete dictionary is typically much
larger than the dimensionality of the feature representation (i.e.,m<
the linear system of equations of(4.3) is underdetermined and has no unique solution.
Traditionally, this difficulty is solved by choosing the minimum
2
solution. That is,
the desired coefficients have minimum
2
norm:
ˆ α = argmin
α
α
2
subject to y = Aα (4.5)
where.
2
denotes the
2
norm. The solution of the above problem is given by
ˆ α = A
†
y (4.6)
where A
†
is the pseudoinverse of A. However, this solution yields a non-sparse coeffi-
cient vector which is not informative for our activity recognition task.
51
Recent research in the field of compressed sensing [20] [24] has shown that if α is
sufficiently sparse, it can be recovered by solving the
1
minimization problem instead:
ˆ α = argmin
α
α
1
subject to y = Aα (4.7)
where .
1
denotes the
1
norm. This optimization problem, also known as Basis Pur-
suit (BP), is built on a solid theoretical basis and can be solved very efficiently with
traditional linear programming techniques whose computational complexities are poly-
nomial inn [21].
In practical real world applications, signals are always noisy. As a result, it may not
be possible or reasonable to model the test sample exactly as a sparse linear combination
of the training samples as in (4.3). In such cases, (4.3) can be modified to explicitly
account for limited noise with:
y = Aα+e (4.8)
where e is the noise term with bounded energy e
2
<. With such modification, the
sparse solution of (4.8) can still be efficiently computed by solving the following
1
minimization problem via second-order cone programming [21]:
ˆ α = argmin
α
α
1
subject to Aα−y
2
≤ . (4.9)
4.2.5 Classification via Sparse Representation
Given a new test sample y in the form of an m-dimensional feature vector from one of
the k activity classes, we first compute its sparse coefficient vector ˆ α by solving (4.9).
To identify the class membership ofy, we adopt the classification strategy by comparing
how well the various parts of the coefficient vector ˆ α associated with different activity
52
classes can reproducey, where the reproduction error is measured by the residual value.
Specifically, the residual of classi is defined as
r
i
(y)= y −Aδ
i
(ˆ α)
2
(4.10)
whereδ
i
(ˆ α) is a characteristic function that selects only the coefficients in ˆ α associated
with class i. Therefore, r
i
(y) measures the difference between the true solution y and
the approximation using only the components from classi. Finally,y is classified as the
activity classc that gives rise to the smallest residual value:
c = argmin
i
r
i
(y) (4.11)
As an example, Figure 4.2(a) and Figure 4.2(c) illustrate the two coefficient vectors
recovered by solving (4.9) with the noise tolerance =0.03 for two test samples from
two activities: walk forward and running, respectively. As illustrated, both of the recov-
ered coefficient vectors are sparse. Moreover, the majority of the large coefficients are
associated with the training samples belonging to the same activity class. Figure 4.2(b)
and Figure 4.2(d) show the corresponding residual values with respect to the nine activ-
ity classes. As illustrated, both test samples are correctly classified since the smallest
residual value is associated with the true activity class. To show the robustness of our
residual-based classification strategy, we calculate the ratios between the two smallest
residuals for each test sample. The larger the ratio value, the more robust is the classifi-
cation result. In the examples of Figure 4.2(b) and Figure 4.2(d), the ratios between the
two smallest residuals are1:2.2 for walk forward and1:3.8 for running. We obtain
similar results for the other seven activities.
53
0 500 1000 1500 2000 2500
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Coefficient Index
Coefficient Value
Sparse Representation of
Walking Forward
walk forward samples
(a) The sparse coefficient solution recovered via
1
minimization for one test sample from activity
walk forward.
1 2 3 4 5 6 7 8 9
0
20
40
60
80
100
120
140
Activity Index
Residual
Walking Forward
(b) The residual values with respect to the nine
activity classes. The test sample is correctly clas-
sified as walk forward (index number 1). The
ratio between the two smallest residuals is1:
2.2.
0 500 1000 1500 2000 2500
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Coefficient Index
Coefficient Value
Sparse Representation of
Running
running samples
(c) The sparse coefficient solution recovered via
1
minimization for one test sample from activity
running.
1 2 3 4 5 6 7 8 9
0
50
100
150
200
250
300
Activity Index
Residual
Running
(d) The residual values with respect to the nine
activity classes. The test sample is correctly clas-
sified as running (index number 6). The ratio
between the two smallest residuals is1:3.8.
Figure 4.2: The sparse representation solutions via
1
minimization and the correspond-
ing residuals for two test samples from walk forward and running respectively.
4.2.6 Classification Confidence Measure
The residual-based classification strategy described in the last subsection only provides
a classification result whose confidence is unknown. To quantify the classification con-
fidence, we use the metric Sparsity Concentration Index (SCI) proposed in [90]. The
rationale behind the design of SCI is based on the sparse recovery results illustrated in
Figure 4.2. Specifically, a test sample classified with high confidence should have a
sparse representation whose non-zero entries concentrate mostly on one activity class,
54
whereas a test sample classified with low confidence should have sparse coefficients
spread widely among multiple activity classes.
Based on this observation, the SCI of a coefficient vectorα is defined as
SCI(α)=
k ·max
i
δ
i
(α)
1
/α
1
−1
k −1
(4.12)
With this definition, SCI takes values between 0 and 1. For a coefficient vector of a test
sample recovered via
1
minimization, if the SCI value is close to 1, that is, the value
of max
i
δ
i
(α)
1
/α
1
≈ 1, it indicates that the test sample can be approximately
represented using training samples only from a single activity class. On the other hand,
if the SCI value is close to 0, that is, max
i
δ
i
(α)
1
/α
1
≈ 1/k, it corresponds to
the situation where no single activity is dominant such that the non-zero coefficients
are distributed over all activity classes. For example, the SCI values of the two test
samples from walk forward and running in Figure 4.2 are 0.44 and 0.57 respectively.
Furthermore, we can manually set up a threshold τ ∈ [0,1] such that only test samples
with SCI values equal or larger than τ are considered. As will be shown in the next
section, τ can be used as an input parameter of the overall human activity recognition
system such that it can be tuned by users to achieve the desired performance.
4.3 Experiments and Results
In this section, we evaluate the performance of our sparse representation-based frame-
work. For the evaluation procedure, we use the leave-one-subject-out cross validation
strategy. Specifically, we use the data from thirteen subjects as training examples. Data
from the left-out subject is used for testing. This process iterates for every subject. The
final result is the averaged value over all the subjects.
55
The framework was implemented in MATLAB programming environment. The
1
minimization was performed using the
1
-magic package
1
. The noise tolerance was
set to 0.03. We use the classification accuracy as the single quality metric for all our
experiments. The classification accuracy (ACC) is defined as
ACC =
TP +TN
TP +TN +FP +FN
(4.13)
where the variables TP, TN, FP and FN respectively represent the number of True Pos-
itive, True Negative, False Positive and False Negative outcomes in a given experiment.
4.3.1 Effect of the Feature Dimension and Comparison to Baseline
Algorithms
As our first experiment, we examine the framework’s classification performance with
respect to different feature dimensions, and compare it to three classical classification
methods: nearest neighbor (NN), naive Bayesian classifier (NBC), and support vector
machine (SVM). We choose these three classification methods as the baseline algo-
rithms because the advantages of our proposed sparse representation-based approach
can be clearly and better illustrated. In particular, Figure 4.3 illustrates the average clas-
sification accuracy rates as a function of feature dimension, ranging from10 to100, with
interval equal to 10. Each curve represents one classification method respectively. At
each dimension, features are selected based on the Sequential Forward Selection (SFS)
since it is reported in [98] as a very effective feature selection method. In such case, we
compare the performance of the four classification methods using the same feature set.
1
http://www-stat.stanford.edu/∼candes/l1magic/
56
10 20 30 40 50 60 70 80 90 100 110
55
60
65
70
75
80
85
90
95
100
Feature Dimension
Recognition Rate (%)
Classification Performance Comparison Between NN, SVM, NBC, and SR
NN
SVM
SR
NBC
Figure 4.3: Impact of Feature Dimension
As shown in the figure, NN and NBC have relatively worse performance. The max-
imum recognition rates for NN and NBC are 91.3% and 89.4% respectively. In com-
parison, SVM achieves a better performance than both NN and NBC at each individual
feature dimension when feature dimension is equal or larger than 30, with the best rate
of 94.8%. For our sparse representation-based classification method (SR), the perfor-
mance is the worst when feature dimension is less than or equal to30. This observation
indicates that using fewer than 30 features is not sufficient to recover the sparse signals
via
1
minimization with no information loss. However, when the feature dimension is
equal to or larger than 40, SR achieves consistent performance and beats all the other
three classification methods, achieving a maximum recognition rate of96.1% when fea-
ture dimension is equal to60.
To take a closer look at the classification result, Table 6.2 shows the confusion table
for feature dimension equal to 50. The overall averaged recognition accuracy across all
57
activities is 95.2%, with eight out of nine activities having precision and recall values
higher than 90%. If we examine the recognition performance for each activity indi-
vidually, both walk left and walk right achieve very high precision and recall values.
Furthermore, they never get confused with each other. For jump up, although it has a
near 100% precision value, it only achieves a recall value of 92.9%. This is because
some of the samples of jump up are misclassified as run forward but not vice versa. For
walk forward, it is interesting to notice that it can be misclassified as any other walk-
related activities. This is exactly the same as go upstairs except that go upstairs never
gets misclassified to go downstairs. Finally, sit on a chair has a relatively low recall
value because it is mostly confused with stand. This result makes sense since both sit
on a chair and stand are static activities, and we expect difficulty in differentiating dif-
ferent static activities especially when the sensing device is attached to the hip of the
participants.
4.3.2 Effect of the Choice of Features and Random Projection
In this section, we study the effect of different choices of features and random projec-
tion on the classification performance of our framework. Similar to the last subsection,
Figure 4.4 shows the average classification accuracy rates vs. feature dimension from10
to 100. The red curve with asterisks represents the standard SR method using features
selected based on the SFS feature selection method mentioned before. The black curve
with circles represents the standard SR method using features randomly selected without
the help from any feature selection algorithm. The blue curve with triangles represents
the SR method with random projection. In this work, the entries of our random pro-
jection matrix are independently sampled from a Gaussian distribution with mean zero
and variance 1/n (recall n is the total number of training samples included in the over-
complete dictionary). As shown, it is interesting to see that the SR method with random
58
10 20 30 40 50 60 70 80 90 100 110
45
50
55
60
65
70
75
80
85
90
95
100
Feature Dimension
Recognition Rate (%)
Classification Performance Comparison Between SR,
SR with Randomly Selected Features, and SR with Random Projection
SR
SR with Random Projection
SR with Randomly Selected Features
Figure 4.4: Impact of Feature Choices
projection achieves very similar performance to the standard SR method using features
selected from SFS. In comparison, the SR method with randomly selected features per-
forms much worse than the other two methods when using 80 or fewer features. This
observation indicates that the choice of features plays a significant role in our sparsity-
based framework if random projection is not used. On the other hand, by using random
projection, the task of looking for “optimal features” to achieve the best performance is
less important. In other words, the random projected features should perform as well as
features selected by many effective feature selection algorithms.
4.3.3 SCI as a Measure of Confidence
Finally, as our last experiment, we examine the role of the classification confidence
measure SCI as a tunable input parameter of the recognition system.
59
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
90
91
92
93
94
95
96
97
98
99
100
SCI Threshold Value
Recognition Rate (%)
Impact of SCI Threshold Value on Classification Performance
Figure 4.5: Impact of SCI Threshold Value on Classification Performance
As explained in the last section, we set up a threshold τ ∈ [0,1] such that only
test samples with SCI values equal or larger than τ are considered. Figure 4.5 shows
the average classification accuracy rates by sweeping the threshold τ through a range
of values starting from 0 to 1, with interval equal to 0.1. As expected, the accuracy
rate increases monotonically as the thresholdτ increases from0 to1. The classification
accuracy reaches100% when the thresholdτ is at0.7. Based on this generated curve, the
user can set the threshold τ to a specific value to make the recognition system achieve
the desired performance.
60
Classified Activity
Walk forward Walk left Walk right Go up stairs Go down stairs Run forward Jump up Sit on a chair Stand Total Recall
1 Walk forward 1710 15 14 20 6 2 0 1 2 1770 96.6%
Ground Truth
2 Walk left 10 1528 0 2 14 0 0 0 0 1554 98.3%
3 Walk right 29 0 1673 22 1 3 0 6 0 1734 96.5%
4 Go up stairs 31 8 19 1372 0 5 0 0 7 1442 95.1%
5 Go down stairs 8 16 0 5 1352 5 0 0 0 1386 97.5%
6 Run forward 0 0 0 0 0 925 1 0 0 926 99.9%
7 Jump up 0 0 0 0 10 43 692 0 0 745 92.9%
8 Sit on a chair 5 0 2 4 1 0 4 1387 91 1494 92.8%
9 Stand 11 7 0 4 4 0 0 164 1188 1378 86.2%
Total 1804 1574 1708 1429 1388 983 697 1558 1288
Precision 94.8% 97.1% 98.0% 96.0% 97.4% 94.1% 99.3% 89.0% 92.2%
Table 4.1: Confusion table for our sparse representation-based human activity recognition framework when feature dimension
is 50. The entry in the i
th
row and j
th
column is the count of activity instances from class i but classified as class j. Overall
classification accuracy is95.2%.
61
Chapter 5
Learning Motion Primitive
5.1 Introduction
As illustrated in Chapter 3 and 4, conventional wearable sensor-based activity recogni-
tion techniques represent activities using a “whole-motion” model in which continuous
sensor streams are divided into fixed-length windows. The window length is properly
chosen such that all the information of the activity can be extracted from each window.
Features are then extracted from the window which are used as input to the classifier
for classification. Although this “whole-motion” model has proven very effective in
existing studies, the performance is highly dependent on the window length [30]. As
a possible solution to this problem, motion primitive-based models were proposed and
have recently attracted numerous research attention.
The motion primitive-based models are inspired by the similarity of human speech
signals and human motion [28]. In human speech recognition, sentences are first divided
into isolated words, which are then divided into a sequence of phonemes. Models are
first built for the approximately 50 phonemes shared by all words (in English). These
phoneme models then act as the basic building blocks to build words and sentences in
a hierarchical manner [42]. Following the same idea, in motion primitive-based model,
each activity is represented as a sequence of motion primitives which act as the smallest
units to be modeled. Different from the “whole-motion” model that examines the global
features for human activities, motion primitives capture the invariance aspects of the
62
local features and more importantly, provide insights for better understanding of human
motion.
The key issues related to the motion primitive-based model are: (1) constructing
meaningful motion primitives that contain salient motion information; and (2) repre-
senting activities based on the extracted primitives. Most existing approaches construct
primitives either using fixed-length windows with identical temporal/spatial duration or
through clustering. Each window is then mapped to a symbol according to a specific
mapping rule. As a consequence, the continuous activity signal is transformed into a
string of symbols where each symbol represents a primitive. Figure 5.1 shows an exam-
ple on two activity classes: walking forward (top) and running (bottom). For illustration
purposes, a total of five motion primitives are used (labeled A, B, C, D, E in different
colors). In this example, walking forward contains five types of motion primitives (A, B,
C, D, E) while running contains four (B, C, D, E). For both activities, the first line shows
the original sensor signal and the second line shows the primitive mapping of the orig-
inal sensor signal. Below these are five lines showing the locations of the five motion
primitives in the signal. The last line is a sample of the symbol string. To build activity
models based on these extracted primitives, one common strategy is to adopt a string-
matching-based approach. Specifically, in the training stage, for each activity class, a
string which minimizes the sum of intra-class distances is created and acts as a template
to represent all training instances belonging to that class. Since different strings in gen-
eral do not have the same length, the distances between them are normally measured by
edit distance (Levenshtein distance) [25]. In the recognition stage, the test instance is
first transformed into the primitive string, and then classified to the activity class whose
template matches the test instance the best. Although this string-matching-based strat-
egy shows competitive performance in both vision-based and wearable sensor-based
63
Figure 5.1: An example of activity representation (walking forward (top) and running
(bottom)) using five motion primitives (labeled A, B, C, D, E in different colors).
activity recognition tasks [34] [73] [31] [29], the main drawback is its high sensitiv-
ity to noise and its poor performance in the presence of high intra-class variation [46].
Under such conditions, it is extremely difficult to extract a meaningful template for each
activity class. Therefore, to overcome this problem, we use a statistical-based approach.
Our statistical motion primitive-based framework is based on the Bag-of-Features
(BoF) model, which has been applied in many applications such as text document classi-
fication, texture and object recognition and demonstrated impressive performance [96].
Different from the string-matching-based strategy, our BoF-based framework takes
advantage of the state-of-the-art learning machines with the aim to build statistically
robust activity models. There are two goals of this work. The first goal is to explore
the feasibility of applying a BoF-based framework for human activity recognition and
examine whether BoF can achieve better performance compared to the string-matching-
based approach. Our second goal is to perform a thorough study on several factors
64
which could impact the performance of the framework. These factors include the size
of windows, choices of features, methods to construct motion primitives, size of motion
vocabulary, weighting schemes of motion primitive assignments, and kernel functions
of the learning machines.
The rest of this chapter is organized as follows. Section 5.2 describes the basic idea
of BoF and outlines the key components of the BoF framework. Section 8.3 presents our
experimental results on the evaluations of these factors and compares the performance
between BoF and the traditionally used string-matching-based approach.
5.2 The Bag-Of-Features Framework
Local
Feature
Vector
Training Stage
Motion
Sensors
Class 1 Class N
Feature
Extraction
Primitives
Construction
Motion
Vocabulary
A B C D Y Z
Bag-of-Features
Representation
Class 1 Class N
Raw
Data
Global Feature Vector in Primitive Space
A
A
A
B
B
C D
D
B
B
A
A
B C
C
C
Y
Y
C
Y
A = 3
B = 4
C = 1
D = 2
...
Z = 0
Activity
Model
Recognition Stage
Local
Feature
Vector
Motion
Sensors
Feature
Extraction
Raw
Data
Bag-of-Features
Representation
Global Feature
Vector in
Primitive Space
Classifier
Determined
Activity
A = 2
B = 1
C = 4 ...
Z = 0
Y = 3
Figure 5.2: Block diagram of Bag-of-Features (BoF)-based framework for human activ-
ity representation and recognition
Figure 5.2 gives a graphical overview of our BoF-based fram-ework for human activ-
ity representation and recognition. The framework consists of two stages. In the training
stage, the streaming sensor data of each activity is first divided into a sequence of fixed-
length window cells whose length is much smaller than the duration of the activity itself.
Features are extracted from each window cell to form a local feature vector. The local
65
feature vectors from all training activity classes are then pooled together and quantified
through an unsupervised clustering algorithm to construct the motion vocabulary, where
the center of each generated cluster is treated as a unique motion primitive in the vocab-
ulary. By mapping the window cells to the motion primitives in the vocabulary, the
activity signal is then transformed into a string of motion primitives. Here, we assume
that activity signals do not follow any grammar and thus information about the temporal
order of motion primitives is discarded. Instead, we construct a histogram representing
the distribution of motion primitives within the string, and map the distribution into a
global feature vector. Finally, this global feature vector is used as input to the classifier
to build activity models and learn the classification function. In the recognition stage,
we first transform the unknown stream of sensor data into motion primitives and con-
struct the global feature vector based on the distribution of the motion primitives. Then
we classify the unknown sensor data to the activity class that has the most similar dis-
tribution in the primitive space. In the remainder of this section, we present the details
of all the key components of this framework.
5.2.1 Size of Window Cells
As the first parameter of our BoF framework, the size of window cells is known to have
a critical impact on recognition performance [30]. A large size may fail to capture the
local properties of the activities and thus dilute the discriminative power of the motion
primitive-based model. A small size, on the other hand, is highly sensitive to noise and
thus is less reliable to generate meaningful results. This trade-off between discrimina-
tion and stability motivates the studies of the size of window cells. Our survey shows
that a wide range of window cell sizes have been used in previous work, leading to diffi-
culties in interpreting and comparing their results. At one extreme, Huynh et al. in [44]
and Krause et al. in [52] extracted features from a 4 seconds window and a 8 seconds
66
window respectively. At the other extreme, Stiefmeier et al. in [73] adopted a0.1 second
window. In this work, we experiment with window sizes ranging from0.1 to2 seconds.
The best size is the one at which the classification accuracy reaches the maximum. We
did not experiment with window size beyond2 seconds since the “whole-motion” model
has exhibited good performance at and beyond such scales in many existing studies.
5.2.2 Features
The features used in this work are the same as the ones introduced in Chapter 3. Both
statistical and physical features are extracted from accelerometer and gyroscope. The
key difference is that in this work we evaluate those features at the primitive level. Please
refer to Chapter 3 for the definitions of these features.
5.2.3 Primitive Construction
Primitive construction forms the basis of BoF and thus plays an important role in our
framework. The extracted primitives are expected to contain salient human motion infor-
mation and thus could be used to interpret human motion in a more meaningful way.
Existing approaches construct motion primitives either using fixed-length windows with
identical temporal/spatial duration or through unsupervised clustering. Stiefmeier et al.
in [73] first recorded the motion trajectory and divided the trajectory into fixed-length
windows with identical spatial duration. Motion primitives were then constructed by
quantifying all the fixed-length windows based on their trajectory directions calculated
in the Cartesian space. Krause et al. in [52], Huynh et al. in [44], and Ghasemzadeh
et al. in [29] followed the same procedure as in [73], but using clustering algorithms to
group data points with consistent feature values to construct motion primitives. In [52]
and [44], authors used K-means for clustering. In [29], Gaussian Mixture Model
(GMM) was used and was argued to outperform K-means by the authors due to its
67
tolerance to cluster overlap and cluster shape variation. In this work, we evaluate both
K-means and GMM methods.
5.2.4 Vocabulary Size
The result of primitive construction is a motion vocabulary where each generated cluster
is treated as a unique motion primitive in the vocabulary. As a result, the vocabulary size
is equal to the total number of clusters. V ocabulary Size has a similar effect as the size
of window cells mentioned in Section 5.2.1. Specifically, a small vocabulary may lack
discriminative power since two window cells may be assigned into the same cluster even
if they are not similar to each other. On the contrary, a large vocabulary is sensitive to
noise and thus susceptible to overfitting.
In [44], Huynh et al. experimented with vocabulary sizes ranging from 10 to
800. The best vocabulary size was determined based on the classification accuracy.
Ghasemzadeh et al. in [29] selected the vocabulary size with the best Bayesian Infor-
mation Criterion (BIC) score. In [52], the best vocabulary size was determined by the
guidance of the Davies-Bouldin index. In our study, we experiment with vocabularies of
5 to 200 primitives. These vocabulary sizes cover most of the implementation choices
in the existing work. The best vocabulary size is determined empirically, similar to our
determination of the best window cell size.
5.2.5 Primitive Weighting
Given the motion vocabulary, the next step is to construct the global feature vector to
represent activities based on the distribution of the motion primitives. There are many
ways to describe the distribution. In this work, we evaluate three weighting schemes
that map the distribution of motion primitives to the global feature vectors.
68
• Term Weighting: Term weighting originates from text information retrieval
where the counts of occurrences of words in a given text are used as features
for text classification tasks. In our case, the local feature vector extracted from
each window cell is first mapped to its nearest motion primitive in the feature
space. This quantization process generates a primitive histogram which describes
the distribution of the motion primitives for each activity. Given the primitive his-
togram, the feature value of each dimension of the global feature vector is set to
the count of the corresponding motion primitive in the histogram.
Formally, let x
i
be the local feature vector associated with the i
th
window cell
of the activity signal x, and let P
j
denote the j
th
primitive (cluster) out of m
primitives (clusters) in the vocabulary. The term weighting feature mappingϕ
term
is defined as
ϕ
term
(x)=[ϕ
1
,...,ϕ
m
]
T
,
whereϕ
j
=
i∈x
ϕ
i
j
,
andϕ
i
j
= δ(x
i
∈ P
j
).
(5.1)
• Binary Weighting: Binary weighting is similar to term weighting, but with the
difference that the feature value of each dimension of the global feature vector
is either 1 or 0. The value 1 indicates the presence of the corresponding motion
primitive in the primitive histogram while value 0 indicates the absence. The
binary weighting feature mappingϕ
binary
is defined as
ϕ
binary
(x)=[ϕ
1
,...,ϕ
m
]
T
,
whereϕ
j
=
i∈x
ϕ
i
j
,
andϕ
i
j
= δ(x
i
∈ P
j
).
(5.2)
69
where
is the logical OR operator.
• Soft Weighting: The two weighting schemes describ-
ed above are directly migrated from the text information retrieval domain. For
text, words are discrete and sampled naturally according to language context. For
human motion signals in our case, signals are continuous and motion primitives
are the outcome of clustering. Based on this difference, although the harsh quan-
tization that associates each window cell with only its nearest cluster shows good
performance in the tasks of text analysis and categorization, it may not be optimal
for continuous smoothly-varying human motion signals. For example, two win-
dow cells assigned to the same motion primitive are not necessarily equally similar
to that primitive since their distances to the primitive may be different. Therefore,
the significance of motion primitives is weighted more accurately if these dis-
tances are taken into consideration. In this work, we propose a soft weighting
scheme that takes the distances (similarity) between window cells and motion
primitives into account during weight assignment.
Formally, let c
j
denote the j
th
cluster center (primitive prototype), and let K(·,·)
represent the kernel function for similarity measure. The soft weighting feature
mappingϕ
soft
is defined as
ϕ
soft
(x)=[ϕ
1
,...,ϕ
m
]
T
,
whereϕ
j
=
i∈x
ϕ
i
j
;ϕ
i
j
= K(x
i
,c
j
).
(5.3)
whereK(x
i
,c
j
) measures the similarity between thei
th
window cellx
i
and clus-
ter centerc
j
. In this work, we use the Laplacian kernel
K(x
i
,c
j
)= exp(−
||x
i
−c
j
||
σ
j
) (5.4)
70
where σ
j
is the standard deviation of primitive P
j
. As a consequence, the feature
value of the j
th
dimension of the global feature vector ϕ
j
measures the total sim-
ilarity of all the window cells of the activity signal x to the primitive prototype
c
j
.
5.2.6 Classifier and Kernels
The choice of classifier is critical to the recognition performance. Since the size of
the motion vocabulary can be potentially large, in this work, we choose Support Vector
Machines (SVMs) to be our learning machine. They have proved to be very effective in
handling high dimensional data in a wide range of machine learning and pattern recog-
nition applications [27].
SVM aims to maximize the margin between different class-es, where margin is
defined as the distance between the decision boundary and the nearest training instances.
These instances, called support vectors, finally define the classification functions [83].
Mathematically, for a two-class classification scenario, given a training set of instance-
label pairs (x
i
,y
i
),i=1,...,l where x
i
∈ R
n
represents the n-dimensional feature
vector and y
i
∈{1,−1} represents the class label, the support vector machines require
the solution of the following optimization problem:
min
w,b,ξ
1
2
w
T
w +C
l
i=1
ξ
i
subject to: y
i
w
T
φ(x
i
)+b
≥ 1−ξ
i
,
ξ
i
≥ 0,i=1,...,l
(5.5)
whereφ is a function that maps training instancex
i
into a higher (maybe infinite) dimen-
sional space;ξ
i
are called slack variables, which measure the degree of misclassification;
71
andC> 0 is the soft-margin constant acting as a regularization parameter to control the
tradeoff between training error minimization and margin maximization.
To enable efficient computation in high-dimensional feature space, a kernel function
K(x
i
,x
j
) ≡ φ(x
i
)
T
φ(x
j
) is defined. The choice of the kernel function K(x
i
,x
j
) is
critical for statistical learning. Although a number of general purpose kernels have been
proposed, it is unclear which one is the most effective for BoF in the context of human
activity classification. In this work, we evaluate the following two kernels which are all
Mercer kernels [83].
• Linear kernel:
K
linear
(x
i
,x
j
)= x
T
i
x
j
(5.6)
• Gaussian RBF kernel:
K
Gaussian
(x
i
,x
j
)= exp(−γ||x
i
−x
j
||
2
),γ> 0 (5.7)
5.3 Evaluation
In this section, we evaluate the effectiveness of our BoF-based framework. We divide the
dataset into training set and test set. Since each participant performs five trials for each
activity, we use three trials from each participant as training set to build activity models.
Three-fold cross validation is used to determine the corresponding parameters. The
vocabulary of motion primitives is learned from half of the training set. The remaining
two trials from each participant are used as test set. A confusion table is built from the
test set to illustrates the performance of the framework.
72
5.3.1 Impact of Window Cell Sizes
Our first experiment aims to evaluate the effect of different window cell sizes on the
classification performance. In this experiment, we use the statistical feature set, K-
means for primitive construction, term weighting and linear kernel for SVM training.
Figure 5.3 shows the average misclassification rates as a function of window cell sizes
0.1, 0.2, 0.3, 0.4, 0.5, 0.8, 1, 1.5, and 2 seconds. Each line represents one vocabulary
size. As shown in the figure, vocabulary size 5 has the worst performance across all
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Window Cell Sizes (Second)
MCE (Mis −Classification Error)
Impact of Window Cell Sizes on Classification Performance
5 Primitives
50 Primitives
100 Primitives
150 Primitives
Figure 5.3: Impact of Window Cell Sizes
window cell sizes. This indicates that using only5 motion primitives is not sufficient to
differentiate nine activities. In comparison, for other three vocabulary sizes, the perfor-
mances are 30% better on average, with the misclassification rates ranging from 12.4%
to 19.8% across all window cell sizes. If we look at each case individually, vocabulary
size 50 reaches its minimum misclassification rate at 0.2 second window cell size, and
73
the rate starts rising as the size increases. For vocabulary size100 and150, the misclas-
sification rates reach the first local minimum at0.2 second, and only vary slightly when
the window cell size is less than0.8 second. The performances start degrading when the
size is beyond 1 second. Based on these observations, we conclude that the appropriate
window cell size is around 0.2 second. Therefore, we only use 0.2 second window cell
in the remaining experiments.
5.3.2 Impact of Vocabulary Sizes
In this experiment, we study the impact of different vocabulary sizes on the classifica-
tion performance of our BoF framework. We fix the window size to 0.2 second and
keep other factors the same as in the last experiment. Figure 5.4 shows the average
misclassification rates as a function of vocabulary sizes 5, 10, 25, 50, 75, 100, 125, 150,
175, and200. The error bars represent the standard deviation in the cross validation. As
0 20 40 60 80 100 120 140 160 180 200
0
0.1
0.2
0.3
0.4
0.5
Vocabulary Size
MCE (Mis −Classification Error)
Impact of Vocabulary Sizes on Classification Performance
Figure 5.4: Impact of V ocabulary Sizes
74
illustrated in the figure, the misclassification rate drops significantly from vocabulary
size 5 and stabilizes starting from vocabulary size 75. The misclassification rate reaches
the minimum of 12.0% (88.0% accuracy) when 150 motion primitives are used. When
the number of motion primitives is bigger than 150, the misclassification rate increases
slightly. This indicates that a vocabulary of 150 primitives is sufficient for our activity
set. Another interesting observation when combining the results in Figure 5.3 and Fig-
ure 5.4 is that vocabulary size has a more significant impact on the performance than the
size of window cell.
5.3.3 Comparison of Features
Next, we examine the effects of features. Specifically, we use the statistical feature set
and physical feature set described in the previous section and keep other factors the
same to construct motion primitives and build activity models respectively. The results
are shown in Figure 5.5. Similar to the statistical features, the misclassification rate
based on the physical features drops significantly from vocabulary size 5 and stabilizes
starting from vocabulary size 75. The misclassification rate reaches its minimum of
9.9% (90.1% accuracy) when100 motion primitives are used.
In addition, it is interesting to observe that the physical features outperform the sta-
tistical features consistently across all vocabulary sizes, with an improvement of 5.7%
for the same vocabulary size on average. This indicates that primitives constructed from
physical features contain more salient and meaningful motion information compared to
the primitives constructed from statistical features. In order to validate this argument
and have a better understanding of why the physical features perform better, we map the
primitives constructed by statistical features and physical features onto the original sen-
sor signals respectively. Figure 5.6 shows the primitive mappings based on the physical
features (top) and the statistical features (bottom) on the same sensor signal (running
75
0 20 40 60 80 100 120 140 160 180 200
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Vocabulary Size
MCE (Mis −Classification Error)
Impact of Features on Classification Performance
Statistical Features
Physical Features
Figure 5.5: Comparison of Features
in this example). For illustration purposes, a total of five motion primitives are used,
with different colors representing different primitives. As illustrated, different feature
sets lead to different primitive mappings. For statistical features, it is obvious that prim-
itives are constructed based on the signal’s statistical characteristics. For example, the
primitive in red corresponds to the data points which have mid-range raw values and
a positive derivative. The primitive in blue corresponds to the data points which have
high raw values and a small derivative. In comparison, primitives constructed based on
the physical features contain useful physical meanings that help discriminate different
activities. For example, the primitive in cyan illustrated in Figure 5.6 only occurs in half
cycle. This primitive may be a very important primitive for describing the motion of
the subject’s left/right hip (the subject wears the sensing device at this location) during
running. Since physical features outperform statistical features consistently across all
vocabulary sizes, only physical features will be used in the remaining experiments.
76
Figure 5.6: The difference of primitive mapping between physical features (top) and
statistical features (bottom)
5.3.4 Comparison of Primitive Construction Algorithms
This section compares the performance of the two primitive construction algorithms:
K-means and Gaussian Mixture Model (GMM). As shown in Figure 5.7, for GMM,
the misclassification rate drops significantly from vocabulary size 5. The misclassifica-
tion rate reaches the minimum of 18.5% (81.5% accuracy) when 150 motion primitives
are used. Co-mpared to GMM, it is interesting to see that K-means achie-ves better
performance consistently across all vocabulary sizes, with an improvement of 13.5%
for the same vocabulary size on average. Our result contradicts the arguments of the
authors in [29], indicating thatK-means is powerful to handle cluster overlap and shape
variations of human motion data as long as the number of clusters is sufficient.
77
0 20 40 60 80 100 120 140 160 180 200
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Vocabulary Size
MCE (Mis −Classification Error)
Impact of Primitive Construction Algorithms on Classification Performance
GMM
K−means
Figure 5.7: Comparison of Primitive Construction Algorithms
5.3.5 Comparison of Weighting Schemes
Figure 5.8 illustrates the performance differences between three primitive weighting
schemes. We first examine the relationship between binary weighting and term weight-
ing. In both cases, the misclassification rates drop and then stabilize as the size of
vocabulary increases in general. The difference between these two cases is that term
weighting outperforms binary weighting by a large margin when the vocabulary size is
small and by a small margin when the vocabulary size becomes large. This is because
that, with a larger vocabulary size, the counts of a large number of motion primitives are
either 0 or 1, which makes term weighting and binary weighting similar. Next, we see
that the Laplacian kernel-based soft weighting scheme outperforms both term weighting
and binary weighting across all vocabulary sizes except vocabulary size 10. In partic-
ular, soft weighting achieves the minimum misclassification rate at 7% (93% accuracy)
when 125 motion primitives are used. This result indicates that, different from words
78
in text information retrieval which are discrete, motion primitives extracted from con-
tinuous human motion signals are smooth, and taking the smoothness into account is
significant to the classification performance of the BoF framework.
0 20 40 60 80 100 120 140 160 180 200
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Vocabulary Size
MCE (Mis −Classification Error)
Impact of Primitive Weighting Schemes on Classification Performance
Binary Weighting
Term Weighting
Soft Weighting
Figure 5.8: Comparison of Weighting Schemes
5.3.6 Comparison of Kernel Functions
In this experiment, we examine the performance of BoF framework when two different
kernel functions are used. The results are shown in Figure 5.9. As illustrated, neither
kernel predominates everywhere, but linear kernel is preferred when the vocabulary size
is equal or larger than 100. This observation can be attributed to the fact that motion
primitives are linear separable when the dimension of primitive space is high.
79
0 20 40 60 80 100 120 140 160 180 200
0.05
0.1
0.15
0.2
0.25
0.3
Vocabulary Size
MCE (Mis −Classification Error)
Impack of Kernels on Classification Performance
Linear kernel
Gaussian kernel
Figure 5.9: Comparison of Kernel Functions
5.3.7 Confusion Table
The experimental results presented in the previous subsections demonstrate that all six
factors are influential to the final classification performance of our BoF-based frame-
work. Here, we investigate the best possible choices of the six factors with the goal of
exploring the upper limit of the performance of the BoF framework for human activity
recognition. Based on the results presented earlier, we determine the best combination of
factors to be0.2 second window cell, physical feature set, a vocabulary with125 motion
primitives, K-means for primitive construction, Laplacian kernel-based soft weighting
for motion primitive assignment, and linear kernel for SVM training. To evaluate the
performance of the BoF-based framework with the best combination of factors, a con-
fusion table is built from the test set and is shown in Table 6.2. The overall recognition
accuracy across all activities is 92.7%. If we examine the recognition performance of
each activity individually, jump up and run forward are the two easiest activity classes
80
to recognize. go upstairs and go downstairs have relatively low recall values since they
can be confused with other walking-related activities. stand has the lowest recall value
because it is often confused with sit on a chair. This result makes sense since both
stand and sit on a chair are static activities, and we expect difficulty in differentiating
different static activity classes especially when the sensing device is attached to the hip
of the subjects. Finally, for walk forward, walk left and walk right are the two dominant
activity classes which walk forward is misclassified into. However, walk left and walk
right never get confused with each other.
5.3.8 Comparison with String-Matching
As our last experiment, we conduct a comparative evaluation with the non-statistical
string-matching-based approach. We implement the string-matching method described
in [29]. We select this method because the authors in [29] also use a clustering algorithm
to construct motion primitives. To make a fair comparison, we use a0.2 second window
cell with statistical features andK-means primitive construction algorithm for both BoF
and string-matching. The results are shown in Figure 5.10. As illustrated in the figure,
the average misclassification rate of the string-matching-based approach ranges from
37% to 54% across all vocabulary sizes. In addition, there is no clear trend of the mis-
classification rate as the vocabulary size varies. This indicates that the string-matching-
based approach is not stable such that it is extremely difficult to determine a meaningful
vocabulary size. Moreover, as expected, the string-matching-based approach performs
consistently worse compared to BoF by a large margin across all vocabulary sizes. As
explained in the first section, this is because extracting meaningful string templates for
the string-matching-based approach is difficult when the activity data is noisy and has a
large intra-class variation.
81
0 20 40 60 80 100 120 140 160 180 200
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
Vocabulary Size
MCE (Mis −Classification Error)
Performance Comparison between String −Matching and BoF
String −Matching
BoF
Figure 5.10: Performance Comparison with String-Matching-Based Approach
5.4 Extension Based on Sparse Representation
An extension of our BoF-based framework is to build the motion primitives based on
recently developed compressed sensing and sparse representation theories. Figure 5.11
shows an overview of this new framework. In the training stage, the streaming sensor
Local
Feature
Vector
Training Stage
Motion
Sensors
Feature
Extraction
Dictionary
Construction
Overcomplete
Dictionary
Sparse Coding
Raw
Data
Normalized Sparse
Coefficients
Recognition Stage
Local
Feature
Vector
Motion
Sensors
Feature
Extraction Raw
Data
Sparse Coding
Normalized Sparse
Coefficients
Classifier
Determined
Activity
Overcomplete
Dictionary
Figure 5.11: The block diagram of the sparse representation-based motion primitive
framework
data sampled from activity segments is first divided into a sequence of fixed-length tiny
82
window cells whose length is much smaller than the duration of the activity segment
itself. Features are extracted from each window cell and stacked together to form a local
feature vector. The local feature vectors from all training activity segments are then
pooled together to learn the overcomplete dictionary. By incorporating sparse coding,
activity models are built and represented through sparse coefficients related to the dic-
tionary elements. Finally, these coefficients are used as global features to train the clas-
sifier. In recognition stage, the test activity segment is first transformed into a sequence
of local feature vectors in the same manner as in training stage. Its sparse coefficients
related to the dictionary elements are then computed and imported into the classifier for
classification. We now present the details of each component.
5.4.1 Dictionary Learning
In this extension, we employ the K-SVD algorithm proposed in [3] to learn the over-
complete dictionary from the training data. Specifically, assume that there areL distinct
activity classes to classify and n
c
training window cells from class c, c ∈ [1,2,...L].
Recall that each window cell is represented as a m-dimensional local feature vector (m
is equal to30 in our case). To learn the dictionary, we first pool the local feature vectors
from all the activity classes together and arrange them as columns to construct the data
matrix:
Y =[y
1
,y
2
,...,y
N
] ∈ R
m×N
(5.8)
where N = n
1
+ n
2
+... + n
L
denotes the total number of training windows cell
samples. GivenY , the K-SVD algorithm intends to learn a reconstructive overcomplete
dictionary D =[d
1
,d
2
,...,d
K
] ∈ R
m×K
with K elements, over which each y
i
in
Y can be sparsely represented as a linear combination of no more than T
0
dictionary
elements. This can be formulated as an optimization problem which constructs the
83
desired dictionary by minimizing the reconstructive error while satisfying the sparsity
constraints:
argmin
D,X
Y −DX
2
2
s.t. ∀i,x
i
0
≤ T
0
(5.9)
where X=[x
1
,x
2
,...,x
N
] ∈ R
K×N
are the sparse coefficients of the data matrix Y
related to D, x
i
0
is the
0
norm of the coefficient vector x
i
, which is equivalent to
the number of non-zero components in the vector, and the termY −DX
2
2
represents
the reconstruction error ofY over D in terms of
2
norm.
Here, it is worthwhile to note the connection and the difference between the K-SVD
algorithm and the K-means algorithm in the baseline motion primitive-based model
for the task of dictionary learning. By using K-means, each local feature vector y
i
is represented by the dictionary element which has the minimum
2
norm distance to
it. Moreover, the coefficient multiplying the closest dictionary element is forced to be
integer one. In comparison, the K-SVD algorithm is designed to look for a more general
solution, in which each local feature vector y
i
is represented as a linear combination of
as many as T
0
dictionary elements. In addition, the corresponding coefficients can be
any real numbers. Therefore, the K-SVD algorithm can be regarded as a generalization
of the K-means algorithm. As a consequence, the dictionary learned by K-SVD is
expected to have more representation power of the data matrixY .
5.4.2 Sparse Coding for Activity Modeling
Given the overcomplete dictionary D learned in the previous step, any window cell y
can be decomposed as a linear combination of the dictionary elements. Its coefficients
x can be computed by solving the following standard sparse coding problem:
argmin
x
x
0
s.t. y −Dx
2
≤ (5.10)
84
where is the noise level. Finding the exact solution to (5.10) provestobeanNP
hard problem [3]. However, if the signal is sparse enough, approximate solutions can
be found by pursuit algorithms such as the matching pursuit (MP) [59] and orthogonal
matching pursuit (OMP) [79]. Based on our experiments, OMP achieves better perfor-
mance than MP. Therefore the results reported here are based on the OMP method.
Since each activity segment consists of a sequence of window cells, to build the
activity model, we need to find a meaningful way to accumulate information from all
the window cells within each segment. Assume there are M window cells in each
activity segment. As mentioned above, each window cell is represented as a linear
combination of dictionary elements[d
1
,d
2
,...,d
K
] with the corresponding coefficients
[x
d
1
,x
d
2
,...,x
d
K
]. These coefficients can be viewed as the weights of the dictionary
elements for reconstructing the window cells. Therefore, if we aggregate the coefficients
from all window cells in each activity segment together, we will obtain a class-related
distribution of the dictionary elements for each activity segment. After normalization,
the distribution is transformed into the conditional probability defined as:
P(d
i
|c)=
j
x
d
i
,j
i
j
x
d
i
,j
(5.11)
where i ∈ [1,2,...,K],j ∈ [1,2,...,M],c ∈ [1,2,...,L], and P(d
i
|c) represents
the probability of observing the dictionary elementd
i
given activity classc. Since these
conditional probabilities capture the global information of the activity segments, we
use them as features by concatenating them together as a K-dimensional global feature
vectorf=[P(d
1
|c),P(d
2
|c),...,P(d
K
|c)]
T
for classification.
85
5.4.3 Classifier
The size of the overcomplete dictionary can be potentially large. Here we use the multi-
class Support Vector Machine (SVM) with linear kernel as our classifier. This classifier
has proved to be very effective in handling high dimensional data in a wide range of
pattern recognition applications.
5.4.4 Experimental Results and Discussions
To achieve reliable results, we adopt a leave-one-trial-out cross validation strategy.
Specifically, since each participant performs five trials for each activity, we use four
trials of all participants for dictionary learning and activity model training while the
left-out trial is for testing. This process iterates for every trial, and the final result is the
average value across all five trials.
As our first experiment, Figure 5.12 shows the recognition accuracy of our approach
with different window cell sizes from 0.1 to 2 seconds. As shown, the accuracy
reaches the maximum when the window cell size is 0.3 second. As the window cell
size increases, the recognition accuracy declines as a general trend. This is partially
attributed to the fact that as the window cell size increases, the local feature vector con-
structed from each window cell can not capture the local information of the activity
signal anymore. Moreover, the increase of window cell size leads to the reduction of
the total number of window cells included in each activity segment. Thus the statistical
power of our activity model that is built on top of the primitive distribution is diluted.
Next, we fix the window cell size to 0.3 second and examine the impact of sparsity
(T
0
) on the classification performance. As shown in Figure 5.13, whenT
0
is less than30,
the recognition accuracy rises in general asT
0
increases. This observation demonstrates
the superiority of K-SVD over K-means for the task of dictionary learning. In other
86
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
85
86
87
88
89
90
91
92
93
94
95
Window Cell Size (Second)
Recognition Accuracy (%)
Impact of Window Cell Size on Classification Performance
Figure 5.12: Impact of Window Cell Sizes
words, it shows significant benefits in using more than one dictionary element with real
valued coefficients to represent window cells within each activity segment. It should also
be noted that when T
0
is bigger than 30, the recognition accuracy only varies slightly.
This indicates that using 30 elements in the dictionary is sufficient to reconstruct any
window cell for our activity dataset.
Finally, we setT
0
to30 and measure the performance using different dictionary sizes
(K). We also compare our approach with the baseline motion primitive-based model
under the same condition. The baseline algorithm usesK-means for dictionary learning
and the raw primitive distribution (histogram of motion primitives) for activity model-
ing. As shown in Figure 5.14, the accuracy of our approach starts to rise at the very
beginning and stabilizes when the dictionary size reaches 50. A maximum accuracy
of 96.47% is achieved when the dictionary size is 75. More importantly, our approach
achieves a much better performance compared to the baseline algorithm across all dic-
tionary sizes, with an improvement of10% for the same ditionary size on average. This
87
0 10 20 30 40 50 60 70
88
89
90
91
92
93
94
95
96
97
98
Sparsity (To)
Recognition Rate (%)
Impact of Sparsity on Classification Performance
Figure 5.13: Impact of Sparsity (T
0
)
result indicates that by leveraging sparse coding techniques, the motion primitive-based
model can achieve a significant performance improvement.
88
0 20 40 60 80 100 120 140 160 180 200
40
50
60
70
80
90
100
Dictionary Size (K)
Recognition Rate (%)
Impact of Dictionary Size on Classification Performance
Our approach
Baseline
Figure 5.14: Impact of Dictionary Sizes (K)
89
Classified Activity
Walk forward Walk left Walk right Go up stairs Go down stairs Run forward Jump up Sit on a chair Stand Total Recall
1 Walk forward 126 6 7 2 1 0 0 0 0 142 88.7%
Ground Truth
2 Walk left 7 159 0 0 0 0 0 0 0 166 95.8%
3 Walk right 9 0 190 2 0 0 0 1 0 202 94.1%
4 Go down stairs 21 0 0 26 1 0 0 0 30 86.7%
5 Go down stairs 21 0 0 26 1 0 0 0 30 86.7%
6 Run forward 00 0 0 0 93 0 0 0 93 100%
7 Jump up 00 0 0 0 0 54 1 0 55 98.2%
8 Sit on a chair 0 0 0 0 0 0 0 169 9 178 94.9%
9 Stand 0 1 1 0 0 0 0 22 134 158 84.8%
Total 147 167 198 31 28 94 54 193 143
Precision 85.7% 95.2% 96.0% 87.1% 92.9% 98.9% 100% 87.6% 93.7%
Table 5.1: Confusion table for the best factor combination when using0.2 second window cell, physical feature set, vocabulary
size = 125, K-means for primitive construction, soft weighting for motion primitive assignment, and linear kernel for SVM
training. The entry in the i
th
row and j
th
column is the count of activity instances from class i but classified as class j. Overall
classification accuracy is92.7%.
90
Chapter 6
Discovering Low Dimensional Activity
Manifolds
6.1 Introduction
The computational activity models developed in the previous chapters are built on top
of feature vectors usually having relatively high dimensionalities. In this chapter, we
explore the possibility of lowering the dimensionality while retaining decision accu-
racy. Specifically, we propose a framework based on manifold learning techniques that
embeds the high-dimensional human activity signals into a low-dimensional space for
compact representation and recognition. The idea of this manifold-based framework
stems from the observation that the sensor signals of a subject performing certain activ-
ity are constrained by the physical body kinematics and the temporal constraints posed
by the activity being performed. Given these constraints, it is expected that the sensor
signals vary smoothly and lie on a low-dimensional manifold embedded in the high-
dimensional input space. Moreover, these manifolds capture the intrinsic activity struc-
tures and act as trajectories to characterize different types of activities. These motivate
the analysis of human activities in the low-dimensional manifold space rather than the
high-dimensional input space.
91
The keys to the success of the manifold-based framework are: (1) extracting mean-
ingful activity manifolds that preserve the intrinsic structure of the human activity sig-
nals; and (2) constructing effective recognition algorithms to perform activity classifi-
cation in the low-dimensional manifold space. For the first point, the main challenge
is that activity manifolds are in nature nonlinear and even twisted. Because of such
nonlinearity, linear models such as principal component analysis (PCA), and linear dis-
criminant analysis (LDA) are not able to discover the underlying manifold structures.
For the second point, the activity manifolds may have different shapes and lengths for
different activities and even the same activity because different subjects may perform the
same activity in different styles. The classification algorithm should be robust enough
to handle these inter-class and intra-class variations.
Based on the considerations mentioned above, we focus on developing a human
activity recognition framework based on nonlinear manifold learning techniques. These
techniques such as isometric feature mapping (Isomap) [76], local linear embedding
(LLE) [69], and Laplacian Eigenmap [12] are able to capture the low-dimensional non-
linear manifolds embedded in the high-dimensional input spaces for synthetic examples
as well as real world applications, such as face recognition [94], visual speech recog-
nition [18], visual object tracking [87], vision-based body pose identification [26] and
human movement analysis [17] [86]. However, there have been relatively few studies
on practical applications of manifold learning for wearable sensor-based human activity
recognition. Thus, there are two goals for this work. The first goal is to investigate
whether there exists a compact low-dimensional manifold representation for the activity
signals sampled from the wearable motion sensors. The second goal is to explore the
feasibility of applying manifold learning techniques for human activity recognition in
the low-dimensional manifold space.
92
6.2 Manifold-Based Framework
Feature
Vector
Training Stage
Motion
Sensors
Class 1 Class N
Feature
Extraction
Manifold
Construction
Activity Manifolds
Bag-of-Features
Representation
Class 1 Class N
Raw
Data
Activity
Model
Recognition Stage
Feature
Vector
Motion
Sensors
Feature
Extraction
Raw
Data
Input-to-Manifold
Mapping
Manifold
Coordinates
Manifold
Comparison
Determined Activity
Figure 6.1: The block diagram of the manifold-based human activity recognition frame-
work
Figure 6.1 shows the block diagram of our manifold-based framework. The pro-
posed framework consists of two stages. In the training stage, the streaming sensor
data of each activity is first divided into a sequence of fixed-length window cells whose
length is much smaller than the duration of the activity itself (in this work, we use a
window cell size of0.1 second). Features are extracted from each window cell to form a
local feature vector. The supervised LLE algorithm (as described in this section later) is
then applied to map each high-dimensional local feature vector onto a low-dimensional
manifold to construct an activity manifold for each activity class. In the recognition
stage, the unknown stream of sensor data is first transformed into a sequence of local
feature vectors. These feature vectors are then mapped into the low-dimensional man-
ifold space by the manifold projection mapping function learned in the training stage
by means of the nearest-neighbor interpolation technique (as described in this section
later). To classify the unknown sensor data, its newly constructed manifold is compared
93
to the manifolds of the known activity classes. Finally, the manifold is classified as the
activity class that has the most similar manifold. In the remainder of this section, we
present the details of all the components of this framework.
6.2.1 Feature Extraction
For wearable sensor-based human activity recognition, a variety of features both in time
and frequency domains have been investigated within the framework of the “whole-
motion” model. Popular examples are mean, variance, entropy, correlation, FFT coef-
ficients etc. However, since the total number of samples within each window cell is
small, complex features such as entropy and FFT coefficients may not be reliably calcu-
lated. Therefore, we only consider features that can be reliably calculated with a small
number of samples. Table 6.1 lists the features we include in this work. These fea-
tures are extracted from each axis of both accelerometer and gyroscope. Therefore, the
dimensionality of the input feature space is30.
Feature Description
Mean The DC component (average value) of the signal
over the window
Standard Deviation Measure of the spreadness of the signal over the window
Root Mean Square The quadratic mean value of the signal over the window
Averaged derivatives The mean value of the first order derivatives of
the signal over the window
Mean Crossing Rate The total number of times the signal changes from
below average to above average or vice versa
normalized by the window length
Table 6.1: Features and their brief descriptions
6.2.2 Learning Activity Manifolds
In this work, we adapt a LLE framework [69] to capture the intrinsic structures of the
activity signals and construct the corresponding low-dimensional activity manifolds. We
94
choose LLE over other manifold learning techniques such as Isomap and Laplacian
Eigenmap because LLE makes fewer assumptions on the activity signals and runs much
faster [72]. Although LLE was initially proposed as an unsupervised manifold learning
algorithm, here, we utilize the class label information and construct manifolds for each
activity class separately in a supervised manner.
LetX =
x
i
∈ R
D
,i=1,...,N
be the input activity signal segment with length
of N in the D-dimensional input space after feature extraction, where x
i
represents the
local feature vector associated with thei
th
window cell within the segment and acts as a
single point in R
D
. LLE takes X as input and computes the corresponding coordinate
vectors Y =
y
i
∈ R
d
,i=1,...,N
in the d-dimensional manifold space (d<
The procedure of the LLE algorithm consists of three steps and is described as follows.
Find neighborhood
Find K nearest neighbors for each point x
i
,i=1,...,N in the D-dimensional input
space. In this work, the Euclidean distance is used to measure the similarity between
points after each feature dimension is normalized to zero mean and unit variance. The
value ofK is determined empirically.
Compute reconstruction weights
Assuming that each point and its neighbors lie on a locally linear patch of the underlying
manifold, each point can be reconstructed as a linear combination of itsK nearest neigh-
bors found in the first step. The objective of this step is to compute the reconstruction
weights that minimize the global reconstruction error measured by the cost function
(W)=
N
i=1
x
i
−
N
j=1
W
ij
x
j
2
(6.1)
95
whereW
ij
represents the contribution (weight) ofx
j
to the reconstruction ofx
i
.
To compute the weights W
ij
, the cost function is minimized subject to two con-
straints: (1) W
ij
=0,if x
j
is not one of x
i
’s K nearest neighbors, and (2)
K
j=1
W
ij
=
1,if x
j
is among x
i
’s K nearest neighbors. The solution (optimal weights W
ij
) of this
optimization problem can be found by solving a least-square problem [71].
Constructd-dimensional embedding
The constrained weights W
ij
derived from step 2 characterize the intrinsic geometric
properties of each point and its neighbors, and by design, they are invariant to transfor-
mations from D-dimensional input space to d-dimensional manifold space. Therefore,
the same weights W
ij
that reconstruct x
i
in D-dimensional input space can also recon-
struct its embedded manifold coordinates y
i
in d-dimensional manifold space. Based
on this characteristic, the manifold coordinates y
i
can be computed by minimizing the
embedding cost function
Φ(Y)=
N
i=1
y
i
−
N
j=1
W
ij
y
j
2
(6.2)
Similar to step 2, to compute the manifold coordinatesy
i
, the embedding cost func-
tion is minimized subject to two constraints: (1)
N
i=1
y
i
=0, and (2)
1
N
N
i=1
y
i
y
T
i
=
I. These two constraints make the problem well-posed, and the optimization problem
is transformed into an eigenvalue problem, in which we select the d non-zero eigen-
vectors corresponding to the d smallest eigenvalues to provide the desired d manifold
coordinates [69].
Figure 6.2 illustrates the resulting manifolds in 3D spaces for four different activi-
ties: walk forward, run, jump up, and stand. As illustrated, the manifolds (trajectories)
of walk forward (Figure 6.2(a)), run (Figure 6.2(b)), and jump up (Figure 6.2(c)) evolve
96
along a closed nonlinear curve in the embedded space respectively. This is because these
three activities are either periodic or semi-periodic, causing the trajectories of different
cycles to overlap each other. More importantly, these results indicate that there exists a
compact low-dimensional manifold representation for these activities. However, for the
activity stand in Figure 6.2(d), there does not exist a clear trajectory representing the
activity itself. This result is expected since stand is aperiodic and static such that it is
difficult to extract a consistent trajectory. Therefore, it is not useful to recognize stand
and other similar aperiodic and static activities using this manifold-based framework.
Based on this observation, we do not take stand and sit into consideration from now on.
−2
−1
0
1
2
−3
−2
−1
0
1
2
−2
−1.5
−1
−0.5
0
0.5
1
(a) Walking Forward
−2
0
2
4
6 −4
−2
0
2
4
−2
−1.5
−1
−0.5
0
0.5
1
1.5
(b) Running
−5
0
5 −2
−1
0
1
2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
(c) Jumping up
−3 −2
−1
0
1
2
3
4
−5
0
5
−3
−2
−1
0
1
2
3
(d) Standing
Figure 6.2: Manifolds of four different types of activities visualized in 3D spaces
97
6.2.3 Learning Input-to-Manifold Mapping
As shown in the previous subsection, given the input coordinates inD dimensions, LLE
provides the embedding coordinates in d dimensions directly. In other words, the map-
ping function f : R
D
→ R
d
is not explicitly given by LLE. For the task of activity
recognition, however, we need to compute the embedding coordinates corresponding to
new test activity segments. In principle, we could rerun the entire LLE algorithm with
the original training dataset augmented by the test activity segment. For large datasets
of high dimensionality, however, this approach is prohibitively expensive. Thus, it is
necessary to derive an explicit mapping function between the high and low dimensional
spaces.
In this work, we use the non-parametric mapping function proposed in [71]. The
mapping function is inspired by the LLE algorithm described in the previous subsection
and learned by means of the nearest-neighbor interpolation technique. Specifically, to
compute the embedding coordinates ˆ y for a new input ˆ x, we perform the following
three steps: (1) identify the K nearest neighbors of ˆ x among the training set (denoted
as ˆ x
i
,i=1,...,K); (2) compute the linear weights W
i
that best reconstruct ˆ x from its
K nearest neighbors, subject to the constraint
K
i=1
W
i
=1; (3) since the neighbors of
ˆ x have known corresponding embedding coordinates (denoted as ˆ y
i
,i=1,...,K), ˆ y
is then obtained by linearly combining these embedding coordinates with the recovered
weightsW
i
. That is, ˆ y =
K
i=1
W
i
ˆ y
i
.
As an example, Figure 6.3 shows the resulting manifolds after mapping a test seg-
ment of activity jump up (Figure 6.3(a)) and a test segment of activity walk forward
(Figure 6.3(b)) into the same activity jump up manifold space respectively. In both fig-
ures, the points in blue represent the mapped test segment. It is obvious to see that the
manifolds of activity segments belonging to the same activity class have similar shapes
and highly overlapped trajectories while the shapes of manifolds of different activity
98
segments are quite distinct. This observation indicates that activity recognition can be
performed by comparing the shapes of manifolds.
−5
0
5
−2
−1
0
1
2
−3
−2
−1
0
1
2
Jumping
Jumping
(a) The result after mapping a test segment of
activity jump up to the activity jump up manifold
space
−5
0
5
−2
−1
0
1
2
−3
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2 Walking Forward
Jumping
(b) The result after mapping a test segment of
activity walk forward to the activity jump up
manifold space
Figure 6.3: Mapping results of the non-parametric mapping function
6.2.4 Recognizing Activity Manifolds
Based on the observations in Figure 6.2 and 6.3, activity recognition is performed by
comparing trajectories of manifolds in the low-dimensional space. One issue of trajec-
tory comparison is that trajectories of manifolds from different activity classes or the
same activity class but from different segments may be misaligned and have different
lengths. Therefore, a distance measure that can handle misalignment and variations in
trajectory lengths is desired. In this work, we use a variant of the Hausdorff metric,
that is, the “mean value of the minimums”, to measure the distance between different
manifolds:
Dist(M
1
,M
2
)=
1
T
M
1
T
M
1
i=1
min
1≤j≤T
M
2
M
1
(i)−M
2
(j) (6.3)
99
whereM
1
andM
2
are two manifolds under comparison,T
M
1
andT
M
2
are the lengths of
M
1
and M
2
respectively, and M
1
(i) is the i
th
point on the manifold M
1
[86]. Since the
Hausdorff metric is asymmetric, the distance measure is thus modified in the form
D(M
1
,M
2
)= Dist(M
1
,M
2
)+Dist(M
1
,M
2
) (6.4)
to ensure symmetry.
Based on this distance measure, the recognition procedure is as follows. The test
activity segment is first mapped into the manifold space of each known activity class
to construct its manifold. This newly constructed manifold is then compared to the
manifolds of each known activity class. The test activity segment is classified as the
activity class that has the most similar manifold.
6.3 Evaluation Results
In this section, we evaluate the effectiveness of our manifold-based framework. We
divide the dataset into training set and test set. Both sets cover segments from all activity
trials performed by all participants. Activity manifolds and the corresponding param-
eters are learned from the training set. A confusion table is built from the test set to
illustrate the performance of the framework.
100
6.3.1 Estimating the Intrinsic Dimensionality
As our first experiment, since there exist compact low-dimensional manifold structures
for human activity signals, it is important to estimate the manifolds’ intrinsic dimen-
sionality. In this work, we use residual variance proposed in [76] for the estimation.
The residual variance is defined as
residual variance = R
2
(D
I
,D
M
) (6.5)
where D
I
, and D
M
are the Euclidean distance matrices in the input space and low-
dimensional embedding space, respectively, and R is the standard linear correlation
coefficient, taken over all entries ofD
I
, andD
M
. The lower the residual variance is, the
better the high-dimensional input data are represented in the low-dimensional manifold
space.
Figure 6.5 illustrates the values of residual variance as a function of the dimensional-
ity of the manifold space for different activities. To avoid overfitting, the intrinsic dimen-
sionality of the manifold d is estimated by looking for the “elbow” at which the curve
ceases to decrease significantly with added dimensions [76]. As expected, the intrinsic
dimensionality for different activity manifolds are different. Specifically, for activity
walk forward (Figure 6.5(a)), walk left (Figure 6.5(b)), walk right (Figure 6.5(c)), and
run (Figure 6.5(f)), the estimated intrinsic dimensionality is 3. For activity go upstairs
(Figure 6.5(d)) and go downstairs (Figure 6.5(e)), the estimated intrinsic dimensionality
is 4. For activity jump up (Figure 6.5(g)), the estimated intrinsic dimensionality is 2.
It should be noted that the higher the intrinsic dimensionality is, the more dimensions
of variation and complicated structure the activity has. Therefore, our result indicates
that go upstairs and go downstairs contain the most complicated structures while jump
up has the simplest structure among all the activities, which to some extent matches our
101
intuition. Finally, since the intrinsic dimensionalities are different for different activities,
activity manifolds are constructed and classified in their own intrinsic dimensionality
spaces respectively.
6.3.2 Impact of the Number of Nearest Neighbors
The other key parameter of the framework is the number of nearest neighbors (K)
defined in the first step of LLE. It is obvious that a small K may falsely divide a con-
tinuous manifold into disjoint sub-manifolds. To the extreme, the LLE algorithm can
only recover embeddings whose intrinsic dimensionality is strictly less than K [71]. In
contrast, a large K may violate the basic assumption of local linearity. Furthermore, if
K is larger than the dimensionality of the input space (in our case, D=30), the local
reconstruction weights in the second step of LLE are no longer uniquely defined [71].
Given these constraints, in this study, we experiment withK ranging from5 to25.Five-
fold cross validation is used to evaluate the performance. The best K is determined as
the one at which the classification accuracy reaches the maximum.
Figure 6.4 shows the average misclassification rates as a function of K at 5, 10,
15, 20, and 25. The error bars represent the standard deviation across five folds in
cross validation. As illustrated in the figure, the misclassification rate drops significantly
from K =5 and reaches the minimum at K =10. When K is larger than 10, the
misclassification rate increases. This observation demonstrates that using 10 nearest
neighbors is the best to construct activity manifolds for our dataset.
6.3.3 Confusion Table
Finally, the confusion table for the test set with K =10 is shown in Table 6.2. The
overall recognition accuracy across all activities is 80.3%. If we examine the recogni-
tion performance for each activity individually, jump up (with a 94.1% precision and
102
5 10 15 20 25
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
K
MCE (Mis −Classification Error)
Influence of Number of Neighbors on Classification Performance
Figure 6.4: Impact of the number of nearest neighbors (K) on the classification perfor-
mance of the manifold-based framework
90.9% recall) and run (with a 98.2% precision and 88.9% recall) are the two easiest
activities to recognize. Compared to other activities, go upstairs and go downstairs
have relatively low precision values (75.5% and 74.5% respectively). This is because
these two activities can be confused with each other and other walking-related activi-
ties. Finally, walk forward and walk right perform the worst in the sense that they have
the lowest recall value (71.9%) and precision value (74.0%) respectively. As illustrated
in the table, walk forward, walk left and walk right are more likely to be misclassified
among each other. This indicates that the manifolds of these three activities have sim-
ilar shapes and trajectories such that our manifold-based framework has difficulties in
differentiating them from each other.
103
1 2 3 4 5 6 7
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Residual Variance
LLE Dimensionality
Walking Forward
(a) Walking Forward, withd=3
1 2 3 4 5 6 7
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Residual Variance
LLE Dimensionality
Walking Left
(b) Walking Left, withd=3
1 2 3 4 5 6 7
0.7
0.75
0.8
0.85
0.9
0.95
1
Residual Variance
LLE Dimensionality
Walking Right
(c) Walking Right, withd=3
1 2 3 4 5 6 7
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Residual Variance
LLE Dimensionality
Walk Upstairs
(d) Walking Upstairs, withd=4
1 2 3 4 5 6 7
0.35
0.4
0.45
0.5
0.55
0.6
0.65
Residual Variance
LLE Dimensionality
Walking Downstairs
(e) Walking Downstairs, withd=4
1 2 3 4 5 6 7
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Residual Variance
LLE Dimensionality
Running
(f) Running, withd=3
1 2 3 4 5 6 7
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Residual Variance
LLE Dimensionality
Jumping
(g) Jumping, withd=2
Figure 6.5: Intrinsic dimensionality estimation based on residual variance
104
Classified Activity
Walk forward Walk left Walk right Go up stairs Go down stairs Run forward Jump up Total Recall
1 Walk forward 146 23 27 2 2 1 2 203 71.9%
Ground Truth
2 Walk left 12 178 28 3 1 0 0 222 80.2%
3 Walk right 19 27 182 1 3 0 0 232 78.4%
4 Go up stairs 21 2 40 3 0 1 49 81.6%
5 Go down stairs 11 3 2 35 1 0 43 81.4%
6 Run forward 3 2 3 3 1 112 2 126 88.9%
7 Jump up 12 1 2 2 0 80 88 90.9%
Total 184 234 246 53 47 114 85
Precision 79.3% 76.1% 74.0% 75.5% 74.5% 98.2% 94.1%
Table 6.2: Confusion table when using 10 nearest neighbors. The entry in the i
th
row and j
th
column is the count of activity
instances from classi but classified as classj. The overall recognition accuracy is80.3%.
105
Chapter 7
RehabSPOT: A Customizable
Networked Body Area Sensing System
for Computerized Rehabilitation
As stated in the first chapter, wearable sensor-based human activity recognition tech-
nology can be applied to a variety of healthcare applications. In the previous chapters,
we have described several fundamental human activity recognition techniques. In the
following two chapters, we shift the focus to applications built on top of the wearable
sensing systems and human activity analysis techniques. Specifically, the following two
chapters focus on the development of body-area networked sensing system for comput-
erized medical rehabilitation for stroke patients. Chapter 7 focuses on the system design
and Chapter 8 focuses on the signal processing algorithm development for motion anal-
ysis of stroke patients.
7.1 Introduction
In the US, more than 700,000 people annually suffer a stroke, a disease that is a lead-
ing cause of long-term disability [1]. This disability can manifest itself as difficulty in
performing activities of daily living such as dressing, eating meals, bathing, and work
related tasks. Fortunately, during the early post-stroke period, the impaired limb is not
completely paralyzed but has limited movement capability. Studies show that the loss
106
of function can be improved with physical rehabilitation through some type of task-
oriented motor training [62]. However, motor-training tasks used in conventional reha-
bilitation are carried out by physical therapists based on their many years’ experience.
This methodology is limited in their capability to systematically control stimulus pre-
sentation and precisely capture motor performance for diagnose and evaluation in real
time [95]. As a result, the status and the progress the patient achieves during rehabilita-
tion can not be reliably monitored and precisely evaluated.
The emergence of body area sensing systems attempts to provide a solution to this
problem. These systems capture physiological and physical motion data of the patients
and then transmit these data to medical personnel for monitoring and processing. By
using these systems, numerous assets are provided beyond what is currently available
with traditional technology. They (1) address the weaknesses of traditional data collec-
tion methods such as imprecision (qualitative observation) and undersampling (infre-
quent assessment) [39] [74]; and (2) enable telehealth based applications to shorten the
hospital stay for patients under treatment.
In order to fully exploit these benefits, three important criteria must be followed
when designing a practically useful body area sensing system for rehabilitation. First,
the sensing sytem must be non-invasive to make patients feel comfortable when
equipped with it. In addition, the system must not limit the movement of the patients.
Otherwise, the fidelity of the collected data can not be guaranteed. Second, as the
focus of healthcare shifts from being “hospital-centered” to “patient-centered”, body
area sensing systems must evolve to facilitate highly personalized care. Furthermore,
considering that the users of wearable systems are patients and medical personnel, who
are people normally with limited engineering background, the system must be designed
for ease of use and can be quickly configured. Third, if the systems are to control or
help assess life-critical physiological events, they must be reliable.
107
To achieve the goals listed above, in this chapter, we describe the design, implemen-
tation, and experimental evaluation of RehabSPOT, a highly customizable wireless net-
worked body area sensing system for rehabilitation. RehabSPOT is built on top of Sun
SPOT technology from Sun Microsystems (now part of Oracle)
1
. The purpose of the
development of RehabSPOT is to provide a new sensing technology to benefit patients
under treatment and facilitate the daily work of physical therapists. RehabSPOT consists
of a number of “free-range” sensing nodes which can be attached to various locations of
human body and a basestation connected to a workstation / PC. Each free-range sens-
ing node is able to sense, perform computations, and communicate via wireless radio.
The basestation acts as a “bridge” to interconnect the free-range sensing nodes and the
program running on the workstation.
What distinguishes RehabSPOT from other existing sensing systems is its novel soft-
ware architecture that enforces a high degree of system configurability and reliability.
Some features of RehabSPOT’s software architecture are listed below:
• The software architecture supports dynamic sensor management including sensor
addition/removal and sampling rate adjustment during runtime;
• The system utilizes a device discovery manager moduel for dynamic body area
sensor network construction during runtime;
• The system uses an exception handler to detect sensor failure inside each free-
range sensing node;
• In cases when free-range nodes temporarily lose connections to the basestation,
RehabSPOT supports both multi-hop routing and on-board storage.
1
http://www.sunspotworld.com/
108
The rest of this chapter is organized as follows. Section 7.2 gives a brief review of
some existing networked body area sensing systems and highlights the contributions of
this work. Section 7.3 describes the design and implementation details of our Rehab-
SPOT platform. The experiment design and evaluation results are presented in Sec-
tion 7.4.
7.2 Existing Networked Body Area Sensing Systems
In recent years, numerous networked body area sensing systems have been proto-
typed for many healthcare related applications. Examples include physical rehabili-
tation [74] [47], ambulant patient monitoring [2] [82] [48] [91] [4], and kinetics stud-
ies [7] [14]. In this section, we review some of the existing networked body area sensing
systems from a system design perspective with a special focus on system configurability.
The LiveNet project in [74] is one of the pioneer works that focuses on designing
a wearable sensing system intended for long-term ambulatory health monitoring. The
system adopts a distributed architecture which uses a central sensor hub to wire hetero-
geneous sensors placed on multiple locations of the human body. A Personal Digital
Assistant (PDA) is used as a personal server and data sink to receive data from the cen-
tral hub for display and online processing. The major concern of the system is its wire
interface which may limit the users’ movement. As a result, the fidelity of the collected
data may be ruined.
Jovanov et al. in [47] present a multi-tier telemedicine body area sensing system
used for ambulatory monitoring and rehabilitation. Compared to [74], the system adopts
wireless technologies to acquire data from on-body sensors. At its top tier, the system
is capable of connecting to high-level communication infrastructure, such as cellular
network and Wi-Fi, to transmit sensed data to remote sites. Although this multi-tier
109
system architecture greatly extends the usefulness of body area sensing systems, it does
not take system configurability into consideration, which we claim is the key to make
body area sensing systems practically useful in real-world applications.
The MEDIC system proposed in [91] shares a similar multi-tier architecture as
in [47] and makes a step to enable system configuration. The authors design a cen-
tralized software architecture installed in a PDA. The software not only permits users
to enable/disable particular on body sensors, but also supports system configuration and
sensor management by receiving commands sent from medical professionals at remote
sites. Although the system can be customized to provide personalized healthcare, its
configurability is quite limited. The software only supports static configuration in that
sensor devices can only be paired with the PDA before runtime. As a result, the body
area sensor network can not be re-constructed if necessary during runtime.
Andre et al. in [4] propose a prototype system for health monitoring applications
which makes a further step towards system configurability. The system consists of a
personal server (a Pocket PC) and a multitude of “Mednodes”. A Mednode is a sensing
device equipped with a processing unit, a sensor board, and a radio to support wire-
less communication. Customization at Mednode is realized by downloading different
software programs onto Mednodes based on medical conditions and requirements of
different patients.
Our RehabSPOT platform resembles [4] in spirit, in terms of both using sensing
devices equipped with computing power and wireless interface to form a networked sen-
sor system. However, RehabSPOT differs from the existing prototype systems and com-
pliments them in the following three aspects: (1) RehabSPOT adopts a lightweight pro-
tocol for device discovery to support dynamic body area sensor network construction;
(2) Instead of downloading different programs into different sensing devices, Rehab-
SPOT runs a uniform program on all the sensing devices where they can configured to
110
perform different functions during runtime; and (3) Compared to [91] and [4], Rehab-
SPOT can be reconfigured after it has been deployed and initialized, which makes the
system more useful in real-world settings.
7.3 The Design of RehabSPOT
Figure 7.1 illustrates the three-tier architecture of our RehabSPOT platform. The first
TCP/IP
Socket
Basestation
Internet
Free-Range Node
802.15.4
Tier-3
Tier-1
Tier-2
Figure 7.1: Overview of RehabSPOT architecture
tier consists of all the free-range nodes that themselves are organized as a mesh network.
In this tier, each free-range node can communicate with any other node based on its
unique MAC address. The second tier includes the basestation and all the free-range
nodes the basestation can reach. The basestation and free-range nodes altogether form
a star network where the basestation acts as the master node. In this tier, data is first
collected by the free-range nodes and transmitted to the basestation via wireless chanel.
Then, the basestation passes the data to the client program running on the workstation for
real-time display and online processing. Meanwhile, physical therapists can configure
the system via a graphic user interface (GUI) provided as a part of the client program.
Our system relies on the established internet infrastructure as its third tier. In this tier,
111
data stored inside the workstation can be transmitted to a remote site such as servers in
hospitals and clinics for backup or further processing.
7.3.1 Sensing Hardware
The hardware of our RehabSPOT platform is based on Sun SPOT technology from Sun
Microsystems. The system is composed of a number of Sun SPOT free-range nodes,
a Sun SPOT basestation, and a multitude of sensors that we deploy for rehabilitation
applications. Both the Sun SPOT free-range node and the basestation are shown in
Figure 7.2(a).
(a) Sun SPOT Free-Range node (Left) and Sun
SPOT Basestation (Right)
(b) The signal conditioning accessory board with
voltage amplifiers and voltage divider circuitry.
(Left) The free-range Sun SPOT coupled with the
signal conditioning accessory board. (Right)
Figure 7.2: Sun SPOT Sensing Platform
Sensors
In health monitoring systems, sensors play key roles in sensing vital physiological sig-
nals as well as monitoring physical behaviors of users. For physical rehabilitation in
particular, various types of kinetic sensors are employed in our RehabSPOT system.
These sensors continuously measure body movements so as to identify and quantify a
patient’s physical dysfunctions. Table 7.1 summarizes the kinetic sensors we can use
and applications where these sensors can be utilized.
112
Sensor Applications
Accelerometers Gait analysis, Activity recognition, Upper extremity dysfunction identification
Gyroscopes Gait Analysis, Upper extremity dysfunction identification
Bend Single-finger dysfunction identification
Pressure Gait analysis, Finger / Foot pressure measurement
IR Ranger Upper extremity dysfunction identification
Stretch Multi-finger dysfunction identification
Table 7.1: Various kinetic sensors that can be used in physical rehabilitation training
tasks
Sun SPOT Free-Range Node
The SunSPOT free-range node is a standalone small device designed for low-power
RF applications. Like general embedded systems, it is programmable that can be cus-
tomized for various applications. The device consists of a 180MHz ARM micropro-
cessor with 4M on-board flash memory, an integrated IEEE 802.15.4 radio for wireless
communication, a sensor board, and a battery board. The powerful 32-bit microproces-
sor and an 4M on-board flash memory make real-time processing and local data storage
possible. The sensor board not only includes a range of built-in sensors, such as a 2g/6g
tri-axis accelerometer, but also provides a standard interface to outfit external sensors,
such as the ones listed in Table 7.1. Furthermore, in order to drive sensors whose outputs
are in the millivolt range, we have implemented a signal conditioning accessory board
to amplify the sensor outputs such that the on-board 10-bit A/D converter could process
correctly. The board itself and a Sun SPOT free-range node coupled with the board are
shown in Figure 7.2(b).
Sun SPOT Basestation
Compared to the Sun SPOT free-range node, the Sun SPOT basestation does not have
the sensor board and battery. It is a device wired to a development machine (workstation
/ PC) and allows users to write programs that can run on the PC and use the basestation’s
radio to communicate with any remote Sun SPOT free-range node.
113
7.3.2 Software Architecture
The system software is based on client-server architecture. The server program is
installed and running on the PC while the client program is installed in the free-range
SunSPOT nodes. Both the client and server programs are written in Java language.
The communication between client and server programs follows the message-passing
distributed computing paradigm. Each message contains a source address, a message
type code, and data payload. The size of payload varies among different message types.
The detailed message format is illustrated in Figure 7.3. The communication security is
enforced by utilizing a highly efficient pure Java cryptographic library which supports
key exchange and digital signatures based on Elliptic Curve Cryptography (ECC). In
addition, it is worthwhile to note that the design philosophy that our RehabSPOT plat-
form follows is to build a reliable and highly customizable system for personal use. This
philosophy is enforced in both our server and client programs. The following subsec-
tions describe the design and implementation details of these two parts.
Header Data
srcAddr packetType Data
portNumber Heartbeat_ACK
AssignPort_ACK
Null
AddSensor sensorType samplePeriod
FinishConfiguration
nbrSensors sensorType1
...
StartSampling
DataReply
nbrData Data1
...
pktNumber
...
sensorType2 sensorTypeN
sensorType
Data2 DataN
Figure 7.3: RehabSPOT Message Formats
114
Sensor 1
Device
Discovery
Manager
Command
Listener
Sensor Controller
Network Interface
(IEEE 802.15.4)
Data
Aggregator
Acceleromet
er
Free-Range SunSPOT Node
Sensor 2 Sensor 3 Sensor 4
Figure 7.4: RehabSPOT client architecture
Client Program
Figure 7.4 presents the software architecture of the client program. It consists of four
main components, the device discovery manager, command listener, sensor controller,
and data aggregator. The device discovery manager is responsible for establishing and
maintaining wireless connections with basestation and other free-range nodes in the
neighborhood. Considering that the microprocessor embedded in the free-range node
has a relatively constrained capability, a lightweight device discovery protocol with a
small memory footprint is designed and implemented. To initiate the connection, a
Heartbeat message is broadcast via a well-known radio port. Between two consecutive
broadcasts, the free-range node goes to deep sleep mode to save power. If there are
any other nodes nearby, a Heartbeat ACK acknowledgment message with the sender’s
unique MAC address will be received. This information is stored and maintained in a
local routing table. If the connection with the basestation is temporarily lost, the device
discovery manager will opportunistically route the data to one of the available nodes.
In the worst case, if there is no free-range node alive, it pushes the data into the flash
115
memory and goes back to the initial state to broadcast Heartbeat message with a much
longer period until the basestation is recovered.
The command listener running as a background thread listens to commands sent
from the basestation. It then dispatches the corresponding commands to either the
sensor controller or data aggregator respectively. The sensor controller is a soft-
ware layer abstracting out the common aspects of connectivity into a generic con-
troller. As a result, any sensor powered by either 3VDC or 5VDC can be connected
onto the free-range node externally. Based on the received configuration messsages
(Add Sensor,Remove Sensor), the sensor controller turns on/off any connected sensor
and manipulates some of the operating parameters, such as sampling rate. Further-
more, by leveraging the powerful 32-bit on-board microprocessor, a FIR low-pass filter
is implemented to prevent oscillations from unwanted impulse data. In addition, this
low-pass digital filter helps detect sensor failure if the output data from the filter falls
beyond the threshold several times in a row. In such cases, an exception handle is acti-
vated to turn off the broken sensor.
When configuration is complete, a format file recording the types and sampling rates
of chosen sensors is generated and passed to the data aggregator. The data aggregator
then composes the Data Reply messages based on it and pushes the messages into a
queue where the messages are sent out in order. In addition, it should be noted that
during our experiments, we found that the overhead to send a message could be signifi-
cant (2-4 ms depending on message size). As a result, this radio overhead becomes the
bottleneck for real-time performance even though the sensors’ sampling rates are high
enough. To amortize this overhead, multiple samples are bundled together into a single
message. Another benefit of doing this is it reduces radio usage so as to reduce the radio
power consumption, making the system more energy-efficient.
116
Server Program
Command
Controller
Device Discovery Manager
FR-1
Controller
FR-2
Controller
FR-N
Controller
GUI
Database
Desktop
Application
Engine
Internet
SunSPOT Basestation
Figure 7.5: RehabSPOT server architecture
The software architecture of the server program is shown in Figure 7.5. Similar to the
counterpart in free-range node, the device discovery manager in the basestation plays a
significant role in constructing body area sensor network during runtime. Whenever the
device discovery manager detects a new free-range node that wants to join the network,
it dynamically assigns a unique port number for that free-range node to establish a point-
to-point communication channel. A bitmap is used to keep track of what port numbers
are still available. In the cases when some free-range node quits the network, the occu-
pied port number is released for reuse. This connection maintenance work is performed
in the background and is totally transparent to users. Compared to many static config-
uration methodologies, such as TDMA-based polling, which has been adopted in many
117
health monitoring systems [5] [7] [91], our dynamic network construction methodology
is much more flexible and fault-tolerant.
Once the free-range node accepts the port number offered by basestation, a free-
range (FR) controller is created as a new-spawn thread to maintain the connection
between basestation and the specific free-range node. The FR controller sends the com-
mands issued from the command controller and receives data from the free-range node.
When data streams in, the FR controller first parses it according to the format file gener-
ated and sent from the free-range node. It then forwards the parsed data to both the GUI
and the back-end database. The GUI itself is implemented as a multi-thread program
too. Each thread is responsible for displaying the streaming data from one single sen-
sor onto an independent sub-window. The number of sub-windows is equal to the total
number of sensors employed in the body area sensor network. Furthermore, the scale of
sensed data value is calibrated adaptively for different types of sensors during runtime.
Furthermore, our system is extensible in that the streaming sensor data can be
exported to any other desktop programs. This is realized by establishing a TCP/IP
socket between two programs. Benefiting from the language-independence offered by
TCP/IP interface, the desktop programs can be written in multiple programming lan-
guage choices. As an example, we have connected a machine learning engine with our
RehabSPOT system and relayed the streaming data into the engine for real-time activity
recognition.
7.4 System Evaluation
To demonstrate the power and functionality of the RehabSPOT system, we deploy it in
a real-world physical rehabilitation program for upper extremity dysfunction identifica-
tion.
118
The experiment starts from the physical therapist selecting the appropriate set of
sensors for the rehabilitation program session and plugging them onto the Sun SPOT
free-range nodes. In our experiment, we use two free-range nodes, each equipped with
one external gyroscope. The therapist attaches one node onto the patient’s left upper arm
and the other onto his left wrist. The basestation node connected to the therapist’s laptop
keeps listening on the broadcast channel and waits for the connection setup request mes-
sages to come. After connections are established, the GUI displays the list of available
free-range nodes in its Sun SPOT list panel as illustrated in Figure 7.6(a). The therapist
then configures two free-range nodes one by one by selecting the needed sensors from a
sensor pool and setting the sampling rate for each sensor. In our experiment, we select
the built-in tri-axis accelerometer and the external gyroscope from the pool, and set all
the sensors’ sampling rates to 100Hz.
After the therapist finishes configuring the hardware by clicking the Finish Con-
figuration button, the configuration message is forwarded to the corresponding free-
range node. The sensor controller module in the free-range node interprets the message,
checks the availability of the chosen sensors, and passes the format file (unwrapped
from the message payload) to the data aggregator. The data aggregator module starts
sampling from sensors after the therapist clicks the Start icon. The sensed data is then
transmitted to the basestation, which in turn relays the data to the GUI for display. Fig-
ure 7.6(b) shows a snapshot of the data captured from the two free-range nodes where
the patient was slowly moving his left arm up and down. When the session is complete,
the therapist clicks on the Stop icon to turn off all the sensors and close the connections
with all the free-range nodes.
The data sampled during the rehabilitation program session is stored in a backend
database. Besides the data, the database also allows the therapist to build a personal
119
profile for each patient and save the sensor configuration file in each session for future
access. In such manner, a complete healthcare information system is assembled.
In our experiment, it takes no more than ten minutes for the therapist to plug sensors
onto the two free-range nodes and attach those free-range nodes onto the human body.
In addition, the sensor network construction and sensor configuration takes approxi-
mately five minutes. Therefore, it takes in total no more than fifteen minutes for the
overall setup process. When the battery is fully charged, RehabSPOT can support up to
six hours of operation, which is sufficient for a standard rehabilitation training session.
Although the usability of our RehabSPOT system is very difficult to quantify, the initial
feedback from both physical therapists and the patient is promising. This indicates that
our RehabSPOT platform holds a great potential to benefit physical therapists’ work.
120
(a) The device discovery manager at basestaion detects two available
fre-range nodes in its neighborhood. The therapist selects both of them,
labels the first node as LeftWristNode and the second one as LeftUpper-
ArmNode, and is ready to configure these two nodes. (Left) Therapist
selects sensors to be configured for the first free-range node from a sen-
sor pool. (Right)
(b) A snapshot of the data captured from the two
free-range nodes where the patient was slowly mov-
ing his left arm up and down
Figure 7.6: Snapshots of the demonstration of RehabSPOT in a rehabilitation program
121
Chapter 8
Fine-Grained Motor Function
Assessment for Computerized
Rehabilitation
The last chapter focuses on the system design of a networked body area sensing sys-
tem for computerized rehabilitation application. In this chapter, I focus on the signal
processing algorithm development for motion analysis of stroke patients based on the
motion signals collected from the on-body sensors.
8.1 Introduction
As explained in the last chapter, wearable sensing technologies play a significant role
in quantitatively evaluating patients’ motor functions. Therefore, the use of wearable
sensing technology for assessment of motor function has grown significantly in recent
years. For example, Hester et al. [40] used accelerometers and linear regression models
based on their measurements to predict functional ability scores for stroke rehabilitation.
Patel et al. [63] used clustering techniques to correlate accelerometer signals with the
severity of dyskinesia in patients with Parkinson’s disease. They demonstrated that
patients with different severities can be represented by well-separated clusters. Other
studies cast the motor function assessment as a classification problem. For example, the
authors in [13] used accelerometer, gyroscope, and magnetometer data to sense patient
122
upper limb movements after neurological injury. A decision tree classifier inferred the
functional ability score of the Wolf Motor Function Test (WMFT). Similarly, the authors
in [64] apply the support vector machine (SVM) as the classifier to learn the mapping
function between the functional ability scores and the severity of Parkinsonian motor
fluctuations.
Previous techniques use linear regression, clustering, and classification to building
mapping functions that correlate sensor signals with standard clinical rating scales. We
feel that the clinical rating scales cannot record all details of motor behavior, thus fail-
ing to evaluate precisely the patients’ performance during rehabilitation interventions.
To bridge this gap, we describe a fine-grained motor function assessment approach that
captures detailed patterns of the patients’ motor behavior which standard clinical scores
fail to acquire. We do not regard our approach as a replacement to the existing clinical
score system. Instead, our approach should act as a significant complement to the stan-
dard clinical rating scales in the sense that combining the scores and the detailed patterns
detected by our approach could produce a more accurate assessment of patients’ motor
behavior.
8.2 Motion Trajectory-Based Method
Different from the existing motor function assessment methods in which the whole seg-
ment of the motor task is mapped to a single point in the feature space, the first step
of our fine-grained approach is to divide the streaming sensor data sampled from each
motor task segment into a sequence of fixed-length tiny windows whose length is much
smaller than the duration of the motor task itself (In this study, the duration of motor task
ranges from 2 seconds to 10 seconds. The length of the tiny window we use is 0.2 sec-
ond). Then we extract a number of features which capture the intrinsic characteristics of
123
the motor behavior from each tiny window and stack them together to form a local fea-
ture vector. As a consequence, each motor task segment is transformed into a sequence
of local feature vectors which forms a motion trajectory in the feature space. Compared
to the “single-point” representation used in the existing methods, this trajectory-based
representation provides more information about the patients’ motor behavior in the sense
that it captures the local details of the motor tasks in a fine-grained manner. Moreover,
we have developed a trajectory comparison algorithm on top of the fine-grained rep-
resentation, which helps clinicians to keep track quantitatively of the patients’ progress
during rehabilitation. In the remainder of this section, we explain the features we extract
and the details of the trajectory comparison algorithm.
8.2.1 Feature Extraction
The raw signals sampled from wearable inertial sensors are not only noisy but also
difficult for clinicians who have little engineering background to understand and inter-
pret. Therefore, we need to extract features which contain useful information about
the patients’ motor behavior and more importantly, are meaningful and interpretable to
clinicians.
Below is the list of features we use in this study. These features are selected since
they have clear meanings related to the physical movements and have been demon-
strated to be able to represent the important characteristics of the movements in existing
studies [98] [63].
• Mean Value of Movement Intensity (MI): Motion Intensity (MI) is defined as
MI(t)=
a
x
(t)
2
+a
y
(t)
2
+a
z
(t)
2
, (8.1)
124
the Euclidean norm of the total acceleration vector, wherea
x
(t),a
y
(t), anda
z
(t)
represent thet
th
acceleration sample of thex,y, andz axis in each window respec-
tively. The value ofMI(t) can be seen as an indirect measure of the instantaneous
intensity (strength) of the performed movement at sample t. We calculate the
mean value of MI within the window and use it as one of our features.
• Movement Intensity Variation (VI): VI is computed as the variation of MI
defined above. It is intended to measure the strength variation (range) of the
movements.
• Smoothness of Movement Intensity (SI): SI is computed as the derivative values
of MI. It is used in our study to measure the smoothness of the movements.
• Averaged Acceleration Energy (AAE): AAE calculates the mean value of
energy as the sum of the squared discrete FFT component magnitudes of the
sensor signals over three accelerometer axes. It measures the total movement
acceleration energy.
• Averaged Rotation Energy (ARE): ARE calculates the mean value of energy
over three gyroscope axes. It measures the total movement rotation energy.
• Time Taken to Complete the Task (TIME): The time duration is also used as a
metric to indirectly measure the degree of difficulty in completing the movement.
8.2.2 Trajectory Comparison
The constructed motion trajectory is expected to capture the intrinsic characteristics of
the patients’ motor behavior. Although clinicians can find critical patterns and compare
125
pattern differences between trajectories to track patients’ progress by just visual obser-
vation, it is helpful to compare the trajectories and measure the differences in a quanti-
tative and objective manner. However, one of the biggest challenges for the comparison
is that trajectories from any two motor task segments may have different lengths. In this
work, we develop a trajectory comparison algorithm based on dynamic time warping
(DTW) technique to resolve this issue. DTW is a nonlinear alignment technique for
measuring similarity/difference between two signals (normally time series) which may
have different lengths or durations [50]. One classical application of DTW is to accom-
modate different speaking speeds in the domain of automatic speech recognition. For
our case, DTW is used to cope with different movement speeds when patients perform
motor tasks. Specifically, let X and Y denote two trajectories constructed from two
motor task segments of lengthM andN respectively:
X = x
1
,x
2
,...,x
i
,...,x
M
(8.2)
Y = y
1
,y
2
,...,y
j
,...,y
N
(8.3)
where x
i
and y
j
represent the ith and jth local feature vector in X and Y respectively.
DTW compensates for the length differences and finds the optimal alignment between
X andY by solving the following dynamic programming (DP) problem:
D(i,j) =min{D(i−1,j −1),D(i−1,j),D(i,j −1)}
+d(i,j)
(8.4)
where d(i,j) represents the distance function which measures the local difference
between local feature vector x
i
and y
j
in the feature space, and D(i,j) repre-
sents the cumulative (global) distance between sub-trajectory {x
1
,x
2
,...,x
i
} and
126
{y
1
,y
2
,...,y
j
}. The solution of this DP problem is the cumulative distance between
the two trajectoriesX andY which sits inD(M,N) and a warp pathW of length K
W = w
1
,w
2
,...,w
k
,...,w
K
(8.5)
which traces the mapping between X and Y . Finally, since the cumulative distance
D(M,N) is dependent on the length of the warp path W , we normalize D(M,N) by
dividing it by the warp path length K and use this averaged cumulative distance as the
metric to measure the distance between trajectoriesX andY as
Dist(X,Y)=
D(M,N)
K
(8.6)
8.2.3 Similarity Score
It should be noted that many forms of distance function d(i,j) (e.g. Euclidean distance
and Mahalanobis distance) can be used to calculate the local difference. In this work,
we use the cosine distance as the local distance function defined as
d(i,j)=1−
x
i
T
·y
j
x
i
·y
j
(8.7)
Compared to other distance functions, the benefit of using the cosine distance is that
d(i,j) is by definition in the range [0,1]. As a result, the averaged cumulative distance
Dist(X,Y ) defined in Eq.(8.6) is also in the range [0,1], and thus can be interpreted
as the dissimilarity between X and Y in terms of percentile. Therefore, we can also
define the corresponding similarity in percentile betweenX andY as:
Sim(X,Y)=1−Dist(X,Y ) (8.8)
127
8.3 Evaluation
8.3.1 Experimental Setup
To evaluate the effectiveness of our approach, three subjects including one healthy sub-
ject (female) and two subjects (one male and one female) with different levels of upper
limb hemiparesis from stroke were recruited at the Precision Rehabilitation Clinic
1
and
Rancho Los Amigos National Rehabilitation Center
2
located in Los Angeles. The wear-
able inertial sensor we use for this study is called MotionNode
3
. MotionNode is a
high-performance inertial measurement unit (IMU) that can sense±6g acceleration and
±500dps rotation rate. This is high enough to capture all the details of the patients’
movements. In addition, the size of MotionNode is extremely small such that it can be
attached to the patient’s body nonintrusively.
During data collection, one MotionNode was attached to the forearm of the subjects
(see Figure 8.1). Each subject followed the instructions from a physical therapist and
performed a subset of five upper limb motor tasks from the Fugl-Meyer Assessment
(FMA). We choose FMA since it is well-known for its comprehensiveness as a measure
of motor impairment after stroke and it is widely recommended for motor rehabilitation
for post-stroke patients [36]. These five motor tasks include: (1) Flexor Synergy; (2)
Hand to Lumbar Spine; (3) Shoulder Flexion; (4) Pronation; and (5) Supination (See
Table 8.1 for detailed explanations). Each motor task was repeated five times by each
subject for both affected and unaffected limbs and was assigned a functional ability
score based on the FMA scale (0 = can not perform,1 = performs partially,2 = performs
fully) by the therapist.
1
http://www.precisionrehabilitation.com
2
http://www.rancho.org/
3
http://www.motionnode.com/
128
Figure 8.1: The placement of MotionNode on the upper limb of the subject
Motor Task Description
Flexor Synergy Fully supinate the forearm, flex the elbow, and
bring the hand to the ear of the affected side
Hand to Lumbar Spine Move the hand behind the back
Shoulder Flexion Flex the shoulder to90
◦
, keeping the elbow
extended
Pronation Flex the elbow to90
◦
and pronate the forearm
through the full available range of motion
Supination Flex the elbow to90
◦
and supinate the forearm
through the full available range of motion
Table 8.1: Motor tasks considered in our study
8.3.2 Evaluation Results
Fine-Grained Trajectory-based Representation
In order to demonstrate better the benefits of our fine-grained trajectory-based approach,
we first implement the traditional motor function assessment method where each motor
task segment is represented as a single point in a high dimensional feature space. Fig-
ure 8.2(a) and Figure 8.2(b) show the scatter plots in the 3D feature space for motor
task Pronation and Flexor Synergy respectively. The three features used for the plots
are AI, VI, and ARE. Here subject 1 is the female patient with upper limb hemipare-
sis, subject 2 is the healthy female, and subject 3 is the male patient with upper limb
129
hemiparesis. As shown in Figure 8.2(a), data from different subjects and limbs forms
−2
−1
0
1
2
−2
0
2
4
−2
−1
0
1
2
3
AI (Strength)
Pronation
VI (Range of Movement)
ARE (Rotation Energy)
Subject 1 Unaffected Limb
Subject 1 Affected Limb
Subject 2 Unaffected Limb
Subject 3 Unaffected Limb
Subject 3 Affected Limb
(a) Scatter plot of the motor task Pronation in 3D
feature space
−2
−1
0
1
2
3
−2
0
2
4
−1
0
1
2
3
4
AI (Strength)
Flexor Synergy
VI (Range of Movement)
ARE (Rotation Energy)
Subject 1 Unaffected Limb
Subject 1 Affected Limb
Subject 2 Unaffected Limb
Subject 3 Unaffected Limb
Subject 3 Affected Limb
(b) Scatter plot of the motor task Flexor Synergy
in 3D feature space
Figure 8.2: The 3D scatter plots of the traditional automatic motor function assessment
method. Subject 1 is the female patient with upper limb hemiparesis, Subject 2 is the
healthy female, and Subject 3 is a male patient with upper limb hemiparesis. The three
features used for the plots are AI, VI, and ARE.
compact clusters. Each cluster is well separated from others except one case where the
data from the unaffected limbs of subject 1 and subject 2 overlaps. This observation
can be explained by the fact that subject 1 and subject 2 are both female and the motor
tasks are all performed by their unaffected limbs. For Figure 8.2(b), data is aggregated
as compact clusters as previous example. However, the cluster formed by the data from
subject 1’s affected limb is very close in distance to the other two clusters formed by the
data from subject 1 and subject 2’s unaffected limbs. Although we can learn a classifier
(e.g. nearest neighbor, SVM) to find a boundary to partition different clusters and then
map the clusters to different clinical rating scores as in many existing research work,
we argue that the distances between the data points in the feature space does not reflect
their true differences.
In comparison, Figure 8.3 and Figure 8.4 illustrate our fine-grained trajectory-based
solution in time-feature space to tackle the problem observed in the traditional method
mentioned above. As illustrated, our fine-grained approach is capable of capturing
130
the detailed patterns of the patients’ motor behavior which traditional methods fail to
acquire. As an example, Figure 8.3(a) and Figure 8.3(b) show the fine-grained trajectory
representation of one segment of motor task Pronation in terms of feature AI and ARE
respectively. The red curve represents the task performed by the unaffected limb and
the blue one for the affected limb. In Figure 8.3(a), our approach is able to capture the
two troughs as patterns for the unaffected limb which does not appear for the affected
limb. Similarly, two high peaks are captured for the unaffected limb in Figure 8.3(b)
which represent the two key points during the movement of Pronation where significant
rotation energy is exerted. As another example, Figure 8.4 shows the trajectories of
motor task Flexor Synergy. In Figure 8.4(a), the curve of the unaffected limb is smooth
with two peaks and two troughs detected while the curve of the affected limb shows less
motion intensity but more fluctuations during the movement. This detailed difference is
by no means reflected by the 1 point difference in the FMA score assigned by the clini-
cian where the segment of unaffected limb is given 2 points and the segment of affected
limb is given 1 point. Finally, similar conclusion can be drawn for Figure 8.4(b) as in
Figure 8.3(b).
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0.9
0.95
1
1.05
1.1
1.15
Time (S)
MI (Strength)
Fine −Grained Trajectory Representation on
Pronation
Unaffected Limb
Affected Limb
(a) The fine-grained trajectory
representation of Pronation using
feature AI
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
50
100
150
200
250
300
350
400
Time (S)
ARE (Rotation Energy)
Fine −Grained Trajectory Representation on
Pronation
Unaffected Limb
Affected Limb
(b) The fine-grained trajectory
representation of Pronation using
feature ARE
Affected Limb
Unaffected Limb
Warp Path for Motor Task
Pronation
2 4 6 8 10 12 14 16 18
2
4
6
8
10
12
14
16
18
(c) The warp path (red curve) of
Pronation between affected and
unaffected limbs overlaid on the
grayscale similarity matrix
Figure 8.3: The fine-grained trajectory representation and the warp path calculated from
DTW of Pronation
131
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
Time (S)
MI (Strength)
Fine −Grained Trajectory Representation on
Flexor Synergy
Unaffected Limb
Affected Limb
(a) The fine-grained trajectory
representation of Flexor Synergy
using feature AI
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
20
40
60
80
100
120
Time (S)
ARE (Rotation Energy)
Fine −Grained Trajectory Representation on
Flexor Synergy
Unaffected Limb
Affected Limb
(b) The fine-grained trajectory
representation of Flexor Synergy
using feature ARE
Affected Limb
Unaffected Limb
Warp Path for Motor Task
Flexor Synergy
5 10 15 20 25 30 35 40 45
5
10
15
20
25
30
35
(c) The warp path (red curve) of
Flexor Synergy between affected
and unaffected limbs overlaid on
the grayscale similarity matrix
Figure 8.4: The fine-grained trajectory representation and the warp path calculated from
DTW of Flexor Synergy
Trajectory Comparison Performance using DTW
As shown in the previous subsection, our trajectory-based approach captures the detailed
pattern differences of the patients’ motor behaviors. Here, we apply the DTW tech-
nique and the dissimilarity/similarity metric defined in Eq.(8.6)/Eq.(8.8) in percentile
to quantitatively measure the differences/similarities between the fine-grained trajecto-
ries of the affected and unaffected limbs. Specifically, we first use DTW to extract the
warp path of the two trajectories to be compared. Figure 8.3(c) and Figure 8.4(c) show
two extracted warp path examples (the red curves overlaid on the grayscale similarity
matrix, where dark entries indicate less similarity) for motor task Pronation and Flexion
Synergy respectively. Based on the extracted warp paths, we then compute the simi-
larity metric values in percentile between each motor task segment performed by the
affected limb and the unaffected limb for each subject and all five motor tasks. The
value of similarity indicates the recovery status of the affected limb. In other words,
the higher the similarity value is, the better the performance of the affected limb. We
also compare the trajectory similarity between segments from the same motor task of
the unaffected limb. This is useful to validate whether our similarity metric is robust to
132
noise or not. The accumulated results are listed in Table 8.2. Since subject 2 is healthy,
we only show the results for subject 1 and subject 3. For each subject, the first row in
the table lists the averaged similarity between segments of affected limb and unaffected
limb with the standard deviation in bracket. The second row shows the corresponding
averaged FMA scores assigned by the professional physical therapist. The third row
lists the similarity between segments from the same motor task of the unaffected limb
with the corresponding FMA scores in the forth row.
For all five motor tasks, the trajectory similarity between segments from the same
motor task of the unaffected limb is very high. This observation indicates that our sim-
ilarity metric is robust to noise incurred from the patients’ movements. The more inter-
esting cases lie in the comparison between unaffected limb segment and affected limb
segment. As shown in the table, their similarity values vary significantly across dif-
ferent motor tasks. More importantly, the calculated similarity values show a positive
correlation to the FMA scores assigned by the therapist. For example, the two highest
similarity values of subject 1 come from the motor task Flexor Synergy (70.55%) and
Hand to Lumbar Spine (72.10%). These two motor tasks also have the highest FMA
scores. Similarly, the motor task which gets the highest similarity value from subject 3
also has the highest FMA score. On the other hand, for all the motor tasks which get
FMA score zero, the similarity values range from 33.30% to 49.39%. This indicates
that our similarity metric can show more intermediate levels which would be extremely
valuable for clinicians to track patients’ gradual progress which are not reflected by the
standard clinical scores.
133
Flexor Synergy Hand to Lumbar Spine Shoulder Flexion Pronation Supination
Subject 1 Similarity 70.55% (±8.20%) 72.10% (±6.72%) 62.10% (±8.06%) 33.30% (±7.42%) 41.17% (±4.27%)
FMA 1.6 1.6 1 0 0
Similarity 90.18% (±5.29%) 85.17% (±8.71%) 85.53% (±7.46%) 91.27% (±3.32%) 87.65% (±5.06%)
FMA 2 2 2 2 2
Subject 3 Similarity 51.46% (±5.32%) 46.28% (±6.90%) 49.39% (±10.53%) 41.54% (±9.28%) 43.68% (±7.74%)
FMA 0.6 0 0 0 0
Similarity 83.37% (±7.38%) 79.04% (±10.51%) 85.13% (±9.27%) 91.89% (±5.64%) 90.66% (±6.93%)
FMA 2 2 2 2 2
Table 8.2: Trajectory comparison results using DTW and the corresponding FMA scores
for subject 1 and subject 3
134
Chapter 9
Conclusion
In this thesis we focus on developing ubiquitous computing technologies to support
personalized healthcare, a new model that aims to deliver healthcare seamlessly in our
everyday lives anywhere and any time. The first half of this thesis focuses on wear-
able sensor-based human activity recognition technology which acts as the fundamental
technology to support a variety of personalized healthcare applications. Specifically, we
have developed four different computational techniques for human activity recognition
based on wearable sensing systems.
(1) In Chapter 3, we propose a framework based on feature selection techniques
to study the effects of different features on recognition performance. Specifically, we
examined three feature selection methods and found that the sequential forward selec-
tion (SFS) method achieved the best performance compared to Relief-F and single fea-
ture classification (SFC) methods. In addition, we have demonstrated that our self-
designed physical features make significant contributions to the recognition system.
Finally, we have shown that the feature selection framework with a hierarchical structure
improves the recognition performance compared to the single-layer framework.
(2) In Chapter 4, we propose a framework based on sparse representation theory to
represent human activity signals as a sparse linear combination of activity signals from
all activity classes in the training set. The class membership of the activity signal is
determined by solving a
1
minimization problem. Our approach achieves a maximum
recognition rate of 96.1%, which beats conventional methods based on nearest neigh-
bor, naive Bayes, and support vector machine by as much as 6.7%. Furthermore, we
135
demonstrate that by using random projection, the task of looking for “optimal features”
to achieve the best activity recognition performance is less important within this frame-
work.
(3) In Chapter 5, we have investigated the feasibility of using a Bag-of-Features
(BoF)-based framework for human activity representation and recognition. The benefit
of this framework is that it is able to identify general motion primitives which act as the
basic building blocks for modeling different human activities. We have studied six key
factors which govern the performance of the BoF-based framework. The factors include
window size, choices of features, methods to construct motion primitives, motion vocab-
ulary size, weighting schemes of motion primitive assignments, and learning machine
kernel functions. Our experimental results validate the effectiveness of this framework
and show that all the six factors are influential to the classification performance of the
BoF framework. In addition, our BoF framework achieves a 92.7% overall classifica-
tion accuracy with a0.2 second window cell and a vocabulary of125 motion primitives
constructed based on physical features using K-means clustering and soft weighting.
This result is32.3% higher than the corresponding non-statistical string-matching-based
approach.
(4) In Chapter 6, we have investigated the feasibility of applying manifold learn-
ing technique to extract activity manifolds. We use locally linear embedding (LLE) to
capture the intrinsic low-dimensional manifold structures for activities that are either
periodic or semi-periodic. A nearest-neighbor interpolation technique is then applied
to learn the mapping function from the input space to the manifold space. Activity
recognition is performed by comparing trajectories of different activity manifolds in the
manifold space. Finally, we demonstrate that activity recognition can be performed on
top of this compact representation and achieves promising results.
136
The second half of this thesis focuses on the development of body-area networked
sensing systems for computerized medical rehabilitation for stroke patients. In Chapter
7, we present the system design of a networked body area sensing system called Rehab-
SPOT and its value in personalized rehabilitation delivery via real-time system recon-
figuration. In Chapter 8, we present the computational model based on fine-grained
motion trajectory to analyze patients’ motor behavior. Our experimental results validate
our approach and demonstrate that the captured patterns can be used to complement
the standard clinical scores to provide fine-grained motor function assessment and help
clinicians to track patients’ gradual progress during rehabilitation.
9.1 Future Work
As an emerging field, the research of developing ubiquitous computing technologies to
improve human health and well-being has attracted significant attention in recent years.
This thesis covers some of the most fundamental technologies in this field and can be
further extended in the following directions.
1. Expand the Dataset:
Our USC-HAD dataset currently consists of 12 daily activities performed by 14
subjects. We want to expend it in the near future to make this dataset more ben-
eficial to researchers in this field. Future extensions could include: the addition
of new activities; data taken over a larger time duration; additional subjects; a
broader distribution of subject age; and subjects with disabilities. Another exten-
sion is to include data sets with high-level activities (i.e., a combination of low-
level activities).
2. Ubiquitous and Mobile Computing for Preventive Care:
137
Behavior and lifestyle choices such as obesity, inactivity, and smoking contribute
to the increased prevalence of preventable chronic diseases and premature deaths.
In particular, the worldwide obesity phenomenon and associated diabetes are
becoming the main epidemic of the 21st century. Largely as a result of changes in
the modern lifestyle, this growing phenomenon placed a heavy burden on today’s
healthcare system. Ubiquitous and mobile computing technologies have great
potentials to promote behavior change to prevent chronic diseases and encourage
people to take on the responsibility of self-care. In the face of the current obe-
sity epidemic, research findings suggest that encouraging physical activity in the
general population is effective in reducing the risk of obesity. By utilizing human
activity recognition technology I have developed, it is possible to automatically
record an individual’s mobility patterns and make people aware of their decreased
physical activity to prevent obesity and associated chronic diseases from happen-
ing. I am very interested in applying my expertise in wearable system design and
human activity recognition technology to pursue research in this direction in the
future.
3. Virtual Reality for Home-Based Rehabilitation:
Limited healthcare resources have created an urgent demand for more efficient
delivery of care for patients. In the domain of rehabilitation, home-based rehabil-
itation has recently come to be regarded as offering potential benefits over inpa-
tient rehabilitation in terms of reducing healthcare costs, delivering care in the
comfort of familiar home surroundings, and improving patient outcomes. This
opens up new opportunities for developing computerized rehabilitation technolo-
gies to facilitate rehabilitation at home. As an emerging area of great impact and
significance, the use of Virtual Reality technology for home-based rehabilitation
138
is still in its infancy. I would like to continue collaborating with physicians and
rehabilitation scientists to advance the technology as my future research.
4. Assisted Living Technologies for Elder Care:
One of the most challenging problems facing our society today is the increasing
aging population. By 2030, 20% of the U.S. population will be over the age of
65. This means that one in every five Americans, or70 million Americans, will be
aged65 or older. The aging effect of the population is bringing new challenges to
our current healthcare system. Some of the key challenges include the increase in
age-related diseases such as Parkinson’s disease, the rise of healthcare costs, and
the shortage of professionals working with the aging population. Assisted living
technologies provide promising solutions that meet those challenges by building
smart home to support independent living, developing intelligent medication man-
agement systems to regulate medication intake, and designing activity recognition
systems to detect abnormal behavior. Although a number of similar assisted living
technologies have been developed and studied, there are still many open questions
to be addressed. For example, how to integrate personal health information col-
lected from assisted living technologies and information contained in Personal
Health Record (PHR) in meaningful ways to deliver context-aware personalized
elder care? How to develop predictive models from the collected personal health
data such that early detection of adverse events such as heart attacks and dia-
betes complications is possible? Exploring the answers to these questions are key
research opportunities that I am excited to pursue in the future.
139
Bibliography
[1] Heart disease and stroke update. American Heart Association, 2006.
[2] Reisner A., Sokwoo Rhee, Asada HH., Shaltis P., and Hutchinson RC. Mobile
monitoring with wearable photoplethysmographic biosensors. IEEE Engineering
in Medicine and Biology Magazine, 2003.
[3] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An Algorithm for Designing
Overcomplete Dictionaries for Sparse Representation. IEEE Transactions on Sig-
nal Processing, 54(11):4311–4322, November 2006.
[4] R. Andre, A. Encarnacao, A. Zahoory, F. Dabiri, H. Noshadi, and M. Sarrafzadeh.
Wireless sensor networks for health monitoring. International Conference on
Mobile and Ubiquitous Systems, 2005.
[5] Murali Annavaram, Nenad Medvidovic, and et. al. Multimodal sensing for pedi-
atric obesity applications. UrbanSense, 2008.
[6] Louis Atallah, Benny Lo, Rachel King, and Guang-Zhong Yang. Sensor placement
for activity detection using wearable accelerometers. International Workshop on
Wearable and Implantable Body Sensor Networks, 0:24–29, 2010.
[7] S.J.M. Bamberg, Benbasat A.Y ., Scarborough D.M., Krebs D.E., and Paradiso J.A.
Gait analysis using a shoe-integrated wireless sensor system. IEEE Transactions
on Information Technology in Biomedicine, 2008.
[8] Stacy Bamberg, Ari Benbasat, Donna Scarborough, David Krebs, and Joseph Par-
adiso. Gait analysis using a shoe-integrated wireless sensor system. In IEEE
Transactions on Information Technology in Biomedicine, 2006.
[9] Ling Bao and Stephen S. Intille. Activity recognition from user-annotated accel-
eration data. In International Conference on Pervasive Computing, pages 1–17,
Vienna, April 2004.
[10] R.G. Baraniuk, E. Candes, M. Elad, and Yi Ma. Applications of sparse represen-
tation and compressive sensing [scanning the issue]. Proceedings of the IEEE,
98(6):906–909, June 2010.
140
[11] P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. fisherfaces:
recognition using class specific linear projection. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 19(7):711–720, July 1997.
[12] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduc-
tion and data representation. Neural Computation, 15:1373–1396, 2002.
[13] V .F. Bento, V .T. Cruz, D.D. Ribeiro, and J.P.S. Cunha. Towards a movement quan-
tification system capable of automatic evaluation of upper limb motor function
after neurological injury. In International Conference of the IEEE Engineering in
Medicine and Biology Society (EMBC), pages 5456–5460, September 2011.
[14] E. Berkson, R. Aylward, J. Zachazewski, J. Paradiso, and T.J. Gill. Imu arrays:
The biomechanics of baseball pitching. Orthopaedic Journal at Harvard Medical
School, 2006.
[15] Ella Bingham and Heikki Mannila. Random projection in dimensionality reduc-
tion: applications to image and text data. In International conference on Knowl-
edge discovery and data mining, pages 245–250, San Francisco, 2001.
[16] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information
ScienceandStatistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
[17] Jaron Blackburn and Eraldo Ribeiro. Human motion recognition using isomap and
dynamic time warping. In Workshop on Human Motion: Understanding, Model-
ing, Capture and Animation, pages 285–298, Rio de Janeiro, October 2007.
[18] Christoph Bregler and Stephen M. Omohundro. Nonlinear manifold learning for
visual speech recognition. In nternational Conference on Computer Vision (ICCV),
pages 494–499, Cambridge, Massachusetts, USA, 1995.
[19] E.J. Candes. Compressive sampling. InInternationalCongressofMathematicians,
pages 1433–1452, Madrid, Spain, August 2006.
[20] E.J. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal
reconstruction from highly incomplete frequency information. IEEE Transactions
on Information Theory, 52(2):489–509, February 2006.
[21] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decom-
position by basis pursuit. SIAM Review, 43:129–159, January 2001.
[22] Tanzeem Choudhury and et al. The mobile sensing platform: An embedded activ-
ity recognition system. IEEE Pervasive Computing, 7(2):32–41, April 2008.
141
[23] Tanzeem Choudhury, Matthai Philipose, Danny Wyatt, and Jonathan Lester.
Towards activity databases: Using sensors and statistical models to summarize
people’s lives. IEEE Data Engineering Bulletin, 29(1):49–58, 2006.
[24] D.L. Donoho. Compressed sensing. IEEE Transactions on Information Theory,
52(4):1289–1306, April 2006.
[25] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification (2nd
Edition). Wiley-Interscience, 2 edition, November 2001.
[26] Ahmed M. Elgammal and Chan-Su Lee. Inferring 3D body pose from silhouettes
using activity manifold learning. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 681–688, Washington, DC, USA, 2004.
[27] C. Hsu et al. A practical guide to support vector classification. Bioinformatics,
1(1):1–16, 2010.
[28] H. Ghasemzadeh et al. A phonological expression for physical movement moni-
toring in body sensor networks. In MASS, pages 58–68, Atlanta, Georgia, USA,
2008.
[29] H. Ghasemzadeh et al. Collaborative signal processing for action recognition in
body sensor networks: a distributed classification algorithm using motion tran-
scripts. In IPSN, pages 244–255, New York, NY , USA, 2010.
[30] T. Hu` ynh et al. Analyzing features for activity recognition. In Proceedings of
the 2005 joint conference on Smart objects and ambient intelligence: innovative
context-aware services: usages and technologies, sOc-EUSAI ’05, pages 159–
163, New York, NY , USA, 2005.
[31] T. Stiefmeier et al. Fusion of string-matched templates for continuous activity
recognition. In ISWC, pages 1–4, Washington, DC, USA, 2007.
[32] J. Chipchase F. Ichikawa and R. Grignani. Where’s the phone? a study of mobile
phone location in public spaces. In Proceedings of Conference on Mobile Technol-
ogy, Applications, and Systems, pages 797–804, Guangzhou, China, 2005.
[33] Jonny Farringdon, Andrew J. Moore, Nancy Tilbury, James Church, and Pieter
Biemond. Wearable sensor badge and sensor jacket for context awareness. In
InternationalSymposiumonWearableComputers, San Francisco, CA, USA, 1999.
[34] Preben Fihl, Michael B. Holte, Thomas B. Moeslund, and Lars Reng. Action
recognition using motion primitives and probabilistic edit distance. In Interna-
tional Conference on Articulated Motion and Deformable Objects (AMDO), pages
375–384, Andratx, Mallorca, Spain, 2006.
142
[35] J. F. Gemmeke and B. Cranen. Using sparse representations for missing data impu-
tation in noise robust speech recognition. In European Signal Processing Confer-
ence (EUSIPCO), Lausanne, Switzerland, August 2008.
[36] David J. Gladstone, Cynthia J. Danells, and Sandra E. Black. The Fugl-Meyer
Assessment of Motor Recovery after Stroke: A Critical Review of Its Measurement
Properties. Neurorehabil Neural Repair, 16(3):232–240, 2002.
[37] A. Godfrey, R. Conway, D. Meagher, and G.
´
OLaighin. Direct measure-
ment of human movement by accelerometry. Medical engineering & physics,
30(10):1364–1386, 2008.
[38] Isabelle Guyon and Andr´ e Elisseeff. An introduction to variable and feature selec-
tion. Journal of Machine Learning Research, 3:1157–1182, March 2003.
[39] M. A. Hanson, H. C. Powell, and et. al. Body Area Sensor Networks: Challenges
and Opportunities. Computer, 42(1):58–65, 2009.
[40] T. Hester, R. Hughes, D.M. Sherrill, B. Knorr, M. Akay, J. Stein, and P. Bon-
ato. Using wearable sensors to measure motor abilities following stroke. In Inter-
national Workshop on Wearable and Implantable Body Sensor Networks (BSN),
pages 4–8, April 2006.
[41] Enamul Hoque, Robert Dickerson, and John Stankovic. Monitoring body positions
and movements during sleep using WISPs. In Wireless Health, pages 44–53, San
Diego, CA, USA, 2010.
[42] Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon. Spoken Language Process-
ing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR,
Upper Saddle River, NJ, USA, 1st edition, 2001.
[43] DTG Huynh. Human activity recognitionwith wearable sensors. Ph.D. Thesis,
2008.
[44] Tˆ am Hu` ynh, Ulf Blanke, and Bernt Schiele. Scalable recognition of daily activ-
ities with wearable sensors. In Proceedings of the 3rd international conference
on Location-and context-awareness, LoCA’07, pages 50–67, Berlin, Heidelberg,
2007. Springer-Verlag.
[45] Tˆ am Huynh and Bernt Schiele. Analyzing features for activity recognition. In
Proceedings of the 2005 joint conference on Smart objects and ambient intelli-
gence: innovative context-aware services: usages and technologies, sOc-EUSAI
’05, pages 159–163, New York, NY , USA, 2005. ACM.
143
[46] Anil K. Jain, Robert P. W. Duin, and Jianchang Mao. Statistical pattern recogni-
tion: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence,
22(1):4–37, 2000.
[47] Emil Jovanov, Aleksandar Milenkovic, and et. al. A wireless body area network of
intelligent motion sensors for computer assisted physical rehabilitation. Journal of
NeuroEngineering and Rehabilitation, March 2005.
[48] Andrew D. Jurik, Jonathan F. Bolus, Alfred C. Weaver, Benton H. Calhoun, and
Travis N. Blalock. Mobile health monitoring through biotelemetry. BodyNets,
April 2009.
[49] D. M. Karantonis, M. R. Narayanan, M. Mathie, N. H. Lovell, and B. G. Celler.
Implementation of a Real-Time Human Movement Classifier Using a Triaxial
Accelerometer for Ambulatory Monitoring. IEEE Transactions on Information
Technology in Biomedicine, 10(1):156–167, 2006.
[50] Eamonn J. Keogh and Michael J. Pazzani. Derivative Dynamic Time Warping. In
SIAM International Conference on Data Mining (SDM), Chicago, Illinois, April
2001.
[51] Kenji Kira and Larry A. Rendell. A practical approach to feature selection. In Pro-
ceedings of the ninth international workshop on Machine learning, ML92, pages
249–256, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc.
[52] Andreas Krause, Daniel P. Siewiorek, Asim Smailagic, and Jonny Farringdon.
Unsupervised, dynamic identification of physiological and activity context in
wearable computing. In IEEE International Symposium on Wearable Computers,
volume 0, page 88, Los Alamitos, CA, USA, 2003. IEEE Computer Society.
[53] Jonathan Lester, Tanzeem Choudhury, and Gaetano Borriello. A practical approach
to recognizing physical activities. In International Conference on Pervasive Com-
puting, pages 1–16, Dublin, Ireland, 2006.
[54] Jonathan Lester, Tanzeem Choudhury, Nicky Kern, Gaetano Borriello, and Blake
Hannaford. A hybrid discriminative/generative approach for modeling human
activities. In International Joint Conference on Artificial Intelligence (IJCAI),
pages 766–772, Edinburgh, Scotland, 2005.
[55] Ming Li and et al. Multimodal physical activity recognition by fusing temporal and
cepstral information. IEEE Transactions on Neural Systems and Rehabilitation
Engineering, 18(4):369–380, 2010.
144
[56] Qiang Li, John A. Stankovic, Mark A. Hanson, Adam T. Barth, John Lach, and
Gang Zhou. Accurate, fast fall detection using gyroscopes and accelerometer-
derived posture information. In International Workshop on Wearable and
Implantable Body Sensor Networks, pages 138–143, Berkeley, CA, USA, 2009.
[57] Mohammad H. Mahoor, Mu Zhou, Kevin L. Veon, Seyed Mohammad Mavadati,
and Jeffrey F. Cohn. Facial action unit recognition with sparse representation. In
IEEE International Conference on Automatic Face and Gesture Recognition (FG),
pages 336–342, Santa Barbara, California, USA, March 2011.
[58] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned
dictionaries for local image analysis. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 1–8, Anchorage, Alaska, June 2008.
[59] S.G. Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionar-
ies. Signal Processing, IEEE Transactions on, 41(12):3397 –3415, dec 1993.
[60] Uwe Maurer, Anthony Rowe, Asim Smailagic, and Daniel Siewiorek. eWatch: A
wearable sensor and notification platform. In International Workshop on Wearable
and Implantable Body Sensor Networks, Cambridge, MA, USA, 2006.
[61] Uwe Maurer, Asim Smailagic, Daniel P. Siewiorek, and Michael Deisher. Activity
recognition and monitoring using multiple sensors on different body positions. In
Proceedings of the International Workshop on Wearable and Implantable Body
Sensor Networks, pages 113–116, Washington, DC, USA, 2006. IEEE Computer
Society.
[62] M. McLaughlin, A. Rizzo, and et. al. Haptics-enhanced virtual environments for
stroke rehabilitation. IPSI, 2005.
[63] S. Patel, D. Sherrill, R. Hughes, T. Hester, N. Huggins, T. Lie-Nemeth, D. Stan-
daert, and P. Bonato. Analysis of the severity of dyskinesia in patients with parkin-
son’s disease via wearable sensors. In International Workshop on Wearable and
Implantable Body Sensor Networks (BSN), pages 123–126, April 2006.
[64] Shyamal Patel, Konrad Lorincz, Richard Hughes, Nancy Huggins, John Growdon,
David Standaert, Metin Akay, Jennifer Dy, Matt Welsh, and Paolo Bonato. Moni-
toring motor fluctuations in patients with parkinson’s disease using wearable sen-
sors. IEEE Transactions on Information Technology in Biomedicine, 13(6):864–
873, 2009.
[65] Douglas Paul and Janet Baker. The design for the wall street journal-based csr
corpus. In Proceedings of the workshop on Speech and Natural Language, pages
357–362, Stroudsburg, PA, USA, 1992.
145
[66] J.K. Pillai, V .M. Patel, R. Chellappa, and N.K. Ratha. Secure and robust iris recog-
nition using random projections and sparse representations. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 33(9):1877–1893, September 2011.
[67] Nishkam Ravi, Nikhil Dandekar, Preetham Mysore, and Michael L. Littman.
Activity recognition from accelerometer data. In Conference on Innovative Appli-
cations of Artificial Intelligence (IAAI), pages 1541–1546, Pittsburgh, Pennsylva-
nia, USA, 2005.
[68] Daniel Roggen and et al. Collecting complex activity data sets in highly rich net-
worked sensor environments. In International Conference on Networked Sensing
Systems (INSS), Kassel, Germany, 2010.
[69] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by
locally linear embedding. Science, 290:2323, 2000.
[70] Yvan Saeys, I˜ naki Inza, and Pedro Larra˜ naga. A review of feature selection tech-
niques in bioinformatics. Bioinformatics, 23:2507–2517, September 2007.
[71] Lawrence K. Saul and Sam T. Roweis. Think globally, fit locally: unsupervised
learning of low dimensional manifolds. Journal of Machine Learning Research,
4:119–155, December 2003.
[72] Lawrence K. Saul, Kilian Q. Weinberger, Fei Sha, Jihun Ham, and Daniel D. Lee.
Spectral methods for dimensionality reduction. Semisupervised Learning. MIT
Press: Cambridge, MA, USA, 2006.
[73] Thomas Stiefmeier, Daniel Roggen, and Gerhard Tr¨ oster. Gestures are strings:
efficient online gesture spotting and classification using string matching. In Inter-
national Conference on Body Area Networks (BodyNets), pages 16:1–16:8, Flo-
rence, Italy, June 2007.
[74] Michael Sung, Carl Marci, and Alex Pentland. Wearable Feedback Systems for
Rehabilitation. Journal of NeuroEngineering and Rehabilitation, 2005.
[75] Emmanuel Tapia, Stephen Intille, Louis Lopez, and Kent Larson. The design of
a portable kit of wireless sensors for naturalistic data collection. In International
Conference on Pervasive Computing, pages 117–134, Dublin, Ireland, 2006.
[76] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A Global Geometric
Framework for Nonlinear Dimensionality Reduction. Science, 290(5500):2319–
2323, 2000.
[77] Monica Tentori, Gillian R. Hayes, and Madhu Reddy. Pervasive computing for
hospital, chronic, and preventive care. Found. Trends Hum.-Comput. Interact.,
5(1):1–95, January 2012.
146
[78] F. Torre, J. Hodgins, J. Montano, S. Valcarcel, R. Forcada, and J. Macey. Guide
to the carnegie mellon university multimodal activity (CMU-MMAC) database. In
Technical report CMU-RI-TR-08-22, Robotics Institute, Carnegie Mellon Univer-
sity, 2009.
[79] Joel A Tropp and Anna C Gilbert. Signal recovery from random measurements
via orthogonal matching pursuit. IEEE Transactions on Information Theory,
53(12):4655–4666, 2007.
[80] P. Turaga, R. Chellappa, V . Subrahmanian, and O. Udrea. Machine Recognition
of Human Activities: A Survey. IEEE Transactions on Circuits and Systems for
Video Technology, 18(11):1473–1488, September 2008.
[81] Matthew A. Turk and Alex P. Pentland. Face recognition using eigenfaces. In IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–591,
Hawaii, June 1991.
[82] A. Van Halteren, D. Konstantas, and et. al. Mobihealth: Ambulant patient moni-
toring over next generation public wireless networks. Studies in health technology
and informatics, 106:107–122, 2004.
[83] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag
New York, Inc., New York, NY , USA, 1995.
[84] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of
simple features. Compter Vision and Pattern Recognition, IEEE Computer Society
Conference on, 1:511, 2001.
[85] Michael B. Wakin and et al. Compressive imaging for video representation and
coding. In Picture Coding Symposium (PCS), Beijing, China, April 2006.
[86] Liang Wang and David Suter. Analyzing human movements from silhouettes using
manifold learning. In IEEE International Conference on Video and Signal Based
Surveillance (AVSS), Sydney, New South Wales, Australia, 2006.
[87] Qiang Wang, Guangyou Xu, and Haizhou Ai. Learning object intrinsic structure
for robust visual tracking. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), volume 2, pages 227–233, Madison, Wisconsin, USA, 2003.
[88] Mark Weiser. The computer for the 21st century. Scientific american, 265(3):94–
104, 1991.
[89] A. W. Whitney. A direct method of nonparametric measurement selection. IEEE
Transactions on Computers, 20:1100–1103, September 1971.
147
[90] John Wright, Allen Y . Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma. Robust
face recognition via sparse representation. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 31:210–227, February 2009.
[91] W. Wu, A. Bui, M. Batalin, L. Au, J. Binney, and W. Kaiser. MEDIC: Medi-
cal embedded device for individualized care. Artificial Intelligence in Medicine,
42:137–152, 2008.
[92] Allen Yang, Roozbeh Jarafi, Shankar Sastry, and Ruzena Bajcsy. Distributed
recognition of human actions using wearable motion sensor networks. In Jour-
nal of Ambient Intelligence and Smart Environments (JAISE), 2009.
[93] Jhun-Ying Yang, Jeen-Shing Wang, and Yen-Ping Chen. Using acceleration mea-
surements for activity recognition: An effective learning algorithm for constructing
neural classifiers. Pattern Recognition Letters, 29:2213–2220, December 2008.
[94] Ming-Hsuan Yang. Extended isomap for pattern classification. In Eighteenth
National Conference on Artificial Intelligence (AAAI), pages 224–229, Edmonton,
Alberta, Canada, 2002.
[95] S. Yeh and A. Rizzo. An integrated system: Virtual reality, haptics and modern
sensing technique (VHS) for post-stroke rehabilitation. ACMsymposiumonVirtual
reality software and technology, 2005.
[96] J. Zhang, M. Marsza, S. Lazebnik, and C. Schmid. Local features and kernels
for classification of texture and object categories: A comprehensive study. Int. J.
Comput. Vision, 73:213–238, June 2007.
[97] Mi Zhang and Alexander A. Sawchuk. A customizable framework of body area
sensor network for rehabilitation. In International Symposium on Applied Sciences
in Biomedical and Communication Technologies (ISABEL), pages 1–6, Bratislava,
Slovak Republic, November 2009.
[98] Mi Zhang and Alexander A. Sawchuk. A feature selection-based framework for
human activity recognition using wearable multimodal sensors. In International
Conference on Body Area Networks (BodyNets), pages 92–98, Beijing, November
2011.
148
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Magnetic induction-based wireless body area network and its application toward human motion tracking
PDF
Efficient pipelines for vision-based context sensing
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Improving efficiency, privacy and robustness for crowd‐sensing applications
PDF
Crowd-sourced collaborative sensing in highly mobile environments
PDF
Sense and sensibility: statistical techniques for human energy expenditure estimation using kinematic sensors
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Gradient-based active query routing in wireless sensor networks
PDF
Advanced visual processing techniques for latent fingerprint detection and video retargeting
PDF
Cloud-enabled mobile sensing systems
PDF
Advancing distributed computing and graph representation learning with AI-enabled schemes
PDF
User-centric smart sensing for non-intrusive electricity consumption disaggregation in buildings
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Smart monitoring and autonomous situation classification of humans and machines
PDF
Virtual extras: conversational behavior simulation for background virtual humans
PDF
From active to interactive 3D object recognition
PDF
Essays on improving human interactions with humans, algorithms, and technologies for better healthcare outcomes
PDF
Human activity analysis with graph signal processing techniques
PDF
Machine learning based techniques for biomedical image/video analysis
PDF
Rapid prototyping and evaluation of dialogue systems for virtual humans
Asset Metadata
Creator
Zhang, Mi
(author)
Core Title
Ubiquitous computing for human activity analysis with applications in personalized healthcare
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
07/02/2013
Defense Date
03/25/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
human activity recognition,mobile computing,OAI-PMH Harvest,personalized healthcare,ubiquitous computing,virtual rehabilitation,wireless health
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sawchuk, Alexander A. (Sandy) (
committee chair
), Krishnamachari, Bhaskar (
committee member
), Liu, Yan (
committee member
)
Creator Email
mizhang@usc.edu,zhangmifigo@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-283085
Unique identifier
UC11294066
Identifier
etd-ZhangMi-1733.pdf (filename),usctheses-c3-283085 (legacy record id)
Legacy Identifier
etd-ZhangMi-1733.pdf
Dmrecord
283085
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhang, Mi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
human activity recognition
mobile computing
personalized healthcare
ubiquitous computing
virtual rehabilitation
wireless health