Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 766 (2002)
(USC DC Other)
USC Computer Science Technical Reports, no. 766 (2002)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Multi-Layer Gesture Recognition: An Experimental Evaluation
Jacob Eisenstein, Shahram Ghandeharizadeh, Leana Golubchik,
Cyrus Shahabi, Donghui Yan, Roger Zimmermann
Department of Computer Science
University of Southern California
Los Angeles, CA 90089, USA
Abstract
Gesture recognition techniques often suffer from being highly device-dependent and hard to extend.
If a system is trained using data from a specific glove input device, that system is typically unusable with
any other input device. The set of gestures that a system is trained to recognize is typically not extensible,
without retraining the entire system. We propose a novel gesture recognition framework to address these
problems. This framework is based on a multi-layered view of gesture recognition. Only the lowest layer
is device dependent; it converts raw sensor values produced by the glove to a glove-independent semantic
description of the hand. The higher layers of our framework can be reused across gloves, and are easily
extensible to include new gestures. We have experimentally evaluated our framework and found that it
yields at least as good performance as conventional techniques, while substantiating our claims of device
independence and extensibility.
1 Introduction
Gesture recognition offers a new medium for human-computer interaction that can be both efficient and
highly intuitive. However, gesture recognition software is still in its infancy. While many researchers
have documented methods for recognizing complex gestures from instrumented gloves at high levels of
accuracy [6, 9, 12, 14], these systems suffer from two notable limitations: device dependence and lack of
extensibility.
Conventional approaches to gesture recognition typically involve training a machine learning system
to classify gestures based on sensor data. A variety of machine learning techniques have been applied,
including hidden Markov models [6, 9, 15], feedforward neural networks [14], and recurrent neural net-
works [6, 11]. These different approaches have a common feature: they all treat gesture recognition as
a one-step, single-layer process, moving directly from the sensor values to the detected gesture. Conse-
quently, the properties of the specific input device used in training are built into the system. For example, a
system that was trained using a 22 sensor CyberGlove would almost certainly be of no use with a 10 sensor
This research is supported in part by NSF grants EEC-9529152 (IMSC ERC), and IIS-0091843 (SGER).
1
DataGlove. The system expects 22 inputs, and would be unable to produce meaningful results with the 10
inputs offered by the DataGlove.
Ideally, a gesture recognition system should be able to work with a variety of input devices. As the
number of sensors is reduced, the performance might degrade, but it should degrade gracefully. We call this
property device independence, and it is the first significant advantage of our approach.
In order to achieve device independence, we have reconceptualized gesture recognition in terms of a
multi-layer framework. This involves generating a high-level, device-independent description of the sensed
object – in this case, the human hand. Gesture recognition then proceeds from this description, independent
of the characteristics of any given input device.
Because our approach allows the gesture recognition system to use a clean, semantic description of the
hand, rather than noisy sensor values, much simpler techniques can be employed. It is not necessary to
use anything as complicated as a neural network; rather, simple template matching is sufficient. Template
matching provides a key advantage over more complex approaches: it is easily extensible, simply by adding
to the list of templates. To recognize a new gesture with a conventional system, the entire set of gestures must
be relearned. But we will show experimentally that with our approach, it is possible to add new gestures
without relearning old ones. These new gestures are recognized with nearly the same accuracy as those in
the original training set. Thus, extensibility is the second main advantage of our approach.
In Section 2, we describe our multi-layer framework in more detail. Section 3 presents our imple-
mentation for the task of ASL fingerspelling recognition, which we believe will generalize to other virtual
reality applications. The results of an experimental evaluation of this implementation are given in Section 4.
Section 5 surveys related work, and Section 6 presents brief conclusions and future research directions.
2 A Multi-Layer Framework
Our proposed framework is based on a multi-level representation of sensor data. It consists of the following
four levels:
1. Raw data: This is the lowest layer and contains continuous streams of data emanating from a set
of sensors. We have addressed the analysis of this general class of data in [4]. In this paper, we
are specifically concerned with data generated by the sensors on a glove input device. A detailed
description of the sensors included with the CyberGlove haptic device are provided in Table 1. The
data at this level is highly device-dependent; each device may have a unique number of sensors,
and the sensors may range over a unique set of values. However, raw sensor data is application
independent; no assumptions are made about how the data will be used, or even what it describes.
Conventional approaches to dealing with streaming sensor data typically operate at exactly this level.
Consequently, these approaches are usually very difficult to adapt to new devices, and they fail to take
advantage of human knowledge about the problem domain.
2. Postural Predicates: This level contains a set of predicates that describe the posture of the hand.
2
Sensor number Sensor description
1 thumb roll sensor
2 thumb inner joint
3 thumb outer joint
4 thumb-index abduction
5 index inner joint
6 index middle joint
7 index outer joint
8 middle inner joint
9 middle middle joint
10 middle outer joint
11 middle-index abduction
Sensor number Sensor description
12 ring inner joint
13 ring middle joint
14 ring outer joint
15 ring-middle abduction
16 pinky inner joint
17 pinky middle joint
18 pinky outer joint
19 pinky-ring abduction
20 palm arch
21 wrist flexion
22 wrist abduction
Table 1: CyberGrasp Sensors
Table 2 provides a list of the predicates to represent the hand postures for ASL fingerspelling. A vector
of these predicates consisting of 37 boolean values describes a general hand posture. For example, a
pointing posture is described by noting that the index finger is open, Open (index finger), while every
other finger is closed, Closed (thumb), Closed (middle finger), etc. Each predicate describes a single
features of the overall posture – e.g., Closed (index finger). In our pointing posture, five vector values
(corresponding to the aformentioned predicates evaluate as true), while the remaining ones evaluate
as false.
While we do not claim that Table 2 presents a comprehensive list that captures all possible postures,
we do show that these predicates can describe the ASL alphabet. Preliminary research on other
fingerspelling systems suggests that this set of predicates is general, and we plan to investigate its
generality in support of non-fingerspelling applications. To this end, we plan to apply this set of
predicates to a virtual reality application in future research.
The postural predicates are derived directly from the lower level of raw sensor data. This process is
described in more detail later in Section 3.2. The derivation of postural predicates from sensor data
is, by necessity, device-dependent. However, it is application-independent, if our postural predicates
are indeed general over a range of applications.
Once postural predicates are extracted from the sensor data, device-independent applications can be
implemented using this higher level of representation. Thus, our multi-layered approach provides
for the sound software engineering practice of modular reuse. Some modules can be reused across
multiple devices, while others can be reused across multiple applications.
3. Temporal Predicates: Our work thus far has mainly focused on postural predicates. Here, we assume
the ASL alphabet as our working data set where most signs are static and do not require any hand
motion (and hence have no temporal aspect). The extension of the framework to temporal signs is part
of our future work.
4. Gestural Templates: This layer contains a set of templates, each of which corresponds to a whole
hand gesture. Postures contain no temporal information; a gesture may contain temporal information,
3
Name Definition Applicability Number of predicates
Open (X) Indicates that finger X is extended parallel to
the palm.
Any finger, and the thumb 5
Closed (X) Indicates that finger X is closed with the
palm. Note that the open and closed predi-
cates are mutually exclusive, but they are not
complete. A finger may neither entirely open,
nor closed.
Any finger, and the thumb 5
Touching-thumb (X) Indicates that finger X is touching the thumb. Any finger other than the
thumb
4
Grasping (X) Indicates that finger is grasping something
with the thumb.
Any finger other than the
thumb
4
Split (X, Y) Indicates that adjacent fingers X and Y are
spread apart from each other.
Any adjacent pair of fin-
gers, and the index finger
and the thumb.
4
Crossing (X, Y) Indicates that finger X is crossing over finger
Y , with Y closer to the palm than X.
Applies to any two fingers,
but because of the lim-
ited flexibility of the hu-
man hand, it is assumed
that the index finger cannot
cross with the ring finger or
the pinky, and that the mid-
dle finger cannot cross with
the pinky.
14
Palm-facing-in () Indicates that the palm is facing the signer,
rather than the recipient of the signs.
Applies to the whole hand. 1
Table 2: Postural Predicates
4
Alphabet Set of predicates (corresponding to a template)
A Closed (F1, F2, F3, F4)
B Open (F1, F2, F3, F4), Closed (T)
C Grasping F1, F2, F3, F4 (1in, 0 degrees)
D Open (F1), Touching-thumb (F2, F3)
E Closed (T), Crossing (F1, T), Crossing (F2, T) Crossing (F3, T)
F Open (F2, F3, F4), Touching-thumb (F1), Split (F2, F3), Split (F3, F4)
G Open (F1), Closed (F2, F3, F4), Crossing (T, F2), Palm-facing-in()
H Open (F1, F2), Closed (F3, F4), Crossing (T, F3), Crossing (T, F4), Palm-facing-in()
I Closed (F2, F3, F4), Open (F4), Crossing (T, F1), Crossing (T, F2), Crossing (T, F3)
K Open (F1, F2), Closed (F3, F4)
L Closed (F2, F3, F4), Open (F1, T), Split (F1, T)
M Closed (T, F4) Crossing (F1, T), Crossing (F2, T), Crossing (F3, T), Crossing (T, F4)
N Closed (T, F3, F4), Crossing (F1, T), Crossing (F2, T), Crossing (T, F3)
O Touching-thumb (F1, F2, F3, F4)
P Closed (T, F3, F4), Open (F1, F2)
Q Open (T, F1) Closed (F2, F3, F4)
R Closed (F3, F4), Open (F1, F2), Crossing (T, F3), Crossing (T, F4), Crossing (F2, F1)
S Closed (F1, F2, F3, F4), Crossing (T, F1), Crossing (T, F2)
T Closed (F2, F3, F4), Crossing (T, F2), Crossing (F1, T)
U Closed (F3, F4), Open (F1, F2), Crossing (T, F3), Crossing (T, F4)
V Closed (F3, F4), Open (F1, F2), Crossing (T, F3), Crossing (T, F4), Split (F1, F2)
W Closed (F4), Open (F1, F2, F3), Crossing (T, F4), Split (F1, F2), Split (F3, F4)
X Closed (F2, F3, F4), Crossing (T, F2) Crossing (T, F3) Crossing (T, F4)
Y Closed (F1, F2, F3), Open (T, F4)
Table 3: ASL Fingerspelling Templates
although this is not required. A gesture is a description of the changing posture and position of the
hand over time. An example of a gesture is a hand with a pointing index finger moving in a circular
trajectory with a repetitive cycle.
Gestures are described as templates because they are represented as a vector of postural and temporal
predicates. In order to classify an observed hand motion as a given gesture, the postural and temporal
predicates should match the gestural template. This might be an approximate match; in our ASL
application, we simply choose the gestural template that is the closest match to the observed data (see
Section 3.1.1 for details).
3 Implementation
Figure 1 shows the two key modules of our implementation: 1) a set of predicate recognizers, and 2)
a template matcher. The predicate recognizers convert raw sensor data to a vector of predicates. The
template matcher then identifies the nearest gestural template. The template matcher is assisted by two other
components: a confidence vector, and a probabilistic context. We will first describe how these components
work together to detect ASL signs. Next, Section 3.2 describes how the system is trained.
5
3.1 ASL Sign Detection
This section explains how the system moves from sensor data to the vector of predicates. We then describe
the basic template matching technique, and show how it can be augmented with context and the confidence
vector.
3.1.1 Predicate Recognizers
The predicate recognizers use traditional gesture recognition methods to evaluate each postural predicate
from a subset of the sensor data. In this case, we implemented the predicate recognizers as feedforward
neural networks; we have explored other approaches in the past [3]. Each predicate recognizer need not
consider data from the entire set of sensors. Rather, the sensors are mapped as input to the predicate rec-
ognizers manually, using common knowledge of which sensors are likely to be relevant to each predicate.
For example, the predicate Crossing (T, F1) receives input only from the sensors on the thumb and index
finger. By mapping only those sensors that are relevant to each predicate recognizer, human knowledge can
be brought to bear to dramatically improve both the efficiency and accuracy of training. Table 2 shows the
six predicate types and the thirty-seven predicates required to describe a single handshape.
To perform recognition of these thirty-seven postural predicates, we employ thirty-seven individual neu-
ral networks (see Figure 1). Each neural net consumes between 4 and 10 sensor values, includes ten hidden
nodes, and produces either a zero or a one as output, denoting the logical valence of its predicate. The
outputted predicates are then collated together into a vector, which is fed as input to a template matcher.
3.1.2 Template Matching
The gesture recognizers for a specific application, such as ASL, are realized using these postural predicates.
Since these gesture recognizers manipulate high-level semantic data rather than low-level sensor values, it
becomes possible to employ simpler and more extensible approaches. Our system performs gesture recog-
nition by simple template matching on the detected vector of postural predicates. Template matching can
be extended by simply adding to the set of pre-existing templates. In addition, this template matching
component can be used across many different gloves (see Section 4).
The template matcher works by computing the Euclidean distance between the observed predicate vector
and every known template. The template that is found to be the shortest distance from the observed predicate
vector is selected. Mathematically, for a perceived predicate vector v, we want to find the gesture template
i that minimizes d
i;v
, which is the Euclidean distance between the two vectors.
d
i;v
=
X
0 p
Abstract (if available)
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 748 (2001)
PDF
USC Computer Science Technical Reports, no. 622 (1995)
PDF
USC Computer Science Technical Reports, no. 590 (1994)
PDF
USC Computer Science Technical Reports, no. 600 (1995)
PDF
USC Computer Science Technical Reports, no. 618 (1995)
PDF
USC Computer Science Technical Reports, no. 587 (1994)
PDF
USC Computer Science Technical Reports, no. 834 (2004)
PDF
USC Computer Science Technical Reports, no. 625 (1996)
PDF
USC Computer Science Technical Reports, no. 627 (1996)
PDF
USC Computer Science Technical Reports, no. 630 (1996)
PDF
USC Computer Science Technical Reports, no. 659 (1997)
PDF
USC Computer Science Technical Reports, no. 584 (1994)
PDF
USC Computer Science Technical Reports, no. 754 (2002)
PDF
USC Computer Science Technical Reports, no. 592 (1994)
PDF
USC Computer Science Technical Reports, no. 685 (1998)
PDF
USC Computer Science Technical Reports, no. 650 (1997)
PDF
USC Computer Science Technical Reports, no. 598 (1994)
PDF
USC Computer Science Technical Reports, no. 869 (2005)
PDF
USC Computer Science Technical Reports, no. 913 (2009)
PDF
USC Computer Science Technical Reports, no. 969 (2016)
Description
Jacob Eisenstein, Shahram Ghandeharizadeh, Leana Golubchik, Cyrus Shahabi, Donghui Yan, Roger Zimmermann. "Multi-layer gesture recognition: An experimental evaluation." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 766 (2002).
Asset Metadata
Creator
Eisenstein, Jacob
(author),
Ghandeharizadeh, Shahram
(author),
Golubchik, Leana
(author),
Shahabi, Cyrus
(author),
Yan, Donghui
(author),
Zimmermann, Roger
(author)
Core Title
USC Computer Science Technical Reports, no. 766 (2002)
Alternative Title
Multi-layer gesture recognition: An experimental evaluation (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
19 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16270978
Identifier
02-766 Multi-Layer Gesture Recognition An Experimental Evaluation (filename)
Legacy Identifier
usc-cstr-02-766
Format
19 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/