Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Robust real-time vision modules for a personal service robot in a home visual sensor network
(USC Thesis Other)
Robust real-time vision modules for a personal service robot in a home visual sensor network
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ROBUST REAL-TIME VISION MODULES FOR A PERSONAL
SERVICE ROBOT IN A HOME VISUAL SENSOR NETWORK
by
Kwangsu Kim
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2007
Copyright 2007 Kwangsu Kim
ii
Acknowledgements
I am deeply grateful to my advisor Prof. Gérard G. Medioni for his guidance
and support in this work. He has always supported me with thoughtful in-
sight and priceless knowledge to address the problem in this work and in
computer vision in general. He has been a mentor of my life, as well as an
academic adviser. I would also like to thank Prof. Ramesh Govindan, Prof.
Gaurav Sukhatme, Prof. Wlodek Proskurowski, and Prof. Alexandre R. J.
François for their guidance, interaction, and participation in my qualifying
examination and dissertation defense. I also appreciate the support and co-
operation with Dr. Jaeyeon Lee in the Electronic Telecommunication Re-
search Institute and Dr. Sangrok Oh in the Ministry of Information and
Communication in Korea. Thanks to the Korean Government to give me a
chance to study abroad.
I am thankful to members of the Institute for Robotics and Intelligent
Systems (IRIS) for their assistance and support, especially Chanki Min,
Matheen Siddiqui for the enjoyable years in the institute. I also appreciate
Joanne for her kind consideration and support during my study.
Finally, special thanks to my family members, Eunhee, Alissa, Sally,
and my mother for their invaluable help and support in every way and for
giving me a peace of mind.
iii
Table of Contents
Acknowledgements ........................................................................................ ii
Table of Contents ......................................................................................... iii
List of Tables ................................................................................................. v
List of Figures ............................................................................................... vi
Abbreviations ................................................................................................ ix
Abstract .......................................................................................................... x
Chapter 1 Introduction ................................................................................ 1
1.1 Background ................................................................................... 1
1.2 Problem and Goals ........................................................................ 2
1.3 Proposed Approach ....................................................................... 4
1.4 Impact of the research ................................................................. 10
1.5 Thesis overview ........................................................................... 10
Chapter 2 Related Work ........................................................................... 12
2.1 Cooperation Model for Distributed Sensor Nodes ...................... 12
2.2 Multiple Camera-based Visual Surveillance System .................. 15
2.3 Advantages of Our Approach ...................................................... 16
Chapter 3 Cooperation Model for Multi-camera Nodes ........................... 19
3.1 Cooperation Model ...................................................................... 20
3.2 Interface Framework between Nodes .......................................... 21
3.3 Synchronization in multiple camera system ................................ 29
3.3.1 Automatic Synchronization .......................................... 30
3.3.2 Resolution of redundant information ........................... 32
Chapter 4 Fixed Sensor Node Processing ................................................. 35
4.1 People Detection and Tracking ................................................... 36
4.1.1 Related Work ............................................................... 36
4.1.2 People Detection .......................................................... 37
4.1.3 Tracking of Detected People ........................................ 42
4.2 Identification of Detected Person ................................................ 44
iv
4.2.1 Measuring Height and Location ................................... 44
4.2.2 Identification using clothing color histogram distance 47
4.3 Head Detection ............................................................................ 51
4.3.1 Related Work ............................................................... 51
4.3.2 Combined Head Detection Method .............................. 51
4.3.3 Experimental Results ................................................... 54
4.4 Behavior Understanding .............................................................. 57
4.4.1 Related Work ............................................................... 57
4.4.2 2D Model Based Behavior Understanding .................. 58
4.4.3 Real-time Action Recognition Using Nondeterministic
Finite Automata model ................................................................ 60
Chapter 5 Mobile Robot Node Processing ............................................... 66
5.1 Communication Framework between the Robot and the Main
Node 66
5.2 Self-localization with Omni-directional Camera ........................ 68
5.2.1 Related work ................................................................ 69
5.2.2 The sensor .................................................................... 70
5.2.3 Approach ...................................................................... 72
5.2.4 Experimental Results ................................................... 75
5.3 Gesture Recognition at Short Range ........................................... 77
Chapter 6 Current Results of Integrated System ...................................... 80
6.1 Integration of Multiple Camera Nodes ........................................ 80
6.2 Validation Test ............................................................................ 83
6.3 Fault Tolerance ............................................................................ 84
Chapter 7 Conclusion ............................................................................... 86
References .................................................................................................... 89
v
List of Tables
Table 1: Functionalities of the Main Node .................................................. 27
Table 2: Broadcasting-Event scheme between the main node and each
camera node ......................................................................................... 31
Table 3: Experimental results of head detection method ............................. 56
Table 4: Experimental results of action matching ....................................... 65
Table 5: Experimental results of robot navigation ....................................... 76
Table 6: Validation test results .................................................................... 83
vi
List of Figures
Figure 1: Various types of home service robots (images from internet) (a)
Samsung: Chorongi (b) ZMP: Nubo (c) Mitsubishi: Wakamaru (d)
DasaTech: Zenibo (e) Samsung: Mir (f) Hanwool Robotics: Netoro (g)
Yujin Robotics: Irobi Q (h) Axlon: Mir (i) Ezrobotics: Cubo ............... 1
Figure 2: System organization of sensor/network architecture ...................... 5
Figure 3: Required capabilities for the robot ................................................. 6
Figure 4: Distribution of functionalities ........................................................ 7
Figure 5: Robot Platform ............................................................................... 9
Figure 6: Comparison of related approaches ............................................... 18
Figure 7: Cooperation model between nodes ............................................... 20
Figure 8: Ubiquitous Camera Interface Framework (UCIF) ....................... 23
Figure 9: Interaction between application nodes and the main node ........... 24
Figure 10: Data management between a fixed camera node and the main
node ...................................................................................................... 25
Figure 11: Overlapping cameras .................................................................. 30
Figure 12: Resolving duplication with distance ........................................... 32
Figure 13: HSV Color Histogram Matching to resolve duplications .......... 33
Figure 14: People Detection Flow ............................................................... 39
Figure 15: People detection results .............................................................. 41
Figure 16: Association Matrix ..................................................................... 42
vii
Figure 17: Experimental results of people tracking ..................................... 43
Figure 18: Measuring Height & Location .................................................... 45
Figure 19: Experimental results of measuring location ............................... 46
Figure 20: Masked Color Histogram in HSV space .................................... 47
Figure 21: Experimental results of identification using clothing color
histogram distance: Assign same ID number in between different
sequences which have thirty minutes time lag. .................................... 50
Figure 22: Head detection flow using contour analysis ............................... 52
Figure 23: Head detection flow using Haar classifier and Mean shift tracker
.............................................................................................................. 54
Figure 24: Head detection results................................................................. 55
Figure 25: Feature detection for state transition .......................................... 61
Figure 26: Action Transition using Nondeterministic Finite Automata
Model ................................................................................................... 62
Figure 27: Experimental Results of action recognition ............................... 63
Figure 28: Communication framework between the main node and the robot
node ...................................................................................................... 68
Figure 29: Omni-directional camera structure (top), raw input image
(middle left), cube box image (middle right) and panoramic image
(bottom). ............................................................................................... 71
Figure 30: Getting position from angles (upper), Samples of edges (lower)
.............................................................................................................. 73
Figure 31: Experimental results of the robot navigation using omni
directional camera in real application .................................................. 75
viii
Figure 32: Trajectory of the robot navigation in the experiment to measure
the error (black point: ground truth) .................................................... 77
Figure 33: Experimental results of the upper body detection method ......... 79
Figure 34: Experimental results of the integrated system. We installed 3
cameras in 2 different rooms. ............................................................... 82
ix
Abbreviations
IH Intelligent Home
VSAM Video Surveillance and Monitoring
FOV Field of View
FSM Finite State Machine
NFA Nondeterministic Finite Automata
UCIF Ubiquitous Camera Interface Framework
CRIF Common Robot Interface Framework
HSV Hue Saturation Value
BS Background Subtraction
FCN Fixed Camera Node
MCN Mobile Camera Node
RANSAC Random sample consensus
OCN Omni-Directional Camera Node
x
Abstract
The Intelligent Home, which integrates information, communication and
sensing technologies with/for everyday objects, is emerging as a viable en-
vironment in industrialized countries. It offers the promise to provide secu-
rity for the population at large, and possibly to assist members of an aging
population. In the intelligent home context, personal service robots are ex-
pected to play an important role as interactive assistants, due to their mobili-
ty and action ability which complement other sensing nodes in the home
network. As an interactive assistant, a personal service robot must be en-
dowed with visual perception abilities, such as detection and identification
of people in its vicinity, recognition of people's actions or intentions.
We propose to frame the problems in terms of distributed sensor net-
work architecture with fixed visual sensing nodes (wall-mounted cameras)
and mobile sensing/actuating nodes such as one (or more) personal service
robot. Each fixed node processes its video input stream to detect and track
people, and to perform some level of behavior analysis, given the limited
resolution. It may also communicate with the robot, directing it to move to a
specific area. The robot, in addition to navigation, must process visual input
from the on-board camera(s) to also perform person detection and tracking,
but at a finer level, since it is closer to the person. In particular, it should
locate a person’s face, possibly identify the person, in order to interact with
humans in a social setting. Each sensor node is connected to the intelligent
xi
home network, and performs its task independently, according to the range
of interaction and the object of perception. Each fixed camera node on the
wall detects and tracks people in its field of view, and analyzes their beha-
vior. It may then trigger other sensor nodes, or communicate with the robot
for further sensing and closer analysis, by integrating multiple sensing
nodes in various levels, according to the range of interaction, mobility, or
required resolution. We also extend this strategy to the fusion of different
kinds of sensing, such as sound and vision, as human robot interaction is
multi-modal.
This fusion strategy can provide robustness and efficiency, compared to
the traditional image level analysis from a single camera, through a certain
level of redundancy, as well as the cooperation among the sensor nodes. We
have obtained encouraging results with this framework on a real robot in a
realistic environment, such as multiple, different types of sensor nodes in
several different places, changing illumination, and uninterrupted
processing for long period of time.
1
Chapter 1 Introduction
1.1 Background
The population in industrialized countries is aging, making the concept of Intelligent
Home relevant. It integrates information, communication and sensing technologies
with/for everyday objects [50], to help seniors control their home easily and provide
their security. In the intelligent home, personal service robots are expected to play an
important role as interactive assistants, due to their mobility and intelligent percep-
Figure 1: Various types of home service robots (images from internet) (a) Samsung:
Chorongi (b) ZMP: Nubo (c) Mitsubishi: Wakamaru (d) DasaTech: Zenibo (e) Samsung:
Mir (f) Hanwool Robotics: Netoro (g) Yujin Robotics: Irobi Q (h) Axlon: Mir (i) Ezro-
botics: Cubo
2
tion ability. In addition, a personal service robot can extend its functionalities by
joining and tapping into the home network. We propose a framework to organize and
coordinate multiple camera nodes in the intelligent home network, as they carry per-
ception tasks that support the robot in its activities, resulting in a synergy between
the robot and the intelligent home.
1.2 Problem and Goals
As an interactive assistant, a personal service robot must have essential visual per-
ception abilities, such as detection and identification of people in its vicinity, recog-
nition of people's actions or intentions. These tasks require an extremely wide field
of sensing to cover the entire domain of the home, which cannot be accomplished
with a single camera sensor on the robot. The narrow field of view from a single
camera enables these abilities just at the front of the user. In addition, it requires a
significant processing power inside the robot, as the tasks should be carried simulta-
neously and in real-time.
These difficult requirements can be overcome by tapping into resources of the
intelligent home network. Multiple vision sensors in the home network extend the
perceptual range of the robot. For example, its master does not need to appear in the
sight of the robot to issue commands. With information from other sensors, the robot
can monitor the entire home, or check the behavior of people inside the home to ex-
ecute its role autonomously and non-intrusively. In addition, the burden of
processing power can be distributed to several processing elements in the home net-
3
work, such as processors for the camera nodes on the wall, and even to a separate
server, resulting in a fast and reliable system.
Challenges associated with this framework are:
First, how to ensure the reliability of each functional module in a camera node in
the home network, and
Second, how to organize the individual nodes into a system to maximize the ef-
ficiency and robustness of the entire system.
Each functional module in a camera node, such as people detection and tracking,
or gesture recognition, should run concurrently and in real-time with a level of per-
formance that allows the robot to interact with its environment consistently and with
minimum latency. Certainly, it is not a trivial problem, and many new algorithms for
each task should be developed to meet the requirements. However, a systematic ap-
proach can boost the overall performance of the system, even if individual modules
are weak. For instance, multiple vision sensors in the intelligent home network can
share the appearance information of detected people to remove possible false alarms,
and can predict the appearance of people using the moving trajectory information of
the detected people.
The task to design an efficient framework to integrate each individual module
can be itemized as followings
(1) Distributing required tasks to each sensor node efficiently
(2) Constructing a well defined and reliable conceptual model to fuse informa-
tion from each node, and to share information between nodes.
4
Our research is motivated by both the improvement of efficiency and robustness
through the integration of multiple vision sensors in the intelligent home network, as
well as the necessity of reliable perception modules for natural human-robot interac-
tion.
1.3 Proposed Approach
We intend to treat these challenges from a fusion point of view, which integrates
multiple sensing nodes in various levels, according to the range of interaction, mo-
bility, required resolution. We can also extend this strategy to the fusion of different
kinds of sensing, such as sound and vision, for a multi modal human-robot interac-
tion. We propose a complete integrated system composed of an actual personal ser-
vice robot and a set of fixed sensing nodes, as shown in Figure 2. We refer to this
figure throughout the thesis.
In the home, typical activities of a personal service robot might involve the fol-
lowing scenarios: The robot actively and quietly monitors its environment until it
recognizes that a potential user summons it or might be in need of assistance. In re-
sponse to a signal from a person, the robot should take some relevant actions, such as
moving closer to the potential user, and become engaged in a focused interaction
with this potential user, whom it might identify as its "master" or privileged user.
5
3D Localization
People Detection & Tracking
Feature Extraction
(Height, Color, Posture, Head)
Action Recognition
Broadcasting
Mobile Robot Node
(Omni-Directional)
Planning and
Navigation
Face Detection
and Tracking
Fixed
Camera Node
Mobile Robot Node
(Stereo Camera)
Face Detection
and Tracking
Limb Detection
and Tracking
2.5D Fusion
3D robot-centric
description
Gesture Recognition
Identification
External
Processing
3D Localization
People Detection & Tracking
Feature Extraction
(Height, Color, Posture, Head)
Action Recognition
Broadcasting
Mobile Robot Node
(Omni-Directional)
Planning and
Navigation
Face Detection
and Tracking
Fixed
Camera Node
Mobile Robot Node
(Stereo Camera)
Face Detection
and Tracking
Limb Detection
and Tracking
2.5D Fusion
3D robot-centric
description
Gesture Recognition
Identification
External
Processing
3D Localization
People Detection & Tracking
Feature Extraction
(Height, Color, Posture, Head)
Action Recognition
Broadcasting
Mobile Robot Node
(Omni-Directional)
Planning and
Navigation
Face Detection
and Tracking
Fixed
Camera Node
Mobile Robot Node
(Stereo Camera)
Face Detection
and Tracking
Limb Detection
and Tracking
2.5D Fusion
3D robot-centric
description
Gesture Recognition
Identification
External
Processing
Figure 2: System organization of sensor/network architecture
6
When it responds to a signal from a person, it should communicate with this person
with an explicit acknowledgement.
According to this scenario, the robot is required to have the capabilities shown
in Figure 3. To implement these capabilities, the vision system for a home service
robot involves:
(1) Continuous detection and tracking of non static objects in the environment
(principally people)
(2) Recognition of the user as somebody who may issue commands to it
(3) Some level of scene understanding and situation awareness, including under-
standing of gestures or commands
(4) Self localization in the environment
Figure 3: Required capabilities for the robot
7
These key functionalities can be distributed to several sensor nodes, according
to the range of interaction and the type of image data, as shown in Figure 4. When
we distribute these key functionalities, we adopt the following three strategies which
can improve the efficiency and robustness of the system.
(1) Division of labor: To make the system interact with the user efficiently, the
level of interaction should be different according to the type and specification of
camera. For example, the stereo camera node on the robot can recognize finer level
of gestures of the user in short range. On the other hand, the fixed camera nodes on
the wall can detect coarse level of behaviors of the user with cheap processes.
(2) Cooperation with a certain level of redundancy: The information from each
sensor node should contain a certain level of redundancy to overcome limitations of
individual sensors, such as narrow field of view and limited resolution, as well as
share the information to increase the efficiency of entire system.
Fixed Camera Node Fixed Camera Node Fixed Camera Node
Detects & tracks people
Identifies people
Recognizes the posture for
basic scene understanding
Stereo camera Node Stereo camera Node Stereo camera Node
Activated near people
Detects and tracks limbs
Gesture recognition
Activated near people
Detects and tracks limbs
Gesture recognition
Omni Camera Node Omni Camera Node Omni Camera Node
Computes 3D position
and orientation of robot
Computes 3D position
and orientation of robot
Fixed Camera Node Fixed Camera Node Fixed Camera Node
Detects & tracks people
Identifies people
Recognizes the posture for
basic scene understanding
Stereo camera Node Stereo camera Node Stereo camera Node
Activated near people
Detects and tracks limbs
Gesture recognition
Activated near people
Detects and tracks limbs
Gesture recognition
Omni Camera Node Omni Camera Node Omni Camera Node
Computes 3D position
and orientation of robot
Computes 3D position
and orientation of robot
Figure 4: Distribution of functionalities
8
(3) Loosely coupled, independent nodes: Each sensor node should run indepen-
dently and in parallel, and share only semantic information through the network to
make the system scalable, reliable, and fault tolerant.
According to these strategies, we propose to use three different types of visual
sensors. As shown in figure 4, first, a fixed camera node, which is usually on a wall,
detects and tracks people, identifies people as privileged user, and analyzes the be-
havior of detected people to trigger other sensor nodes on the robot for a further inte-
raction, or to make the robot react properly. The behaviors which can be recognized
include sitting, lying, falling down, waving, and walking. All these tasks are per-
formed in real-time.
Second, the stereo camera in mobile robot node is activated when the robot is
closer to people, and it detects and tracks the limbs using the head position informa-
tion from an individual camera node. If needed, we classify the gestures such as
pointing, negation and waving in order to generate data for the planning and re-
sponse actions.
Third, the omni-directional camera in the mobile robot node computes the 3D
position and orientation of the robot in the world. This information can be used to
issue motion commands to the robot, and to register the view centric 3D model of the
observed world with the fixed environment.
Proposed methods to address the functional modules are further described in
Chapter 4 and Chapter 5.
9
In our system, each sensor node is connected to the intelligent home network us-
ing wireless or wired communication channel, and performs its task independently.
To make all the sensor nodes work seamlessly and cooperatively, we propose an
efficient cooperation framework UCIF (Ubiquitous Camera Interface Framework),
which fuses the information from each node at the symbolic level, instead of at the
image level, and makes decision for reactions of the robot through analysis of ga-
thered information.
We validate the efficiency and robustness of our framework using a real home
service robot platform (shown in Figure 5) in a realistic environment, deploying dif-
ferent types of sensor nodes in several different places, changing illumination, and
uninterrupted processing for long periods of time.
Figure 5: Robot Platform
(a) Prototype of robot platform (b) Commercial robot platform
10
1.4 Impact of the research
Recently, the focus of the robotics industry is shifting from industrial robots to per-
sonal service robots for health care, entertainment, education, home cleaning. They
are being deployed on a commercial scale, and the market size of it is expected to
grow rapidly, reaching over $17 billion in 2010 [27].
One of the major technical obstacles to the development of a personal service
robot is the high perception ability required for natural human-robot interaction. The
integration of intelligent home network and personal service robot will extend the
perception range of the robot to the entire domain of home, enabling non-intrusive,
autonomous services to the user. For example, the robot can respond to a command
from its master, deliver objects, handle emergency, wherever he/she is inside the
home. Our research can provide a foundation to integrate not only camera sensors,
but also different types of sensors in the intelligent home network for multi-modality.
Once a robust and efficient framework to integrate various sensor nodes in the home
has been obtained for robust and wide range of perception, major technological
breakthroughs can be expected in all applications where a personal service robot is
used.
1.5 Thesis overview
Chapter 2 provides useful reviews for previous research efforts which are mainly
related to the framework for integrating multiple sensor nodes in intelligent home
environment. The literature relevant to each functional module is reviewed in the
11
chapter describing the specific module. Chapter 3 describes the details of the cooper-
ation framework for multi-sensor nodes. Chapter 4 and Chapter 5 explain technical
details of each sensor node in the proposed framework. In Chapter 6, we show cur-
rent experimental results of integrated system. In Chapter 7, concluding remarks are
given and some future research directions are listed.
12
Chapter 2 Related Work
Both the intelligent home and personal service robots are part of a new, very active
research field. It is being pioneered by several research teams worldwide, and some
new conferences, such as the International conference on Autonomous Robots and
Agents (ICARA) [69], the International Conference on Ubiquitous Robots and Am-
bient Intelligence (URAI) [70], and new journals, such as the Journal of Intelligent
Service Robotics [71], are taking part in this field. We review below some of the
larger efforts in this area, and the related work for individual visual perception tasks
in single camera node is reviewed separately in Chapter 3 and Chapter 4.
Chapter 2 is organized as follows: In Section 2.1, previous studies about cooper-
ation model for distributed sensor nodes are reviewed. Section 2.2 reviews related
work about multiple camera based visual surveillance system because a multiple
camera control framework in this surveillance system can be a reference model of
our approach.
2.1 Cooperation Model for Distributed Sensor Nodes
Conventionally, a major trend of the research field of robot perception has been to
increase the intelligence of the robot itself in a limited space. However, it is difficult
to overcome the limitations of field of sensing and of processing power inside a ro-
bot. As a new approach to overcome these limitations, a concept of distributed sen-
sors outside a robot has been proposed and has been an active research field recently.
13
Lee and Hashimoto propose a concept of Intelligent Space, where Distributed
Intelligent Networked Devices (DINDs) monitor the space, and share acquired data
through the network [33]. They validate this concept with a human following system,
where a robot is controlled to follow a human through the interaction between the
mobile robot and the Intelligent Space [1]. In their approach, surrounding space has
sensors and intelligence instead of the robot. As a result, the burden of processing
power can be distributed to several processors in the intelligent space, and distri-
buted sensors can provide wider sensing area and more accurate information about
the robot and the human in the space. However, this approach is mainly focusing on
robot control, such as localization and navigation, and the interface between a user
and a robot in the same space. Also, in some cases, fixed sensors cannot provide
enough resolution which is needed for detailed interaction, such as facial gesture
recognition and hand gesture recognition.
Similar to the Intelligent Space concept, Lee et al. propose a concept and a
structure of a Ubiquitous Robotic Space (URS), where sensing ability or intelligence
is distributed into the space or a remote server, and robotic tasks are executed by
cooperation within the intelligent space [33]. In their approach, URS comprises three
major components. Physical space, which consists of a robot and sensor network,
provides environment data including current status of the robot. Semantic space
processes data to extract contextual information about the physical space. Virtual
space transforms environment data to graphic data for user interface. They enhance
the perception and motion ability of a robot through this distributed robotic space
14
approach, and simplify the robot itself by distributing processing power to a remote
server.
Saffiotti and Broxvall [51] propose a concept of Ecology of Physically Embed-
ded Intelligent Systems, or PEIS-Ecology, which offers a new paradigm to develop
pervasive robotic applications by combining pervasively distributed robotic devices
in the environment, which are in the form of sensors, active tagged objects, smart
appliances. In their approach, each PEIS has linking functional components, and as a
result, it can use functionalities from other PEIS in order to compensate or to com-
plement its own task. Also, a human can interact with all the PEIS through a single
common interface point. This conceptual approach is targeting a ubiquitous robotic
system in a surrounding space instead of a standalone robot.
Kröse [28] at the University of Amsterdam defines functionalities a cognitive
robot should have, such as detection and understanding of human activities, spatial
cognition, skills and tasks learning. Based on these functionalities, he proposes to
use multiple cameras to track humans and proposes to use spatio-temporal appear-
ance based method to deal with association problem in multiple cameras.
While the approaches above are assuming an indoor environment, such as home
and small office, Hasegawa and Murakami in Kyushu University expand the scope
of application to larger ordinary environment such as a shopping mall [20]. Their
approach integrates dynamically changing data from distributed vision sensors,
RFID, and GIS data for a robot working within an ordinary environment. As an ex-
15
ample, they use multiple distributed cameras to track pedestrians in world coordinate
system.
2.2 Multiple Camera-based Visual Surveillance System
Similar to the framework of distributed sensor nodes for a home service robot is that
of multiple camera-based visual surveillance system. Using multiple cameras in a
visual surveillance system, the surveillance area can be expanded and ambiguity
from a camera can be overcome using multiple view information.
Collins et al. proposed a multi sensor control framework for Video Surveillance
and Monitoring (VSAM) test bed [10]. In this framework, each sensor processing
unit transmits symbolic information to the operator control unit. Basically, the pur-
pose of this framework is surveillance. Consequently, it is a centralized model for
the center unit to control multiple sensor processing units correctly, and each sensor
unit does not share the information from other unit. However, in the intelligent home
environment, each sensor node should share its processing result and, if needed,
should update it to improve the reliability and performance of the system.
In a multiple camera-based visual surveillance system, if an object is moving out
of the field of view of an active camera, the system should switch to another camera
which can give a better view. For this purpose, Cai and Aggarwal propose a concept
of tracking confidence for automatic camera switching. In their approach, if the
tracking confidence is below a threshold, a global search begins and selects the cam-
era with the highest confidence [5].
16
One of the advantages of a multiple camera system is that data from different
cameras can be fused to resolve an occlusion problem and to track object conti-
nuously. Dockstader and Tekalp use a Bayesian network to fuse 2-D state vectors
acquired from various image sequences to obtain a 3-D state vector [13]. Collins et
al. [10] propose a method to represent an entire scene by fusing information from
every camera into a 3-D geometric coordinate system. Kettnaker et al. synthesize the
tracking results of different cameras to obtain an integrated trajectory [28].
2.3 Advantages of Our Approach
The approaches use a cooperation model for distributed sensor nodes, use distributed
sensors in a surrounding space to expand the perception range of a robot in a limited
space, and distribute the burden of processing power to multiple distributed
processing elements, resulting in the increased intelligence of a robot.
However, from a view point of a vision system for a home service robot, one
more thing to be considered is awareness about the whole domain of home. A home
service robot should know where its masters are and what they are doing, and
whether they need any kind of help from the robot or not, even though they may not
be in the sight of the robot. In this context, each visual sensor in different location
should always keep awareness of sensing area, and should recognize predefined ac-
tion of people in its sensing area, which may require a particular response of the ro-
bot.
17
Also, human-robot interactions in detail, such as facial gesture recognition, and
hand gesture interaction, require a concept of division of labor among sensors. For
example, if someone calls the robot, one of the fixed sensors recognizes it and asks
the robot to move to the caller. After the robot approaches, the caller may give a
command using a hand gesture, such as pointing. However, it is a difficult task for a
fixed camera in some distance from the caller to recognize it. In that case, a sensor
node on the robot can be used cooperatively.
To make each sensor node work with maximum efficiency and robustness, indi-
vidual nodes should be organized into a system, according to a well defined and reli-
able conceptual model to fuse information from each node and to share information
among nodes. Based on the approaches above, we propose a novel cooperation mod-
el for distributed sensor nodes, which can improve the perception ability of a home
service robot and the diversity of application provided by a robot.
The frameworks for multiple camera based visual surveillance system are main-
ly focusing on the occlusion problem (e.g. [8]) or the fusion of data from each cam-
era in different places but identical level (e.g. [6] ). The methods to fuse the data
from each camera and those to control the entire cameras for switching in these ap-
proaches inspire our approach in designing distributed sensor nodes system. Follow-
ing is the comparison table of the approaches introduced above.
18
Figure 6: Comparison of related approaches
19
Chapter 3 Cooperation Model for Multi-camera Nodes
A vision system for a personal service robot should have a number of visual functio-
nalities, such as people detection, gesture or behavior recognition, to maintain
awareness of the entire home. Considering a cooperation of multiple camera nodes in
the intelligent home network, these functionalities should be well defined and distri-
buted to each sensor node in the network. These distributed functionalities should
exhibit a certain level of redundancy, as well as the cooperation in the model, to
overcome the common limitations of most vision sensors, such as narrow field of
view, and limited resolution. Also, social aspects of robot behavior should be consi-
dered, during a decision process for proper reaction of the robot. For example, when
a robot approaches a user, it should approach the user from the front, not too fast and
not too come close, which is generally perceived as a threatening behavior. Also, the
robot should acknowledge commands from its user, and express understanding.
We propose an efficient and robust cooperation framework, and we validate it in
a challenging environment, to make it applicable to real commercial robot through
the proposed research. Also we will extend the inclusion of framework to other types
of sensors.
This chapter is organized as follows: Section 3.1 addresses the cooperation
model among multiple camera nodes. Section 3.2 explains the interface framework
between a main node and each sensor node, as well as the interface between each
node. Section 3.3 deals with the synchronization problem which is caused by mul-
tiple cameras in the same space.
20
3.1 Cooperation Model
In the proposed system shown in Figure 2, each camera node runs independently and
in parallel, but shares its information through the network, as shown in Figure 7.
Note that the transmitted information is symbolic (not images) and therefore requires
little bandwidth.
A fixed camera node shares its user identification information, such as height,
and color of clothes, with other nodes for continuous tracking of people, at the mo-
ment of hand-over. Each fixed camera node use this information to check whether a
person it detects is new person, or moved from another camera’s field of view. Also,
Fixed
Camera Node
Mobile Robot
Node (Stereo)
Mobile Robot
Node (Omni)
Location of user
3D Body &
Head Position,
Posture of
Person
Direction and
Distance to user
Biometric Info
(Height, Color)
People Detection &
Tracking,
Identification
Gesture
Recognition
Self-
Localization
Fixed
Camera Node
Mobile Robot
Node (Stereo)
Mobile Robot
Node (Omni)
Location of user
3D Body &
Head Position,
Posture of
Person
Direction and
Distance to user
Biometric Info
(Height, Color)
People Detection &
Tracking,
Identification
Gesture
Recognition
Self-
Localization
Figure 7: Cooperation model between nodes
21
a fixed camera node provides the location information of the user which the robot
should approach. Similarly, it provides the head position and posture of the user to
the gesture recognition module in the mobile robot node.
The mobile robot node is equipped with a stereo camera and an omni-directional
camera. The stereo camera provides direction and distance information in a world
coordinate reference frame, for the robot to move correctly. The omni-directional
camera can provide the angle information, with which the robot should turn to make
the stereo camera node see the user in its field of view.
The mobile robot node sends input images to the remote processing node, and
receives the processing result from this node, because the processing power inside
the robot is limited.
Through this distributed architecture, the proposed system can maximize per-
formance, and prevent a possible bottleneck resulting from low processing power of
the robot.
3.2 Interface Framework between Nodes
Each sensor node should send the inferred symbolic data to the main node which
fuses this data and plans a proper action for the robot to do. In designing an interface
framework between camera nodes, the type of each sensor node and the scalability of
node are the main factors to consider.
If a camera node is a passive sensor node, which gathers and transmits raw data
without any processing, the burden of processing is transferred to the main node,
22
resulting in a bottleneck. Also, the network can be congested with huge amount of
data heading to the main node.
We propose the Ubiquitous Camera Interface Framework (UCIF), which is a
novel interface framework based on an active camera node. Each camera node in our
framework performs its visual tasks independently and locally. After processing an
input image, it transmits the results to the main node in symbolic form, instead of
sending the image itself. The main node in our framework fuses the information
from each node to execute real actions, such as robot control, display of fused infor-
mation. As a result, the communication between nodes is very low bandwidth, and
the burden of processing power can be distributed to each node, and provides a good
scalability.
Note that the UCIF application nodes and the main node are virtual nodes.
These are software processes which can reside on any computer in the system, even
on a single computer.
23
As shown in Figure 8, each camera node has only a UCIF application instance,
which is an abstraction layer of the main node and the robot. The main node has an
API presentation layer for the robot and the display, which embodies the request
Figure 8: Ubiquitous Camera Interface Framework (UCIF)
24
from each camera node. This way, we can remove the dependency of interface
framework on specific sensor type and specific robot hardware, which facilitates
porting and extends availability.
UCIF application nodes execute their tasks and manage their processing data in-
dependently. Only the data which should be shared with other nodes and the request
to operate the robot are sent to the main node through the API functions in symbolic
data form.
App Nodes App Nodes App Nodes
Main Node Main Node Main Node
Registration
Detect & Track
People
Recognize
Behavior
Update
Local Data
Global Map Disp
Update
Global Data
Camera Info
Person DB
New?
Robot?
CRIF Instance
Report
Broadcast
Report
App Nodes App Nodes App Nodes
Main Node Main Node Main Node
Registration
Detect & Track
People
Recognize
Behavior
Update
Local Data
Global Map Disp
Update
Global Data
Camera Info
Person DB
New?
Robot?
CRIF Instance
Report
Broadcast
Report
Figure 9: Interaction between application nodes and the main node
25
Figure 10: Data management between a fixed camera node and the main node
26
As shown in Figure 9 and Figure 10, a fixed camera node detects and tracks
people in its field of view independently, and sends the feature data of the detected
people, such as location, trajectory, height, and color feature, to the main node. Us-
ing this feature data and the updated person DB in the main node, it identifies the
detected people. If a detected person is a new person, it updates the local person DB
it has and labels the new person with the ID from the global data in the main node.
Also, it recognizes the predefined actions of the detected people, and reports it to the
main node.
The main node is a headquarter keeping awareness of the entire domain of home,
fusing and analyzing data from each node to provide the intelligence needed for a
proper reaction of the robot. Table 1 shows the functionalities the main node should
have.
The main node decodes the received data from each sensor node, and processes
this data in global coordinates. For example, the main node updates the location data
of each person inside the home in the global map using the reported data from each
sensor node. If a particular data should be shared among the sensor nodes, it broad-
casts this data. Also, if an operation of the robot is required, it issues control com-
mand to the robot through an instance. For the interface between the main node and a
robot, we use the Common Robot Interface Framework (CRIF), which has been de-
veloped by ETRI. CRIF is a robot API software that provides convenient interface
between the robot and the external processing node outside [14].
27
Functionality Description Proposed Approach
Communication - Network API
- One to one and one to all
two-way communication
- TCP/IP
- Report and Acknowledge-
ment
- Broadcasting
User Interface - Show the status of entire
domain
- 3D Display of global map
Data manage-
ment
- Fuse the data from each
node and update reposito-
ry
- Global – Local data
- Exclusive global data update
logic in main node
- Broadcasting of the data
which affect the local data
Robot Interface - Issue control command to
the robot
- Common Robot Interface
Framework (CRIF)
Intelligence - Logic of the situation
- Decision of proper reac-
tion of the robot
- Planning based on respond-
ing scenario
Table 1: Functionalities of the Main Node
28
For efficient communication between nodes, we use TCP/IP connection, and in
our experiments, we use a wireless LAN as a communication channel.
A user should be able to monitor the entire domain of home for interacting with
the robot. For this purpose, the main node manages the location of all the people
inside the home and the robot. In our approach, this information is displayed on the
3D global map of the home, as shown in Figure 9.
When a node is turned on, it is registered to main node with its characteristic da-
ta, such as the location in global map, calibration data, and time stamp for synchro-
nization. Receiving this registration data, the main node assigns a sensor number
exclusively and invokes a data management module corresponding to the type of
sensing data.
One notable thing is each node and the main node is an independent software
module which can be run together in same computer or in each different computer. It
does not require dedicated computer for each node, resulting in good scalability of
the system
We validate our framework in challenging setup, such as multiple, different
types of sensor nodes in several different places, changing illumination, and measure
the durability and stability of the framework in a test bed which simulating a real
home for long period of time. Details of our experiments are described in Chapter 6.
The UCIF can be extended to be able to fuse the information from sensors other
than camera, such as sound recognition, ultra violet. Extension of the kind of sensor
can enlarge the interaction between user and the robot. For example, the system can
29
respond to the specific sound such as handclap to call the robot. The sound signal is
captured as an event, and is reported to the main node to make the robot respond to
the signal.
As for the social aspect of robot behavior, we constrain the speed of the robot
movement and plan its moving trajectory to approach the user from the front, and
control the speed of approach.
3.3 Synchronization in multiple camera system
When we use multiple cameras whose fields of view overlap, we have several advan-
tages. As shown in Figure 11, the overall field of view can be extended and occlu-
sion between multiple people can be mitigated to a certain extent.
However, it requires fine synchronization of these cameras to prevent an error
caused by image capturing time lag between cameras. For the scalability and stabili-
ty of the system, this synchronization should begin or end automatically. For exam-
ple, when multiple cameras are working at the same time, each camera should be
synchronized. If one of these cameras is turned off by a certain reason such as mal-
function, the other camera should work without synchronization.
Another problem when using the multiple camera system is the duplication of
information. If a person is detected by multiple cameras, each camera node reports a
processing result to the main node. The main node should resolve these duplicated
data.
30
3.3.1 Automatic Synchronization
The UCIF maintains a blackboard system to check the status of each camera node.
Using the blackboard system, the main node keeps the information, such as the num-
ber of cameras in each room which is running, update of processing result from each
(b) In camera B, occlusion is detected. Camera B can make a distinction be-
tween two people.
(a) Extended field of view
Figure 11: Overlapping cameras
31
camera node. Using this information, the main node controls the synchronization of
multiple camera nodes.
We use a broadcasting-event based method as shown in table 2 for synchroniza-
tion. When a camera node is turned on, it registers itself to the main node and starts
its role independently, while waiting for a start of synchronization event which is
occurred in a background process. If the main node decides that multiple camera
nodes are working at the same time, it broadcasts the information of these camera
nodes. This information breaks out a start of synchronization event in appropriate
camera node, and this camera node waits for a synchronization signal from the main
node using a sync signal event. The main node broadcasts a synchronization signal
only when it receives a report from a camera node which needs to be synchronized.
If one camera is turned off, resulting in an end-of-synchronization case, the main
node broadcasts an end-of-synchronization signal to tell other camera nodes not to
wait for synchronization signal.
Table 2: Broadcasting-Event scheme between the main node and each camera node
Broadcast
(Main node)
Event
(camera node)
Start-of-Synchronization
- Multiple cameras in a room
- Stop processing
- Wait for synchronization signal
Synchronization Signal
- Every nodes in sync reported result
- Start image capture and processing
- Report processing result
End-of-synchronization
- Single camera node in a room
- Start image capture and processing with-
out synchronization signal
32
3.3.2 Resolution of redundant information
As shown in Figure 11, each camera node reports the people detection results to the
main node, and this information has a redundancy in the overlapped region.
When a camera node detects a person in its field of view, it extracts feature data
of this person, such as height, clothing color, and location from the image to identify
the person. The main node can utilize this information to resolve the redundancy
caused by overlapping field of view. When more than one people are detected in the
overlapping region (e region in Figure 11 (a)) of each camera node, the main node
matches a corresponding pair which has the minimum distance if the distance is less
than threshold τ, as shown in Figure 12. In our experiments, we set τ to 30cm. If a
distance is larger than threshold for all the pairs, due to an error caused by an incor-
rect blob selection and so on, we match a pair according to the matching score of
color histogram from each person blob, as shown in Figure 13.
Figure 12: Resolving duplication with distance
33
H
S
V
(a) Input image from camera 1 (b) Input image from camera 2
Figure 13: HSV Color Histogram Matching to resolve duplications
34
() { }
2
min arg
i i
P
match
y F x F P
r r
− =
where, F is a HSV color histogram vector
from a person blob image.
In case of occlusion, the main node chooses the data from the camera node
which does not show occlusion, if at least one camera node reports non-occlusion.
35
Chapter 4 Fixed Sensor Node Processing
A fixed sensor node, for example one single camera on the wall, can provide remote
awareness of its environment to the robot. It is critical for the robot to know whether,
when, and where people are present, who the people are, and what they are doing.
We implement each functional module in this fixed camera node, based on the sce-
nario that multiple people are moving around in different rooms at home, sometimes
occluded by each other, sometimes exiting the field of view. When one of these
people shows a predefined behavior, such as raising a hand to call the robot, the ro-
bot should react properly to this behavior.
To implement this scenario, we propose a fast method to detect and track people
and their head, and to identify the people who have been detected. Also we propose a
2D based behavior understanding methodology. Preliminary results of our method in
an experimental scheme are shown in Chapter 6. In the experiment, we setup three
fixed camera nodes in different rooms. These nodes are connected to the main node,
which fuses the information from each fixed camera node and makes the robot re-
spond properly to the request from a detected person. As for the behavior under-
standing, we tested six kinds of actions, including a raising hand gesture to call the
robot. Our method in a fixed camera node runs at approximately 7-9 frames per
second on a Pentium IV 3.2GHz desktop.
36
4.1 People Detection and Tracking
In the home, a personal service robot should continuously detect and track people
who might be a master of it. A fixed camera node on the wall has an advantage for
this task, compared to a moving sensor on the robot, due to its stationary image.
4.1.1 Related Work
Approaches to detect people in images can be classified into motion blob segmenta-
tion by background modeling (e.g. see [64]) and direct detection of human forms (e.g.
[41][60]). Direct human detection has an advantage on moving camera and illumina-
tion change, but it is limited to restricted viewpoints and its computational complexi-
ty can be a disadvantage to be applicable to a real-time system.
In the direct human detection category, many methods represent the human body
as an integral whole. For example, Papageorgiou et al.’s SVMs detectors [47], Fel-
zenszwalb’s shape models [15], and Gavrila et al.’s edge templates [17][18] locate
humans by recognizing the full body pattern. These methods degrade quickly in case
of partial occlusion of the full body. To cope with partial occlusion problem, part-
based approachs, such as [41][60][39] has been proposed.
The background modeling approach has the advantage of computational sim-
plicity. Pfinder [59] is a real-time system for tracking a person which uses a multi-
class statistical model of color and shape to segment a person from a background
scene. It finds and tracks people’s head and hands under a wide range of viewing
condition. Haritaoglu et al. propose W
4
, a real-time visual surveillance system for
37
people detection and tracking [20], which employs a combination of shape analysis
and tracking to locate people and their parts and to create models of people’s appear-
ance.
One of the most common methods to segment foreground in real-time applica-
tion is background subtraction, which threshold the difference between the current
image and the background image. The background can be modeled as Gaussian dis-
tribution, and this model can be adapted to gradual light change by recursively up-
dating the model using an adaptive filter [37]. After foreground segmentation, a per-
son can be recognized using neural network based methods (e.g. [62]) and model
based methods (e.g. [2]).
In our approach, we use a fixed camera on the wall to detect people in indoor
environment. A moving object in the image is mostly a human. Accordingly, we
propose to use an adaptive background subtraction method [10], without any com-
plex human modeling method.
4.1.2 People Detection
During the foreground segmentation process, we make use of an adaptive back-
ground subtraction method. To cope with the problem caused by sudden illumination
change and temporary moving objects which are not a human, we combine frame
differencing with background subtraction.
In our experiments in challenging environment, such as turning off and on the
light, suddenly moving objects, the system shows good adapting ability for a long
period of time (8 hours continuously).
38
As shown in the flowchart in Figure 14, a pixel x is a candidate of foreground if
the difference between I
n
(x), intensity value at pixel x in n
th
image and I
n-1
(x) is
greater than the threshold th
frame Diff
. We set th
frame Diff
to 20.
Diff frame n n
th x I x I > −
−
) ( ) (
1
The regions defined as foreground in the frame difference operation and in the
background subtraction operation are recognized as a human if the size of blob is
within a threshold range Γ. Experimentally, we set the threshold range Γ to 30~200
pixels in height, 10~200 pixels in width. Even a blob recognized as a human can be
removed later if the initialization of head fails continuously. After that, we update a
background model as following. B
n
(x) is the background model, and T
n
(x) is the thre-
shold, α, β are the coefficient for a speed of change in background model. We set α
to 0.9 and β to 5. For background updating, we use a median filter.
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
− − +
=
+
pixel moving x B
pixel moving non x I x B
x B
n
n n
n
), (
), ( ) 1 ( ) (
) (
1
α α
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛ − − − +
=
+
pixel moving x T
pixel moving non x B x I x T
x T
n
n n n
n
), (
), ) ( ) ( )( 1 ( ) (
) (
1
β α α
39
In indoor environment, the edge of shadow is not strong, compared to outdoor.
Using this feature, we can minimize the effect of shadow. When we define a human
blob region, we use the result from edge background subtraction, instead of intensity
background subtraction. As shown in Figure 15, weak shadow can be removed with
this method.
Background
Modeler
Current
Image
Previous
Image
Background
Image
Edge Image
Edge Image
Edge Image
-
-
Blob
Analysis
Blob
Analysis
R
FD
R
BS
x
R
C
Person
Manager
R
P
Person
Recognition
Update
User Information
Update
Background
R
FD
: Person blobs from a frame differencing operation
R
BS
: Person blobs from a Background Subtraction operation
R
C
: Person Candidates, R
P
: Person blob in previous image
Background
Modeler
Current
Image
Previous
Image
Background
Image
Edge Image
Edge Image
Edge Image
-
-
Blob
Analysis
Blob
Analysis
R
FD
R
BS
x
R
C
Person
Manager
R
P
Person
Recognition
Update
User Information
Update
Background
R
FD
: Person blobs from a frame differencing operation
R
BS
: Person blobs from a Background Subtraction operation
R
C
: Person Candidates, R
P
: Person blob in previous image
Figure 14: People Detection Flow
40
(a) Input image
(b) Gray Image
(c) Background
Model
(d) Binary image of input image (e) Binary image of background
model
41
(f) Intensity background subtrac-
tion image
(g) Edge background subtraction
image
Anding
(h) Result Image (i) Background model update
Figure 15: People detection results
42
4.1.3 Tracking of Detected People
In the home, the detected person blob in two consecutive images should overlap if its
blob results from the observation of the same person. Using this property, tracking of
person blob and detecting occlusion can be easily achieved. For this purpose, we
construct an association matrix between human blob candidates {R
C
} in image I
n
and
those {R
P
} in image I
n-1
. From this matrix, we can recognize the relationship as
shown in Figure 16.
A disadvantage of our approach might be a difficulty of detecting a person who
has not moved from the beginning of sensing, because we use a background subtrac-
tion method for detection. In our system at present, we are assuming that all the
people are moving from the beginning. However, we could enhance our people de-
Figure 16: Association Matrix
43
tection module with combining direct detection of human appearance method
(e.g.[41][60]) as a future research direction.
(a), (b) Tracking using association matrix (c) Occlusion detection results
(a) (b)
(c)
Figure 17: Experimental results of people tracking
44
4.2 Identification of Detected Person
In long range interaction, it is not easy to identify each detected person with exact
biometric information such as face, or iris. However, even though not perfect, semi-
biometric information, such as height and clothing color, is useful to discriminate in-
and-out people in the scene. This discrimination is also necessary to share informa-
tion between multiple camera nodes.
4.2.1 Measuring Height and Location
Because the camera is fixed and the position is known, we can calculate the height of
a person from the angle between the center of image and the upper end of detected
people blob, assuming the people stand in the same planar surface and the whole
body of person is shown [28].
As shown in Figure 18, if we assume a pin-hole camera, we can calculate the
angle ( θ
3
- θ
2
) and θ
1
from the image, as following.
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
⋅
−
= −
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
⋅
−
=
−
−
h
image
h
image
h
x x
h
x x
θ θ θ
θ θ
tan
) ( 2
tan
tan
) ( 2
tan
min
1
2 3
max
1
1
h
image
is the image height and θ
h
is the view angle of the camera. If we know the
camera tilt angle, θ
2
, we can get θ
3
, and we can calculate the height of person as fol-
lowing.
45
) (tan(
tan
1 2
3
θ θ
θ
− ⋅ − =
=
d h H
h
d
In our experiments, the average error distance between the measured location and
the ground truth location was 295mm and the error range was 100mm to 586mm.
The experiment was conducted with a setup of 9 lattice points in 1 meter spacing.
The distance range between lattice points and the camera was 6000mm to 8000mm.
Figure 19 shows the experimental results.
θ
1
θ
2
θ
3
h
H
d
C(x,y)
Image
X
max
X
min
x
max
x
min
Camera θ
h
θ
1
θ
2
θ
3
h
H
d
C(x,y)
Image
X
max
X
min
x
max
x
min
Camera θ
h
Figure 18: Measuring Height & Location
46
mm
mm
(0,0)
. : Ground truth point
x : Measured point
Figure 19: Experimental results of measuring location
camera
47
4.2.2 Identification using clothing color histogram distance
To make use of the clothing color information, we calculate the probability of identi-
ty and difference to resolve identity, using color histogram distance d between two
object blobs, B
1
and B
2
. As shown in Figure 20, we can get a color histogram
distance in HSV space as following:
Anding
H
S
V
Figure 20: Masked Color Histogram in HSV space
48
Let H
B1
, H
B1
are the color histogram vector of blob B
1
and B
2
. Color histogram
distance
() () ()
∑
− =
i
B B
i H i H d
2
2 1
If the probability of identity p
i
(d) is larger than the probability of difference
p
d
(d), then two blobs are identical. To get the probability distributions of color histo-
gram distance, p
i
(d) and p
d
(d), we extract the color distance distribution from an im-
age set, containing 3,540 images of 15 people in same or different clothes, assuming
Gaussian distribution.
In resolving the identity, a slight illumination change in the background, or vari-
ation in the person's pose can cause an abrupt change of probability, resulting in mi-
sidentification. To regulate abrupt changes, we apply a one dimensional Kalman
filter for each measurement of probability.
Kalman filter model is
k k k k
k k k k k
v x H z t Measuremen
w Bu x A x Estimate
+ =
+ + =
+1
x
k
is state at time k and w
k
represents the white process noise with normal distri-
bution of N(0,Q) . u
k
is the control input. A is the n×n matrix which relates the state
at time step k to the state at time step k+1 and n×l matrix B relates the control input
to the state x.
z
k
is the measurement at time k and v
k
represents the white measurement noise
with normal distribution of N(0,R). m×n matrix H relates the state to the measure-
ment z
k.
49
In our model, the state does not change from step to step, so A=1. There is no
control input so u=0. Also, our measurement is of the state directly, so H=1. There-
fore, the time update equations are
Q P P
x x
k k
k k
+ =
=
−
+
−
+
1
1
ˆ ˆ
The measurement update equations are
−
− −
−
−
− − −
− =
− + =
+
=
+ =
k k k
k k k k
k
k
k k k
P K P
x z K x x
R P
P
R P P K
) 1 (
) ˆ ( ˆ ˆ
) (
1
Where, K is the Kalman gain, and P
k
is an estimate error covariance. ^ denotes
an estimate and
–
denotes priori.
We set small process variance 1e-5 as Q and set initial value for P
k
, i.e. P
-
k
= 1.
Even though the estimated probability of difference is greater than threshold, the
decision of issuing a new ID is delayed until this state is maintained for 3 consecu-
tive frames.
50
Figure 21: Experimental results of identification using clothing color histogram dis-
tance: Assign same ID number in between different sequences which have thirty mi-
nutes time lag.
51
4.3 Head Detection
In understanding the gesture or behavior of people, the position of the head plays an
important role. For instance, if the head position is lower than the top of a person
blob in standing posture, we can infer that the person is possibly raising a hand. Fur-
thermore, the head position is useful as an anchor for the detection of body parts,
such as limbs, torso [35].
4.3.1 Related Work
For the frontal face detection, numerous methods using statistical classifiers, such as
neural network [48], support-vector machine [46], or using learning algorithms, such
as AdaBoost [58], have been suggested. Schneiderman et al. [52] proposed a wavelet
based method that relaxes the restriction to frontal face detection, but cannot handle
rear head detection.
To detect a head in every view point, a model fitting approach using ellipsoid is
proposed [19]. However, many false alarms can be a disadvantage of this method.
Zhao and Nevatia [64] try to detect Ω shape which consists of head and shoulder.
This approach shows a good result for upright posture.
4.3.2 Combined Head Detection Method
Generally, in the home, people are not attentive to the camera, so every view of head,
including profile or rear, should be detected. In this context, we suggest a combined
method of 2D ellipsoid (head) fitting and Ω shape (head and shoulder) detector. The
ellipse fitting method can detect heads in various postures, but it is prone to false
52
alarms in cluttered background or in other body region. On the other hand, the Ω
shape detector is robust to false alarms, but works well only for upright posture.
In the ellipse fitting module, first we convert the input image into the binary
edge image using Roberts Cross Edge Detector. To remove small contours caused by
noisy edge images, we apply a thinning algorithm [62]. Then, we analyze the edge
image to find closed contours. Finally, we can fit 2D ellipses to the contours found,
which can be head candidates, as shown in Figure 22.
Figure 22: Head detection flow using contour analysis
53
When we calculate the probability of each candidate, we assume the position
and the size of head has Gaussian distribution. Using the ground truth data, we de-
fine the relationship between the size of human blob and the size and position of
head as following. We assume Gaussian distribution and the probability of head is
⎥
⎦
⎤
⎢
⎣
⎡
⎟
⎠
⎞
⎜
⎝
⎛ −
+ =
2
1
2
1
) (
σ
μ x
erf x P
To get the μ and σ of this distribution, we assume that the mean is related to the
ratio between width and height. To justify that assumption, we apply a regression
analysis on experimental data. We sample 200 images and we can get following rela-
tionship.
w w w Height Width w w
h h h
Height Width
h
h
W Ratio
H
Ratio
μ σ β α μ
μ σ β
α
μ
⋅ = ⋅ + ⋅ =
⋅ = ⋅ + =
−
−
5 . 0 , ) (
5 . 0 , ) (
H, W is the height and width of human blob.
In case of the omega shape detector, we use a Haar classifier [58], trained with
about 3,000 images of Ω shaped head and shoulder, to get head candidates. From the
candidates' pool, the one with maximum likelihood in position and size is chosen as
the head, assuming that the position and size of head have a Gaussian distribution.
After initial detection, the head position is tracked using a color-based mean-shift
method. The tracked position is verified with detection and the model is updated.
54
4.3.3 Experimental Results
We tested our head detection module with various sequences, such as one person
walking, sitting down, standing up, jumping, and two people walking. The image
resolution is 320×240. Regardless of view point, our method shows good result of
91.33% detection rate. In a complex sequence, such as lying, the detection rate of
heads from the side view is lower because the Ω shape detector does not work well.
However, the ellipse fitting module compensates for it and overall result shows that
our method is applicable to real applications.
Figure 23: Head detection flow using Haar classifier and Mean shift tracker
55
One remarkable fact is the detection rate of heads in rear view is very high
(98.68%). It is because both detection modules show good performance in that view.
Figure 24 and Table 3 show the experimental results.
Sequence View # of images
(people)
Detect Rate Rate
Walk
Front 117 116 99.15%
98.83%
Side 100 99 99.00%
Back 41 40 97.56%
Figure 24: Head detection results
56
Walk, Sit, Stand
Front 154 141 91.56%
91.63%
Side 63 57 90.48%
Back 22 21 95.45%
Walk, Jump, Sit, Stand
Front 212 192 90.56%
85.6%
Side 119 90 82.57%
Back 18 18
100.00
%
2 people Walk
Front 221 190 85.97%
91.10%
Side 282 262 92.9%
Back 70 70
100.00
%
Total
Front 704 639 90.76%
91.33% Side 564 508 90.07%
Back 151 149 98.68%
Table 3: Experimental results of head detection method
57
4.4 Behavior Understanding
Understanding behaviors of people can make it possible for the robot to react auto-
nomously and non-intrusively. For example, if a person falls down and needs help,
even outside the FOV of the robot, the robot should recognize it and take proper ac-
tion. A fixed camera node in our system detects the feature vector sequence of a pre-
defined behavior of person and reports it to the main node for a proper action of the
robot.
4.4.1 Related Work
To understand a behavior of human, human pose in the image should be estimated
first. There has been substantial work on estimating 2D human pose [48][49][61].
Estimating 3D pose is more challenging as some degrees of motion freedom are not
observed and it is difficult to find a mapping from observations to state parameters
directly. Several learning based techniques have been proposed [1][53], but these
rely on accurate body silhouette extraction and having a large number of training
images. Model-based approaches [11][54][22][54] are popular because it is easy to
evaluate a state candidate by synthesizing the human appearance. Recently, local
parts detection has been used as a data-driven mechanism for the pose estimation
[42][49].
After the pose estimation process, a sequence of time varying feature data from
the result of pose estimation should be matched with a reference sequences
representing typical behavior. For this purpose, several methods, such as dynamic
58
time warping (DTW) [57][3], HMMs [56][4], Syntactic techniques [25], and Self-
organizing neural network [26][22], were proposed to analyze and interpret time
varying data. However, real-time performance is still challenging.
4.4.2 2D Model Based Behavior Understanding
In the robot vision system, the goal of behavior understanding is not to fit an exact
human model to the detected human blob, but to understand some predefined beha-
viors. In this context, we don’t need to detect every body part, such as joints, limbs,
and torso. In understanding a behavior, a pattern of some components has even more
meaning than the exact detection of feature. Accordingly, even if we lose some part
of some features in some images in sequence, an optimal matching process between
an input feature sequence and a feature sequence from training data can provide an
exact type of behavior. The need for real-time operation also constrains not only the
dimensionality of feature but also the computational resource, which can be used to
detect exact feature in the image.
From the detected person blob image, we can detect features which are defined
to represent the pose of detected person, and track them with probabilistic human
body model. The result of this process is a sequence of feature vector, which is then
matched to the training data set to recognize the behavior comprising this feature
vector sequence.
Generally, several behaviors can occur continuously without a pause. We should
spot each behavior from the image stream before matching. When a particular beha-
vior starts or ends, a distinctive feature, such as height, and position of head, has
59
inflective change. These inflection points can be used to discriminate a input candi-
date of particular behavior from the image stream.
For the predefined behaviors to be recognized, each synthesized human action
data is defined as A = (a, k, θ, φ), where ‘a’ is an action category, ‘k’ is a sequence
numbers in the action ‘a’, ‘ θ’ and ‘ φ’ represents a viewing angle of the camera. Each
action data has also feature vector, representing its body shape.
If a sequence of feature vectors F
1
, F
2
,…F
N
is given from an input image se-
quence, behavior matching process is then calculating P(A
k
|F
i
) for all A in the train-
ing data DB, and finding optimal sequence of k which maximizes the sum of condi-
tional probabilities.
One of the difficult problems in this approach is how to match this feature vec-
tor sequence with a group of labeled reference sequences representing typical beha-
viors, even though the time scales of these two sequences are different.
When we consider the need of real time operation, dynamic time warping
[57][3] can be one of the possible methods. Dynamic time warping is a method for
measuring similarity between two sequences which may vary in time or speed. The
optimization process is performed using dynamic programming. Bobick et al. [3]
compute a prototype gesture of a given set of example gestures, which preserves the
temporal ordering of the samples, but lies in a measurement space without time. A
gesture is then defined as an ordered sequence of states along the prototype. They
use dynamic programming to compute a match score for new examples of the ges-
ture. This method can establish matching as long as the time ordering constraints
60
hold, even though the time scale of input sequence and that of predefined sequence
are different.
In our system, each fixed camera node should have several functionalities in
parallel, such as people detection and tracking, identification. It requires a very
simple and fast method to recognize behaviors. We also adopt a division of labor
strategy, in which fixed camera nodes recognize only very simple actions, such as
raising hand, walking, falling down, and mobile robot node recognizes gestures in
detail at the front of the user.
For these reasons, we propose a simple and fast method using nondeterministic
finite automata model.
4.4.3 Real-time Action Recognition Using Nondeterministic Finite Automata
model
An automaton is a mathematical model for a finite state machine (FSM). In this
model, a transition function tells the automaton model which state to go to next giv-
en a current state and a current symbol or observation. In a deterministic finite auto-
mata model, each state of an automaton has a transition for every input symbol. On
the other hand, nondeterministic finite automata model may not have a transition for
each observation, or can even have multiple transitions for an observation.
In our system, a sequence of observations can tell which state to go to next giv-
en a current state. Some observations may imply multiple state changes from the
current state. For example, the increment of height may imply a standing action or a
raising hand action.
61
The actions to be recognized in our system are
Q = {Walk, Sit, Raised Hand, Lie, Stand, Fall}, including starting state q
0
.
A finite set of observations which enable a transition of state is
Σ = {Variation of Height, Location, Head Position, Hand
Position, Human Blob Ratio}
δ is the transition function, that is
δ: Q × Σ →Q.
In our method, δ is defined as a stack which is increased or decreased by tokens
from the observations. In each input image, each detected observation issues positive
or negative token to transition function δ as show in Figure 25. According to the in-
Figure 25: Feature detection for state transition
62
formation in the stack, the transition function changes the state of the system or
keeps the current state.
To prevent errors resulting from false interpretations, we define constraints on
state change, such as state change from ‘lie’ to ‘walk’ without standing up.
Figure 27 shows the experimental results for recognizing six kinds of actions
and transitions between these actions, according to the transition model in Figure 26.
Figure 26: Action Transition using Nondeterministic Finite Automata Model
63
(a) Stand (b) stand to sit
(c) Sit (d) Sit to lie
(e) Lie (f) Sit to stand
Figure 27: Experimental Results of action recognition
64
We have tested our method with sequences of 544 images which is showing 5
kinds of actions and 4 kinds of transitions. The action ‘stand’ is not discriminated
from the ‘no action’ state. In our experiments, the correct matching rate was 85.4%,
when we include ‘no action’ state. Without ‘no action’ state, the matching rate was
82.5%.
As shown in Table 4, the correct matching rate of ‘falling’ and ‘fall’ is quite low
because this action state and transition is quite similar to the action ‘lie’ and ‘lying’.
Also, the lying transition is hard to discriminate with the action state ‘sit’, resulting
in low matching rate.
In this experiment, a fixed camera node runs at approximately 7-9 frames per
second with all the modules, such as people detection, identification, head detection
and action recognition, are running together. It shows a prominent result with only 6
kinds of observations.
65
Action & Transition Correct Matching Rate (%)
Walk 86.8
Sit 82.0
Sitting 85.2
Lie 94.9
Lying 54.2
Fall 75.0
Falling 50.0
Raising Hand 72.9
Standing 86.7
No action state 95.3
Table 4: Experimental results of action matching
66
Chapter 5 Mobile Robot Node Processing
The tasks associated with the mobile robot node are first, localization to make the
robot move properly and prepare to respond to commands from its master, and
second, gesture recognition, to make the robot recognize potential users and respond-
ing to gestures or commands.
We handle the localization problem by vertical edge matching in consecutive
panoramic images captured by the omni-directional camera. We will keep searching
and testing an efficient and robust localization method to be integrated to our frame-
work. We handle gesture recognition problem by detecting and tracking the head
position and localizing potential limb locations.
5.1 Communication Framework between the Robot and the Main
Node
In our distributed sensor node framework, the mobile robot node communicates with
other sensor nodes through the main node. To maximize the performance and the
robustness of the entire system, an efficient communication framework between the
robot node and the main node should be defined. We design this framework consi-
dering following three aspects.
67
A. Division of work
- The camera node on the robot has better resolution than other fixed camera
node. Also, the mobility of the robot enables a short range interaction with a
human, resulting in finer level gesture recognitions, such as limbs and facial
gesture. Therefore, a division of work between the mobile robot node and
the other fixed camera nodes can provide more user friendly and delicate in-
teraction through a coarse-to-fine approach.
B. Minimization of computation
- According to the division of work scheme, the information acquired from
each sensor nodes should be shared to eliminate unnecessary processing, as
well as a certain level of redundancy for the robustness of system
C. Efficient robot control
- Fixed camera nodes can detect the robot and provide the location informa-
tion to the robot. This information can be used to improve the accuracy of
localization of the robot.
According to this design scheme, we propose the communication framework in
Figure 28.
68
5.2 Self-localization with Omni-directional Camera
An assistive robot should always be able to provide answers to the question "Where
am I" to move properly and to be prepared to respond to commands from its master.
That is, fast and accurate localization is one of the essential capabilities the robot
should have.
Omni-directional camera has a strong advantage on the localization purpose due
to its extremely wild field of view. We propose a simple and fast localization method
using omni-directional camera.
Fixed
camera
node
Main
node
Robot
node
Stereo
Camera
Node
Omni
Camera
Node
Behavior
Recognition
People
Detection
Robot
Detection
Gesture
Recognition
Localization
Robot
Position
Target
Position
User
Request
Operational
Command
Fixed
camera
node
Main
node
Robot
node
Stereo
Camera
Node
Omni
Camera
Node
Behavior
Recognition
People
Detection
Robot
Detection
Gesture
Recognition
Localization
Robot
Position
Target
Position
User
Request
Operational
Command
Figure 28: Communication framework between the main node and the robot node
69
5.2.1 Related work
Many solutions have been developed to solve the localization problem. The simplest
method is using information from the encoder of robot wheel. However, this method
suffers from the accumulation of errors, in particular, when a floor is slippery or
there are some obstacles such as door sill, the problem could be severe. To cope with
this problem, vision sensors can be used to capture environment image and to recog-
nize some characteristic pattern or object to calculate its location. This approach still
has some problems such as ambiguity resulted from low resolution of sensor, slow
processing time, lack of characteristic object in the image. There are combined me-
thods which trying to take advantages of each method [9][37]. However, combining
two different kinds of sensing information requires additional computing power and
cost.
An omni-directional camera has strong advantage of capturing surrounding im-
age at one snap, therefore, easily provides enough information to localize the robot.
For example, visual landmarks are easily found since they remain longer in the field
of view than standard camera. Some drawbacks of omni-directional camera is low
resolution resulted from extremely wide field of view and some additional cost re-
sulted from delicate mirror system. These days, the cost of high resolution camera is
declining, which makes the prospect of omni-directional camera bright.
70
We propose a simple and fast localization method using omni-directional cam-
era. Considering a usual home environment where visual landmarks may be changed
or occluded easily, we use only 3 vertical edges in each image captured from the
omni-directional camera to get relative position of the robot [35]. Our approach
shows a good performance even in poor computational environment of robot with
embedded CPU due to minimizing the finding and matching process of landmarks.
5.2.2 The sensor
An omni-Directional camera, which was described in the patent in 1970 [24], uses a
mirror in combination with conventional camera and it is supposed to be one of the
popular vision sensors due to its extremely wide field of view, not only in robot na-
vigation but also in usual vision such as video surveillance, human tracking, and
stereo system. In recent years, a very popular type of omni-directional camera is sin-
gle camera with single mirror, such as conical, elliptical, parabolic, hyperbolic or
spherical. For our robot system, we use a camera (Point-Grey Scorpion camera with
1600 ×1200 resolution) with 2 mirrors (one is convex and the other is concave para-
bolic), which is known as Folded Catadioptric Camera [16]. While the single camera
with single mirror system gives a simple structure, it requires physically large size
because the camera lens and the mirror should be adequately separated from each
other for a wide field of view. As shown in Figure 29, if we use 2 mirrors, we can
fold the light path and make it compact enough to be installed easily on our experi-
mental robot platform. From possible combinations of 2 mirrors, we choose a con-
71
cave-convex combination to compensate the side-effect of field curvature, being best
focused not on a plane but rather on curved surface behind the lens [44].
In our approach, we use vertical edges in omni-directional image as landmarks
for localization. Therefore, panoramic image is more convenient to process. We con-
vert raw input image (1600 ×1200) to panoramic image (1152 ×240) by assigning the
Figure 29: Omni-directional camera structure (top), raw input image (middle left),
cube box image (middle right) and panoramic image (bottom).
72
interpolated color values of the nearest 4 pixels in raw image to the panoramic image
pixels. Usually the center of raw spherical image is different from the center of im-
age and this error may cause severe distortion in vertical edges unless carefully com-
pensated. To get the exact error value, we use a cubic calibration box in Figure 29.
The intersection of each line in spherical image means the center of the hole which is
resulted from the physical hole in convex parabolic mirror.
Figure 29 shows the omni-directional camera (NetVision 360 Type B) we use, a
sample of raw image, and a panoramic image.
5.2.3 Approach
Due to the advantage of 360 degree FOV of omni-directional image, we can get an
absolute angle between the lines connecting camera center and 2 landmarks in the
same height. These angles are directly related to the position of robot and if we know
the model of the space where the robot is we can get the exact displacement of the
robot from the original position using simple trigonometric equations of angular var-
iation between original image and input image. The minimum number of angles
needed is just 2 and it means just 3 vertical edges are enough as landmarks.
Vertical edges to be matched in each panoramic image are defined as a set of
pixels in same x position which have greater average RGB variation between neigh-
boring pixels than threshold th
c
and greater length than threshold th
l
.
Because only 3 vertical edges are enough, we set th
c
, th
l
as 60, 120 each, which
are quite tight.
73
∑
> > Δ + Δ + Δ
⋅
=
n
l c
th n th b g r
n
I Strength Edge ,
3
1
,
Where, n is the number of pixels where
c
th b g r > Δ + Δ + Δ
To match vertical edges in each image to the original image, we compare the
intensity of vertical edge first and then compare the average RGB value of neighbor-
ing region of five pixels width. In each similarity comparison, we use RGB values
which are averaged to the entire image to make it robust to illumination change.
Figure 30: Getting position from angles (upper), Samples of edges (lower)
74
region j i
B G R
R
r
th r th I
jij i j i
j i
ave
rgb L ave I
∈
+ +
=
< Δ < Δ
∑∑ ∑
∑
, ,
,
,, ,
,
, ,
, which is same for each R, G, B and for left, right region.
For the final candidates, an order constraint in position is applied to verify the
correctness. In panoramic image, as shown in Figure 30, the pixel distance between
two vertical edges A and B is directly proportional to the angle between
XB and XA
.
From the angle α and β, we can get following simultaneous equation.
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
−
= Ψ
= ⋅ Ψ = ⋅ Ψ
θ θ
θ θ
θ
β α
cos sin
sin cos
, XC k XB XB k XA
In this equation, we know the position of edges from the model of the space, and
finally we can calculate the exact position of robot. The equation has 2 solutions and
these solutions respond to the intersection of circles in Figure 30. One of the solu-
tions, therefore, indicates edge positions A, B and C and the other solution indicates
the final position of the robot.
75
5.2.4 Experimental Results
In our experiment, we mount the omni-directional camera on the robot platform
(Weber-N). For accurate measurement, we capture images on predefined position
and compare the result from our method with ground data. Table 5 and Figure 31
show the result of our experiment. The error range was 0.11~3.64 inches in the mov-
ing range of 54 inches and average error distance was 1.2 inches for 13 positions,
which shows enough accuracy to be applicable to real robot navigation system. Fig-
ure 32 shows a trajectory of the robot on the map of real environment.
Figure 31: Experimental results of the robot navigation using omni direc-
tional camera in real application
76
Table 5: Experimental results of robot navigation
77
5.3 Gesture Recognition at Short Range
When a user calls the robot, and the robot approaches to reduce the distance between
caller and robot, the stereo camera node on the robot is triggered and starts short
range interaction with the caller.
A fixed camera node provides the location of head, which is used as an anchor
of upper body, to the stereo camera node. It also provides the direction, to which the
Figure 32: Trajectory of the robot navigation in the experiment to measure
the error (black point: ground truth)
78
user is facing. Accordingly, the robot approaches the user from the front, and the
stereo camera node can get the frontal image of the user. This cooperation frame-
work can decrease the processing time for detecting head location and aligning a 2D
model to the body shape.
Referring the information from a fixed camera node, the stereo camera node
detects and tracks hands and arms, and estimates pose from this information. Head
and limb information can then be interpreted by gesture recognition sub-systems.
We use the above method on each camera stream independently and in parallel.
The results can then be fused at a higher level. This design offers robustness to sin-
gle camera failures and loss of calibration in that information can still be provided.
We have obtained satisfactory preliminary results in detecting and tracking
limbs. Figure 33 shows the experimental results. However, this gesture recognition
method is not the focus of this thesis and the details of this method are in [28].
79
Figure 33: Experimental results of the upper body detection method
80
Chapter 6 Current Results of Integrated System
We have tested the proposed framework with a real robot and camera systems in
challenging environment. In this chapter we describe our experimental results and
explain a method to improve the system tolerance.
6.1 Integration of Multiple Camera Nodes
We integrated three fixed camera nodes (two nodes have overlapping field of view in
a room, one camera node is in the other room) and a real robot for our experiments.
In this setup, we used 2 desktops (1 with Pentium IV 3.2GHz CPU, 1 with 3GHz
CPU), 1 laptop with Pentium IV 3.0GHz CPU. The main node and 1 fixed camera
node were installed on the desktop with 3.2GHz CPU, and the other 2 fixed camera.
nodes were installed on a desktop and laptop.
Each camera node detects and tracks multiple people in real time, and transmits
the location and description of each detected person to a main node. It identifies de-
tected people using the global information in the main node and detects head posi-
tion in a human blob. Also, it recognizes one gesture to call the robot, and transmits
this request to the main node.
The main node displays all the detected people in world coordinates system. In
describing detected people in the map, the main node uses the position, height and
facing angle information of detected people. The main node has the information of
each registered camera and uses this to convert the location information from each
camera node to the world coordinate system. Whenever a node reports the status of
81
the region it manages, the main node updates the global state and broadcast the data
which should be maintained as the local data in each node. If this information is dup-
licated, the main node resolves duplication using the color histogram matching me-
thod described in section 3.3.2. It also integrates and analyzes the information from
each sensor node to make the robot move. If one of the sensor nodes transmits a sig-
nal that one of the predefined behaviors of people is detected, the main node requests
a proper operation of the robot through the CRIF API module. During the execution
of this operation, the main node buffers a possible request for a reaction of the robot
until the operation of the robot finishes.
Each fixed camera node functions at 7-9 frames per second of processing speed.
Figure 34 illustrates the process, where a raised arm causes the robot to come from a
different room and approach the caller.
82
(c) Main node: Display the detected people on the 3D map in real-time
(a) Camera node 1 (b) Camera node 2
(d) Gesture to call the robot
(e) Main node makes the robot ap-
proach the caller from the front
Figure 34: Experimental results of the integrated system. We
installed 3 cameras in 2 different rooms.
83
Each fixed camera node maintains the trajectory on the ground of detected
people. Using this trajectory, it can infer the facing direction of person, assuming all
the detected persons moves only forward. The main node uses this information to
make the robot approach the caller from the front. In a commercial system, if a robot
approaches the user from the rear or side, the user may feel threatened or surprised.
6.2 Validation Test
In our experiments, we have validated the efficiency and robustness of our frame-
work using a real home service robot platform in a realistic environment, deploying
different types of sensor nodes in several different places, changing illumination, and
uninterrupted processing for long period of time. Table 6 shows the results of expe-
riment.
Environment Results
System
running time
- 8 hours of continuous operation
System works without
any interruption.
Illumination
adaptation
- The light in the room was turned
off for a several seconds and turned
on again.
- Subtle illumination change due to
sunlight.
- After 6 frames, sys-
tem adapts
- No effect
Moving object
Discrimination
- Non-human objects such as chair,
table suddenly move in the field of
view
- After 30 frames, non-
human objects region
was absorbed to the
background
Table 6: Validation test results
84
In our experiments, if a detected person is occluded by non-human objects, such
as chair or table, the fixed camera node may fail to detect this person continuously.
Sometimes in that case, the height of detected person and the location change abrupt-
ly and the association matrix cannot show proper association between consecutive
images. Using this property, the system can recognize the occlusion by non-human
objects.
6.3 Fault Tolerance
In our system, if a camera node fails to work properly, the main node automatically
detect it. If the main node cannot receive a report from a camera node for more than
10 seconds, it decides that this camera node does not work properly. The main node
unregisters this node from the camera registration list, and if this node was working
in a synchronization mode, it broadcasts an end-of-synchronization signal to make
other camera node work without synchronization.
In case of the failure of the main node, the system can make other camera node
succeed the role of the main node, because the main node and all the camera nodes
are software module. If a camera node cannot receive a confirmation message from
the main node for a specific time, this node may decide that the main node does not
work properly. In this case, this camera node can issue a command to make a new
main node module run in its hardware which contains itself and change the network
information, such as IP address of the main node for a proper communication chan-
nel can be newly established.
85
These mechanisms are possible due to our message handling system based on
report and broadcast scheme. When our framework is applied to a commercial sys-
tem, these mechanisms can greatly improve the system tolerance.
86
Chapter 7 Conclusion
In this work, we have addressed visual perception for a personal service robot in the
intelligent home environment. We have identified and described key functionalities,
which the vision system of the personal service robot should have, considering the
role model of the robot at home. These functionalities include continuous detection
and tracking of non static objects in the environment (principally people), recogni-
tion of the user as somebody who may issue commands to it, some level of scene
understanding and situation awareness, including understanding of gestures or com-
mands, and self localization in the environment.
To implement these functionalities efficiently and robustly, we have pro-
posed an efficient and reliable framework to organize each sensor node in the intelli-
gent home network, and to distribute each vision task to proper camera node, which
is connected by a noble camera interface framework (UCIF).
Each camera node in our framework performs its visual tasks independently and
locally. After processing an input image, it transmits the results to the main node in
symbolic form, instead of image itself. The main node in our framework fuses the
information from each node to execute real actions, such as robot control, display of
fused information. As a result, the communication between nodes is very low band-
width, and the burden of processing power can be distributed to each node, leading
to good scalability. UCIF can also handle multiple camera nodes whose field of view
overlap. For this purpose, the UCIF manages the synchronization scheme and re-
solves duplicated information from the overlapping region.
87
We also proposed fast and reliable methods for each functional module in
fixed camera node and mobile camera node. A fixed sensor node, for example one
single camera on the wall, can provide remote awareness of its environment to the
robot. We proposed adaptive background subtraction method using intensity and
edge image together to detect people and each detected person is tracked using asso-
ciation matrix between consecutive images. We use the color histogram based
matching to identify detected person. Each fixed camera node should recognize
some actions which may require some reaction of the robot. For this purpose, we
proposed a noble action recognition method based on nondeterministic finite auto-
mata model. In this method, we detect six kinds of observations, such as variation of
height, location, head position, hand position, and human blob ratio, which enable a
transition of state. Each detected observation issues positive or negative token to
transition function, resulting in the change of state.
In understanding the gesture or behavior of people, the position of the head
plays an important role. Furthermore, the head position is useful as an anchor for the
detection of body parts, such as limbs, torso. To detect the head of detected person,
we proposed a combined method of 2D ellipsoid (head) fitting and Ω shape (head
and shoulder) detector. This method can detect heads in various postures and is ro-
bust to false alarms.
The tasks associated with the mobile robot node are first, localization to make
the robot move properly and prepare to respond to commands from its master, and
88
second, gesture recognition, to make the robot recognize potential users and respond-
ing to gestures or commands.
We handle the localization problem by vertical edge matching in consecutive
panoramic images captured by the omni-directional camera. We will keep searching
and testing an efficient and robust localization method to be integrated to our frame-
work. We handle gesture recognition problem by detecting and tracking the head.
As experimental result, we showed that vision modules based on our model
could be applicable in real-time and reliably to a personal service robot in the intelli-
gent home network. Our system is running in the lab, and the current implementation
has three cameras in two rooms, plus the robot.
Future work should focus on, first, integration of each sensor node to the real
robot vision system. Currently, only fixed camera nodes are integrated to the UCIF
framework and mobile robot nodes work in stand-alone environment. Second,
People detection method should be extended to detect people occluded by non hu-
man objects. This may require a direct human detection method which detects body
parts. Third, the system needs to identify users. Currently, the identification module
in the system can recognize only whether detected people are new person or existing
person for a data management purpose. The robot should recognize its master for the
more personalized interaction. Finally, we believe further works are required for
more complex environments and scenarios.
89
References
[1] A. Agarwal, B. Triggs: “Recovering 3D human pose from Monocular Images”,
IEEE Trans. on PAMI, vol. 28, No. 1, pp. 44-58, 2006
[2] C. BenAbdelkader and L.Davis, “Detection of People Carrying Objects : a Mo-
tion-based Recognition Approach”, 5
th
IEEE International Conference on Au-
tomatic Face and Gesture Recognition, May, 2002
[3] A. F. Bobick and A. D. Wilson, “A state-based technique to the representation
and recognition of gesture”, IEEE Trans. on PAMI, vol 19. pp. 1325-1337, Dec.
1997
[4] M. Brand and V. Kettnaker, “Discovery and segmentation of activities in vid-
eo,” IEEE Trans. on PAMI, vol. 22, pp. 844–851, Aug. 2000
[5] Q. Cai and J. K. Aggarwal, “Tracking human motion in structured environ-
ments using a distributed-camera system,” IEEE Trans. Pattern Anal. Machine
Intell., vol. 21, no. 11, pp. 1241–1247, 1999
[6] Y.Caspi and M.Irani, “Spatio-temporal alignment of sequences”, IEEE Trans.
on PAMI., vol 24, pp. 1409-1424, Nov. 2002
[7] J. Castellanos, J. Montiel, J. Neira, and J. Tardos. The spmap, “A probabilistic
framework for simultaneous localization and map building”, IEEE Transac-
tions on Robotics and Automation, vol 15. pp. 948-952, 1999
[8] Chang, T.-H. Gong, S. “Tracking multiple people with a multi-camera system”,
In Proceedings of IEEE Workshop on multi-object tracking pp.19-26, 2001
[9] F. Chenavier and J. L. Crowley, "Position estimation for a mobile robot using
vision and odometry", IEEE Int. Conf. Robot. Automat, Nice, pp. 2588-2593,
France,1992
90
[10] Collins, R.T.; Lipton, A.J.; Fujiyoshi, H.; Kanade, T., “Algorithms for coopera-
tive multisensor surveillance” Proceedings of the IEEE, vol 89, issue 10, pp.
1456-1477, Oct. 2001
[11] J. Deutscher, A. Davison, I. Reid, “Automatic partitioning of high dimensional
search spaces associated with articulated body motion capture,” CVPR vol II.
pp. 669-676, 2001
[12] M. Dissanayake, P. Newman, S. Clark, H. Durrant-Whyte, and M. Csobra. “A
solution to the simultaneous localization and map building (SLAM) problem”,
IEEE Transactions on Robotics and Automation, vol 17. pp.229-241, 2001
[13] S. L. Dockstader and A. M. Tekalp, “Multiple camera tracking of interacting
and occluded human motion,” Proc. IEEE, vol. 89, pp.1441–1455, Oct. 2001.
[14] Intelligent Robot Research Division, ETRI, Common Robot Interface Frame-
work (CRIF) Manual
[15] P. Felzenszwalb, “Learning Models for Object Recognition”, CVPR, Vol I:
1056-1062, 2001
[16] Jose Gaspar, Niall Winters, Etienne Grossmann, Jose Santos-Victor, "Toward
Robot Perception using Omni-directional Vision", S. Patnaik, L.C. Jain, G.
Tzafestas and V. Bannore (Eds), Springer-Verlag, in press, 2004
[17] D. M. Gavrila and U. Franke, S. Görzig and C. Wöhler, “Real-Time Vision for
Intelligent Vehicles”, IEEE Instrumentation and Measurement Magazine, vol. 4,
No. 2, pp. 22-27, 2001
[18] D. M. Gavrila, “Sensor-based Pedestrian Protection”, IEEE Intelligent Systems,
vol. 16, No. 6, pp. 77-81, 2001
[19] Grammalidis, N., Strintzis, M.G., “Head detection and tracking by 2-D and 3-D
ellipsoid fitting”, Proceedings of Computer Graphics International, pp. 221-226,
2000
91
[20] I. Haritaoglu, D, Harwood, and L. Davis, “W4: real-time surveillance of people
and their activities,” IEEE Trans. on PAMI, vol 22. pp. 809-830, Aug. 2000
[21] Tsutomu Hasegawa, Kouji Murakami, “Robot Town Project: Supporting Ro-
bots in an Environment with Its Structured Information”. International Confe-
rence on Ubiquitous Robots and Ambient Intelligence, pp. 119-123, Seoul Ko-
rea, Oct. 2006
[22] W. M. Hu, D. Xie, and T. N. Tan, “A hierarchical self-organizing approach for
learning the patterns of motion trajectories”, Chin. J. Comput., vol. 26, no. 4,
pp. 417–426, 2003
[23] G. Hua, M. Yang, Y. Wu, “Learning to estimate human pose with data driven
belief propagation” CVPR vol II. pp. 747-754, 2005
[24] H. Ishiguro, "Development of Low-Cost and Compact Omni-directional Vision
Sensors and Their Applications", Proc. Int. Conf. Information systems, analysis
and synthesis, pp. 433-439, 1998
[25] Y. A. Ivanov and A. F. Bobick, “Recognition of visual activities and interac-
tions by stochastic parsing”, IEEE Trans. on PAMI, vol. 22, pp. 852–872, Aug.
2000
[26] N. Johnson and D. Hogg, “Learning the distribution of object trajectories for
event recognition”, Image Vis. Comput., vol. 14, no. 8, pp. 609–615, 1996
[27] Dan Kara. “Sizing and Seizing the Robotics Opportunity” In COMDEX, 2003
[28] V. Kettnaker and R. Zabih, “Bayesian multi-camera surveillance” in Proc.
IEEE Conf. Computer Vision and Pattern Recognition, pp. 253–259. 1999
[29] D.H. Kim, J. Lee, H.S. Yoon, H.J. Kim, Y. Cho, E.Y. Cha. “A vision-based
user authentication system in robot environments by using semi-biometrics and
tracking”, IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 1812-
1817, 2005
92
[30] K. Kim, M. Siddiqui, A. François, G. Medioni, Y. Cho. “Robust Real-Time
Vision Modules for a Personal Service Robot” International Conference on
Ubiquitous Robots and Ambient Intelligence, pp. 133-138, Seoul Korea , 2006
[31] Kwangsu Kim and Gérard Medioni, “Robust Real-Time Vision for a Personal
Service Robot in a Home Visual Sensor Network”, 16
th
IEEE International
Symposium on Robot & Human Interactive Communication, Aug, 2007
[32] Ben Kröse, “Cognitive robot companions for smart environments”, Interna-
tional Conference on Ubiquitous Robots and Ambient Intelligence, pp. 5-26,
Seoul Korea, Oct. 2006
[33] Jaeyeong Lee, Heesung Chae, Hyo-Sung Ahn, Wonpil Yu, and Young-Jo Cho.
“Development of Ubiquitous Robotic Space for Networked Robot”. Interna-
tional Conference on Ubiquitous Robots and Ambient Intelligence, pp. 172-176,
Seoul Korea, Oct. 2006
[34] J.-H. Lee and H. Hashimoto, “Intelligent Space – Its concept and contents”,
Advanced Robotics Journal, Vol. 16, pp. 265-280, 2002
[35] M. W. Lee, I. Cohen, "A Model-Based Approach for Estimating Human 3D
Poses in Static Images", IEEE Trans. on PAMI, vol. 28, No. 6, pp. 905-916,
2004
[36] Wang Liang, Zhu Qidan, Liu Zhou, “Location research of mobile robot with an
omni-directional camera", Proceedings of the 2004 International Conference on
Intelligent Mechatronics and Automation", pp. 662-666, 2004
[37] S.J. McKenna, S. Jabri, Z. Duric, A. Rosenfeld, and H. Wechsler, “Tracking
Groups of People”, CVIU, vol. 80, pp. 42-56, 2000
[38] G. Medioni, A. Francois, M. Siddiqui, K. Kim, H. Yoon, Robust Real-Time
Vision for a Personal Service Robot, CVIU 2007
[39] K. Mikolajczyk, C. Schmid, and A. Zisserman, “Human Detection Based on a
Probabilistic Assembly of Robust Part Detector”, ECCV, Vol I: 69-82, 2004
93
[40] Moballegh, H. R, Amini, P, Pakzad, Y, Hashemi, M, Nanniani, M, "An im-
provement of self-localization for omnidirectional mobile robots using a new
odometry sensor and omnidirectional vision", Canadian Conference on Elec-
trical and Computer Engineering", pp. 2337-2340, 2004
[41] A. Mohan, C. Papageorgiou, and T. Poggio, “Example-based Object Detection
in Image by Components”, IEEE Trans. on PAMI, vol.23, no. 4, April 2001
[42] G. Mori, X. Ren, A. Efros, J. Malik, “Recovering Human Body Configurations:
Combining Segmentation and Recognition”, CVPR vol II. pp. 326-333, 2004
[43] Kazuyuki Morioka, Joo-Ho Lee, Hideki Hashimoto, “Human Centered Robot-
ics in Intelligent Space”, IEEE Int. Conf. Robot. Automat, pp. 2010-2015, 2002
[44] S. K. Nayar and V. Peri, "Folded catadioptric cameras", Panoramic Vision:
Sensors, Theory, Applications, pp. 103-119, 2001
[45] Ram Nevatia, Jerry Hobbs, Bob Bolles. “An Ontology for Video Event Repre-
sentation” IEEE Workshop on Event Detection and Recognition, pp. 119-119,
June 2004
[46] E. Osuna, R. Freund and F. Girosi, "Training Support Vector Machines: an
Application to Face Detection", CVPR, pp. 130-136, San Juan, Puerto Rico,
1997
[47] C. Papageorgiou, T. Evgeniou, and T. Poggio, “A Trainable Pedestrian Detec-
tion System”, Proc. Of Intelligent Vehicles, pp. 241-246, 1998
[48] D. Ramanan, D. A. Forsyth: “Tracking people by Learning Their Appearance”,
IEEE Trans. on PAMI, vol. 29, pp. 65-81, 2007
[49] T. J. Roberts, S. J. McKenna, I. W. Ricketts, “Human Pose Estimation Using
Learnt Probabilistic Region Similarities and Partial Configurations,” ECCV pp.
291-303, Prague, Czech, 2004
94
[50] H. Rowly, S. Baluja and T. Kanade, "Neural network based face detection",
IEEE Trans. on PAMI, vol 20. pp.23-38, 1998
[51] Alessandro Saffiotti, Mathias Broxvall. “PEIS Ecologies: Ambient Intelligence
meets Autonomous Robotics”, Proc. of the sOc-EUSAI conference on Smart
Objects and Ambient Intelligence, pp. 277-281, Grenoble, FR, October 2005
[52] H. Schneiderman and T. Kanade, "A Statistical Method for 3D Object Detec-
tion Applied to Faces and Cars", CVPR, vol. 1, pp. 1746-1753, 2000
[53] G. Shakhnarovich, P. Viola, T. Darrell, “Face pose estimation with parameter
sensitive hashing”, ICCV pp.750-777, 2003
[54] L. Sigal, S. Bhatia, S. Roth, M. J. Black, M. Isard, “Tracking Loose-limbed
People”, CVPR vol I. pp. 421-428, 2004
[55] C. Sminchisescu, B. Triggs, “Estimating Articulated Human Motion with Co-
variance Scaled Sampling”, International Journal of Robotics Research, vol. 22,
pp. 371-391, No. 6, 2003
[56] T. Starner, J. Weaver, and A. Pentland, “Real-time American sign language
recognition using desk and wearable computer-based video,” IEEE Trans. on
PAMI, vol. 20, pp. 1371–1375, Dec. 1998
[57] K. Takahasi, S. Seki, H. Kojima, and R. Oka, “Recognition of dexterous mani-
pulation from time-varying images”, in Proc. IEEE Workshop on Motion of
Non-Rigid and Articulated Objects, Austin TX, pp. 23-28, 1994
[58] P. Viola and M. J. Jones, “Robust real-time face detection”, IJCV, vol 57(2),
pp. 137-154, 2004
[59] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, “Pfinder: Real-time
tracking of the human body’ IEEE Trans. on PAMI, vol 19. pp. 780-785, July.
1997
95
[60] Bo Wu, Ram Nevatia, “Detection of Multiple, Partially Occluded Humans in a
Single Image by Bayesian Combination of Edgelet Part Detectors”, ICCV, Vo-
lume I, pp. 90-97. Beijing, China, October 2005
[61] J. Zhang, R. Collins, Y. Liu, “Representation and Matching of Articulated
Shapes”, CVPR vol II. pp.342-349, 2004
[62] T. Y. Zhang and C. Y. Suen, “A fast parallel algorithm for thinning digital pat-
terns”, Communication of the ACM, vol. 27, pp. 236-239, March 1984
[63] L. Zhao and C.E. Torpe, “Stereo- and Neural Network Based Pedestrian Detec-
tion”, IEEE Trans. Intelligent Transportation System, vol. 1, No.3, pp. 148-154,
Sept. 2000
[64] Tao Zhao, Ram Nevatia, “Tracking Multiple Humans in Complex Situations”,
IEEE Trans. on PAMI, vol. 26. No. 9, pp. 1208-1221, 2004
[65] http://oxygen.csail.mit.edu/
[66] http://www.poserworld.com
[67] http://www.wikipedia.com
[68] http://www.cs.nott.ac.uk/~txa/
[69] http://icara.massey.ac.nz/default.asp
[70] http://www.robotweek.or.kr/m1/m1s1.asp
[71] http://www.springer.com/west/home/engineering?SGWID=4-175-70-
71454808-0
[72] OpenCV Reference Manual
Abstract (if available)
Abstract
The Intelligent Home, which integrates information, communication and sensing technologies with/for everyday objects, is emerging as a viable environment in industrialized countries. It offers the promise to provide security for the population at large, and possibly to assist members of an aging population. In the intelligent home context, personal service robots are expected to play an important role as interactive assistants, due to their mobility and action ability which complement other sensing nodes in the home network. As an interactive assistant, a personal service robot must be endowed with visual perception abilities, such as detection and identification of people in its vicinity, recognition of people's actions or intentions.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Adaptive sampling with a robotic sensor network
PDF
Relative positioning, network formation, and routing in robotic wireless networks
PDF
Robust and efficient geographic routing for wireless networks
PDF
Robust routing and energy management in wireless sensor networks
PDF
Gradient-based active query routing in wireless sensor networks
PDF
Design of cost-efficient multi-sensor collaboration in wireless sensor networks
PDF
Efficient and accurate in-network processing for monitoring applications in wireless sensor networks
PDF
Efficient pipelines for vision-based context sensing
PDF
Robot vision for the visually impaired
PDF
Intelligent robotic manipulation of cluttered environments
PDF
Domical: a new cooperative caching framework for streaming media in wireless home networks
PDF
Coordinating social communication in human-robot task collaborations
PDF
Managing multi-party social dynamics for socially assistive robotics
PDF
Towards socially assistive robot support methods for physical activity behavior change
PDF
Situated proxemics and multimodal communication: space, speech, and gesture in human-robot interaction
PDF
Robot life-long task learning from human demonstrations: a Bayesian approach
PDF
Decentralized real-time trajectory planning for multi-robot navigation in cluttered environments
PDF
Learning to detect and adapt to unpredicted changes
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Multiparty human-robot interaction: methods for facilitating social support
Asset Metadata
Creator
Kim, Kwangsu
(author)
Core Title
Robust real-time vision modules for a personal service robot in a home visual sensor network
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/07/2007
Defense Date
10/11/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
behavior recognition,OAI-PMH Harvest,robot vision,vision sensor network
Language
English
Advisor
Medioni, Gerard (
committee chair
), Govindan, Ramesh (
committee member
), Proskurowski, Wlodek (
committee member
)
Creator Email
kwangski@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m912
Unique identifier
UC1457667
Identifier
etd-Kim-20071107 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-586388 (legacy record id),usctheses-m912 (legacy record id)
Legacy Identifier
etd-Kim-20071107.pdf
Dmrecord
586388
Document Type
Dissertation
Rights
Kim, Kwangsu
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
behavior recognition
robot vision
vision sensor network