Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning to detect and adapt to unpredicted changes
(USC Thesis Other)
Learning to detect and adapt to unpredicted changes
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
i
LEARNING TO DETECT AND ADAPT TO
UNPREDICTED CHANGES
by
Nadeesha Oliver Ranasinghe
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2012
Copyright 2012 Nadeesha Oliver Ranasinghe
ii
Acknowledgements
First, I would like to express my deepest gratitude to my advisor Prof. Wei-Min Shen
without whom this work would not even be conceived. In particular, Prof. Shen provided
me with the necessary motivation, knowledge and most importantly permission to build
on his own dissertation research. I have learned from him that consideration,
determination, hard work and an endless strive for perfection are qualities of a good
researcher and a great human being.
I would also like to thank Prof Ramakant Nevatia, Prof Michael Safonov, Prof Laurent
Itti and Prof Yu-Han Chang for giving me their precious time by being on my committees,
providing me with invaluable feedback and an abundance of encouragement.
I am eternally grateful to my wife, parents, sister and in-laws for their unwavering
support throughout this process. My father, mother, wife and daughter (1+ year) have
helped me focus on this research by addressing all other challenges around me. They are
my light against the darkness of uncertainty.
Finally, thank you to all the members of the Polymorphic Robotics Laboratory for their
help over the years. Special thanks to Harris Chiu, Dinesh Hiripitiyage, Kary Lau, Lizsl
DeLeon, Feili Hou, Jacob Everist, Mike Rubenstein, Behnam Salemi, Akiya Kamimura,
Prof Peter Will, Luenin Barrios, Teawon Han, Joseph Chen, TJ Collins and all my friends.
iii
Table of Contents
Acknowledgements ........................................................................................................... ii
List of Tables ................................................................................................................... vii
List of Figures ................................................................................................................. viii
Abstract ............................................................................................................................. xi
Chapter 1: Introduction ................................................................................................... 1
1.1 Problem Statement .................................................................................................. 1
1.2 Motivation of Our Approach ................................................................................... 2
1.3 Scope and Assumptions ........................................................................................... 3
1.4 Scientific Contributions .......................................................................................... 5
1.5 Dissertation Organization ...................................................................................... 6
Chapter 2: Overview of the State-of-the-Art .................................................................. 7
2.1 Inspirations from Developmental Psychology ........................................................ 7
2.2 Artificial Intelligence .............................................................................................. 8
2.3 Adapting to Unpredicted Changes (including Fault & Failure Tolerance) ......... 13
2.4 Reasoning with Unpredicted Interference ............................................................ 16
2.5 Summary ............................................................................................................... 17
Chapter 3: Surprise Based Learning ............................................................................ 21
3.1 Terminology .......................................................................................................... 21
3.2 Illustrative Example - Constrained Box Environment .......................................... 26
3.3 Learning a Prediction Model ................................................................................ 29
3.4 Life Cycle of Prediction Rules .............................................................................. 32
3.4.1 Rule Creation ................................................................................................ 34
3.4.2 Surprise Detection and Analysis ................................................................... 36
3.4.3 Rule Splitting ................................................................................................ 40
3.4.4 Rule Refinement ........................................................................................... 43
3.5 Using a Prediction Model ..................................................................................... 46
3.6 Goal Management (Knowledge Transfer) ............................................................ 48
iv
3.7 Probabilistic Rules and Forgetting ....................................................................... 50
3.8 Entity & Attribute Relevance ................................................................................ 53
3.9 SBL Pseudocode.................................................................................................... 56
3.10 Summary ........................................................................................................... 58
Chapter 4: Evaluation Strategy ..................................................................................... 59
4.1 Experimental Environments .................................................................................. 59
4.1.1 Game Environments...................................................................................... 60
4.1.2 Real-world Office Environment ................................................................... 62
4.2 Evaluation Methods .............................................................................................. 63
4.2.1 Comparison via Simulation........................................................................... 63
4.2.2 Feasibility in Real-World .............................................................................. 66
4.3 Summary ............................................................................................................... 67
Chapter 5: Structure Learning ...................................................................................... 68
5.1 Scalability by Exploiting Structure ....................................................................... 68
5.1.1 Approach ....................................................................................................... 69
5.1.2 Results & Discussion .................................................................................... 70
5.2 Constructing Useful Models with Comparison Operators ................................... 76
5.2.1 Approach ....................................................................................................... 77
5.2.2 Results & Discussion .................................................................................... 78
5.3 Impact of Training ................................................................................................ 78
5.3.1 Approach ....................................................................................................... 79
5.3.2 Results & Discussion .................................................................................... 80
5.4 Summary ............................................................................................................... 81
Chapter 6: Learning from Uninterpreted Sensors and Actions ................................. 82
6.1 Discretizing a Continuous Sensor ......................................................................... 82
6.1.1 Approach ....................................................................................................... 82
6.1.2 Results & Discussion .................................................................................... 83
6.2 Combining Multiple Uninterpreted Sensors ......................................................... 84
6.2.1 Approach ....................................................................................................... 85
6.2.2 Results & Discussion .................................................................................... 86
6.3 Scalability in the Number of Sensors and Actions ................................................ 89
6.3.1 Approach ....................................................................................................... 89
6.3.2 Results & Discussion .................................................................................... 90
6.4 Summary ............................................................................................................... 94
v
Chapter 7: Detecting and Adapting to Unpredicted Changes .................................... 95
7.1 Unpredicted Directly Observable Goal Changes ................................................. 95
7.1.1 Approach ....................................................................................................... 95
7.1.2 Results & Discussion .................................................................................... 97
7.2 Unpredicted Indirectly Observable Goal Changes............................................... 99
7.2.1 Approach ..................................................................................................... 100
7.2.2 Results & Discussion .................................................................................. 101
7.3 Unpredicted Configuration Changes in the Environment .................................. 105
7.3.1 Approach ..................................................................................................... 105
7.3.2 Results & Discussion .................................................................................. 106
7.4 Unpredicted Sensor Changes .............................................................................. 108
7.4.1 Approach ..................................................................................................... 108
7.4.2 Results & Discussion .................................................................................. 109
7.5 Unpredicted Action Changes .............................................................................. 110
7.5.1 Approach ..................................................................................................... 110
7.5.2 Results & Discussion .................................................................................. 111
7.6 Repairing vs. Rebuilding the Learned Model from Scratch ............................... 113
7.6.1 Approach ..................................................................................................... 113
7.6.2 Results & Discussion .................................................................................. 114
7.7 Relevance and Unpredicted Sensor & Action Changes ...................................... 115
7.7.1 Approach ..................................................................................................... 115
7.7.2 Results & Discussion .................................................................................. 117
7.8 Simultaneous Unpredicted Changes in Sensors, Action & Goals ...................... 119
7.8.1 Approach ..................................................................................................... 120
7.8.2 Results & Discussion .................................................................................. 121
7.9 Simultaneous Unpredicted Changes in Sensors, Action, Goals & the
Environment’s Configuration ..................................................................................... 123
7.9.1 Approach ..................................................................................................... 123
7.9.2 Results & Discussion .................................................................................. 124
7.10 Summary ......................................................................................................... 126
Chapter 8: Detecting and Reasoning with Unpredicted Interference ...................... 128
8.1 Noise & Missing Data ......................................................................................... 128
8.2 Experimental Setup ............................................................................................. 129
8.3 Approach ............................................................................................................. 132
8.3.1 Model Learning Phase ................................................................................ 133
8.3.2 Similarity Metric & Similarity Bounds Learning Phase ............................. 136
8.3.3 Testing Phase .............................................................................................. 137
vi
8.3.4 Noisy and Gapped Recognition .................................................................. 138
8.4 Results & Discussion .......................................................................................... 140
8.4.1 Recognition ................................................................................................. 140
8.4.2 Gap Filling .................................................................................................. 144
8.5 Summary ............................................................................................................. 146
Chapter 9: Conclusion .................................................................................................. 147
9.1 Summary and Contributions ............................................................................... 147
9.2 Future Research Directions ................................................................................ 150
References ...................................................................................................................... 152
vii
List of Tables
Table 1: Comparison of some competitive learning algorithms ....................................... 17
Table 2: Example of rule creation ..................................................................................... 35
Table 3: Example of surprise analysis .............................................................................. 38
Table 4: Example of rule splitting using causes from Table 3 ......................................... 41
Table 5: Example of rule refinement ................................................................................ 44
Table 6: Prediction model discretizing a continuous sensor ............................................. 83
Table 7: Adapting to action & sensor changes in the constrained box environment ...... 117
Table 8: Summary of contributions ................................................................................ 147
viii
List of Figures
Figure 1: a) Overhead view b) SuperBot & sensors c) Vision & range sensor ................ 26
Figure 2: SBL Process .................................................................................................. 29
Figure 3: SBL architecture ................................................................................................ 29
Figure 4: Life cycle of a prediction rule ........................................................................... 32
Figure 5: a) Robot’s location b) Base-condition c) Base-result ........................................ 36
Figure 6: a) base-condition b) base-result/surprised-condition c) surprised result ........... 38
Figure 7: a) Base-consequence b) Surprised-condition c) Surprised-consequence .......... 45
Figure 8: a) After 1st/before 2
nd
b) After 2nd/before 3rd action c) after 3rd action ......... 52
Figure 9: a) Hunter-goal layout b) Hunter-prey layout ..................................................... 60
Figure 10: Layout of my office room ............................................................................... 62
Figure 11: Actions to learn each map w/o obstacles from every starting location ........... 70
Figure 12: Data from Figure 11 excluding the “Random exploration” series .................. 70
Figure 13: Learning time for each map w/o obstacles from every starting location ........ 71
Figure 14: Actions to learn each map w obstacles from every starting location .............. 72
Figure 15: Learning time for each map w obstacles from every starting location ........... 73
Figure 16: Actions executed to learn each chaotic map ................................................... 74
Figure 17: Data from Figure 16 excluding the “Random exploration” series .................. 74
Figure 18: Learning time for each chaotic map ................................................................ 75
Figure 19: Actions to reach the goal after the specified amount of training runs ............. 80
Figure 20: Proximity sensor response ............................................................................... 83
ix
Figure 21: Actions executed for hunter to catch prey in each map .................................. 86
Figure 22: Execution time for the hunter-prey solution for each map .............................. 87
Figure 23: Impact of increasing the available actions ...................................................... 90
Figure 24: Impact of increasing the number of dummy constant valued sensors ............. 91
Figure 25: Impact of increasing the number of dummy random valued sensors .............. 92
Figure 26: Scalability in the number of irrelevant sensors ............................................... 93
Figure 27: Actions to reach dynamic goal locations in each map w/o obstacles .............. 97
Figure 28: Actions to reach dynamic goal locations in each map with obstacles ............. 98
Figure 29: Response of fixed platform with new starting locations at run 1, 4 & 7 ....... 101
Figure 30: a) Initial b) Run 1 c) Run 2 d) Run 7 ............................................................. 102
Figure 31: Response of fixed start location with platform moving at run 1, 4 & 7 ........ 103
Figure 32: a) Initial b) Run 1 c) Run 2 d) Run 7 ............................................................. 103
Figure 33: The hunter cannot traverse the gray obstacles............................................... 105
Figure 34: Actions executed to reach the goal in each run w environmental changes ... 106
Figure 35: No of surprises encountered during each run w environmental changes ...... 107
Figure 36: Actions executed to reach the goal in each run with sensor changes ............ 109
Figure 37: Number of surprises encountered during each run with sensor changes ...... 109
Figure 38: Actions executed to reach the goal in each run with action changes ............ 111
Figure 39: Number of surprises encountered during each run with action changes ....... 112
Figure 40: Actions executed when the model is repaired vs. rebuilt .............................. 114
Figure 41: Surprises encountered with repair vs. rebuild ............................................... 114
Figure 42: Goal locations in hunter layout ..................................................................... 120
x
Figure 43: Actions executed to reach the goal in each run with several changes ........... 121
Figure 44: Number of surprises encountered during each run with several changes ..... 121
Figure 45: Actions executed to reach the goal in each run with several changes ........... 124
Figure 46: Number of surprises encountered during each run with several changes ..... 125
Figure 47: Action recognition system dataflow .............................................................. 130
Figure 48: a) Frame 1 b) Frame 5 ................................................................................... 131
Figure 49: a) Frame 10 b) Frame 30 ............................................................................... 131
Figure 50: SBL action recognition process ..................................................................... 132
Figure 51: Relationship between examples, models and an action ................................ 134
Figure 52: Graphical depiction, sensor data and prediction model for “replace” ........... 135
Figure 53: Segmented data with fired rules for “replace” .............................................. 135
Figure 54: Markov chain for “replace” ........................................................................... 135
Figure 55: a) Missing start data b) Expected results for postdiction .............................. 138
Figure 56: a) Stream with missing data b) Expected results for interpolation................ 139
Figure 57: a) Missing end data b) Expected results for prediction ................................. 139
Figure 58: Action recognition results with positive examples and positive models....... 142
Figure 59: Action recognition results with different combinations of models ............... 143
Figure 60: Action recognition results with gap filling .................................................... 145
xi
Abstract
To survive in the real world, a robot must be able to intelligently react to unpredicted and
possibly simultaneous changes to its self (such as its sensors, actions, and goals) and
dynamic situations/configurations in the environment. Typically there is a great deal of
human knowledge required to transfer essential control details to the robot, which
precisely describe how to operate its actuators based on environmental conditions
detected by sensors. Despite the best preventative efforts, unpredicted changes such as
hardware failure are unavoidable. Hence, an autonomous robot must detect and adapt to
unpredicted changes in an unsupervised manner.
This dissertation presents an integrated technique called Surprise-Based Learning (SBL)
to address this challenge. The main idea is to have a robot perform both learning and
representation in parallel by constructing and maintaining a predictive model which
explains the interactions between the robot and the environment. A robot using SBL
engages in a life-long cyclic learning process consisting of “prediction, action,
observation, analysis (of surprise) and adaptation”. In particular, the robot always
predicts the consequences of its actions, detects surprises whenever there is a significant
discrepancy between the prediction and observed reality, analyzes the surprises for its
causes (correlations) and uses critical knowledge extracted from the analysis to adapt
itself to unpredicted situations.
xii
SBL provides four new contributions to robotic learning. The first contribution is a novel
method for structure learning capable of learning accurate enough models of interactions
in an environment in an unsupervised manner. The second contribution is learning
directly from uninterpreted sensors and actions with the aid of a few comparison
operators. The third contribution is detecting and adapting to simultaneous unpredicted
changes in sensors, actions, goals and the environment. The fourth contribution is
detecting and reasoning with unpredicted interference over a short period of time.
Experiments on both simulation and real robots have shown that SBL can learn accurate
models of interactions and successfully adapt to unpredicted changes in the robot’s
actions, sensors, goals and the environment’s configuration while navigating in different
environments. Experiments on surveillance videos have shown that SBL can detect
interference, and recover some information that was hidden from sensors, in the presence
of noise and gaps in the data stream.
1
Chapter 1
Introduction
1.1 Problem Statement
No matter how carefully a robot is engineered, the initial knowledge of the robot is bound
to be incomplete or incorrect with respect to the richness of the real world. Thus, an
autonomous robot must deal with dynamic situations caused by changes in its sensors,
actions and goals, changes in the environment and interference.
Specifically, we define “unpredicted changes” as addition and deletion of sensors,
actions, goals, alterations in the environment’s configuration, and change of definition of
sensors and actions. A definition-change in a sensor means that its typical response to
stimuli has changed, e.g. a camera being accidentally “twisted”. A definition-change in
an action means that its actuator response has changed, e.g. a cross-wire has made left-
turn to right-turn and vice versa. These unpredicted changes occur instantaneously and
could last over an indefinite period of time. Therefore, the robot must adapt to such
changes as soon as possible. In contrast, we define “interference”, such as noise and
missing data or gaps in sensor data streams, as unpredicted changes that occur over a
finite duration. Typically, the robot must be able to detect such situations and recover any
missing information. So our objective in this dissertation is to develop a solution to
address all of these unpredicted changes.
2
There are several major challenges for this problem. Unpredicted changes may occur
simultaneously in sensors, actions, goals or other related aspects. Thus, a robot cannot
assume any particular component to be fault-free, or guaranteed redundancy. Due to the
lack of permanently correct models for sensors, actions and environments, the robot must
constantly check and refine its models. The detection of change may not be under the
supervision of external guidance. A fast response time is required despite resource
limitations on a robot, and in most cases the robot must cope with continuous (non-
discrete), uncertain and vast information/action space.
1.2 Motivation of Our Approach
Events such as the Spirit Mars rover getting stuck in soft soil [Coul09] validated the fact
that no matter how carefully a robot is engineered, its initial knowledge is bound to be
incomplete or incorrect with respect to the richness of the real world. Similarly, the same
robot received some damage to a control circuit resulting in it being unable to turn its
right-front wheel, forcing it to drive backwards dragging the dead wheel [WB09]. This
and many other examples justify the fact that an autonomous robot must deal with
dynamic situations caused by unpredicted changes in its sensors, actions, goals and
environmental configurations.
Motivated by children’s developmental psychology, we are inspired to develop a life-
long cyclic learning process for an autonomous robot. We believe that a child is born
with very limited knowledge or expectations of his or her actions and interactions with
3
the environment, but as they interact with the world they extract valuable knowledge
from these experiences to develop the ability to predict future outcomes of their actions
even in an ever changing world. As a human progresses through life, their physical
characteristics or sensing and actuating capabilities change, such as changes in eyesight,
hearing, physical strength etc. Clearly, we are able to adjust and continue with our goals,
so a truly autonomous robot must also be able to handle changes in its sensors, actuators,
environments and goals as upgrades, failures and re-tasking are inevitable during its life-
time. Thus, our primary motivation was to create a learning algorithm for robots inspired
by a human’s process of “prediction, action, observation, analysis (of surprise), and
adaptation”. Eventually we hope that this will not only advance the fields of robotics and
artificial intelligence, but also be a vehicle that will contribute towards the field of
development learning.
1.3 Scope and Assumptions
Surprise-Based Learning was developed to facilitate life-long learning for an
unsupervised robot or autonomous agent by systematically addressing the challenges
outlined earlier. Throughout this dissertation we assume that the robot has sufficient
redundancy in hardware to compensate for any “surprise” or unpredicted change. In other
words we assume that a cause for a surprise (or a reason for an unpredicted change) is
observable, such that the learning could converge to a solution that exists. This
assumption is practical in sensor rich applications like robotic navigation. Yet, there are
situations where the cause may not be directly observable due to hidden states. Although
4
there are strategies such as “local distinguishing experiments” [Shen93a] that could be
performed to disambiguate hidden states, they will not be addressed in the scope of this
dissertation.
We refer to a structured response to a stimulus from the environment as an “entity”. This
research assumes that a sensor maps to one or more entities, each entity has one or more
attributes, and an attribute has a value obtained during an observation. For example, a
camera sensor could return a set of entities such as uniquely identified objects or colored
blobs, which could have the attributes size and location, while a proximity sensor would
return an entity corresponding to distance with the attribute size. The mapping from a
sensor to entities and their attributes must be provided by a user as preprocessing is
required for complex sensors (i.e. multi-dimensional data), while raw or uninterpreted
data from simple sensors (i.e. one-dimensional data) can be fed directly to the learner.
It is important to define the sensors, entities and attributes adequately as the learner can
only converge on goals defined in terms of them. Any inadequate definitions may result
in random action execution dominating over goal directed behavior as the learned model
is unable to plan towards the goals. In order to determine adequate mappings a user
should identify a set of goals that the robot is expected to achieve, then work backwards
by identifying the desired attributes, corresponding entities and which sensors would
most likely satisfy these goals. Note that it is not necessary to identify the most
appropriate mapping. Providing one or more adequate mappings would ensure learning.
5
This research does not assume that the robot can be reset to its initial configuration prior
to each experiment in the real world. However, it does assume the continuity of entities in
consecutive observations. For example, in an environment consisting of uniquely colored
walls, if the robot perceives a blob of a certain color at a particular size in the first
observation, and a blob of the same color but a different size in the next observation, it
assumes that these blobs represent the same entity. To ensure the validity of this
assumption the duration of an action between consecutive observations is defined to be
sufficiently small.
Finally, when learning to detect and adapt to interference this research assumes that there
is either no interference, or a very tolerable amount of interference is present in the data
during the learning phase, and no further adaptation of the learned model is allowed
during the testing phase.
1.4 Scientific Contributions
This dissertation investigated the problem of autonomous detection and adaptation to
unpredicted changes of a robot. There are four contributions to robotic learning.
The first contribution is a new machine learning technique called Surprise-Based
Learning for structure learning in robotics. SBL captures the structure, which are patterns
in the sensed data associated each action, in a set of prediction rules by detecting and
eliminating surprises.
6
The second contribution is learning directly from uninterpreted sensors and actions. More
precisely SBL can learn from both interpreted as well as uninterpreted sensors by
discretizing continuous sensor data through the application of comparison operators.
The third contribution is detecting and adapting to simultaneous unpredicted changes in
sensors, actions, goals and the environment. SBL achieves this by combining its abilities
of detecting and identifying the independent unpredicted changes.
The fourth contribution is detecting and reasoning with unpredicted interference over a
short period of time. The interference considered here includes temporary noise and gaps
in data.
1.5 Dissertation Organization
This dissertation is organized into 9 chapters. Following the introduction in this chapter,
Chapter 2 outlines the background and related work in developmental learning, artificial
intelligence, and autonomous robot adaptation. Chapter 3 details the Surprise-Based
Learning approach. Chapter 4 describes experimental environments and a strategy for
evaluating the contributions. Chapters 5 to 7 present experiments and results on the
contributions of structure learning, learning from uninterpreted sensors & actions, and
detecting and adapting to unpredicted changes. Chapter 8 describes the contribution of
detecting and reasoning with unpredicted interference, with its results. Finally, chapter 9
provides a summary and some future research directions.
7
Chapter 2
Overview of the State-of-the-Art
For autonomous detection and adaptation to unpredicted changes a number of challenges
must be addressed. This chapter presents inspirations drawn from human developmental
psychology and reviews some current artificial intelligence algorithms capable of
addressing these challenges with specific attention to related research on adaptation.
2.1 Inspirations from Developmental Psychology
The basic concept of surprise-based learning was first proposed by Shen and Simon in
1989 [SS89] and later formalized as Complementary Discrimination Learning [Shen90,
Shen93a, SS93]. This learning paradigm stems from Piaget’s theory of Developmental
Psychology [Piag52], Herbert Simon’s theory on dual-space search for knowledge and
problem solving [SL74], and C.S. Peirce’s method for science that “our idea of anything
is our idea of its sensible effects” [Peir1878].
Over the years, researchers have attempted to formalize this intuitively simple but
powerful idea into an effective and general learning technique. A number of experiments
in discrete or symbolic environments have been carried out with successes, including the
developmental psychology experiments for children to learn how to use novel tools
[Shen94], scientific discovery of hidden features (genes) [Shen89, SS93, Shen95], game
8
playing [Shen93b], and learning from large knowledge bases [Shen92]. Here, we
generalize these previous results for real robots to learn autonomously from continuous
and uncertain environments.
Another important inspiration drawn from developmental psychology is a behavioral
procedure called the water maze [Morr84], which was designed by Richard G. Morris to
test spatial memory. Although it was originally developed to test learning in animals, it
has since been modified for evaluating the same in autonomous robots.
2.2 Artificial Intelligence
In computer science, SBL is related to several solutions for the inverse problem, such as
Gold’s algorithm for system identification in a limit, Angluin’s L* algorithm [Angl87]
for learning finite state machines with hidden states using queries and resets, the L*-
extended algorithm by Rivest and Schapire [RS93] using homing sequences, and the D*
algorithm based on local distinguishing experiments [Shen93a].
For learning from stochastic environments, SBL is related to learning hidden Markov
models (HMM) [Murp04], partially observable Markov decision processes [Cass99], and
most recently, predictive state representations [WJS05] and temporal difference (TD)
algorithms [ST05].
9
Some systems also incorporate novelty [HW02], with a flavor of surprise, into the value
function of states, although they are not used to modify the learned models. The notion of
prediction is common in both developmental psychology and AI. Piaget’s constructivism
[Piag52] and Gibson’s affordance [Gibs79] are two famous examples. In AI systems, the
concepts of schemas [Dres91] and fluent [CAOB97] all resemble the prediction rules we
use here, although their primary use is not for detecting and analyzing surprises as we do
here. Different from these results, we present a new technique that capitalizes on
predictions and surprises to facilitate simultaneous learning and representation.
At present most learning algorithms can be classified as supervised, unsupervised or
reinforcement learning [KB72]. Supervised learning (SL) requires the use of an external
supervisor that may not present here. In contrast, unsupervised learning (UL) opts to
learn without any feedback from the environment by attempting to remap its inputs to
outputs, using techniques such as clustering. Hence, it may overlook the fact that
feedback from the environment may provide critical information for learning.
Reinforcement learning (RL) receives feedback from the environment. Some of the more
successful RL algorithms used in related robotic problems include Evolutionary Robotics
[NF00] and Intrinsically Motivated Reinforcement Learning [SKB05]. However, most
RL algorithms focus on learning a policy from a given discrete state model and the
reward is typically associated with a single goal. This makes transferring the learned
knowledge to other problems more difficult. Doya et al. addressed this shortcoming in
10
Multiple Model-based Reinforcement Learning (MMRL) [DSKM02] by applying the
concepts of using multiple paired forward inverse models for motor control [WK98] into
the RL paradigm. MMRL maintains several parallel models and switches between them
as the goal changes, yet it needs a substantial amount of a priori knowledge about the
environment and sensors. It is interesting to note that such techniques may have been
inspired by unfalsified control theory [ST97], whereas SBL has a subtle difference in that
it maintains a single model that adapts with experience.
The following are a few noteworthy extensions and applications of RL. Krichmar et al.
[KNGE05] tested a brain-based device called “Darwin X” on a dry version of the water
maze. This robot utilized visual cues and odometry as input and its behavior was guided
by a simulated nervous system modeled on the anatomy and physiology of the vertebrate
nervous system. Busch et al. [BSKS07] built on this idea by simulating a water maze
environment to compare an attributed probabilistic graph search approach and a temporal
difference reinforcement learning approach based solely on visual cues encoded via a
self-organizing map which discretized the perceptual space. Stone et al. [SSK08] tested
the simulated algorithm in a physical robot and extended it to facilitate adaptation of the
reinforcement learning approach to the relocation of the hidden platform.
RL discretizes the perceptual space with a predefined approximation function. Instead,
SBL learns a model of the environment as a set of prediction rules. RL requires a large
amount of training episodes to learn a path to a single goal. SBL learns via surprises with
11
each action it takes, so it does not need separate training episodes. Most RL algorithms
focus on learning a policy from a given discrete state model and the reward is typically
associated with a single goal, making it difficult to transfer learned knowledge to other
goals or to other problems which are similar. This fact was noted by Stone et al.
prompting an extension of the original algorithm, but even with the short term and long
term rewards the trajectories of the robot were still slightly biased when the goal was
relocated. SBL accommodates multiple goals and can be transferred to other problems
with minimal changes as the algorithm does not require any a priori information
regarding the environment, the sensors or the actuators. In addition, SBL’s ability to
accommodate sensor and actuator failure during runtime without any external
intervention makes it an ideal candidate for a physical robot operating in a real
environment. Compared to the paradigm of reinforcement learning, SBL is model-based
and offers quicker adaptation for large scaled and continuous problems.
Complementary Discrimination Learning (CDL) [Shen94] attempts to learn a model from
a continuous state space and is capable of predicting future states based on the current
states and actions. This facilitates knowledge transfer between goals and discovering new
terms [Shen89]. However, CDL is symbolic learning that has not been applied to physical
robots to learn directly from the physical world. In particular, CDL prescribes how the
conditions in rules are to be altered to perform specialization and generalization, yet it
does not specify how the prediction side should be utilized to capture relations between
sensed entities. There are also some other activities that are necessary for robotic learning
12
that have not been addressed in CDL, such as discretization of continuous sensors, rule
forgetting and adapting to unpredicted changes.
Evolutionary Robotics (ER) is a powerful learning framework which facilitates the
generation of robot controllers automatically using neural networks (NN) and genetic
programming/algorithms (GA). The emphasis in ER is to learn a controller given some
special parameters such as the number of neurons and layers in a NN or the fitness, cross-
over and mutation functions in a GA. These parameters are not easily discoverable for
most robots in different environments. However, advances in ER such as the Exploration-
Estimation Algorithm [LB04] proved that an internal model such as an action or sensor
model of a robot can be learned without a priori knowledge, given that the environment
and the robot can be reset to its initial configuration prior to each experiment, which may
not be realistic in the real world.
Another promising approach is Intrinsically Motivated Reinforcement Learning (IMRL)
where the robot uses a self-generated reward to learn a useful set of skills. IMRL has
successfully demonstrated learning useful behaviors [OKH07], and a merger with ER as
in [SMB07] was able to demonstrate navigation in a simulated environment. The authors
have mentioned that IMRL can cope with situated learning, but the high dimensional
continuous space problem that embodiment produces, is beyond its current capability.
13
There has been a large amount of research in model learning as in [PK97] and [SS06], yet
the majority focus on action, sensor or internal models of the robot and not the external
world. For the autonomous robotic learning problem we are interested in, the learner
must accommodate learning models of interaction in an environment with limited
processing, noisy actuation and noisy sensing available on a physical robot. Furthermore,
the ability to predict and be “surprised” by any ill effects is also critical. The powerful
paradigm of learning from surprises has been analyzed, theorized, explored and used in a
few other applications such as traffic control [HJSL05] and computer vision [IB04].
Alternatively, for problems such as robot navigation, Simultaneous Localization and
Mapping (SLAM) [Thru02] could be used to learn a map. Goals can be achieved by
applying AI searching and planning techniques on this map. Accurate models for the
robot’s actions and sensors must be given to facilitate this type of learning, yet such
models are not readily available for most autonomous robots. Most importantly, SLAM
techniques need carefully designed contingency plans to deal with unpredicted changes,
whereas SBL can overcome such situations by autonomously adapting the learned model.
2.3 Adapting to Unpredicted Changes (including Fault & Failure Tolerance)
“Fault & failure tolerance” used in the context of a robot’s ability to detect and recover
from sensor and actuation failure is related to adapting to unpredicted changes. It is
important to note that unpredicted changes include interference in addition to fault &
failure tolerance. Typically there are 3 attitudes towards dealing with unpredicted
14
changes in robots as identified by Saffiotti [Saff97]. The 1st is to get rid of it through
precise engineering. The 2nd is to tolerate it by utilizing redundancy with carefully
designed contingency routines. The 3rd is to reason about it by using techniques for the
representation and manipulation of uncertain information. There is a rich body of
research under all 3 attitudes, but we are interested in the 2nd and 3rd as they focus on
imbuing intelligence in robots.
In order to tolerate unpredicted changes in sensors and actions the robot must be able to
detect the change and then handle it. For this purpose one feasible approach is to provide
models of the sensors, actions and environments, such that when an exception occurs it
will be trapped and handled by separate contingency strategies. An early example of
developing contingency strategies using sensor redundancy was given by Visinsky
[Visi91]. Ferrell [Ferr94] demonstrated failure tolerance on a multi-legged robot that
exploits redundancy in the number of legs and sensors using a predefined strategy to
deactivate and reactivate each leg. Similarly, Kececi [KTT09] demonstrated a method for
redundant manipulators. From Murphy’s [Murp96] early work to Stronger’s [SS06]
recent work there has been research on learning sensor models under controlled
conditions first, then using them in uncontrolled situations to detect anomalies. When
goals change unpredictably methods such as model switching as shown by Doya
[DSKM02] can be used. Unfortunately, in most practical applications it is not always
possible to acquire accurate models of the sensors, actions or environments especially
because hardware degrades over time and environments change.
15
The attitude towards reasoning and manipulation of uncertainty, known as data-driven
adaptation aims to overcome the shortcomings of the previous attitude. Sensor fusion
techniques demonstrated by Pierce [PK97] and Soika [Soik97] are good examples that
show how probabilistic sensing from multiple modalities could be used to tolerate faults.
Bongard [BZL06] further demonstrated how simulating and reasoning about uncertainties
can be used to recover from unpredicted actuation errors in a legged robot. When the
goals change unpredictably as in the Morris water maze, Stone [SSK08] demonstrated the
feasibility of updating the model and re-planning. However, most of these techniques rely
on either the use of controllable environments i.e. the ability to reset or alter accordingly,
or the availability of one or more models to verify the other i.e. use the sensor model to
establish that an actuator has failed etc. A very competitive solution for this problem was
highlighted by Pierce [PK97] where they demonstrated learning a map of the
environment directly from uninterpreted sensors and actuators. They specify that the
action model is learned with the aid of the previously learned sensor model, also the
environment model is built using these two, yet they do not detail how to accommodate
any unpredicted changes propagating from a previous model.
Despite numerous successes, none of these techniques attempt to deal with simultaneous
unpredicted changes in all aspects, including changes in sensors, actions, goals and the
environment’s configuration, in an unsupervised manner. This is a major motivation for
the development of SBL [RS08a].
16
2.4 Reasoning with Unpredicted Interference
Hidden Markov models as demonstrated in [Diet02], conditional random fields as
demonstrated in [BOP97] and neural networks have shown some promise towards
detecting and reasoning with unpredicted interference. These techniques require human
guidance to design approximation functions and determine the number of states or
neurons. Although an exhaustive search for the number of states can be performed, this
costs a large amount of time and data for training. In contrast, SBL is designed to learn
the number of states from a few examples and reason with unpredicted interference.
17
2.5 Summary
Table 1: Comparison of some competitive learning algorithms
Technique
Solve
Task
No Over
Fitting
No Prior
Models
Raw Data
Probabilistic
Data
Knowledge
Transfer
No Explicit
Training
Fault
Tolerance
Memory
Forgetting
Runtime
Change
Detect
Anomaly
Recover
Gap
RL
NN
GA
CDL
TD
MMRL
IMRL
HMM
SBL
18
Table 1 shows a comparison of the capabilities of various learning algorithms that are
suitable for autonomous robotic learning. The columns are defined as follows:
“Solve Task” means that the algorithm can learn to satisfy some goals.
“No Over Fitting” marks scalability as it implies generalization, resulting in
compactness.
“No Prior Models” means that the algorithm does not need an approximation
function or a sensor, actuator or environment model in advance.
“Raw Data” is the ability to handle data directly from the sensors rather than via a
human engineered approximation function or a preprocessor.
“Probabilistic Data” means that non deterministic environments can be dealt with.
“Knowledge Transfer” facilitates changing goals without having to re-learn.
“No Explicit Training” indicates that the learning is incremental such that the
training and testing are interleaved.
“Fault Tolerance” is the ability to compensate for faults and failures provided
there is sufficient redundancy.
“Memory Forgetting” identifies the ability to reject irrelevant data over a period
of time.
“Runtime Change” permits changes to hardware (sensors, actuators) or software
(operators) dynamically during runtime.
“Detect Anomaly” is the ability to recognize an unpredicted or anomalous
situation and identify possible causes.
19
“Recover Gap” indicates that the algorithm can recover data hidden in a gap in a
data stream over a period of time.
Note that these algorithms may be improved in certain ways, such as combining them
together into hybrid algorithms, but we only consider its pure form here. These
capabilities are important as they relate to the original challenges in the following ways:
1. Coping with vast amount of non-discrete (continuous) information.
a. No over fitting
b. Raw data
2. Coping with uncertain information.
a. Raw data
b. Probabilistic data
c. Fault tolerance
d. Forgetting
3. Working without sufficient information such as action, sensor & environment
models.
a. No prior models
b. Knowledge transfer
20
4. Requires fast response times despite resource limitations.
a. No over fitting
b. No explicit training
c. Forgetting
5. Dealing with unpredicted changes in sensors, actions, goals or the environment.
a. Fault tolerance
b. Runtime change
c. Forgetting
d. Probabilistic data
6. Learning with minimal or no human intervention.
a. No prior models
This chapter described inspirations drawn from developmental psychology, and a survey
of the state of the art in the areas of artificial intelligence and robotic learning. Although
there are several competitive approaches for detecting and adapting to unpredicted
changes on a robot, we identified that none of them focus on simultaneous unpredicted
changes in sensors, actions, goals and the environment. The rest of this dissertation will
present a new approach to address this problem.
21
Chapter 3
Surprise Based Learning
This chapter provides an introduction to the core concepts of Surprise-Based Learning.
For this purpose the terminology used throughout this dissertation is defined first,
followed by an illustrative example of a robot navigating inside a constrained box
environment. The box environment is used in subsequent chapters to illustrate finer
details of SBL. We describe the process of creating and maintaining a prediction model,
followed by strategies for goal management, rule forgetting and entity & attribute
relevance, which are needed for detecting and adapting to changes in goals, the
environment, actions and sensors respectively.
3.1 Terminology
Action – An action is a physical change that occurs inside the learner, which in the case
of a physical robot is a sequence of predefined actuator commands over a period of time.
The set of actions given to the learner is A = {a 1, a 2, … , a n}
Value – A value can be a numeric or categorical representation of data. Values may be
ordered (i.e. numeric ordering) or unordered (i.e. categorical bins).
A set of values is V = {v 1, v 2, … , v n}
22
Attribute – An attribute is a grouping of related values. An attribute may have a finite set
of values or an infinite set of values bound by a range.
A set of attributes is B = {b 1= {v 1, v 2}, … , b k= {v 1, … , v n}, b l=[v i, v j]}
Entity – An entity is a grouping of related attributes. It is a structured response to a
stimulus from the environment.
A set of entities is E = {e 1= {b 1}, … , e n= {b 1, … , b k}}
Sensor – A sensor is the only channel that allows data from the environment to flow into
the robot. This raw input data is fed either directly to the learner or via some function that
performs preprocessing. So in the case of raw data, SBL maps it to a single entity that has
a single attribute (null preprocessing). For example a range sensor is mapped to a single
entity which has a single attribute called size, while a camera sensor maps to several
color blob entities which have the attributes size and relative location.
The set of sensors given to the learner is S = {s 1, s 2, … , s n}
A set of sensor to entity mappings is F = {s 1= {e 1}, … , s n= {e 1, … , e k}}
Operator – An operator or comparison operator ⨀ is a mechanism that enables the
learner to reason about an entity, or an attribute’s values. These operators aid in the
creation and evaluation of logic literals, which serve as the basic building block for our
prediction model. ⨀ may be presence(%), absence(~), increase(↑), decrease(↓), greater-
than(>), less-than(<), equal(=), etc.
23
The set of comparison operators given to the learner is OP = {⨀
1
, ⨀
2
, … , ⨀
n
}
Condition – A condition is a logic literal that uses a comparison operator to define a
range of values that an entity and its attribute can take. A logic sentence can be formed by
grouping several conditions with the logic (AND) and ¬ (NOT) operators.
A condition is c = (s 1e 1b 1 ⨀ v 1)
A set of conditions is C = {c 1 ¬c
2
… c
k
}
Observation – An observation at time t is the set of sensor readings obtained via the
hierarchy mentioned above.
At any time t, an observation is O
t
= {(s 1e 1b 1 = v 1) … (s ne ib j = v k)}
Goal – A goal is an observation or a part of an observation that the robot wants to receive
from the environment. So a goal is accomplished when the observation at time t satisfies
every clause defined in the goal. The symbol * represents any value, while the symbol ∅
means empty.
A goal is G = {(s 1e 1b 1 = v 1) (s 1e 1b 2 = *) (s 2e 1b x ≠ ∅) … (s ie jb k = v l)}
An observation is O
t
= {(s 1e 1b 1=v 1) (s 1e 1b 2=v 2) (s 2e 1b x=v 3) … (s ie jb k=v l)}
This goal is accomplished when O
t
⊨ G
24
Prediction – A prediction for time t is a forecasted observation made before time t. A
clause in a prediction may be negated with the logic ¬ (NOT) operator. A prediction is
true when the observation at time t satisfies every clause in the prediction.
A prediction is P
t
= {(s 1e 1b 1 ⨀ v 1) … ¬(s 2e 1b 1 ⨀ v k)}
An observation is O = {(s 1e 1b 1 ⨀ v 1) … ¬(s 2e 1b 1 ⨀ v k) (s ie jb k ⨀ v l)}
The prediction is true at time t when O
t
⊨ P
t
Surprise – A surprise occurs when a prediction is falsified by the observation at time t.
We define that a surprise is true when none of the predicted literals in a clause are
contained within the observation. A surprise is synonymous to a falsification or refutation.
A prediction is P = {(s 1e 1b 1 ⨀ v 1) (s 2e 1b 1 ⨀ v k) … (s ie jb k ⨀ v l)}
An observation is O = {(s 2e 1b 1 ⨀ v k) (s ie jb k ⨀ v l)}
A surprise occurs when O
t
⊭ P
t
Prediction Rule – A prediction rule describes the forecasted observation of executing a
particular action, provided the current observation satisfies a set of conditions.
A prediction rule is R = C
→ P
Prediction Model – The prediction model is comprised of a set of prediction rules. This is
a description of the environment learned by the robot through exploration.
The set of prediction rules forming a prediction model is M = {R 1, R 2, … , R n}
25
Base-condition – During the creation of a prediction rule, the observation made at time t-
1 before executing the action is recorded as the “base-condition”.
Base-result – During the creation of a prediction rule, the observation made at time t after
executing the action is recorded as the “base-result”.
Surprised-condition – When a surprise occurs, the observation made at time t-1 before
executing the action is recorded as the “surprised-condition”.
Surprised-result – When a surprise occurs, the observation made at time t after executing
the action is recorded as the “surprised-result”.
Unpredicted sensor/action changes – When a sensor has an unpredicted change every
rule containing its entities and attributes will be surprised. When an action has an
unpredicted change every rule containing it will be surprised.
Sensor failure: ∀R ∃s f∈S ∀ (O
t
⊭ P
t
) ⇒ changed(s f)
Action failure: ∃R ∀a i∈A ∀ (O
t
⊭ P
t
) ⇒ changed(a i)
A Priori Knowledge – A priori knowledge refers to what is given to the learner at the
onset. A crucial step towards true autonomy is to reduce the amount of priors thereby
reducing the amount of human engineering required to transfer this knowledge. In
particular a priori knowledge can consist of sensor models, action models, environment
26
models, approximation (reduction) functions, preprocessing functions, comparison
operators etc.
The minimal a priori knowledge in SBL when using raw data is A, S, OP.
Unpredicted Changes – At runtime unpredicted changes could occur to the given set of
actions, sensors and goals without the robot being informed. Sensors could be added,
removed or their definitions may be changed. Similarly, actions may be added, removed
or their definitions changed, while goals can be added or removed. A definition change
could be caused by rotations, translations or by a change in scale of the associated sensor
or actuator. Furthermore, unpredicted changes could occur in the configuration of the
environment where the detected entities have physically been relocated. Another
unpredicted change is interference. This may include noise and gaps in data.
3.2 Illustrative Example - Constrained Box Environment
Camera
SuperBot
Power &
Video to PC
Range Sensor
Figure 1: a) Overhead view b) SuperBot & sensors c) Vision & range sensor
The constrained box environment is a SuperBot module [SMS06] placed inside a box.
The robot is equipped with bi-directional Wi-Fi communications, an onboard camera, a
27
short distance range sensor, an external PC capable of wirelessly interfacing with the
robot and an overhead camera overlooking the entire environment for human observation
and recording. The environment is enclosed with four uniquely colored walls and a
discernible floor as seen in Figure 1. The onboard color camera is mounted in a way that
the robot loses sight of the ground plane when it is approximately 6” from the wall it’s
facing. The range sensor is an IR proximity detector, which typically reports a value that
increases as the robot gets closer to a wall. The range sensor has been attached such that
it faces the same direction as the camera, and its maximum range response is roughly 10”,
meaning that it returns a constant value when it is further than 10” away from the wall in
sight. These sensors are generally noisy. For example, the range sensor works differently
for different walls as different colors reflect IR slightly differently. The sensor data is
relayed to the PC where the learning system can radio action commands back to the robot.
The robot is preloaded with four actions corresponding to the motions of forward,
backward, left and right. Each action is defined as an execution pattern of the motors for
a pre-specified period of time (duration). In the current implementation, the duration for
all four actions is set to be 5 seconds, which results in moving an estimated distance of
3.5~4.2 cm for the forward action, 2.5~4.5 cm for the backward action and turning by
13~15 degrees for both the left and right actions. As expected, these numbers are noisy
and uncertain, partially due to the slippage of the wheels on the uneven and non-uniform
surface. The robot has no a priori information about the environment, expected outcome
of each action or reversibility of actions. The names of the actions are labels that have no
28
meaning. In fact, due to the noise these actions are not perfectly reversible, making this
undoubtedly an uncertain environment.
The onboard camera and range sensor allow the robot to make an observation or capture
an observation before and another after the execution of an action. Data from the range
sensor is mapped through a null processor to a single entity that has a single attribute
called proximity. Data from the camera is mapped through a mean-shift segmentation
[CM02] processor to several entities that represent unique blobs of color. These entities
are labeled dynamically and matched at runtime. Each blob entity has the attributes size,
x-location of the center of the blob and y-location of the center of the blob. The center
locations are calculated with respect to the frame of reference. Note that the coordinate
system used in this system for image analysis originates from the top left corner of each
image (i.e. the vertical displacement increases from top to bottom, while the horizontal
displacement increases from left to right).
In two consecutive observations before and after an action, we assume that the same
color represents the continuation of a blob. However, the robot has no a priori knowledge
about the number of entities that would be encountered in the environment, nor does it
know how these entities and attributes are related to the actions. In addition, all sensors
are inherently noisy and uncertain.
29
When learning values are reasoned and compared by a set of comparison operators that
are given to the robot at the outset. Here the operators %, ~ are used to evaluate the
presence or absence of an entity respectively, while the operators <, <=, =, !=, >=, > are
used to evaluate the values of an attribute. These are especially important because the
environment is a continuous space with no predefined regions or discretized grids.
3.3 Learning a Prediction Model
Figure 2: SBL Process Figure 3: SBL architecture
The learning process highlighted in Figure 2 is as follows:
i) Actions are randomly selected or planned based on the current model.
ii) The predictor returns all prediction rules in the model whose action and
conditions match the current observation as well as the selected action.
iii) The action is executed.
iv) A new observation is made.
input
output
action
Model Modifier
predictions
observation
Perceptor
Surprise Analyzer
Model
Action Selector
Physical World
Predictor
Surprise-Based Learner
Goals
30
v) If no prediction was made or new learning opportunities exist, then new rules are
created (see Rule Creation).
vi) If surprises were detected (see Surprise Detection & Analysis) the model is
revised (see Rule Splitting and Rule Refinement) to reflect the new observation.
The corresponding SBL architecture is shown in Figure 3. Learned knowledge is
represented in a prediction model, which consists of prediction rules in the format of (1)
below.
Rule ≡ Conditions → Action
+
→ Predictions (1)
Condition ≡ (Entity, Attribute, Comparison Operator, Value) (2)
Prediction ≡ (Entity, Attribute, Expected Change, Value) (3)
Each prediction rule is a triplet as in (1) of “condition”, “action(s)”, and “prediction”.
Conditions are logical statements describing the state of observed entities and attributes
prior to the execution of a specific action one or more times as indicated by
+
. A
condition can be represented as a 4-tuple as in (2). Using the comparison operators given
to the robot, each condition describes a relationship formed of an entity, its attribute and
value. For example, the expression Condition1 ≡ (entity1, attribute1, >, value1) means
that Condition1 is true if attribute1 of entity1 is greater than value1. Several logically
related conditions can be grouped together to form a sentence using ‘And’ and ‘Not’
logical operators. Predictions are sentences that describe the expected change in the state
31
of observed entities and attributes as a result of performing a specific action possibly a
number of times. As seen in (3) a prediction can be represented using a 4-tuple. i.e.
Prediction1 ≡ (entity1, attribute1, ↑, value1) means that if the rule is successful the value
of entity1’s attribute1 will increase by the amount indicated in value1.
An important aspects of prediction rules is that they can be sequenced to form a
prediction sequence [o
0
, a
1
, p
1
, …, a
n
, p
n
] where o
0
is the current observation at the
current state, and a
i
, 1≤i≤n, are actions, p
i
1≤i≤n, are predictions. As the actions in this
sequence are executed, the environmental states are perceived in a sequence o
1
, o
2
,…, o
n
.
A surprise occurs as soon as an environmental state does not match its correspondent
prediction. Notice that a surprise can be “good” if the unpredicted result is desirable or
“bad” otherwise. This notation is similar to the concept of “predictive state
representations” recently proposed in [LSS02], but a prediction sequence here can be
used to represent many other concepts such as “plan”, “exploration”, “experiment”,
“example” and “advice” in a unified fashion as follows:
A plan is a prediction sequence where the accumulated predictions in the
sequence are expected to satisfy a goal.
Exploration is a prediction sequence where some predictions are deliberately
chosen to generate surprises, so the learner can revise its knowledge.
An experiment is a prediction sequence where the last prediction is designed to
generate a surprise with a high probability.
32
An example is a prediction sequence provided by an external source that the
learner should go through and learn from the surprises generated during the
execution of the sequence.
Advice is a prediction sequence that should be put into the model as it is. The
learner simply translates a piece of advice into a set of prediction rules. For
example, “never run off a cliff” can be translated into a prediction rule as
cliff_facing forward destruction. The learner can weigh the advice ranging
from “never” to “sometimes” depending on the seriousness of the consequence
stated in the rule and use it when planning to reach a goal.
3.4 Life Cycle of Prediction Rules
Figure 4: Life cycle of a prediction rule
The illustration in Figure 4 depicts the life cycle of a prediction rule. The solid arrows
indicate the evolution of a prediction rule, while the dotted lines express the input and
possible output at each stage.
33
During learning and adaptation, prediction rules are created, split, and refined according
to a set of templates described in equations (4)-(10) below. These form the core of the
model modifier in the SBL architecture. They are in essence enhanced versions of rule
creation, splitting and refinement as described in CDL. We first present the format of
these templates and then give detailed examples and explanations in subsequent chapters.
Rule Creation – Let C
0
represent an entity or an attribute. If C
0
represents an attribute
before an action then P
0
indicates its change after the action. If C
0
represents an entity
before an action then P
0
may indicate its change or another entity responsible for its
change after the action. A new rule is created as follows:
Rule
0
= C
0
→ Action
+
→ P
0
(4)
Rule Splitting – If a surprise is caused by a single rule (e.g. Rule
0
above), then for each
possible cause C
X
identified by the analysis of the surprise, the rule is split into two
complementary sibling rules as follows, where P
X
is a newly observed consequence of
the action.
RuleA = C
0
C
X
→ Action
+
→ P
0
¬ P
X
(5)
RuleB = C
0
¬ C
X
→ Action
+
→ ¬ P
0
P
X
(6)
Rule Refinement – If a surprise is caused by a RuleA that has a sibling RuleB, where
C
represents the rule’s current condition minus C
0
, and
P
represents the prediction of the
rule, as follows:
34
RuleA = C
O
C
→ Action
+
→
P
(7)
RuleB = C
O
¬
C
→ Action
+
→ ¬
P
(8)
Then for each possible cause C
X
identified by the analysis of the surprise, the rules will
be refined as follows:
RuleA = C
O
C
C
X
→ Action
+
→
P
(9)
RuleB = C
O
¬ (
C
C
X
) → Action
+
→ ¬
P
(10)
Notice that equations (7)-(10) can be applied to any pair of complementary rules. In
general, a pair of complementary rules can be refined multiple times so that as many C
X
can be inserted into their conditions according to this procedure. Whenever a rule is
discriminated, its complementary rule will be generalized, hence the name
complementary discrimination learning.
3.4.1 Rule Creation
New rules are created according to equation (4). We call the observation made at time t-1
before executing an action the base-condition ‘BC’, and the observation made at time t
after executing the action the base-result ‘BR’. Therefore, new rules are created and
added to the model by comparing the base-condition and base-result of executing an
action ‘a’ one or more times with the set of operators {%, ~, ↑, ↓}. The following
functions return new prediction rules:
35
For each entity e in BC but not in BR, create (%e –a
+
→ ~e); (11)
For each entity e not in BC but in BR, create (~e –a
+
→ %e); (12)
For each entity e
1
in BC not in BR and each e
2
not in BC but in BR,
create (%e
1
–a
+
→ %e
2
) (13)
For each entity e in BC and BR, do
for value v increased, create (e.b –a
+
→ e.b↑v); (14)
for value v decreased, create (e.b –a
+
→ e.b↓v); (15)
For each newly created rule, the robot records observations corresponding to its base-
condition and base-result. Rule creation is invoked when the robot, i) has no predictions
for the selected action, or ii) none of the predicted rules forecasted a change in the current
observation, or iii) all predicted rules have been forgotten after rule analysis as detailed
later.
Table 2: Example of rule creation
Base-condition G.size = 10, H.size = 1, I.length = 2
Base-result G.size = 15, H.size = 1, J.size = 6
Prediction Model R
1
: %I –Action
1
→ ~I
R
2
: ~J –Action
1
→ %J
R
3
: %I –Action
1
→ %J
R
4
: G.size –Action
1
→ G.size↑5
36
Forward
Action
Figure 5: a) Robot’s location b) Base-condition c) Base-result
Two examples of rule creation are shown in Table 2 and Figure 5 respectively. Consider
the situation in Figure 5, where the robot first explores the action “forward” and the
observations before and after the action are in Figure 5b and 5c respectively. There are
three entities in these observations, namely red (wall), white (floor), and proximity. The
rule creation mechanism detects four changes in the attributes before and after the action,
but no entities have changed. So 4 new rules Rule1, Rule2, Rule3, and Rule4, are created
as listed below:
Rule1: (White, All, %, 0) → FORWARD → (White, S, <, Value1) // size of white decreased
Rule2: (White, All, %, 0) → FORWARD → (White, Y, >, Value2) // y-location of white increased
Rule3: (Red, All, %, 0) → FORWARD → (Red, S, >, Value3) // size of red increased
Rule4: (Red, All, %, 0) → FORWARD → (Red, Y, >, Value4) // y-location of red increased
3.4.2 Surprise Detection and Analysis
As prediction rules are incrementally learned, the robot uses them to make forecasts or
predications whenever it can. When the conditions of a prediction rule ‘R’ are satisfied
by the current observation, the rule’s predictions are evaluated after the action is executed.
37
A surprise is detected if a prediction fails to be realized, i.e. the forecasted value was not
observed. When a surprise occurs, we call the observation made at time t-1 before
executing the action the surprised-condition ‘SC’.
The objective of surprise analysis is to identify the possible cause(s) of the surprise by
comparing entities and attributes in the base-condition with those in the surprised-
condition using the given set of comparison operators (typical set {%,~,>,<,=,!=}). The
following functions perform surprise analysis, which return a set of possible causes
[c
1
, …, c
z
], where each cause is an expression of an entity or an attribute that was true in
the base-condition but false in the surprised-condition:
For each entity e in BC but not in SC, cause (%e); (16)
For each entity e in SC but not in BC, cause (~e); (17)
If values in attribute b are ordered, then for each entity e in BC and SC, do
for value e.b.v
BC
< e.b.v
SC
, cause (e.b < e.b.v
SC
); (18)
for value e.b.v
BC
> e.b.v
SC
, cause (e.b > e.b.v
SC
); (19)
If values in attribute b are unordered, then for each entity e in BC and SC, do
for value e.b.v
BC
!= e.b.v
SC
, cause (e.b != e.b.v
SC
); (20)
38
Table 3: Example of surprise analysis
Surprised Rule R
4
: G.size –Action
1
→ G.size↑5
Base-condition G.size = 10, H.size = 1, I.length = 2
Base-result G.size = 15, H.size = 1, J.size = 6
Surprised-condition G.size = 20, H.size = 2, J.size = 6
Surprised-result G.size = 20, H.size = 2, J.size = 3
Causes of Surprise [(%I), (~J), (G.size<20), (H.size<2)]
Forward
Action
Forward
Action
Figure 6: a) base-condition b) base-result/surprised-condition c) surprised result
Two examples of surprise analysis are shown in Table 3 and Figure 6 respectively. If the
robot is about to execute a forward action in a situation shown in Figure 6 where white
(floor) is perceived before the action, then by Rule1, the robot predicts that the size of
white will decrease in the observation after the action. However, the white disappeared
after the action, so the robot detects a surprise.
In CDL, the comparison operators were assigned a fixed order of evaluation such that the
analysis terminated as soon as the first difference was identified. This ordering reduced
the processing and memory requirements of the algorithm, yet it came at the cost of
39
needing a carefully defined order at the outset and would still run the risk of missing
crucial relations between sensors, entities and attributes.
In SBL we initially worked with a fixed ordering of comparison operators and attempted
to mitigate the risk of missing crucial information by establishing the novelty of each
cause. A cause was “novel” if the corresponding entity or attribute was not mentioned in
the existing rule or “ordinary” otherwise. The analysis would return ordinary causes only
if there were no novel causes. For example consider the very rare situation in which the
robot is diagonally located in a corner between two walls (i.e. heading to a wall
diagonally while the corner is on one side). As the robot executes the backward action,
the proximity value decreases while the size of the wall remains equal, but at a certain
point its rear hits another wall and the proximity value becomes equal. Without any
sensors at the rear to tell the differences, the analysis of surprise would return a cause that
is already mentioned in the existing condition of its rule (e.g. the change in the proximity
value). If not even an ordinary cause was detected, then no further action was taken (This
is the case in a translucent environment, which has been considered by CDL in a separate
paper [SS93]).
Although the concept of novelty mitigated the loss of crucial information, having an
incorrect ordering of comparison operators would result in the inability to correctly
model some environments despite having the necessary sensors and actions. For example
if the operators <, ~ were defined in that order, then the absence operator may not be used
40
because the concept of reducing to zero remains true even when an entity becomes
absent.
Therefore, the ordering of comparison operators and the concept of novelty was
eliminated from SBL. Hence, after all the differences are detected, the analysis will return
a set of C
X
as the possible “causes” for the current surprise. In the current example,
comparing Figure 6a (base-condition) with Figure 6b (surprised-condition) will result in
4 possible causes: (Red, S, <, Value5), (Red, Y, <, Value6), (White, S, >, Value7),
(White, Y, <, Value8), where the values are extracted from Figure 6b. No other causes
will be returned in this case because the proximity value and the x-locations of both blob
entities don’t change.
3.4.3 Rule Splitting
Rule maintenance deals with updating existing rules when they are surprised by new
observations. When a surprise occurs, we call the observation made at time t after
executing the action the surprised-result ‘SR’. When a single rule R
0
: c
0
–a
+
→ p
0
is
surprised for the first time, a set of possible causes [c
1
, …, c
z
] is acquired from surprise
analysis and used to perform rule splitting adhering to equations (5)-(6). The details are
as follows:
41
For each cause c
x
identify the expected change p
x
as follows,
For each entity e in SC not in SR, (p
x
= ~e); (21)
For each entity e in SR not in SC, (p
x
= %e); (22)
If values in attribute b are ordered, for each entity e in SC and SR, do
for value v increased, create (p
x
= e.b↑v); (23)
for value v decreased, create (p
x
= e.b↓v); (24)
If values in attribute b are unordered, for each entity e in SC and SR, do
for value v changed, create (p
x
= e.b.v
SR
); (25)
For each cause c
x
create a complementary pair of rules as follows,
Specialization R
00
: c
0
^ c
x
–a
+
→ p
0
˅ ~p
x
(26)
Generalization R
01
: c
0
^ ~c
x
–a
+
→ ~p
0
^ p
x
(27)
Hence, for each possible cause, each rule splitting results in a new pair of complementary
prediction rules that reflect both the original result p
0
and the surprised result p
x
. For each
new complementary rule, the robot records observations corresponding to the single
rule’s surprised-condition and surprised-result as its new base-condition and base-result
respectively.
Table 4: Example of rule splitting using causes from Table 3
Surprised Rule R
4
: G.size –Action1→ G.size↑5
Causes [(%I), (~J), (G.size<20), (H.size<2)]
Prediction Model R
400
: G.size ^ %I –Action
1
→ G.size↑5 ˅ ~J.size↓3
R
401
: G.size ^ ~I –Action
1
→ ~G.size↑5 ^ J.size↓3
42
Table 4 (Continued)
Prediction Model R
410
: G.size ^ ~J –Action
1
→ G.size↑5 ˅ ~J.size↓3
R
411
: G.size ^ %J –Action
1
→ ~G.size↑5 ^ J.size↓3
R
420
: G.size ^ (G.size<20) –Action
1
→ G.size↑5 ˅ ~J.size↓3
R
421
: G.size ^ ~(G.size<20) –Action
1
→ ~G.size↑5 ^ J.size↓3
R
430
: G.size ^ (H.size<2) –Action
1
→ G.size↑5 ˅ ~J.size↓3
R
431
: G.size ^ ~(H.size<2) –Action
1
→ ~G.size↑5 ^ J.size↓3
A simple example is shown in Table 4. To further illustrate the rule splitting procedure,
consider the scenario in Figure 6, where the robot’s forward action in Figure 6b caused a
surprise for Rule1 because white does not decrease but disappeared in Figure 6c. In this
case, the analysis of surprise returns 4 possible causes and 3 new consequences. So Rule1
will be split into 12 pairs of new complementary rules (Similarly, Rule2 will be split into
12 new pairs of complementary rules as well):
Rule1.1.1: (White,All,%,0) (Red,S,<,Value5)→FORWARD→(White,S,<,Value1) ¬(Red,S,>,Value9)
Rule1.1.2: (White,All,%,0) ¬(Red,S,<,Value5)→FORWARD→¬(White,S,<,Value1) (Red,S,>,Value9)
Rule1.2.1: (White,All,%,0) (Red,S,<,Value5)→FORWARD→(White,S,<,Value1) ¬(Red,Y,>,Value10)
Rule1.2.2: (White,All,%,0) ¬(Red,S,<,Value5)→FORWARD→¬(White,S,<,Value1) (Red,Y,>,Value10)
Rule1.3.1: (White,All,%,0) (Red,S,<,Value5)→FORWARD→(White,S,<,Value1) ¬(White,All,~,0)
Rule1.3.2: (White,All,%,0) ¬(Red,S,<,Value5)→FORWARD→¬(White,S,<,Value1) (White,All,~,0)
Rule1.12.1: (White,All,%,0) (White,Y,<,Value8)→FORWARD→(White,S,<,Value1) ¬(White,All,~,0)
Rule1.12.2:(White,All,%,0) ¬(White,Y,<,Value8)→FORWARD→¬(White,S,<,Value1) (White,All,~,0)
43
3.4.4 Rule Refinement
For any subsequent surprise that occurs in a pair of complementary rules R
A
: c
0
^ c
x
–
a
+
→ p
z
and R
B
: c
0
^ ~c
x
–a
+
→ ~p
z
, rule refinement is performed adhering to equations
(7)-(10). The details are as follows:
For each cause returned by surprise analysis if it is has never been added to the
surprised rule then it can be added safely as it forms a new condition. Otherwise,
it should not be added to avoid unnecessary repetition.
When R
A
is surprised, for each new cause c
y
that is not specified in c
0
or c
x
, do
R
A
: c
0
^ c
x
^ c
y
–a
+
→ p
z
(28)
R
B
: c
0
^ ~(c
x
^ c
y
) –a
+
→ ~p
z
(29)
When R
B
is surprised, for each new cause c
y
that is not specified in c
0
or c
x
, do
R
B
: c
0
^ ~c
x
^ c
y
–a
+
→ ~p
z
(30)
R
A
: c
0
^ ~(~c
x
^ c
y
) –a
+
→ p
z
(31)
When R
A
is surprised, if a new cause cannot be identified then add the entire
observation (e
1
, … , e.b.v
n
),
R
A
: c
0
^ c
x
^ ~(e
1
^ … ^ e.b.v
n
) –a
+
→ p
z
(32)
R
B
: c
0
^ ~(c
x
^ ~(e
1
^ … ^ e.b.v
n
)) –a
+
→ ~p
z
(33)
When R
B
is surprised, if a new cause cannot be identified then add the entire
observation (e
1
, … , e.b.v
n
),
R
B
: c
0
^ ~c
x
^ ~(e
1
^ … ^ e.b.v
n
) –a
+
→ ~p
z
(34)
R
A
: c
0
^ ~(~c
x
^ ~(e
1
^ … ^ e.b.v
n
)) –a
+
→ p
z
(35)
44
Therefore rule maintenance keeps the prediction model accurate by performing rule
splitting first followed by rule refinement every time a surprise occurs. For each refined
rule, the robot records observations corresponding to the surprised-condition and
surprised-result as its new base-condition and base-result respectively. One of the
important advantages of learning complementary rules is that even though one rule may
be specialized and limited for application, its complementary rule is always general
enough for further surprises and learning.
Table 5: Example of rule refinement
Surprised Rule R
400
: G.size^%I –Action
1
→ G.size↑5˅~J.size↓3
Base-condition G.size = 10, H.size = 1, I.length = 2
Base-result G.size = 15, H.size = 1, J.size = 6
Surprised-condition G.size = 20, H.size = 2, J.size = 3, I.length = 2
Surprised-result G.size = 20, H.size = 2, J.size = 2, I.length = 2
Causes [(~J), (G.size<20), (H.size<2)]
Prediction Model …
R
40000
: G.size^%I^ ~J –Action
1
→ G.size↑5˅~J.size↓3
R
40001
: G.size^~(%I ^ ~J) –Action
1
→ ~G.size↑5^ J.size↓3
R
40010
:G.size^%I^(G.size<20) –Action
1
→ G.size↑5˅ ~J.size↓3
R
40011
:G.size^~(%I ^ (G.size<20))–Action
1
→~G.size↑5^J.size↓3
R
40020
:G.size^%I^(H.size<2) –Action
1
→ G.size↑5˅ ~J.size↓3
R
40021
:G.size^~(%I ^ (H.size<2))–Action
1
→ ~G.size↑5^J.size↓3
…
45
Right
Action
Forward
Action
Figure 7: a) Base-consequence b) Surprised-condition c) Surprised-consequence
A simple example of surprise analysis is shown in Table 5. To illustrate this dual
procedure for specialization and generalization, consider the scenario in Figure 7. Given
that Rule2 was split earlier as mentioned in rule splitting, one possible complementary
pair of rules is as follows:
Rule2.1.1: (White,All,%,0) (Red,S,<,Value5)→FORWARD→(White,Y,>,Value2) ¬(Red,S,>,Value9)
Rule2.1.2: (White,All,%,0) ¬(Red,S,<,Value5)→FORWARD→¬(White,Y,>,Value2) (Red,S,>,Value9)
Now assume that the robot performed a right turn action from Figure 7a resulting in
Figure 7b. When a forward action is executed from Figure 7b the predictor selects
Rule2.1.1 because the size of red = Value11 is slightly less than Value5. The robot
expects the y-location of white to increase, but it is surprised as white disappears as
shown in Figure 7c.
46
The analysis of this surprise will compare Figure 7b (surprised-condition) with Figure 6a
(base-condition) and conclude that there are 6 possible causes: (Red, X, >, Value12),
(Red, Y, <, Value13), (Red, S, <, Value14), (White, X, <, Value15), (White, Y, <,
Value16), (White, S, >, Value17). The refined rules are as follows:
Rule2.1.1.1.1:(White,All,%,0) (Red,S,<,Value5) (Red,X,>,Value12)→FWD→(White,Y,>,Value2) ¬(Re
d,S,>,Value9)
Rule2.1.2.1.2:(White,All,%,0) ¬((Red,S,<,Value5) (Red,X,>,Value12))→FWD→¬(White,Y,>,Value2) (
Red,S,>,Value9)
…
Rule2.1.1.6.1:(White,All,%,0) (Red,S,<,Value5) (White,S,>,Value17)→FWD→(White,Y,>,Value2) ¬(R
ed,S,>,Value9)
Rule2.1.2.6.2:(White,All,%,0) ¬((Red,S,<,Value5) (White,S,>,Value17))→FWD→¬(White,Y,>,Value2)
(Red,S,>,Value9)
3.5 Using a Prediction Model
The purpose of learning is to solve problems. Prediction rules are learned so that they can
be reasoned with by a planner in selecting actions for solving any given goal in the
environment. In general, the prediction rules can be viewed as the “operators” in a
planning system [Shen90], so any standard planner can be used here as long as they are
compatible with the representation of the prediction rules. The job of a planner is to find
a sequence of rules/actions that if they are executed they can carry the robot from the
current observation to a goal. Currently, we have tested two planners in SBL.
47
The simplest is a planner based on the standard goal-regression with a greedy search.
First the planner finds a goal rule, which is a prediction rule that can accomplish the goal
or a part of it. Next it searches backwards from it by chaining the predictions of a
candidate rule to the conditions of the goal rule. The goal rule is then replaced by the
candidate rule and the process repeats until the conditions of the candidate rule satisfy the
current observations. This way, should it exist, a sequence of actions to reach the goal can
be extracted. When there are multiple goals to be achieved with an arbitrary order, our
greedy search method may cause oscillations in selecting which goal to achieve first. This
situation can be detected and corrected, but it may waste several actions in doing so.
More optimal results in the number of actions were obtained by using a Breadth First
Search (BFS) planner on the prediction rules. The BFS planner worked by considering
the outcome of each action by applying the prediction rules on the current observation,
generating the next observation and repeating the process until the goal was reached.
The robot can be assigned a target or goal observation either prior to, or during, or after
the learning process. Typically a goal observation is assigned by placing the robot in a
particular location and prompting it to record the desired observation. Runtime goal
assignment permits the robot to change goals dynamically. In fact, a plan can be
considered a set of sub goals that leads to the goal observation. When planning is
invoked, the next action from the plan subsumes any random action that the robot had
previously decided to explore.
48
3.6 Goal Management (Knowledge Transfer)
During learning, if the robot is shown a “goal observation” and asked to navigate to the
goal observation, it will attempt to generate a plan to reach the goal observation. In this
case, the robot knows what to look for because the goal observation is given.
When a goal is not directly observable (i.e. indirectly observable), then the feedback is
not incremental and immediate but only available when the robot is at the goal location.
For example, suppose the goal is a submerged platform in a body of water and the robot
must swim to look for it. Then, it cannot know what to look for until it is standing on the
platform. To find such “hidden” goals, the best strategy seems to explore and learn the
environment as much as possible while searching for the goal. Once the goal is reached,
the robot should remember the goal location by associating it with as many sensor cues as
possible so that it knows what to look for when it visits the goal again.
A goal management mechanism for making this association is implemented in our
learning robot to test the ability to learning to reach a hidden goal, as well as transferring
knowledge if the location of the goal is secretly moved. The strategy here is to record the
associations as a set of “goal observations” and maintain them as a dynamic list of goal
observations. A new goal observation is added to the list only if a similar observation is
not currently in the list. In addition to recording the goal observations, two statistics
corresponding to the number of successes and failures for each observation are noted.
The number of successes is incremented each time a similar goal observation is
49
encountered. Note that the robot is allowed to continue randomly moving in the
environment and learning even after it has successfully located the hidden area. This
provides the robot with an opportunity to record multiple goal observations with varying
levels of success.
At any subsequent time after the initial discovery, if the robot is prompted to track back
to the hidden goal, it will invoke the planner by selecting an observation from the list and
follow the sequence of actions required to make that observation appear. Some
observations are unambiguous meaning that the entities and attributes captured in the
goal observation are unique. For example, a goal observation comprised of a corner
where two colored walls and the floor intersect is unique within the environment; hence,
it can be located by matching the presence of the corresponding blob entities while
considering their size attribute. In contrast, most goal observations are ambiguous due to
the lack of information, such as an observation with just one wall which can be
encountered anywhere along a line parallel to that wall. When there is ambiguity in a goal
observation, the robot could successfully reach it but not receive the feedback that the
goal has been reached. In this case, the number of failures attached to that particular goal
observation is incremented.
It is important to realize that the planner might not be able to find a valid sequence of
actions for some goal observations as there might be insufficient rules in the learned
predictive model at that time (especially because SBL has no a priori knowledge hinting
50
that actions are reversible), in which case the robot resorts to selecting random actions
forcing observation changes until a sufficient model is learned to generated a valid plan.
The ratio of successes to failures for a goal observation indicates the probability of
finding the hidden goal with it. Intuitively, the strategy of the planner is to select the best
goal observation by picking goal observation with the highest probability and following
the planned set of actions to reach the goal. When the goal observation is reached, if the
hidden area is not found, then the probability of that observation is lowered as the failures
increase and the next most probable observation will be selected and tracked.
Maintaining a dynamic list of goals and its statistics while planning and learning makes
SBL robust and adaptive. If the planner tracks every goal observation in the dynamic goal
list and does not come across the hidden goal, SBL concludes that the hidden goal has
moved and proceeds to execute random actions. This automatically facilitates scenarios
where the hidden platform is randomly relocated after it has been discovered during
experimentation.
3.7 Probabilistic Rules and Forgetting
Adaptation requires both learning new and useful knowledge and forgetting obsolete and
incorrect knowledge. A prediction rule may become obsolete when a robot’s sensors or
actions are changed or damaged, or when the environment poses new situations. To deal
with unpredicted situations, a learning robot should be prepared to consider all possible
hypotheses of the combinations and relationships of its sensors and actions during its
51
adaptation. Naturally, not all hypotheses are correct and those that are wrong should not
always reside in memory as it is a finite resource. The removal of obsolete data from
memory is commonly referred to as memory management.
Rule forgetting is a memory management technique derived to deal with incorrect or
useless prediction rules. It is achieved by marking inappropriate rules as “rejected” so
that they are not used or refined in future. Yet, they are still kept in memory as a
“reminder” so that they will not be recreated later. Specifically, when new rules are being
constructed SBL checks each new rule against valid rules as well as the rejected rules to
ensure uniqueness prior to adding it to the learned model.
A prediction rule is determined to be inappropriate in two ways: they cause
contradictions, or they consistently and consecutively cause surprises. In the first way, a
contradiction occurs immediately after rule splitting and means that neither of the
complementary rules can describe the surprised consequence correctly. This can be
caused by two reasons: either the rule was created with a wrong C
0
in the first place (i.e.
the entity or attribute C
0
was irrelevant to the current action), or split with an incorrect
prediction P
X
. Either way, these two rules are self-contradicting and have no future in
learning.
The second way is to keep track of the progress of a pair of complementary rules. Each
time a surprise occurs and the complementary rules are refined, it should imply that on
52
the subsequent selection of either rule, the robot should produce better and more accurate
predictions. However, if these rules fail consistently, then it can be inferred that the entity
and attribute relations captured in the rule are inappropriate for the given context and that
the rules should be rejected. The ratio between the number of times a rule has been
successful and the number of times it has been predicted can be stored against each rule
as its probability of success. For example when this probability drops below 0.5 it means
that the rule is failing more than 50% of the time. Therefore, a fixed cutoff probability
such as 0.3 could be employed to effectively reject inappropriate rules. In addition, when
multiple plans are returned the probabilities associated with rules can be used to discard
potentially weak plans that are less likely to accomplish the goal.
Right
Action
Right
Action
Figure 8: a) After 1st/before 2nd b) After 2nd/before 3rd action c) after 3rd action
To illustrate the procedure of rule forgetting, consider an example where the range sensor
has been ignored and the robot executes a right-turn action in Figure 8a, which results in
Figure 8b. Given that a prediction rule Rule5 was created when Figure 8a was observed,
then rule splitting would occur when Figure 8b is observed. This is because the size of
white remained equal rather than decreasing. The resulting set of complementary
prediction rules are as follows:
53
Rule5.1.1: (White,All,%,0) (Red,S,<,Value18)→RIGHT→(White,S,<,Value19) ¬ (Red,S,=,Value20)
Rule5.1.2: (White,All,%, 0) ¬(Red,S,<,Value18)→RIGHT→¬(White,S,<,Value19) (Red,S,=,Value20)
Since rule Rule5.1.1 failed, on the next right-turn action the complementary rule
Rule5.1.2 is selected. Yet, this rule also fails because the white decreased as opposed to
being greater than or equal (¬S<), resulting in consecutive failures of the complementary
rules. Hence, Rule5.1.1 and Rule5.1.2 are rejected.
3.8 Entity & Attribute Relevance
Entity & attribute relevance is critical for a number of capabilities of an autonomous
robot. First, it is critical for learning new knowledge. While learning how to navigate in
the environment, a robot must couple the relevant entities and attributes with the relevant
actions. This could be situation or time dependent. For example when turning, the robot
should use the location attributes rather than the size attribute to avoid ambiguity.
Second, entity & attribute relevance is critical for identifying and recovering from
unpredicted action and sensor changes. Due to unpredicted changes, certain entities and
attributes that were previously relevant to certain actions may become irrelevant. A
learning robot must quickly detect such cases so that it can adapt to new situations. In our
experiments, we tested these cases by deliberately switching the meaning of actions (e.g.
backward to forward and vice versa) and by inverting the robot’s camera (similar to
human wearing inverted vision goggles [LKHS99]) during the learning process.
54
Third, entity & attribute relevance is critical for the scalability of a learning algorithm. In
real-world situations, the perceived information by a robot can often be overwhelming
and not all perceived entities and attributes are relevant to the current goal. The robot
must ignore the irrelevant and focus on the relevant or else the learned model may never
converge. We call this focus of attention.
Our approach to entity & attribute relevance is as follows. The relevance of entities and
attributes are recorded in a table, in which each row uniquely identifies a sensor, an entity
of that sensor, an attribute of that entity, an action and an indicator corresponding to its
relevance during learning. The relevance flag is updated depending on the following four
scenarios:
1. If an entity or attribute does not change for an action, then despite remaining active it
will not be used in rule creation, splitting or refinement, for that action. E.g. a sensor
reading the constant temperature of the room has no relevance to the robot’s motion.
2. For a given action, if all rules that contain this entity or attribute as its first condition
C
0
have been rejected, then the corresponding entity or attribute is flagged as
irrelevant to that action. For example, selecting a sensor that returns random values
will eventually cause all rules that use it to fail and be rejected.
55
3. For a given action, when many rules containing the same entity or attribute
consecutively fail then the corresponding entity or attribute will be flagged as
irrelevant to that action. For example, the size attribute of the white (floor) blob entity
will be deemed as irrelevant to the turning action because it results in consecutive
failures as it causes ambiguities when turning at the corner of two walls.
4. When none of the active entities and attributes for a given action register any change
(this could be caused by an unpredicted change in the robot’s hardware), then
reactivate those flagged entities and attributes that indicated some change, for future
evaluation. A reactivated entity or attribute will be considered from the next learning
cycle onwards. We call this reactivation as “forced relevance” which represents
enlarging the attention of the robot. For example, when the camera is unexpectedly
rotated by 180º, the initial confusion will flag many entities and attributes as
irrelevant. At some point the robot recognizes the lack of progress and re-evaluates its
sensors (entities, attributes) and actuators in an attempt to improve the quality of its
learned model.
Entity & attribute relevance works during rule creation and surprise analysis. SBL
performs a lookup on the table and ignores any entities or attributes that are considered
irrelevant during the analysis of surprises. When a rule is newly created, a copy of the
rule (called trace rule) will be recorded to ensure that after the original rule is split,
refined and rejected, it would not be recreated to repeat the same learning process.
56
As we can see, the combination of rule forgetting, trace rules, and focus of attention on
relevant entities and attributes help to increase the speed of learning, while reducing the
wastage of resources. However, if a forced relevance is invoked, SBL will purge the trace
rules and rejected rules that refer to the toggled entities and attributes to facilitate
relearning.
3.9 SBL Pseudocode
Algorithm 1: Surprise-Based Learning
1: Observe the environment via sensors to initialize observations
2: while (ALIVE) do
3: goal = GoalManagement(); // Get the desired observation with the highest probability or directly
4: if (goal is NULL) then // A goal has not been assigned, or a hidden goal has not been seen yet
5: Action = Choose a random action
6: else
7: if (plan is NULL) then
8: plan = planner(goal); //Find a plan as a sequence of rules that will lead to the goal
9: end if
10: Action = Next action from plan or random if plan is NULL
11: end if
12: predictionRules = Match the current observations and select the matching prediction rules which
forecast the outcome of this action
13: Perform(Action)
14: Observe the environment via sensors
57
15: if (predictionRules is NULL) then
16: CreateNewRules() // Create unique prediction rules using entity & attribute relevance
17: else
18: for (each Rule in predictionRules) do
19: if (The outcome does not match the prediction) then // A “Surprise” has been detected
20: rejected = RuleForgetting(Rule); // Reject Rule if it is based on success probability
21: if (rejected is FALSE)
22: possibleCauses = SurpriseAnalysis(Rule) // Possible causes with relevance
23: for (each Cause in possibleCauses) do
24: if (Rule has never been split before) then
25: SplitRule(Rule, Cause) // Create a pair of complementary rules
26: else
27: RefineRule(Rule , Cause) // Refine the complementary pair of rules
28: end if
29: end for
30: end if
31: end if
32: end for
33: if (all predictionRules do not forecast a change in the observation or are rejected) then
34: CreateNewRules() // Create unique prediction rules using entity & attribute relevance
35: end if
36: end if
37: end while
58
3.10 Summary
This chapter presented a new cyclic lifelong learning technique called Surprise-Based
Learning. The idea is to create a model, which forecasts the expected observation of
executing an action, and to improve this model when contradictions or surprises occur.
SBL has unique and well defined strategies for creating & maintaining a prediction
model, goal management, memory management and entity & attribute relevance. These
strategies facilitate planning to achieve goals, detecting and adapting to goal,
environmental, sensor and action changes respectively. To the best of our knowledge,
SBL is the only integrated learning technique that has built-in mechanisms for detecting
and adapting to simultaneous unpredicted changes in a robot’s sensors, actions, goals and
the environment.
59
Chapter 4
Evaluation Strategy
SBL was developed and tested in several environments ranging from simple simulated
games to a real-world office. This chapter introduces some of these experimental
environments and the evaluation methods used for verifying the contributions of this
research. Simulated game environments were chosen from standard AI text books [RN09]
and related research [DSKM02] for comparison purposes, while other environments were
designed to closely resemble similar experiments in related research.
4.1 Experimental Environments
The objective of the robot in each experimental environment is to learn a predictive
model of its interactions with the environment for the purpose of accomplishing goals
and adapting to unpredicted changes. Some of the experimental environments for SBL
included several simulated games, Morris water maze, a real office room, some classic
data-classification problems and video data analysis. In addition to the constrained box
environment, the following sections describe the environments that were used to verify
the 3 contributions detailed in chapter 5, 6 & 7. The video analysis environment which
was only used to verify the final contribution is detailed in chapter 8. The motivation,
challenges and goals within each environment vary depending on the experiment. Thus,
60
the features of the environment are presented in this chapter, and the above aspects are
addressed within each experiment.
4.1.1 Game Environments
Figure 9: a) Hunter-goal layout b) Hunter-prey layout
The simulated game environments shown in Figure 9 represent grid-worlds which wrap
around at the edges, meaning that when a robot steps outside the boundary of the world it
simply appears at the opposite side. This feature was chosen as it makes the learning
problem more challenging than simply designating the boundaries as impassable walls.
The hunter robot marked as ‘H’ has four basic actions which allows it move either
vertically or horizontally. In particular, it can move North (up), South (down), East (right)
or West (left) by one cell per action. Similarly, the prey robot marked as ‘P’ has four
actions, which allows it to move diagonally. That is the prey can move North-East,
North-West, South-East or South-West by one cell per action. Blocked cells marked as
5
4
3
2
1
5
4
3
2
1
1 2 3 4 5 1 2 3 4 5
y y
x x
61
‘X’ are used to indicate locations that a robot cannot step into; hence an action into them
results in the robot remaining in place. The robots have no a priori information about the
environment and have no models describing the expected outcome of each action. For
example, the names of the actions are some labels that have no meaning and the robots
have no knowledge about the reversibility of the actions i.e. North is equal to the reverse
of South etc.
In the hunter-goal layout the objective of the hunter robot is to reach a stationary goal
location marked as ‘G’. In the hunter-prey layout the objective of the hunter robot is to
reach the roaming prey robot. The hunter is able to precisely sense its global position
with its primary sensor and can sense the position of the prey with its secondary sensor
should it exist, but the goal location can only be sensed when the hunter is directly on top
of it. So these are both transparent environments, with the hunter-prey being a semi-
controllable environment as well. The robots take turns in executing their actions.
The hunter robot makes an observation before and another after the execution of its
action. Each global position sensor is mapped through a null processor to an entity ‘H’
corresponding to the hunter and an entity ‘P’ corresponding to the prey. In some
experiments each entity could have two attributes corresponding to the x and y locations
and in others it could have a single attribute corresponding to a unique number assigned
to each cell.
62
During observation analysis, values are reasoned and compared by a set of comparison
operators that are given to the robot at the outset. Here the operators <, <=, =, !=, >=, >
are used to evaluate the values of an attribute.
4.1.2 Real-world Office Environment
Figure 10: Layout of my office room
Figure 10 shows the targeted real-world office environment. The robot is comprised of a
Superbot module and a laptop computer. A color camera provides the sensor data to the
laptop, where the learning system can issue action commands back to the robot via
Bluetooth.
The robot is preloaded with four actions corresponding to the motions of forward,
backward, left and right. Each action is defined as an execution pattern of the motors for
a pre-specified period of time. The forward and backward actions typically result in
moving about 6 inches. The left and right actions are designed to mimic synchronous
turning of about 20 degrees although the steering is Ackerman. As before, the robot has
63
no a priori knowledge about the expected results of these actions or the reversibility. This
is an uncertain environment as both sensing and actuating is greatly influenced by noise.
The onboard camera allows the robot to make an observation before and another after an
action. Data from the camera is mapped through a SURF object detector [BTG06] with
mean-shift segmentation to several entities that represent physical objects in the
environment. These entities are labeled dynamically and matched at runtime. Each entity
has the attributes size, x-location of the center and y-location of the center, calculated
with respect to the reference frame.
During observation analysis, values are reasoned and compared by a set of comparison
operators that are given to the robot at the outset. Here the operators %, ~ are used to
evaluate the presence or absence of an entity respectively, while the operators <, <=, !=,
=, >=, > are used to evaluate the values of an attribute. Once again, the environment is a
continuous space with no predefined regions or discretized grids.
4.2 Evaluation Methods
4.2.1 Comparison via Simulation
The game environments were selected for the purpose of comparing SBL against other
learning techniques. As there are many competitive techniques such as those shown in
64
Table 1, it was decided that running random exploration without learning, Q-learning (RL)
with value iteration and SBL would be sufficient for a fair comparison.
Within each experiment multiple tests were carried out by varying the map size or the
number of grids in the grid-based world. A run is a sequence of actions that resulted in
the hunter robot successfully reaching either the goal location or catching the prey,
depending on the scenario. Each test comprised of several runs so as to provide
opportunities for learning the world and also for testing this knowledge. In both scenarios
the starting location referred to the cell in which the hunter was placed at the beginning of
each run, while the goal location referred to the location of the goal cell or the prey,
depending on the scenario.
Note that all these experiments were conducted on a single computer with a processor
clocked at 2.8 GHz. Time was measured in terms of the number of clock cycles
consumed for each experiment. Therefore, the learning time for a test was the average
amount of clock cycles consumed in completing all the runs associated with that test,
which included any associated planning time as well. The number of actions represented
the total number of actions executed by the robot averaged over several tests. When
comparing performance, minimizing the number of actions was deemed more important
than minimizing computation time, because in a real-world robotic application
minimizing the amount of energy spent navigating takes priority over processing time.
65
Random exploration without learning – The most primitive technique to discover the goal
or catch the prey was to randomly select the hunter’s actions. The random selection of an
action promotes exploration, but the learner does not attempt to learn anything. This
meant that on a repeated runs, despite having visited the goal location earlier the robot
would not be able to draw on previous experiences.
Q-learning with value iteration – Q-learning learns the utility of state transitions and the
best action to take in a particular state. A state refers to an observation or snapshot of the
environment at a particular time. An approximation function or state space reduction
function mapped the observations to a state. The utility is a reward for being in a
particular state, which was calculated using value iteration [Bell57].
In the hunter-goal layout each cell in the grid was assigned a unique identifier such that
each state identified the current location of the robot. Hence, for a map the total number
of states would be bounded by (map_height*map_width). A fixed positive reward was
received each time the goal state was reached and it was propagated to adjacent states
based on the recorded state transitions. On subsequent runs the robot had the choice of
executing the best action according to the policy with a certain probability. If the
probability was 100%, then the robot had to select the learned action, which meant that
exploration was minimized. In contrast, if the probability was 50% then the robot was
equally likely to explore a new action or select the learned action, resulting in possible
improvements to the learned policy over a period of time.
66
In the hunter-prey layout, two mappings were evaluated. The simplest mapping was to
create unique states for the combination of the location of the hunter and the prey. In
other words for a map there would be (map_height* map_width)
2
number of states. The
second mapping was to create a hunter-centric view of the world and store the relative
location of the prey as the current state. This reduction reduces the state space
significantly, thus leaving a maximum of (map_height* map_width) states.
SBL – The shortest path to the known goal location or the prey was established by using a
BFS planner on the prediction rules.
4.2.2 Feasibility in Real-World
The constrained box environment and real-world office environment were selected to
evaluate the performance of SBL under real-world conditions such as noisy sensing and
actuating. In particular, the box environment was devised for testing with a physical robot
in a controlled environment, while the office environment was a more natural setting for
robotic navigation.
Minimizing the number of actions executed to accomplish a goal was deemed more
important than minimizing computation time because actuating consumed more time and
power than processing. Since the model consisted of prediction rules, resource usage was
monitored in terms of the number of active rules, as it loosely corresponded to the
amount of memory required for storage and processing required for planning purposes.
67
Note that data gathered during this analysis was not subjected to comparison against
competitive learning techniques. This is primarily due to the lack of data from
competitive learning techniques as well as due to complexities involved in attempting to
apply those techniques in these particular experiments. For example, it is difficult to
ascertain an appropriate number of states, the corresponding approximation function or
reset the experiment to deploy Q-learning in the office environment.
4.3 Summary
This chapter discussed some of the experimental environments used for evaluating the
contributions of this research. The fully controllable simulated game environments were
selected for comparing SBL against a baseline of random exploration without learning,
and a competitive reinforcement learning approach. The feasibility in real-world robotic
applications could be established via the constrained box and office environments.
68
Chapter 5
Structure Learning
Structure learning attempts to discover the organization of data. In robotics, structure
learning is required to discover the relations between sensors and actions, whereas
parameter learning can only be applied once these relations are known. This chapter
presents some applications and results for structure learning with SBL.
5.1 Scalability by Exploiting Structure
A structured environment is an environment where there is an orderly relationship
between sensors and actions. This means that the properties that are being monitored by
sensors demonstrate discernible patterns associated with each action. For example, in the
hunter-goal layout structure can be achieved by labeling the x and y axes of the map to be
monotonically increasing with respect to the distance from the origin. The advantage of
structure is that comparison operators could be applied to discover patterns, such as the
relationship between values and a range over which the relationship is valid. This in turn
facilitates the learning of a few prediction rules that precisely describe the environment.
The otherwise perfect structure can be disrupted by inserting obstacles marked by “X” at
random locations. As opposed to a structured environment in a chaotic environment the
sensor response is erratic with respect to the properties that are being monitored. For
69
example, in the hunter-goal layout if the x and y axes of the map were labeled completely
randomly such that the axes are neither monotonically increasing nor decreasing with
respect to the distance from the origin, then the environment would be considered chaotic.
5.1.1 Approach
Three sets of experiments were carried out using the hunter-goal layout to investigate the
impact of structure on learning as the size of the environment varied. These were
designed as follows:
1) Structured environment without any obstacles
2) Structured environment with some randomly placed obstacles
3) Chaotic environment
Under each set of experiments numerous tests were performed by varying the size of the
map to evaluate scalability. During each test the hunter was started in a random cell and
tasked to find the goal location, which was a fixed cell within each map. Each time a run
was completed, meaning that the hunter reached the goal, it would be reset to a new
randomly selected starting location and a new run would commence. A test would
terminate when all starting positions were exhausted. Each set of experiments was
repeated at least 20 times to obtain reasonable averages.
By ensuring that all conditions were the same, such as the same goal location in different
maps, same sequence of random starting locations and the same placement of objects, it
70
was possible to test and compare random exploration without learning, QL with 100%
chance to accept the policy, QL with a 50% chance to accept the policy and SBL.
5.1.2 Results & Discussion
1)
Figure 11: Actions executed to learn each map w/o obstacles from every starting location
Figure 12: Data from Figure 11 excluding the “Random exploration” series
0
10
20
30
40
50
60
70
5x5 10x10 15x15 20x20 25x25 30x30 35x35 40x40 45x45 50x50
Number of Actions Executed
Map Size
QL 100%
QL 50%
SBL
Average number of actions executed to reach the goal from each starting location excluding the random exploration series
Average number of actions executed to reach the goal from each starting location
0
1000
2000
3000
4000
5000
6000
7000
5x5 10x10 15x15 20x20 25x25 30x30 35x35 40x40 45x45 50x50
Map Size
Number of Actions Executed
Random
QL100%
QL50%
SBL
71
In Figure 11 it is visible that on average random exploration without learning executed
the largest number of actions to reach the goal in comparison to QL and SBL. This is
easily explained by the fact that this strategy does not learn anything about the
environment and as such is unable to reuse any past experiences to efficiently explore the
environment on subsequent runs. As there are no obstacles in the environment and the
goal is at a fixed location in each map, following the actions recommended by the policy
(random action if unknown) produced better results as seen in the “QL100%” series in
Figure 12. The “QL50%” series accepted the policy 50% of the time and executed a
random action the rest of the time, resulting in executing more actions despite being able
to learn the optimal policy that QL with 100% acceptance may not accomplish.
Regardless, once SBL had learned the model through a few random actions it was able to
consistently reach the goal with the least number of actions.
Figure 13: Learning time for each map without obstacles from every starting location
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5x5 10x10 15x15 20x20 25x25 30x30 35x35 40x40 45x45 50x50
Average Clock Cycles
Map Size
Average learning time to reach the goal from each starting location
Random
QL 100%
QL 50%
SBL BFS
SBL simple plan
72
The learning time for random exploration is negligible as seen in Figure 13. This is
expected as the computation for random exploration is extremely simple. In contrast, QL
performs value iteration, which essentially solves a number of linear equations known as
Bellman equations that are proportional to the number of states. Therefore, as the size of
the map increases the learning time increases exponentially. At large map sizes such as
50x50 QL took close to 3 hours on a 2.8 GHz computer, making time the primary reason
to limit the number of experiments to just over 20 each for the purpose of extracting
reasonable averages. SBL demonstrated exponential learning time as it includes a BFS
planner. However, the series marked as “SBL simple plan” used a straight-line approach
to planning and as such it showed that learning time in SBL scales linearly provided that
the planning overhead can be minimized through the use of a better planner or some a
priori knowledge of the environment.
2)
Figure 14: Actions executed to learn each map with obstacles from every starting location
Number of Actions Executed
Average number of actions executed to reach the goal from each starting location with obstacles
0
10
20
30
40
50
60
70
5x5 10x10 15x15 20x20 25x25 30x30 35x35 40x40 45x45 50x50
Map Size
QL100%
QL50%
SBL
73
The introduction of obstacles amounting to roughly 10% of the map size has reduced the
number of cells that can be visited, resulting in lowering the average number of actions
executed in comparison to Figure 12. Nevertheless, the trend remained consistent with
previous expectations.
Figure 15: Learning time for each map with obstacles from every starting location
In comparison to the learning times without obstacles in Figure 13 the learning times with
obstacles in Figure 15 have decreased for QL and increased for SBL. The reduction in
learning time for QL is due to the elimination of unreachable states from the state space
for value iteration. SBL accommodates this information by appending conditions to rules
thereby increasing the processing time during planning.
0
500
1000
1500
2000
2500
3000
3500
4000
5x5 10x10 15x15 20x20 25x25 30x30 35x35 40x40 45x45 50x50
Average Clock Cycles
Map Size
Average learning time to reach the goal from each starting location with obstacles
Random
QL 100%
QL 50%
SBL
74
3)
Figure 16: Actions executed to learn each chaotic map
Figure 17: Data from Figure 16 excluding the “Random exploration” series
Average number of actions executed to reach the goal from each starting location
Average number of actions executed from each starting location excluding the random exploration series
0
0
1000
10
2000
20
3000
30
4000
40
5000
50
6000
60
7000
70
8000
5x5
5x5
10x10
10x10
15x15
15x15
20x20
20x20
25x25
25x25
30x30
30x30
35x35
35x35
40x40
40x40
45x45
45x45
50x50
50x50
Map Size
Map Size
Number of Actions Executed
Number of Actions Executed
QL100%
Random
QL50%
QL100%
SBL
QL50%
SBL
75
Figure 18: Learning time for each chaotic map
In a chaotic environment random exploration without learning executed the largest
number of actions to reach the goal just as in a structured environment. In Figure 17
notice that SBL has only two data points corresponding to the number of actions in a 5x5
map and a 10x10 map. Subsequent data points where not acquired due to the large
amount of computation time required. The trend in Figure 18 forecasts runtimes that are
impractical for this problem. In addition, the average number of actions executed by SBL
for each map is higher than the average number of actions executed by QL, but it is much
lower than random exploration.
The reason for the poor performance of SBL in a chaotic environment is due to the fact
that its prediction capabilities rely on some structure in the environment. When there was
no structure almost every prediction was incorrect resulting in a large number of wasted
actions and a large number of rule refinements causing the rules to become very complex.
Consecutive failures and inconsistent successes of rules caused rule forgetting to
0
10000
20000
30000
40000
50000
60000
70000
80000
5x5 10x10 15x15 20x20 25x25 30x30 35x35 40x40 45x45 50x50
Average Clock Cycles
Map Size
Average learning time to reach the goal from each starting location
Random
QL 100%
QL 50%
SBL
76
prematurely reject rules and prevented the model from becoming 100% accurate. As is
visible in the graphs, planning with complex and incomplete rules consumed more
computation time and yielded inaccurate results.
An interesting observation in these experiments was that the strategy for selecting an
action to execute when a valid plan was unavailable impacted how quickly the model
converged to an appropriate representation of the environment. In particular, a strategy of
completely random action selection took slightly longer to eliminate surprises, than a
strategy of selecting an action at random and then repeating it when a surprise occurred.
In other words, structured execution of actions in a structured environment results in
faster convergence for SBL. Since the choice of repeating an action favored SBL, it was
incorporated into all these experiments with an additional randomized counter that
prevented indefinite execution of the same action.
The results of these experiments conclude that SBL can discover, and learn the structure
in an environment sensed by sensors when executing actions. SBL scales up by
exploiting structure in an environment, thus it has high performance in structured
environments and suffers when the data is chaotic.
5.2 Constructing Useful Models with Comparison Operators
A truly autonomous robot much like a human must be able to build useful complex
models with the aid of some simple priors. In SBL some simple priors were provided in
77
terms of a few comparison operators. For these environments, the comparison operators
presence, absence, less-than, equal-to and greater-than, were defined specifically to be
used during rule creation, surprise detection, rule splitting and rule refinement. The
learner was tasked with building a representation of its interactions with the environment
executed via actions that have initially unknown outcomes, and sensors that have initially
unknown correlations to the environment. The crucial aspect of this representation is that
it could be used to achieve goals defined in terms of the sensors.
5.2.1 Approach
In the box and office environments although the robot was restricted to at most two
physical sensors there were numerous entities being identified dynamically. Each colored
blob inside the box environment was an entity which was identified by applying mean
shift segmentation on the acquired camera image. The average hue, saturation and value
parameters for each region were calculated and stored in a dynamic list to uniquely
identify the appearance of that region in all future observations. Given a camera image in
the office environment SURF feature detection was performed to extract all the candidate
features. The candidate features were then tested against a dynamic list of features stored
to identify entities. If there were at least 10% matches (this was much higher than the
required 3 matches, as it minimized false positives) SBL concluded that a known entity
appeared, else it proceeded to apply mean shift segmentation to group the features and
then add them to the dynamic list for future reference.
78
Several experiments were carried out using this simple knowledge consisting of entities,
their attributes, comparison operators and actions to establish if useful models could be
learned for the task of navigating to arbitrary goal locations in the environment.
5.2.2 Results & Discussion
The results are available in the form of videos at http://www.isi.edu/robots/media-
surprise.html. All the videos demonstrate learning through random execution of actions
and then planning to goal locations assigned by a user at runtime. Some of the videos
show the environment, the robot’s observation, learned prediction model and action
selection process together. These videos serve as evidence for most of the SBL
experiments presented in this document.
From these results we can conclude that SBL can indeed bootstrap learning from simple
knowledge, and learn to accomplish goals in complex environments with the aid of a few
predefined comparison operators. These comparison operators helped identify structure in
the environment and facilitated planning to achieve goals with the learned prediction
model.
5.3 Impact of Training
The rate at which a robot learns is very important as it must accomplish its goals in a
finite amount of time. In a task such as navigating in an unknown environment the robot
79
must simultaneously construct a representation of the environment based on its
interactions and learn how to accomplish its goals as it is impossible to differentiate
between a learning/training phase and a testing phase. Nevertheless, it is important to
determine how the amount of training impacts the quality of the learned solution. So, an
experiment was devised for this purpose as detailed below.
5.3.1 Approach
The experiment was conducted using the hunter-goal layout. For 3 different map sizes
5x5, 25x25 and 45x45 numerous tests were conducted by varying the duration of the
training phase. During each test the hunter was started in the top-left most cell and tasked
to find the goal location, which was the cell at the center of each map. Each time a run
was completed, meaning that the hunter reached the goal, it would be reset to the starting
cell and a new run would commence. The length of the training phase was varied by
altering the number of runs for each test. In a test QL had a 50% chance to accept the
policy during all runs except the last one, in which it had to accept the learned policy. By
ensuring that all conditions were the same, such as the same starting and goal locations in
different maps, it was possible to test and compare the best paths learned by QL and SBL
under varying amounts of training.
80
5.3.2 Results & Discussion
Figure 19: Actions executed to reach the goal after the specified amount of training runs
On all 3 maps, in Figure 19 it is visible that as the amount of training increases the
number of actions to reach the goal gradually reduces in QL until it reaches the optimal
solution. It can also be seen that SBL has a slight improvement with the amount of
training, but the disparity between SBL and QL especially on larger environments is quite
significant. Clearly, SBL has significantly better performance with much less training
than QL in large structured environments.
The reason for this is due to the fact that SBL is able to explore and learn simultaneously,
it does not require an explicit training phase. SBL is able to forecast the outcome of
actions using the currently known model even when the robot is situated in a cell where
an action has not been performed previously, whereas QL must perform every action in
every cell in order to ensure convergence towards the optimal policy.
QL
QL
QL
QL 5x5
SBL 5X5
QL 25x25
SBL 25x25
QL 45x45
SBL 45 x45
Number of Actions Executed
81
These results conclude that the impact of training on SBL is lower than that of some
competitive learning techniques. This means that SBL requires less training in structured
environments, and does not require any special training targeted at achieving a specific
goal.
5.4 Summary
This chapter evaluated structure learning with SBL. SBL learned patterns in the sensed
data associated with each action in a structured environment, by creating prediction rules
and updating them when surprises occurred. Several simple predefined comparison
operators were sufficient for SBL to discover the structure, and learn models to achieve
goals. The learning scaled up by exploiting structure, which resulted in high performance
in structured environments. In addition, SBL required less training in structured
environments, and did not require any special training targeted at achieving a specific
goal.
82
Chapter 6
Learning from Uninterpreted Sensors and Actions
In order to make learning tractable most learning algorithms use preprocessing such as
approximation or reduction functions to discretize raw data. Since these functions are
designed by humans they become a bottleneck for autonomous acquisition of new
uninterpreted sensors and actions. This chapter presents experiments which evaluate that
SBL can learn from uninterpreted sensors and actions. It also investigates how SBL can
scale by identifying structured sensors & actions, and ignoring irrelevant ones.
6.1 Discretizing a Continuous Sensor
A simple sensor returns a response to a particular type of stimulus. The response may or
may not be limited to a finite number of values. When a sensor has a continuous response
its values are not restricted to a finite set, therefore SBL must discretize the raw data
using comparison operators to establish data ranges and trends with which it could
predict the sensor’s response to an action.
6.1.1 Approach
An experiment to discretize a continuous sensor was conducted using the constrained box
environment. The robot was initially placed with its proximity sensor directly facing a
wall. The backward action was tuned to roughly produce a displacement of 1 cm away
83
from the previous location. This action was executed several times while recording the
corresponding proximity readings and learned prediction model.
6.1.2 Results & Discussion
Figure 20: Proximity sensor response
Table 6: Prediction model discretizing a continuous sensor
Distance Observation Predict Surprise Prediction Model
1 S.V=100
2 S.V=60
R
0
: S% –away→ V↓40
3 S.V=30
R
0
R
0
: S% –away→ V↓40
4 S.V=15
R
0
R
0
: S% –away→ V↓40
5 S.V=10
R
0
R
0
: S% –away→ V↓40
6 S.V=10 R
0
R
0
R
0
: S% ⋀ S.V>10 –away→ V↓40⋁~V=10
R
1
: S% ⋀ ~S.V>10 –away→ ~V↓40⋀V=10
7 S.V=10 R
1
R
0
: S% ⋀ S.V>10 –away→ V↓40⋁~V=10
R
1
: S% ⋀ ~S.V>10 –away→ ~V↓40⋀V=10
8 S.V=10 R
1
R
0
: S% ⋀ S.V>10 –away→ V↓40⋁~V=10
R
1
: S% ⋀ ~S.V>10 –away→ ~V↓40⋀V=10
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8
Value
Distance / cm
84
Figure 20 shows the proximity sensor readings obtained as the robot moved away from
the wall, while Table 6 describes how the corresponding prediction model is learned.
Although the real distance is unknown to the learner, for this particular proximity sensor
SBL learned to predict that if the value is greater than 10 then the value will decrease on
a subsequent action, else the value will remain the same. The learned prediction model is
a discretized representation of the proximity sensor’s response.
This experiment concludes that SBL is able to discretize raw data from a continuous one-
dimensional sensor with a few simple predefined comparison operators. This eliminates
the need for hand-crafted functions when incorporating new sensors to the robot.
6.2 Combining Multiple Uninterpreted Sensors
Typically, a state refers to a combination or grouping of values from one or more simple
sensors. In QL this grouping is performed by an approximation function, which is
designed by humans to provide a reduction in the state space to ensure that the learning
problem is tractable. For example, in the hunter-goal layout a function mapped each pair
of x-y grid coordinates to a unique identifier corresponding to the cell or state the robot
was in. SBL does not need preprocessing such as these hand-crafted approximation
functions, as it is able to combine data from multiple uninterpreted sensors with the aid of
comparison operators.
85
6.2.1 Approach
To investigate the impact of the approximation function, 3 sets of experiments were
carried out using the hunter-prey layout as follows:
1) The approximation function returned unique states for the combination of the
location of the hunter and the prey. The hunter entity’s x & y attributes where
merged to form a unique number identifying its position in the environment.
Similarly, a unique number was calculated for the prey, and combined with the
hunter entity’s number, to form a unique identifier for each state. The total
number of states was given as (map_width*map_height)
2
.
2) The approximation function returned unique states corresponding to the relative
location of the prey using a hunter-centric view of the world. The difference
between the x & y attributes of the hunter and prey entities were merged to form a
unique number. The total number of states was given as (map_width*map_height).
3) The approximation function was replaced by raw data, namely the actual x-y grid
coordinates of each robot. The hunter and prey entities each had two attributes x
& y, which were not merged together. The total number of states had to be
learned.
Under each set of experiments numerous tests were performed by varying the size of the
map. During each test the hunter and prey were started in a random cells and the hunter
86
was tasked to catch the prey, which was moving in a fixed direction. By ensuring that all
conditions were the same, such as the same sequence of random starting locations, it was
possible to test and compare random exploration without learning, QL with a 50% chance
to accept the policy, and SBL. Each set of experiments was repeated at least 20 times to
obtain reasonable averages.
6.2.2 Results & Discussion
Figure 21: Actions executed for hunter to catch prey in each map
Random exploration without learning is not affected by the approximation function. In
Figure 21 the series “QL Unique” and “SBL Unique” correspond to experiment 1, which
had unique states for the combination of locations. The series “QL Relative” and “SBL
Relative” correspond to experiment 2, which used the relative location. The series “QL
Raw” and “SBL Raw” correspond to experiment 3, which used raw data directly.
0
100
200
300
400
500
600
700
800
5x5 9x9 13x13 17x17 21x21 25x25
Number of actions executed
Map Size
Average number of actions for the hunter to catch the prey
QL Raw
Random
QL Unique
QL Relative
SBL Unique
SBL Relative
SBL Raw
87
The “QL Raw” series has no data points for large map sizes as the tests were terminated
after taking a large amount of time to accomplish the goal. The reason for this is that QL
assumed that there were (map_width+map_height) number of states, resulting in cyclic
policies. This is a clear indication that without an approximation function to provide the
mapping from raw data to states, QL is unable to learn an appropriate policy.
All QL experiments yielded a higher number of actions for each map in comparison to
the SBL experiments. There is also a noticeable divergence in the QL results indicating
that as the state reduction improved so did the learned solution. As opposed to this, the
SBL results are largely unaffected by the state reduction. This is primarily due to the fact
that SBL does not rely on preconceived approximation functions, which discretize the
perceptual space statically. Instead, it learns an efficient model of the environment as a
set of prediction rules from the available observations.
Figure 22: Execution time for the hunter-prey solution for each map
0
500
1000
1500
2000
2500
3000
5x5 9x9 13x13 17x17 21x21 25x25
Average Clock Cycles
Map Size
Average computation time for each hunter-prey solution
SBL Raw
SBL Unique
QL Unique
QL Relative
QL Raw
SBL Relative
Random
88
In Figure 22, the execution times for “SBL Raw” and “SBL Unique” are higher than QL
due to two reasons. One reason is that the number of conditions or complexity of the
rules increase as the reduction function becomes less efficient or nonexistent. The other
reason is the invocation of the BFS planner after each action rather than acquiring a
sequence of actions to the goal. This is unavoidable because after the hunter takes an
action, the prey takes an action thereby changing the goal location and forcing re-
planning.
Nonetheless, this strategy allows SBL to cope with the prey changing directions
dynamically, as it only alters the goal location but does not affect the hunter’s learned
model. Notice the execution times for “SBL Relative” are lower than QL as the efficient
approximation function produces compact prediction rules that aid quick planning.
These experiments demonstrated SBL learning from two global position sensors, each
containing two attributes. SBL was successful regardless of whether the sensors were
interpreted or uninterpreted inputs. From this, we conclude that SBL can learn from raw
data from multiple uninterpreted sensors, without the aid of sensor models, or
preconceived approximation functions. In addition, SBL is also capable of learning from
interpreted sensors, and utilizing preprocessing such as approximation functions when
available.
89
6.3 Scalability in the Number of Sensors and Actions
Autonomous robots are required to operate in many different environments equipped
with a variety of sensors and actuators. The success of any learning algorithm depends on
its ability to cope with the complexity of the task, which grows with the size of the
environment, the number of sensors, and the number of actuators or actions.
6.3.1 Approach
In this investigation, scalability was tested by observing the size of the learned model
against the number of actions executed to reach a particular goal, as the number of
sensors and actions were gradually increased. The size of the model was measured in
terms of the number of active prediction rules in memory, which excluded rules that were
flagged as rejected. Three experiments were carried out using the box environment as
follows:
1) A number of tests were conducted by gradually increasing the number of actions
available to the robot. Each test ended when the model could no longer be
improved, which was signaled by the absence of surprises over a predetermined
large period of time. The size of the model or number of active prediction rules
was recorded at the end of each test.
2) A number of tests were conducted by gradually increasing the number of dummy
constant valued sensors attached to the robot. Each test ended after executing a
fixed sequence of 50 actions.
90
3) A number of tests were conducted by gradually increasing the number of dummy
random valued sensors attached to the robot. Each test ended after executing the
same sequence of 50 actions selected in the previous experiment.
A fourth experiment was carried out using a 5x5 size hunter-goal layout as follows:
4) A number of tests were carried out with valid x, y position sensors and increasing
the number of dummy random valued sensors by 5 in each test. The same
sequence of 150 actions was executed in every test.
6.3.2 Results & Discussion
1)
Figure 23: Impact of increasing the available actions
In Figure 23 the number of available actions were increased by adding the actions
forward, backward, left, right, large left and large right in that order. The growth in the
0
50
100
150
200
250
300
1 2 3 4 5 6
Average Number of Rules
Available number of Actions
91
number of rules appears to be linear due to the layout of the environment. However, this
need not be the case under all configurations, but is acceptable as long as the learning can
scale to handle the growth without collapsing as seen here.
2)
Figure 24: Impact of increasing the number of dummy constant valued sensors
The dummy constant sensors were irrelevant to the learning making it the task of the
learner to identify them and ignore this data. The response curves against adding the
constant sensors shown in Figure 24 are almost identical throughout the 3 tests, in which
the 1
st
test had 1 constant sensor, the 2
nd
test had 2 constant sensors and the final test had
3 constant sensors. Note that as the 50 executed actions were the same throughout each
test, the slight change in the number of rules was caused by noisy sensing and actuation
of the other sensors onboard the robot. So in general, the results indicate that SBL can
scale to an arbitrary number of constant or failed sensors.
0
10
20
30
40
50
60
70
80
1 5 10 15 20 25 30 35 40 45 50
Average Number of Rules
Number of Actions Executed
1 const.
2 const.
3 const.
92
3)
Figure 25: Impact of increasing the number of dummy random valued sensors
Figure 25 shows 3 tests where the number of irrelevant dummy random sensors was
increased by one each time. In each test the response describes how SBL is reacting to
the random data. Notice that the random sensors have no correlation even amongst the 3
tests. This is visible by the fact that the curves have peaks and troughs at different times,
which means that there is no way to reproduce any of the curves on subsequent runs even
with the same number of sensors.
Nevertheless, a trend can be observed in each test. The number of rules grows
exponentially with the number of random sensors during the first few actions. However,
as more actions are taken, contradictions in the predictions force many rules to be
rejected and entity & attribute relevance is invoked to ignore the random sensors during
subsequent learning. This is visible from the relatively linear response of the tail of each
curve.
0
20
40
60
80
100
120
140
160
1 5 10 15 20 25 30 35 40 45 50
Average Number of Rules
Number of Actions Executed
1 rand.
2 rand.
3 rand.
93
4)
Figure 26: Scalability in the number of irrelevant sensors
In Figure 26 it is visible that the number of relevant prediction rules converges to 8,
which corresponds to a pair of complementary rules per action (direction). When there
are no irrelevant sensors SBL converges quickly. As the number of irrelevant sensors is
increased there is an initial exponential growth in rules, but these irrelevant rules are
gradually forgotten and their sensors ignored. This experiment reinforces the results of
the previous experiment at a larger scale.
From the results of these experiments, we conclude that SBL can scale to a reasonable
number of sensors and actions. Initially, random or irrelevant sensors cause an explosion
of prediction rules, but such sensors will be deemed irrelevant as learning progresses.
94
6.4 Summary
This chapter investigated the capability of SBL in learning from uninterpreted sensors
and actions. The results demonstrated that SBL can discretize a continuous sensor with
the aid of some predefined comparison operators. SBL did not require hand-crafted
approximation functions to combine multiple sensors, but was able utilize such
preprocessing when it was available. This ability is useful for fast model convergence on
a robot with high-dimensional sensors, such as cameras. Furthermore, we verified that
with the help of rule forgetting and entity & attribute relevance SBL can indeed scale
appropriately with the number of sensors and actions, in the presence of structure.
95
Chapter 7
Detecting and Adapting to Unpredicted Changes
Since changes in a robot’s sensors, actions, goals and the environment’s configuration
could occur in an unpredicted manner, several experiments were devised to investigate
how well SBL detects and adapts in different situations. This chapter describes
experiments ranging from independent to simultaneous unpredicted changes.
7.1 Unpredicted Directly Observable Goal Changes
In robotics, changing a robot’s goal in a particular environment is very common as the
robot’s or user’s needs may vary with time. It is important to note that as the robot and
the environment remains the same, changing the goal should not force complete
relearning as that would waste a large amount of resources. Hence, a competitive
algorithm must be able to transfer learned knowledge so as to handle dynamic goal
changes efficiently. With SBL, when the goal is visible or directly observable, any
changes to it can be accommodated by generating a new plan with the learned prediction
model.
7.1.1 Approach
Considering these game environments there are several ways to test this feature. A
slightly advanced method would be to evaluate the hunter-prey layout as the prey is
96
constantly moving. In contrast, a simpler method is to take the hunter-goal layout and
alter the goal location periodically.
Using the simpler method, two sets of experiments were carried out, one without any
obstacles and the other with some randomly placed obstacles. Under each set of
experiments several tests were performed by varying the size of the map. During each
test the hunter was started at the same fixed position and tasked to find the goal location
which would change every 5 runs to one of 3 predefined cells randomly. A test would
terminate after completing 30 runs as it was deemed sufficient for this experiment. By
ensuring that all conditions were the same, such as the same starting location, same
sequence of random goal locations and the same placement of objects, it was possible to
test and compare random exploration without learning, QL with 100% chance to accept
the policy, QL with a 50% chance to accept the policy, and SBL. Each set of experiments
was repeated at least 20 times to obtain reasonable averages.
97
7.1.2 Results & Discussion
Figure 27: Actions to reach dynamic goal locations in each map without obstacles
QL with 100% chance to accept the policy is not marked in Figure 27 as the strategy does
not consistently reach every goal. The explanation is as follows: On the first run the
hunter randomly explores the environment and propagates the reward once the goal is
reached. Using the learned policy it is able to revisit the initial goal on the next 4 runs. At
this point the goal location changes but the policy consistently leads to the previous goal.
So, unless the new goal lies on the previous path, the robot reaches the previous goal. At
this point as there is no reward the policy is updated. Now there are no random
movements to force the robot to explore. This situation can be detected and remedied by
reinitializing the policy or enabling exploration, yet these alterations to QL require some
knowledge of the environment.
QL with 50% chance to accept the policy is able to explore to some extent resulting in the
discovery of the new goal location eventually and gradually updating the policy to lead to
Average number of actions to the dynamic goal locations without any obstacles
0
50
100
150
200
250
300
350
400
450
5x5 10x10 15x15 20x20 25x25 30x30 35x35 40x40 45x45 50x50
Map Size
Average Number of Actions
Random
QL50%
SBL
98
the new goal location. The data indicates a large number of actions on average to visit all
the goals in certain maps. This is caused by the random placement of goals coupled with
gradual reinforcement of a new goal location. If the designers have some knowledge of
the environment in advance, a special pre-conceived approximation function such as the
relative distance in the hunter prey problem can be developed for QL to avoid this
hysteresis effect. In contrast, it is evident that random exploration without learning does
not suffer these artifacts, thus it may have better performance than QL as seen in map
sizes less than 35x35, but this is not guaranteed as seen in larger map sizes.
SBL’s goal management strategy allows it to accommodate dynamic goal changes by re-
planning with the learned prediction model, so as to consistently reach the observed goal
location with the least number of actions on average in each map as seen in Figure 27.
Figure 28: Actions to reach dynamic goal locations in each map with some obstacles
Average number of actions to the dynamic goal locations with some obstacles
0
50
100
150
200
250
5x5 10x10 15x15 20x20 25x25 30x30 35x35 40x40 45x45 50x50
Map Size
Average Number of Actions
Random
QL50%
SBL
99
As seen in Figure 28 the introduction of obstacles amounting to roughly 10% of the map
size does not seem to affect the trend observed previously. We noticed that the planning
time increased as the complexity of the prediction model increased, due to obstacles
being recorded as conditions in rules, which is consistent with the results in chapter 5.1.
With these results, we conclude that the knowledge transfer mechanisms in SBL allow it
to transfer learned knowledge across goals that are directly observable. Thus, SBL can
deal with unpredicted directly observable goal changes by re-planning with the learned
prediction model.
7.2 Unpredicted Indirectly Observable Goal Changes
When the goal is indirectly observable, like the hidden platform in the Morris water maze,
SBL utilizes its goal management mechanism (detailed in chapter 3.6) to adapt to
unpredicted goal changes. Typically, the water maze task consisted of a rat or mouse
placed in a small pool of opaque water, which contained an escape platform hidden a few
millimeters below the surface of the water. Some visual cues such as colored shapes were
placed around the pool in plain sight of the animal such that it could learn to move to the
location of the platform from any subsequent release location. Experiments proved that
the time taken to reach the platform or latency on subsequent releases decreased,
indicating that the animal had successfully learned the environment. This experiment is
well suited to test learning and adaptation in robotic learning algorithms.
100
7.2.1 Approach
A simulated version of the water maze environment was created using the box
environment with an overhead camera. The robot was released from different starting
locations within the environment. A hidden platform was marked as a small rectangle on
the overhead camera image such that the robot would detect it via a message received
when it moved into the designated area. So, the robot cannot see the marked area unless it
steps into it. This simulates the platform under water. Also, the visual cues placed around
the pool were simulated by the uniquely colored walls within the box.
Two experiments were carried out to verify learning and knowledge transfer as follows:
1) Learning to reach a fixed hidden goal.
2) Knowledge transfer to reach a new hidden goal.
In both experiments the number of actions executed to reach the platform were recorded
and compared. A minimum of 9 runs were made in each test. Also, the robot was allowed
10 random actions after reaching the hidden area so that it could learn new goal
observations, and reinforce their success probabilities.
Learning to reach a fixed hidden goal – In this experiment the location of the hidden
platform remained unchanged during all tests. Each run was started by placing the robot
at particular starting location within the box and terminated when the robot reached the
101
hidden platform. After 3 runs the starting location was randomly shifted to a new location
within the box.
Knowledge transfer to reach a new hidden goal – In this experiment the location of the
hidden platform was changed during the course of each test. Each run was started by
placing the robot at the same starting location within the box and terminated when the
robot reached the hidden platform. After 3 runs the hidden platform was randomly shifted
to a new location without informing the robot.
7.2.2 Results & Discussion
1)
Figure 29: Response of fixed hidden platform with new starting locations at run 1, 4 & 7
The results of 3 tests are presented in Figure 29. Once the robot reached the hidden goal it
would store the corresponding observations for tracking on subsequent runs. Hence, all
0
50
100
150
200
250
1 2 3 4 5 6 7 8 9
Number of Actions
Run Number
Trial 1
Trial 2
Trial 3
new start new start new start
102
subsequent visits took fewer actions and after 4 tests the robot learned to go to the hidden
goal directly.
Figure 30: a) Initial b) Run 1 c) Run 2 d) Run 7
Figure 30 depicts the paths traversed by the robot in a few selected runs. Figure 30a
shows the starting location of the robot as well as the location of the hidden platform.
Figure 30b is a trace of the path explored to find the hidden area from the initial release
i.e. run 1. Notice that the robot moved in arcs when it was turning and moved in straight
lines for the forward and backward actions. At times it was subjected to slippage caused
by the non-uniform friction of the floor and also when it impacted walls. Figure 30c is a
trace of the path taken when tracking from the starting position to the hidden platform
during run 2.
As is visible, the robot had learned the location of the hidden platform and was able to
navigate to it using the learned model. Figure 30d details the results of run 7, in which the
robot was placed at a new starting location. SBL tracked and reached the hidden area
very quickly using its learned model and knowledge transfer mechanisms. The
diminishing number of actions on each subsequent run verifies that SBL is able to learn
to reach the hidden goal from any starting location.
103
2)
Figure 31: Response of fixed start location with hidden platform moving at run 1, 4 & 7
The results of 3 tests are presented in Figure 31. Once the robot reached the hidden goal it
would store the relevant goal observations for tracking on subsequent runs. However, on
run 4 & 7 when the hidden goal was shifted, the robot was forced to test all known goal
observations, conclude that the goal had been shifted and then switch to wandering.
Notice that on run 4 in test 2 the robot happened to locate the shifted goal while moving
to its first found goal location. Thus, it did not proceed to the wandering phase.
Figure 32: a) Initial b) Run 1 c) Run 2 d) Run 7
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9
Number of Actions
Run Number
Trial 1
Trial 2
Trial 3
new goal
new goal new goal
104
Figure 32 depicts the paths traversed by the robot in a few selected runs. Figure 32a
shows the starting location of the robot as well as the location of the hidden platform.
Figure 32b is a trace of the path explored to find the hidden area during run 1. Figure 32c
is a trace of the path taken when tracking from the starting position to the hidden platform
during run 2. Figure 32d details the results of run 7, in which the robot was placed at the
same starting location and the hidden platform was randomly relocated.
These experiments conclude that SBL can successfully solve the water maze. When the
robot was first released from any arbitrary starting location it randomly explored and
located the hidden platform. SBL captured several goal observations while inside the
hidden area and stored it in a dynamic list with some success and failure statistics. On
subsequent attempts when the robot was released from any starting location, SBL was
able to plan its way to the hidden platform by cycling through the goal observations
based on the probability computed from previous successes. At some point in time as the
hidden platform was randomly relocated, SBL visited each goal observation in its
dynamic list while continuously updating the statistics for each observation and added
new goal observations if it came across any. This ensured that if it stumbled across the
new location it would reinforce the information pertaining to it. Yet, there were times
when all the observations in the list were visited and the hidden platform was not located.
Then, SBL concluded that the platform had been relocated and switched to random
actions. This verified strategy allows SBL to learn a model and transfer it to handle
dynamic goal changes without prompting relearning.
105
7.3 Unpredicted Configuration Changes in the Environment
Dynamic situations occur in the environment when the configuration of objects changes
unpredictably. This could easily occur when agents other than the learning robot
operating in the same environment add, remove or relocate objects. The learner must be
able to differentiate such situations from sensor and action changes, and update the
learned model accordingly. In the presence of configuration changes, SBL does not
discard the entire prediction model. Instead, it only changes parts of the model that need
to be repaired, thereby preserving its past experiences, and conserving a large amount of
resources.
7.3.1 Approach
To test the adaptivity to such changes, the hunter prey layout was used.
Figure 33: The hunter cannot traverse the gray obstacles
A fixed map size of 7x7 was selected starting with the hunter and goals located as shown
in Figure 33. Initially 3 obstacles were placed in the dark gray area and the hunter was
106
tasked with visiting the goals in an alternating manner. A run was considered complete
each time the hunter visited the desired goal location. After 5 runs were complete the
environment was altered by relocating the 3 obstacles to the light gray area. This caused a
change in configuration that the robot was not informed of, forcing it to detect it and
update its prediction model in order to successfully complete 5 more runs prior to
completing the test. At least 20 tests were completed to obtain reasonable averages.
7.3.2 Results & Discussion
Figure 34: Actions executed to reach the goal in each run with environmental changes
change
107
Figure 35: Number of surprises encountered during each run with environmental changes
As seen in Figure 34 & 35 during run 1 & 2 the robot is learning the map through random
action execution as these 2 goals have not been seen yet. During runs 3 to 5 SBL
produces good plans to reach the corresponding goal. In run 6 the robot encounters some
surprises resulting in it executing more actions to explore the environment and update its
model. Once again from run 7 onwards the prediction model is up-to-date as no
unpredicted changes in the environment’s configuration have occurred, so the robot is
able to track the goals successfully without any further surprises.
These results conclude that SBL is able to detect unpredicted configuration changes in
the environment through surprises. Then, obsolete rules are flagged by rule forgetting,
and rule maintenance (detailed in chapter 3.4) updates the prediction model to provide
adaptation to unpredicted configuration changes in the environment.
change
108
7.4 Unpredicted Sensor Changes
Sensors could change unpredictably during runtime due to accidents, wear & tear or even
repairs that are done throughout the lifetime of a robot. A fully autonomous robot should
be able to detect and adapt to such situations without any human intervention. In the
presence of sensor changes, SBL does not discard the entire prediction model. Instead, it
only changes parts of the model that need to be repaired, thereby preserving its past
experiences, and conserving a large amount of resources.
7.4.1 Approach
To test adaptivity to unpredicted sensor changes the hunter-goal layout was used. A fixed
map size of 7x7 was selected starting with the hunter and goals located as shown in
Figure 33. Once again the hunter was tasked with visiting the two goals in an alternating
manner. A run was considered complete each time the hunter visited the desired goal
location. Initially the hunter’s X and Y location attributes responded as prescribed by the
game environment. However, after 5 runs were complete the X & Y attributes were
swapped without informing the robot. The robot had to detect and adapt to this change in
order to complete the 5 remaining runs of the test. At least 20 tests were completed to
obtain reasonable averages.
109
7.4.2 Results & Discussion
Figure 36: Actions executed to reach the goal in each run with sensor changes
Figure 37: Number of surprises encountered during each run with sensor changes
As seen in Figure 36 & 37 during run 1 & 2 the robot is learning the map through random
action execution as these 2 goals have not been seen yet. During runs 3 to 5 SBL
produces good plans to reach the corresponding goal. In run 6 many surprises occur as
change
change
110
the sensor definitions have changed, so the robot results in executing more actions before
it could update its model accordingly. From run 7 onwards the prediction model now
reflects the correct outcome for the swapped sensor attributes, so the robot is able to track
the goals successfully without any further surprises.
These results conclude that SBL is able to detect unpredicted sensor changes through
surprises. Then, entity and attribute relevance identifies the sensors that have changed,
and updates the prediction model with rule forgetting and maintenance, to provide
adaptation to unpredicted changes in sensors.
7.5 Unpredicted Action Changes
Actions could change unpredictably during runtime due to accidents, wear & tear or even
repairs that are done throughout the lifetime of a robot. A fully autonomous robot should
be able to detect and adapt to such situations without any human intervention. In the
presence of action changes, SBL does not discard the entire prediction model. Instead, it
only changes parts of the model that need to be repaired, thereby preserving its past
experiences, and conserving a large amount of resources.
7.5.1 Approach
To test adaptivity to unpredicted action changes the hunter-goal layout was used. A fixed
map size of 7x7 was selected starting with the hunter and goals located as shown in
111
Figure 33. Once again the hunter was tasked with visiting the two goals in an alternating
manner. A run was considered complete each time the hunter visited the desired goal
location. Initially the hunter’s four actions North, South, East & West would result in the
hunter moving in those directions respectively. However, after 5 runs were complete, the
North and East actions were swapped without informing the robot, affectively simulating
an accidental crosswire. The robot had to detect and adapt to this change in order to
complete the 5 remaining runs of the test. At least 20 tests were completed to obtain
reasonable averages.
7.5.2 Results & Discussion
Figure 38: Actions executed to reach the goal in each run with action changes
change
112
Figure 39: Number of surprises encountered during each run with action changes
As seen in Figure 38 & 39 during run 1 & 2 the robot is learning the map through random
action execution as these 2 goals have not been seen yet. During runs 3 to 5 SBL
produces good plans to reach the corresponding goal. In run 6 the planned actions yield
surprises, so the robot results in executing more actions before it could update its model
accordingly. From run 7 onwards the prediction model now reflects the correct outcome
for the swapped actions, so the robot is able to track the goals successfully without any
further surprises.
These results conclude that SBL is able to detect unpredicted action changes through
surprises. Then, entity and attribute relevance identifies the actions that have changed,
and updates the prediction model with rule forgetting and maintenance, to provide
adaptation to unpredicted changes in actions.
change
113
7.6 Repairing vs. Rebuilding the Learned Model from Scratch
When SBL detects an unpredicted change it could opt to repair the prediction model or
rebuild it from scratch. In practice, since SBL detects unpredicted changes through
surprises, it is not possible to force it to rebuild the model from scratch as it would
endlessly rebuild. In addition, it is wasteful to rebuild the parts of the model that are still
valid and intact. Surprises enable SBL to identify the parts of the model that have
changed, and only repair those that are necessary, thereby preserving its past experiences,
and conserving a large amount of resources.
7.6.1 Approach
Nevertheless, to establish a comparison between repairing and rebuilding from scratch,
two experiments were conducted with hunter-goal layout. In both experiments, a fixed
map size of 7x7 was selected. The hunter was tasked with visiting the two goals in an
alternating manner. A run was considered complete each time the hunter visited the
desired goal location. Initially the hunter’s four actions North, South, East & West would
result in the hunter moving in those directions respectively. However, after 5 runs were
complete the North and East actions were swapped. In the first experiment, SBL
maintained its default behavior, which was to detect and repair the prediction model,
while in the second experiment it was forced to rebuild the prediction model from scratch
from the beginning of run 6. Each experiment was repeated at least 20 times to obtain
reasonable averages.
114
7.6.2 Results & Discussion
Figure 40: Actions executed when the model is repaired vs. rebuilt
Figure 41: Surprises encountered with repair vs. rebuild
As seen in Figure 40 the number of actions executed in run 6 is lower when the model is
repaired rather than rebuilt. This is due to the fact that rebuilding from scratch requires
relearning all four actions, while repairing requires the relearning of two actions.
Rebuilding results in random action selection, whereas repairing generates plans that are
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7 8 9 10
Actions Executed
Run
Repair
Rebuild
0
1
2
3
4
5
1 2 3 4 5 6 7 8 9 10
Surprises
Run
Repair
Rebuild
change
change
115
partially accurate. As seen in Figure 41 the number of surprises is higher for rebuilding
than it is for repairing, this indicates that more actions would be executed as a result of
rebuilding as the accuracy of plans decrease when there are surprises.
The results of comparing repairing vs. rebuilding the learned model from scratch,
concludes that SBL’s default behavior of identifying and repairing the parts of the
learned model that have changed, is well suited for an autonomous robot. This strategy
ensures the preservation of past experiences so as to prevent the wastage of resources in
repeating history.
7.7 Relevance and Unpredicted Sensor & Action Changes
Gracefully adapting to unpredicted sensor and action changes is essential for a lifelong
learning robot. As seen in previous sections there are several situations which require the
detection of unpredicted changes, identification of relevant sensors & actions, and
adaptation of the learned model. In this section, we investigate how SBL performs the
above through a systematic series of experiments.
7.7.1 Approach
Using the box environment, several experiments were devised to test unpredicted sensor
and actions changes in SBL by varying the number of sensors, altering them at runtime
and toggling some of the actions of the robot.
116
The experiments were designed in increasing complexity as follows:
1) Test the minimal configuration of hardware required to learn the environment for
successful navigation to a goal observation. The robot is given only the correct
and relevant sensors and actions. This serves as the base case for comparing these
experiments.
2) Test the ability to handle sensor changes by adding a sensor reporting a constant
value.
3) Test the ability to handle irrelevant sensors by adding a sensor reporting random
numbers that have no correlation to the environment or actions.
4) Test unpredicted sensor change and subsequent recovery by deliberately rotating
the camera upside down (180º) after SBL learned a good model.
5) Test unpredicted action change and recovery in actions by swapping the left and
right turn actions after SBL learned a good model.
Each experiment is conducted several times by varying the starting location & orientation
of the robot and terminated only after the robot demonstrated successful tracking of
several randomly assigned targets.
117
7.7.2 Results & Discussion
Table 7: Adapting to action & sensor changes in the constrained box environment
# Entities Description Rules Avg.
Actions Max. Avg.
1 Blobs, Proximity Minimal configuration to test the basic
capabilities of SBL
420 148 88
2 Blobs, Proximity, Constant Added a constant sensor to test sensor
relevance
426 152 92
3 Blobs, Proximity, Constant ,
Random
Added a random sensor to test sensor
relevance
512 155 103
4 Blobs, Proximity, Constant ,
Random
Flipped the camera by 180º to test
adaptivity to unpredicted sensor changes
600 144 136
5 Blobs, Proximity, Constant ,
Random
Swapped the left and right turn actions to
test adaptivity to unpredicted action changes
740 139 152
Table 7 displays the experiment number, the entities used, a brief description of the
objective, followed by the maximum and average number of rules learned, and the
average number of actions executed during the learning process. As rule forgetting
ignored rules instead of deleting them, the maximum number of rules indicated the total
number of rules explored since the commencement of the experiment, whereas the
average number of rules only considered the rules that are active during learning because
they represented the current model.
118
1)
The average number of rules learned in experiment 1 is 148. The difference between the
maximum number of rules 420 and the average number is evidence that rule forgetting
and focus of attention through entity & attribute relevance is functioning properly. An
interesting observation was that SBL learned to ignore the floor and vertical displacement
for turning due to numerous surprises.
2)
The robot ignored the constant sensor, proving that such failures would not adversely
affect the quality of the learned model or the learning time.
3)
Initially, the algorithm constructed several rules using the random sensor and as soon as a
contradiction occurred or after several consecutive failures the sensor was flagged as
irrelevant and was ignored during subsequent learning.
4)
This immediately caused certain rules, such as the ones associated with the location
attributes to cause contradictions. These attributes were flagged as irrelevant (causing the
average number of rules to dip slightly as rules disappeared), but where subsequently
toggled by forced relevance because the robot was not making any progress. Thereafter,
SBL successfully learned a new accurate model.
119
5)
The swapping of the left and right turn actions resulted in several contradictory and
consecutive surprises forcing those rules to be forgotten and new rules reflecting the new
situation to be created. SBL gracefully recovered from these unpredicted changes in
actions.
These results confirm that SBL provides adaptivity to unpredicted changes in actions &
sensors for an autonomous robot, by evaluating their relevance with respect to the
structure of the environment. This further reinforces the conclusions in previous sections.
7.8 Simultaneous Unpredicted Changes in Sensors, Action & Goals
Detecting and adapting to independent unpredicted changes in sensors, actions and goals
is challenging, but in reality any or all of these could occur at the same time. SBL has
built-in mechanisms to enable a robot to overcome such catastrophic situations
autonomously. In the presence of simultaneous unpredicted changes, SBL does not
discard the entire prediction model. Instead, it only changes parts of the model that need
to be repaired, thereby preserving its past experiences, and conserving a large amount of
resources.
120
7.8.1 Approach
Figure 42: Goal locations in hunter layout
A 7x7 hunter environment was used with goal locations as marked in Figure 42. The
robot was tasked with reaching a specified goal location, so when it was reached the run
was deemed complete and another goal location would be automatically selected for the
next run. Several unpredicted changes were simultaneously introduced after 5 runs were
completed. In particular, the action definitions for N and W were swapped such that
executing N results in moving one cell West while executing W results in moving one
cell North. The U attribute of the location sensor was forced to a constant value 0
simulating a failure. The goals G1 & G2 were removed while G3 & G4 were added. The
test was terminated after 5 more runs were completed. 10 tests were conducted to obtain
reasonable averages.
121
7.8.2 Results & Discussion
Figure 43: Actions executed to reach the goal in each run with several changes
Figure 44: Number of surprises encountered during each run with several changes
The “Random” series in Figure 43 describes the performance of the robot executing
purely random actions without any form of learning, while the “SBL” series describes its
performance using simultaneous representation and learning via SBL. The random series
change
change
122
would also reflect the robot’s performance if its model was rebuilt from scratch every
time a surprise occurred.
Initially, the robot was started with standard actions and sensors from the center of the
map. The desired goal was G1 but it could not be sensed remotely. Hence, in run 1 SBL
constructed its model via random action execution until it wandered on to the goal G1. In
Figure 44 the high number of surprises for run 1 as opposed to run 2 is an indication that
the model was creating and maintaining prediction rules. At beginning of run 2 the
desired goal was automatically switched to G2 forcing the robot to wander again despite
having an accurate model. The similarity between the first two data points of both series
in Figure 43 proves this behavior. Up to run 5 the desired goals alternate between G1 &
G2, therefore SBL is able to use its model to traverse the now known goal locations in a
much fewer number of actions.
At the beginning of run 6 the 3 unpredicted changes were introduced simultaneously to
the robot without informing SBL. With all of the above the robot still managed to visit
the previous goal locations and establish that it needed to wander around and locate the
desired goal G3. The large number of surprises in run 6 is an indicator that the prediction
model was violated by these unpredicted changes, while the lack of surprises in run 7
shows that SBL had adjusted the model accordingly. Once again from run 8 onwards the
robot uses the updated model to traverse the now known goal locations G3 & G4
alternately in a much fewer number of actions.
123
The results from this simulated robot experiment conclude that SBL detects, reacts, and
adapts to simultaneous unpredicted changes in the robot’s actions, sensors and goals in an
unsupervised manner. The comparison to the baseline of random action execution shows
that the learned models do indeed contribute the purposeful behaviors of the robot.
7.9 Simultaneous Unpredicted Changes in Sensors, Action, Goals & the
Environment’s Configuration
Detecting and adapting to simultaneous unpredicted changes in sensors, actions, goals
and the environment’s configuration is one of the most complex situations that an
autonomous robot may face when operating in a real environment without any external
intervention. SBL has built-in mechanisms to enable a robot to overcome such
catastrophic situations autonomously. In the presence of simultaneous unpredicted
changes, SBL does not discard the entire prediction model. Instead, it only changes parts
of the model that need to be repaired, thereby preserving its past experiences, and
conserving a large amount of resources.
7.9.1 Approach
As in previous experiments, in the office environment the robot was tasked with reaching
a specified goal observation, which happened to be a particular book lying on the floor
against a wall. A run was complete when the goal was reached, then a new run would
commence by assigning another desired observation, which was another book placed at a
different location. On subsequent runs these two goals were reassigned alternately. At the
124
beginning of run 5, all unpredicted changes were introduced simultaneously. The action
definitions for L and R were swapped simulating a cross-wire, the camera was rotated by
90° and displaced by about 15cm along the top of the laptop, and one of the books in the
goal observation was moved to another wall effectively bringing the two goals closer.
The test was terminated when 3 more runs where completed. Due to limitations of battery
power the total number of runs was restricted to 8 and the averages of 3 separate tests
were obtained.
7.9.2 Results & Discussion
Figure 45: Actions executed to reach the goal in each run with several changes
change
125
Figure 46: Number of surprises encountered during each run with several changes
Initially, the robot was started with standard actions and sensors with the two goal
observations on either side. In run 1 SBL randomly executed actions (as there is no model
to start with) until the desired goal was visible as seen in Figure 45. In Figure 46 the high
number of surprises for run 1 as opposed to run 2 indicates that the prediction model
continued to be updated even while locating the second goal. In run 3 and run 4 the robot
reached the now known goals after executing 12 actions on average. Notice the lower
number of surprises here indicates that the prediction model has become more accurate
and no unpredicted changes have occurred.
In run 5, all unpredicted changes were simultaneously introduced. The number of actions
executed increased together with the number of surprises as SBL detected the unpredicted
changes. From run 6 onwards the model was sufficiently up-to-date so the robot executed
change
126
9 actions on average to alternate between the goals. A video of a smaller test is accessible
via the following link:
http://www.isi.edu/robots/SBL/movies/unexpected.wmv
The results from this real-world robot experiment concludes that despite being unable to
correctly identify the exact set of unpredicted changes, SBL detects and adapts to
simultaneous unpredicted changes in the robot’s actions, sensors, goals and environment
in an unsupervised manner. Note that sufficient hardware redundancy must be available
after simultaneous unpredicted changes in order for the learned model to relearn
appropriately. In other words, the robot must maintain enough sensors to identify the
cause of a surprise, and have some actions that demonstrate purposeful behavior.
7.10 Summary
This chapter investigated the capability of SBL in detecting and adapting to unpredicted
changes in a robot’s sensors, actions, goals and the environment’s configuration, in an
unsupervised manner. When an unpredicted goal change occurred, if the goal was
directly observable SBL generated a new plan to reach it. However, when the goal is
indirectly observable the goal management mechanism revisited previous goal
observations, and switched to random exploration until the goal was found. When an
unpredicted configuration change in the environment, or a sensor change, or an action
change occurred, SBL detected it via surprises and repaired the learned prediction model
accordingly. Similarly, SBL was able to detect and adapt to simultaneous unexpected
127
changes in these aspects. Furthermore, SBL performed adaptation by identifying and
repairing the parts of the learned model that have changed, rather than rebuilding it from
scratch. Hence, it preserved past experiences and prevented the wastage of resources in
repeating history.
128
Chapter 8
Detecting and Reasoning with Unpredicted
Interference
Interference such as noise and gaps occur unpredictably over a short period of time. This
chapter describes how SBL can detect, reason with, and if need be correct such situations.
In particular, SBL assumes that no other changes would occur after the initial learning
phase. Experimental results of testing this ability on video activity recognition and gap
filling are presented in this chapter.
8.1 Noise & Missing Data
On a robot, interference is a disturbance experienced by its sensors or actions over a short
period of time, which differs from a fault or failure that persists indefinitely. A sensor
could experience interference due to random noise or intermittent sensing errors such as
missing data, caused by occlusions and transmission errors. Since the learner may miss
critical events due to interference it must be able to detect and reason with it without
adapting the learned model, as that would prevent the model from becoming stable over a
period of time.
In applications where the learned model must continuously adapt to unpredicted changes
SBL will detect surprises when noise occurs. This will result in the probabilities of
129
prediction rules being updated, but the model will not be adapted unless the noise persists
over a long period of time.
In applications where the learned model will not change, it is possible to detect
interference as surprises and perform reasoning even in the presence of missing data or
gaps. For example in a video surveillance application where the learner is trying to
recognize actions carried out by humans, the data stream could be lost for a few seconds
due to interference, yet it may be possible for the learner to reason about what occurred in
the gap based on the observations made before and after it. The rest of this section will
present an extension to SBL designed to deal with unpredicted interference under the
assumption that no further adaptation of the prediction model is allowed after the initial
learning phase.
8.2 Experimental Setup
The robot is a passive observer tasked with recognizing activities or actions executed by
humans observed in video data streams. The publicly available visual intelligence dataset
(http://www.visint.org) consisting of nearly 5,000 videos containing demonstrations of 48
different actions was used. Each video was about 1 minute in duration. SBL focused on
detecting 24 actions, namely “approach”, “arrive”, “bounce”, “carry”, “catch”, “chase”,
“collide”, “drop”, “exchange”, “exit”, “flee”, “go”, “haul”, “leave”, “lift”, “move”,
“pass”, “pick up”, “push”, “put down”, “raise”, “replace”, “throw” and “walk”.
130
Figure 47: Action recognition system dataflow
An underlying vision module shown in Figure 47 provides a data stream that contains
tracked entities corresponding to actors and objects in each frame and their attributes.
Based on the experiment the vision module could be annotated ground truth, USC vision
system, or synthetically generated trends. The annotated ground truth is acquired by
human annotators, resulting in precise but sometimes biased data. The USC vision system
is comprised of a specialized human detector [SWN08], several Feltenzschwab [FH04]
object detectors, background subtraction and a fast moving blob detector [NN08]. These
detectors are fairly noisy, resulting in a reasonable amount of true and false tracks.
Synthetic trends are generated by observing the ground truth or USC vision tracks and
abstracting it to remove noise in order to prevent over fitting.
The detected actor and object tracks serve as entities of the camera sensor with the
attributes x-location, y-location and velocity. New composite entities are also formed by
Video
Frames
Vision Module: Object Detectors & Trackers
AI Module: SBL Recognizer
Detected Actions
Ground Truth Annotations
Training Tracks Testing Tracks
131
taking two actors, or two objects, or an actor and an object, with the relative distance
between them stored as an attribute calculated in frame coordinates. The velocity and
relative distances are calculated in 2D pixel coordinates as follows:
√
(
)
(
)
(36)
√
(
)
(
)
(37)
SBL is provided with the comparison operators %, ~, <, <=, =, !=, >=, >, ↑, ↓ to analyze
the data.
Figure 48: a) Frame 1 b) Frame 5
Figure 49: a) Frame 10 b) Frame 30
132
Figure 48 shows an example of an occlusion occurring in frame 1 of the video stream and
a gap occurring in frame 5, while the two frames in Figure 49 are sufficient for the
learner to conclude that the actions “arrive”, “approach”, “exchange” and “move” have
occurred. The goal is to recognize these activities or actions in the presence of noise and
missing data.
8.3 Approach
Figure 50: SBL action recognition process
The SBL action recognizer has three phases, model learning, similarity bounds learning,
and testing as in Figure 50. The system is provided with a few fixed parameters such as
the cut-off level for the confidence of tracks, a smoothing rate to average the readings
over a number of frames, the sampling frequency and the tolerance for equality of values.
Details of each phase are presented in subsequent sections.
Model
Learning
Similarity
Bounds
Learning
Testing Input data stream
AMT Classification Evaluation Files Entity Attributes: Relative Distance,
Location, Velocity
Detector Confidence
Smoothing, Sampling,
Equality Tolerances
Prediction Model, State
Machine & Similarity Bounds
Prediction Model &
State Machine
133
8.3.1 Model Learning Phase
In this application SBL makes observations by sampling the video data stream at fixed
intervals such as every 10 frames. Some of the noise in sensing can be reduced by
averaging the sensor data over this sampling period, which is also referred to as a data
segment. A prediction model in SBL is comprised of prediction rules that have a set of
conditions, an action and a set of predictions. While learning exactly one action persists
throughout the video, therefore it is possible to learn a prediction model for an action as
described in chapter 3 by observing data before and after each segment. Next, SBL will
parse the data again with the prediction model to determine which rules are fired in each
segment. If a rule fired successfully then its conditions and predictions were satisfied by
the data stream. States are created by grouping the rules that fired simultaneously. The
data stream can then be represented as a sequence of states. The transitions between these
states are captured in a Markov chain.
During the model learning phase SBL is shown multiple examples possibly reflecting
slightly different exemplars of a given action, such that the prediction model is refined
with each new video. However, once the Markov chain is created it is important to parse
all the training videos again to accurately calculate the state transition probabilities. This
process enables SBL to adjust the state space dynamically during learning as opposed to
some competitive techniques such as Hidden Markov Models and Neural Networks.
134
SBL can learn from positive and negative examples. E.g. “bounce” is a negative example
for the “drop” action. When learning the prediction model SBL does not differentiate
between positive and negative examples, it simply attempts to learn a model that
eliminates surprises in the data stream. However, only positive examples are parsed again
to create the Markov chain, which reinforce valid state transitions observed in positive
examples and avoids invalid state transitions observed in negative examples. Note that in
this case positive and negative examples share the same variable bindings, e.g. both “drop”
and “bounce” refer to an object with an initial similarity and subsequent difference in the
velocity profiles.
Figure 51: Relationship between examples, models and an action
An action may be represented using multiple models as shown in Figure 51. Models that
are learned from a combination of positive and negative examples contribute additive
results. This means that for a given dataset, the union of the subsets of results returned by
each model identifies the set of results that contain the learned action. Imprecise positive
Positive Examples
Negative Examples
Positive Models
False Positive
Models
False Positive
Examples
Action
-
+
Training Videos
share the same
variable bindings
variable
bindings
can be
different
135
models could result in some false positive classifications. False positive models can be
learned from these false positive examples such that their results can be subtracted or
removed from the results of the positive models. Note that false positive examples need
not share the same variable bindings as positive and negative examples. Hence, false
positive models can be used to accommodate biases introduced by human judgment
during action recognition.
Figure 52: Graphical depiction, sensor data and prediction model for “replace”
Figure 53: Segmented data with fired rules for “replace”
Figure 54: Markov chain for “replace”
RULES LEARNED FOR “REPLACE ”
{R0,R2}
{R0,R2}
R0: C R1 P0.9 [ (REL 0 % 0) AND (REL 1 V < 60) ] [ACTION = 1] [((REL 0 V = 50) OR
(REL 1 V != 60) OR (REL 1 ~ 0))]
R1: C R0 P1.0 [ (REL 0 % 0) AND NOT { (REL 1 V < 60) } ] [ACTION = 1] [((REL 0 V !=
50) OR (REL 0 ~ 0)) AND ((REL 1 V ↑ 5))]
R2: C R3 P0.9 [ (REL 1 % 0) AND (REL 1 V > 60) ] [ACTION = 1] [((REL 1 V ↓ 5) OR
(REL 1 V != 60) OR (REL 1 ~ 0))]
R3: C R1 P1.0 [ (REL 1 % 0) AND NOT { (REL 1 V > 60) } ] [ACTION = 1] [((REL 1 V ↑=
5) OR (REL 1 ~ 0)) AND ((REL 1 V = 60))]
{R0,R2}
S
B
L
{R0,R2} {R0,R2} {R0,R2} {R0,R2} {R1,R3} {R1,R3} {R1,R3} {R1,R3} {R1,R3} {R1,R3} {R1,R3} {R1,R3}
136
As an example of the model learning phase consider the “replace” action shown in Figure
52. A human carrying a bag approached a stationary bag, replaced it with the one that she
carried and walked away with the bag that was originally stationary. The image adjoining
the human contains two graphs. The area marked in the lighter color (pink) indicates the
relative distance between the human and the bag she was carrying while the area marked
in the darker color (purple) indicates the relative distance between the human and the bag
that was initially stationary. SBL sampled the data and learned the prediction model
shown beside the graph. The rules fired in each segment were grouped together as shown
in Figure 53 to form the state sequence D
replace
= [S
1
, S
1
, S
1
, S
1
, S
1
, S
1
, S
1
, S
2
, S
2
, S
2
, S
2
,
S
2
, S
2
, S
2
, S
2
]. The transitions between these states were captured in a Markov chain as
shown in Figure 54.
8.3.2 Similarity Metric & Similarity Bounds Learning Phase
Given a video and a learned prediction model, SBL evaluates if the video contains the
learned action. This is performed by applying the data stream to the prediction model and
extracting the rules that fired successfully at each sample segment. Then the data stream
is converted to a state sequence by grouping these rules e.g. D
test
= [S
1
, S
1
, S
1
, S
2
, S
2
]. For
each learned Markov chain, a similarity metric is calculated by validating the observed
state sequence against it. The metric applies a predetermined negative penalty for each
invalid start state, end state, and state transition, while each valid state transition is scored
zero.
137
The actions of interest could be executed over a variable duration, and could occur at any
time within videos of variable durations. Prior to the development of this similarity
metric, the log-likelihood similarity estimate was computed by parsing the observed state
sequence through the learned Markov chain in the forward and backward directions.
Unfortunately, the log-likelihood estimate varies significantly as the duration of the
action varies due to recurring state transitions. Hence, this similarity metric is important
as it allows SBL to compare actions of varying durations, across videos of varying
durations.
The similarity bounds learning phase improves tolerance to noise by identifying the range
of similarity values that correctly classify the trained action. In this phase SBL is shown
all the training videos that human evaluators have classified as the given action. The
lower bound of the range is set to the lowest similarity value observed in these videos,
while the upper bound is set to zero.
8.3.3 Testing Phase
A variable binding problem arises when evaluating an unclassified video as entities
stored in the prediction model may match several entities in the testing video. For
example when testing the “replace” video in Figure 52, if there was a third bag in the
scene then there would be 6 possible bindings to the two entities in the prediction model
that need to be tested. As some of these mappings are interchangeable SBL still needs to
verify at least 3 bindings. Therefore, the strategy is to establish the similarity for each
138
unique binding by parsing the video multiple times and keeping the highest value as the
most likely match.
In the testing phase, each video may contain multiple occurrences of an action. When
computing the similarity metric SBL searches for matching pairs of start and end states as
well as orphaned start or end states and reports them as separate detections of an action
with its corresponding similarity value.
8.3.4 Noisy and Gapped Recognition
In a video data stream, noise typically occurs due to temporary interference of sensors
and inaccuracies in the detectors. Noise manifests as incorrect states in the sequence of
observed states. Their short term nature results in small penalties on the similarity metric.
SBL learns to tolerate noise by memorizing the acceptable range of similarity values
from training videos during the bounds learning phase.
Figure 55: a) Missing start data b) Expected results for post diction
139
Figure 56: a) Stream with missing data b) Expected results for interpolation
Figure 57: a) Missing end data b) Expected results for prediction
Gaps could occur in a video data stream due to temporary occlusions or sensor
interference. If untreated these gaps would manifest as undefined states in the observed
state sequence and produce very poor similarity values. SBL flags these gapped states
and replaces them by performing a Breadth First Search originating from known states
with the aid of the Markov chain. In particular, if the gap was at the beginning, as in
Figure 55, it performs post diction from the first known state. If the gap was in the middle,
as in Figure 56, it performs interpolation between the known states. If the gap was at the
end, as in Figure 57, it performs prediction from the last known state. Note that Figures
140
55, 56 & 57 depict gaps in a single trend, yet a gap in a video results in all sensor trends
becoming unknown. SBL does not attempt to recover each trend. Instead it uses the
Markov chain to recover the missing states that represent the combination of these trends.
Once gap filling is performed, slight penalties are applied to those filled states when
calculating the similarity. This is to ensure that as more states are filled, SBL decreases
the likelihood that the action is present.
8.4 Results & Discussion
8.4.1 Recognition
The first recognition experiment was conducted using a subset of data comprised of
approximately 1200 videos for which USC object tracks were generated. SBL was tasked
with recognizing 12 actions: approach, carry, catch, collide, haul, move, pickup, push,
run, stop, throw and walk. SBL learned 38 prediction models as most of these actions had
several different exemplars. Each video contained at least 1 of the 12 selected actions.
Excluding the vision module’s processing time, SBL was able to classify 100 videos in
approximately 1 hour.
The videos were broken into an overlap set which was used for bounds learning & testing,
and a novel set which was used for testing only. For each action the precision, recall and
F1 scores were computed as follows:
141
Precision = (true positives) / (true positives + false positives) (38)
Recall = (true positives) / (true positives + false negatives) (39)
F1 = (2 * Precision * Recall) / (Precision + Recall) (40)
The positives and negatives for this calculation were obtained by requesting human
annotators (Amazon Mechanical Turks - AMT) to identify which actions they observed
in each video. There was a noticeable amount of correlation error between the annotators
due to the ambiguity of some actions such as “run” and “walk”, and biases introduced by
the context of the scene. In order to understand the implications of these differences, the
output was evaluated against two criteria, namely the exact AMT vote per action per
video and majority AMT votes obtained across all exemplars of an action.
It was observed experimentally that training with a few ground truth files or the
equivalent synthetic data yielded high recognition success, while training from much
noisier data generated by the vision modules tended to over constraint the prediction
model and lower the recognition success. As with most learning algorithms SBL
demonstrated over fitting in the presence of noisy data. For this reason, the results
presented here rely on synthetic data for training. The advantage of synthetic data is that
SBL needs only a few good examples. For example, each action in this experiment was
trained with 2 to 3 examples, each containing 1 to 3 synthetic trends. The disadvantage of
synthetic data is that human intervention is required to create them. However, the
142
creation of synthetic data can be done by identifying the entities of interest, and over
smoothing the noisy ground truth or tracking data.
Figure 58: Action recognition results with positive examples and positive models
The results of the SBL action recognition are presented in Figure 58. Multiple positive
models were learned for each action using positive examples. Good F1 scores are seen
for approach, move, stop and walk which are all human centric actions, while actions that
really on object detection such as catch, collide, haul, pickup, push and throw have lower
scores. This behavior was expected as the USC vision module’s human detector had
much higher precision and recall than any other object detector.
Given that there were many correlation errors and the majority of detectors had a
precision of less than 20% the scores seen here are reasonable. This was validated by
comparing the SBL results against results from hand-coded structured activity models
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
F1-score
Overlap exact AMT
Novel exact AMT
Overlap majority AMT
Novel majority AMT
143
that were also able to recognize these actions using the same USC object tracks. SBL
outperformed the hand-coded models for each action and had an F1-score that was at
least 10% higher. Notice that the disparity between the overlap set and novel set is very
low in most cases indicating that SBL had learned relatively good models of each action
with only a few positive examples.
The second recognition experiment was conducted using a subset of data comprised of
approximately 480 videos for which ground truth object annotations were available. SBL
was tasked with recognizing 8 actions: approach, arrive, carry, chase, drop, exchange,
exit and flee. Ground truth tracks were used to minimize tracking errors, yet perfect F1-
scores were unattainable primarily due to correlation errors.
Figure 59: Action recognition results with different combinations of models
Figure 59 visualizes the impact of adding negative examples and negative models to
positive models learned from positive examples. The positive models for “exit” yielded a
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
F1-score
False positive models
Negative examples
Positive models with positive examples
144
few false negatives, yet it had no false positives. In contrast, “drop” had a few false
positives because several instances of “bounce” and “replace” contained information that
satisfied the models learned by SBL from the positive examples. The introduction of a
“bounce” example as a false negative when training SBL improved the recognition F1-
score from 0.39 to 0.51. The introduction of a false positive model which learned
examples of “replace” further improved the F1-score to 0.69.
In conclusion, these experiments prove that SBL can learn to detect and reason with data
in the presence of interference, provided that no further adaptation of the prediction
model is allowed after the initial learning phase. In particular, it is well suited for
applications such as activity or action recognition in noisy video data streams.
8.4.2 Gap Filling
This experiment was conducted using a dataset comprised of approximately 700 videos
for which ground truth annotations where available. SBL was tasked with recognizing 25
actions: approach, arrive, bounce, carry, catch, chase, collide, drop, exchange, exit, flee,
go, haul, leave, lift, move, pass, pick up, push, put down, replace, run, stop, throw and
walk. SBL learned 68 prediction models. The objective of this experiment was to
introduce a large contiguous gap into each video at a random location and determine if
SBL can correctly identify the action.
145
Figure 60: Action recognition results with gap filling
The results presented in Figure 60 show the recognition scores when there is no gap, a 10%
gap and 30% gap respectively. The base case to compare is when there was no gap. A 10%
gap meant that no data was available for 10% of the duration of each video, while the 30%
gap meant that almost 1/3 of each video remained blank.
These results conclude that SBL can detect, reason with, and even recover the state of the
missing data as described in chapter 8.3.4. This reinforces the fact that SBL can learn to
detect and reason with data in the presence of interference. It is interesting to see that in
some cases the gaps result in higher recognition by helping SBL reduce a number of false
positives created by excessive noise in the detectors.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
F1-score
No Gap
10% Gap
30% Gap
146
8.5 Summary
This chapter investigated the capability of SBL in detecting and reasoning with
unpredicted interference. Given that no further changes occurred after the initial learning
phase, SBL successfully detected and reasoned with unpredicted interference caused by
noise and missing data in the application of video action recognition. In the presence of
gaps, SBL performed a BFS with the learned prediction model and Markov chain to
recover the missing states.
147
Chapter 9
Conclusion
9.1 Summary and Contributions
Table 8: Summary of contributions
Structure Learning
+Discover the number of states
+Scalable/not over fitting in structured environments
+Few training examples produce a good model
+Learns generic models that can solve specific goals
+Temporal modeling with predictive capability
Learning from
uninterpreted sensors
and actions
+Preprocessing not required
+Can use preprocessed data
+Discretizes continuous data
+Identifies useful sensors and ignores irrelevant sensors
Detecting and adapting
to unpredicted changes
+Adapt to action changes
+Adapt to sensor changes
+Adapt to environmental changes
+Adapt to goal changes
+Adapt to simultaneous action, sensor, environment & goal changes
+Repairs model faster and efficiently than rebuilding it from scratch
Detecting and reasoning
with unpredicted
interference
+Detects noise
+Detects and fills gaps
+Quantify similarity between learned model and observed data
+Reason with data of variable durations
148
The research presented in this dissertation has four contributions to the fields of robotics
and artificial intelligence as summarized in table 8. It is important to note that all four
contributions stem from a single solution.
The first contribution is a new machine learning technique called Surprise-Based
Learning for structure learning in robotics [RS08a] [RS08b]. When the sensors, actions
and environment are structured there are discernible patterns in the sensed data associated
with each action. SBL discovers these patterns, and learns a prediction model, which
forecasts the outcome of an action in the environment given the current observation from
sensors. The prediction model is comprised of prediction rules, which record the structure
in a compact form. SBL detects changes in structure through surprises and attempts to
update its model to eliminate future surprises.
The second contribution is learning directly from uninterpreted sensors and actions
[RS09]. SBL learns the coupling between sensors, actions and the perceived entities of
the environment by analyzing observations made before and after the execution of an
action. A set of predefined comparison operators help discretize continuous sensor data.
The third contribution is detecting and adapting to simultaneous unpredicted changes in
sensors, actions, goals and the environment [RS11]. When an unpredicted change to a
sensor occurs, SBL detects surprises in every action that’s coupled with it, resulting in the
sensor being ignored. When an unpredicted change to an action occurs, SBL detects
149
surprises in every prediction rule associated with it, resulting in the prediction model
updating such that the action will no longer be considered by the planner. When an
unpredicted change to a goal occurs, SBL detects the discrepancy between its expectation
and the current observation. It revisits all previously known observations systematically
and finally switches to random exploration until the goal is discovered. When
unpredicted changes in the environment occur, SBL detects surprises in the
corresponding entities rather than all the entities associated with a sensor. Hence, it
updates the associated rules to reflect the new observations. When simultaneous
unpredicted changes occur SBL adapts using the combination of the independent
strategies despite being unable to precisely determine the changes.
The fourth contribution is detecting and reasoning with unpredicted interference over a
short period of time [RS12]. Noise and gaps are two common forms of interference that
SBL detects as surprises provided that the sensors and actions do not change. During a
training phase SBL learns a prediction model, groups rules to form states and captures the
state transitions in a Markov chain. With these, in the testing phase SBL can reason with,
and even recover noisy or gapped states successfully.
Overall, this dissertation presents Surprise-Based Learning as a promising solution for the
problem of autonomous detection and adaptation to unpredicted changes of a robot. It has
successfully been demonstrated on several applications including simulated games, robot
navigation and video surveillance.
150
9.2 Future Research Directions
SBL does not explicitly tolerate interference during model learning. By this, what we
mean is that during model learning when prediction rules encounter some noise SBL
immediately splits or refines them. This is not a problem as these prediction rules will be
subjected to further surprises that will result in them being fixed or forgotten. However,
this process wastes resources, and may result in inaccurate predictions over short periods
of time. As this may be undesirable for operating in noisy environments, a future
improvement could be to maintain copies of prediction rules prior to splitting and
refinement. This will allow the original rules as well as the updated ones to coexist in the
model. Rule forgetting will eventually reject the noisy rules once their success
probabilities become unacceptably low. Note that though this strategy may explicitly
tolerate interference the planner must consider these probabilities to remove invalid plans.
SBL assumes that all states are observable provided sufficient sensor redundancy.
However, in practice hidden states may exist. A feasible future extension could be to
leverage on the knowledge that some actions are reversible and incorporate local
distinguishing experiments to identify hidden states as demonstrated in CDL [Shen90].
Currently, SBL does not expend any effort to abstract prediction rules to improve their
applicability. For example, learning the proximity response to a particular wall could be
abstracted to handle every wall. This is a very important and challenging direction to
151
improve on. A feasible strategy could be to incorporate hypothesizing and testing by
identifying common structure within the learned prediction rules.
The complexity of surprise analysis grows exponentially with the number of sensors, and
the prediction model is not focused to accomplish a particular goal. An extension of this
research could include goal directed learning thereby limiting the search space of surprise
analysis as well. In addition, the planning overheads could be minimized by generating
policies from the prediction model to achieve common goals.
There are several other interesting research directions such as distributed and nested SBL
learners. For example, in the case of swarm robots with identical sensors and actions, the
robots should be able to merge their distributed prediction models and eliminate surprises
significantly faster. Similarly, by nesting several learners such as a learner that associates
actuators with actions, and a learner that associates sensors with actions, through careful
training a rich hierarchical learner could be formed to scale with the complexity of the
robot.
152
References
[Angl87] Angluin, D. “Learning regular sets from queries and counter-examples”,
Information and Computation, 75(2), 1987.
[Bell57] Bellman R., “Dynamic Programming”, Princeton University Press, 1957.
[BJP03] Bodor R., Jackson B., Papanikolopoulos N., “Vision-based human tracking
and activity recognition”, Mediterranean Conference on Control and
Automation, 2003.
[BOP97] Brand M., Oliver N., Pentland A., “Coupled hidden Markov models for
complex action recognition”, Computer Vision and Pattern Recognition,
994-999, 1997.
[BSKS07] Busch M., Skubic M., Keller J., Stone E. “A Robot in a Water Maze:
Learning a Spatial Memory Task”, Robotics and Automation, ICRA, 1727-
1732, Apr 2007.
[BTG06] Bay H., Tuytelaars T., Gool L.-V., “Surf: Speeded up robust features”,
European conference on computer vision, 404-417, 2006.
153
[BZL06] Bongard J., Zykov V., Lipson H., "Resilient machines through continuous
self-modeling", Science, 314: 1118-1121, Nov 2006.
[CAOB97] Cohen P., Atkin M., Oates T., Beal C., “Neo: Learning conceptual
knowledge by sensorimotor interaction with an environment”, International
Conference on Intelligent Agents, 1997.
[Cass99] Cassandra, A., Tony’s POMDP file repository page, 1999,
http://www.cs.brown.edu/research/ai/pomdp/examples/index.html
[CM02] Comaniciu D., Meer P., “Mean Shift: A Robust Approach toward Feature
Space Analysis”, Pattern Analysis and Machine Intelligence, 24: 603-619,
May 2002.
[Coul09] Coulter D., “Santrapped Mars Rover Makes Big Discovery”, 2009,
http://science.nasa.gov/science-news/science-at-nasa/2009/02dec_troy/
[Diet02] Diettrich G., “Machine Learning for Sequential Data: A Review",
“Structural, Syntactic, and Statistical Pattern Recognition”, Springer-Verlag,
15-30, 2002.
154
[Dres91] Drescher, G., “Made-Up Minds: A Constructivist Approach to Artificial
Intelligence”, MIT Press, 1991.
[DSKM02] Doya K., Samejima K., Katagiri K., Kawato M., “Multiple Model-based
Reinforcement Learning”, Journal of Neural Computation, 2002.
[Ferr94] Ferrell C., “Failure recognition and fault tolerance of an autonomous robot”,
Adaptive Behavior, 2: 375-398, 1994.
[FH04] Felzenszwalb P., Huttenlocher D., “Efficient Graph-Based Image
Segmentation”, International Journal of Computer Vision, 2004.
[Gibs79] Gibson J., “The Ecological Approach to Visual Perception”, Houghton
Mifflin, 1979.
[HJSL05] Horvitz E., Johnson A., Sarin R., Liao L., “Prediction, Expectation, and
Surprise: Methods, Designs, and Study of a Deployed Traffic Forecasting
Service”, Conference on uncertainty in Artificial Intelligence, Scotland, July
2005.
155
[HVBC94] Hamilton D., Visinsky M., Bennett J., Cavallaro J., Walker I., “Fault
tolerant algorithms and architectures for robotics”, Electrotechnical
Conference, 3: 1034-1036, Apr 1994.
[HW02] Hurang X., Weng J., “Novelty and reinforcement learning in the value
system of developmental robots”, 2nd Intl. Workshop on Epigenetic
Robotics, 2002.
[IB04] Itti L., Baldi P., “A Surprising Theory of Attention”, IEEE Workshop on
Applied Imagery and Pattern Recognition, Oct 2004.
[KB72] Koslowski B., Bruner J., “Learning to use a lever”, Child Development,
43:790-799, 1972.
[KNGE05] Krichmar J., Nitz D., Gally J., Edelman G., “Characterizing functional
hippocampal pathways in a brain-based device as it solves a spatial memory
task”, National Academy of Science USA, , 102 (6): 2111-2116, Feb 2005.
[KTT09] Kececi E., Tang X., Tao G., “Adaptive actuator failure compensation for
redundant manipulators”, Robotica Journal, vol. 27, 19-28, 2009.
156
[LB04] Lipson H., Bongard J., “An Exploration-Estimation Algorithm for Synthesis
and Analysis of Engineering Systems Using Minimal Physical Testing”,
ASME Design Engineering Technical Conferences, Salt Lake City, UT,
2004.
[LKHS99] Linden D., Kallenbach U., Heineckeo A., Singer W., Goebel R., “The myth
of upright vision. A psychophysical and functional imaging study of
adaptation to inverting spectacles”, Perception, (28): 469 – 481, 1999.
[LSS02] Littman M., Sutton R., Singh S., “Predictive representations of state”,
Advances in neural information processing systems, 14: 1555-1561. 2002.
[Morr84] Morris R., “Developments of a water-maze procedure for studying spatial
learning in the rat”, Journal of Neuroscience Methods, 11 (1): 47–60, May
1984.
[Muga10] Mugan J., “Autonomous Qualitative Learning of Distinctions and Actions in
a Developing Agent”, Doctoral dissertation, Computer Science Department,
The University of Texas at Austin, 2010.
[Murp04] Murphy K., “Hidden Markov Model (hmm) toolbox for matlab”, 2004,
http://www.ai.mit.edu/~murphyk/Software/HMM/hmm.html
157
[Murp96] Murphy R. and Hershberger D., “Classifying and Recovering from Sensing
Failures in Autonomous Mobile Robots”, In Proc. of the 13th National
Conference on Artificial Intelligence, 922-929, 1996.
[NF00] Nolfi S., Floreano D., “Evolutionary robotics: The biology, intelligence, and
technology of self-organizing machines”, MIT Press, Cambridge, MA,
2000.
[NN08] Natarajan P., Nevatia R., “View and Scale Invariant Action Recognition
Using Multiview Shape-Flow Models”, International Conference on
Computer Vision and Patten Recognition, 2008.
[OKH07] Oudeyer P., Kaplan F., Hafner V., “Intrinsic Motivation Systems for
Autonomous Mental Development”, IEEE Transactions on Evolutionary
Computation, Vol. 11, 2007.
[Peir1878] Peirce C., “How to make our ideas clear”, Popular Science Monthly, 12:
286-302, 1878.
[Piag52] Piaget J., “The Origins of Intelligence in the Child", Norton, 1952.
158
[PK97] Pierce D., Kuipers B., “Map learning with uninterpreted sensors and
effectors”, Artificial Intelligence, 92: 169-229, 1997.
[RN09] Russel S., Norvig P., "Artificial Intelligence, A Modern Approach", Prentice
Hall, 21: 830-853, 2009.
[RS93] Rivest R., Schapire R., “Inference of finite automata using homing
sequences”, Information and Computation, 1993.
[RS08a] Ranasinghe N., Shen W-M., “The Surprise-Based Learning Algorithm”,
USC ISI internal publication, April 2008, ISI-TR-651.
[RS08b] Ranasinghe N., Shen W-M, “Surprise-Based Learning for Developmental
Robotics”, Learning and Adaptive Behaviors for Robotic Systems, LAB-RS
08, 65-70, Aug 2008.
[RS09] Ranasinghe N., Shen W-M., “Surprise-Based Learning and experimental
results on robots”, International Conference on Developmental Learning,
ICDL 2009, June 2009.
159
[RS11] Ranasinghe N., Shen W-M., “Autonomous Adaptation to Simultaneous
Unexpected Changes in Modular Robots”, Workshop on reconfigurable
modular robots, International conference on Intelligent Robots and Systems,
October 2011.
[RS12] Ranasinghe N., Shen W-M., “Autonomous Surveillance Tolerant to
Interference”, Towards Autonomous Robotic Systems conference, TAROS
2012, August 2012.
[Saff97] Saffiotti A., “Handling Uncertainty in Control of Autonomous Robots”,
Lecture Notes in Computer Science, pp. 198–224, 1997.
[Shen89] Shen W.-M., “Learning from the Environment Based on Actions and
Percepts” Ph.D. dissertation, Dept. Computer Science., Carnegie Mellon
University., Pittsburgh, PA, 1989.
[Shen90] Shen W.-M., “Complementary discrimination learning: a duality between
generalization and discrimination”, Eighth National Conference on Artificial
Intelligence, MIT Press. 1990.
[Shen92] Shen W.-M., “Discovering regularities from large knowledge bases”,
International Journal of Intelligent Systems, 7(7), 623-636, 1992.
160
[Shen93a] Shen W.-M., “Learning finite automata using local distinguishing
experiments”, in the Proceeding of International Joint Conference on
Artificial Intelligence, 1993.
[Shen93b] Shen W.-M., “Discovery as Autonomous Learning from the Environment,”
Machine Learning Journal, vol. 12, 143–165, Aug. 1993.
[Shen94] Shen W.-M., “Autonomous Learning From The Environment”, New York,
W.H. Freeman and Company, 1994.
[Shen95] Shen, W.-M., “The process of discovery”, in Foundations of Science, 1(2),
1995.
[SKB05] Stout A., Konidaris G., Barto G., “Intrinsically Motivated Reinforcement
Learning: A Promising Framework For Developmental Robot Learning”,
AAAI Spring Symposium on Developmental Robotics, 2005.
[SL74] Simon H., Lea G., “Problem solving and rule induction: A unified view”,
Knowledge and Cognition, Hillsdale, NJ, 1974.
[SL09] Schmidt M., Lipson H., “Distilling Free-Form Natural Laws from
Experimental Data”, Science, 324:81-85, 2009.
161
[SMB07] Schembri M., Mirolli M., Baldassarre G., “Evolving internal reinforcers for
an intrinsically motivated reinforcement-learning robot”, IEEE International
Conference on Development and Learning, 2007.
[SMS06] Salemi B., Moll M., Shen W.-M., “SUPERBOT: A Deployable, Multi-
Functional, and Modular Self-Reconfigurable Robotic System”, Intelligent
Robots and Systems, IROS 2006, 43-52, October 2006.
[Soik97] M. Soika, “A sensor failure detection framework for autonomous mobile
robots”, In Proc. of the international conference on intelligent robots &
systems, vol. 3, pp. 1735-1740, 1997.
[SS89] Shen, W.-M., Simon H., “Rule creation and rule learning through
environmental exploration”, Eleventh Joint Conference on Artificial
Intelligence, Morgan Kaufmann. 1989.
[SS93] Shen, W.-M., Simon H., “Fitness requirements for scientific theories
containing recursive theoretical terms”, British Journal of Philosophy of
Science, 44, 641-652, 1993.
162
[SS06] Stronger D., Stone P., “Towards Autonomous Sensor and Actuator Model
Induction on a Mobile Robot”, Connection Science, 18(2): 97-119, June
2006.
[SSK08] Stone E., Skubic M., Keller J., “Adaptive Temporal Difference Learning of
Spatial Memory in the Water Maze Task”, Development and Learning,
ICDL ‘08, 85-90, August 2008.
[ST97] Safonov M., Tsao T., “The unfalsified control concept and learning”, IEEE
Transactions on Automatic Control, 42 (6): 843 – 847, Jun 1997.
[ST05] Sutton R., Tanner B., “Temporal-difference networks”, in Advances in
neural information processing systems, 17:1377-1384, 2005.
[SWN08] Singh V., Wu B., Nevatia R., “Pedestrian Tracking by Associating Tracklets
using Detection Residuals”, IEEE Motion and Video Computing, 2008.
[Thru02] Thrun S., “Robotic Mapping: A Survey”, Exploring Artificial Intelligence in
the New Millenium, Morgan Kaufmann, 2002.
[Visi91] Visinsky M., “Fault Detection and Fault Tolerance Methods for Robotics”,
Rice University Thesis, 1991.
163
[WB09] Webster G., Brown D., “Mars Exploration Rovers, Mission News”, 2009,
http://www.nasa.gov/mission_pages/mer/news/telecon/tel20091110.html
[WJS05] Wolfe B., James M., Singh S., “Learning Predictive State Representations in
Dynamic systems without reset”, in Proceedings of the 22
nd
International
Conference on Machine Learning, Bonn, Germany, 2005.
[WK98] Wolpert D., Kawato M., “Multiple Paired Forward and Inverse Models for
Motor Control”, Neural Networks 11: 1317-1329, Oct 1998.
Abstract (if available)
Abstract
To survive in the real world, a robot must be able to intelligently react to unpredicted and possibly simultaneous changes to its self (such as its sensors, actions, and goals) and dynamic situations/configurations in the environment. Typically there is a great deal of human knowledge required to transfer essential control details to the robot, which precisely describe how to operate its actuators based on environmental conditions detected by sensors. Despite the best preventative efforts, unpredicted changes such as hardware failure are unavoidable. Hence, an autonomous robot must detect and adapt to unpredicted changes in an unsupervised manner. ❧ This dissertation presents an integrated technique called Surprise-Based Learning (SBL) to address this challenge. The main idea is to have a robot perform both learning and representation in parallel by constructing and maintaining a predictive model which explains the interactions between the robot and the environment. A robot using SBL engages in a life-long cyclic learning process consisting of ""prediction, action, observation, analysis (of surprise) and adaptation"". In particular, the robot always predicts the consequences of its actions, detects surprises whenever there is a significant discrepancy between the prediction and observed reality, analyzes the surprises for its causes (correlations) and uses critical knowledge extracted from the analysis to adapt itself to unpredicted situations. ❧ SBL provides four new contributions to robotic learning. The first contribution is a novel method for structure learning capable of learning accurate enough models of interactions in an environment in an unsupervised manner. The second contribution is learning directly from uninterpreted sensors and actions with the aid of a few comparison operators. The third contribution is detecting and adapting to simultaneous unpredicted changes in sensors, actions, goals and the environment. The fourth contribution is detecting and reasoning with unpredicted interference over a short period of time. ❧ Experiments on both simulation and real robots have shown that SBL can learn accurate models of interactions and successfully adapt to unpredicted changes in the robot’s actions, sensors, goals and the environment’s configuration while navigating in different environments. Experiments on surveillance videos have shown that SBL can detect interference, and recover some information that was hidden from sensors, in the presence of noise and gaps in the data stream.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Active state learning from surprises in stochastic and partially-observable environments
PDF
Learning to adapt to sensor changes and failures
PDF
Robot mapping with proprioceptive spatial awareness in confined and sensor-challenged environments
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Learning affordances through interactive perception and manipulation
PDF
Intelligent near-optimal resource allocation and sharing for self-reconfigurable robotic and other networks
PDF
Data-driven robotic sampling for marine ecosystem monitoring
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Closing the reality gap via simulation-based inference and control
PDF
Robust and adaptive online reinforcement learning
PDF
Intelligent robotic manipulation of cluttered environments
PDF
Autonomous mobile robot navigation in urban environment
PDF
Planning and learning for long-horizon collaborative manipulation tasks
PDF
Robot life-long task learning from human demonstrations: a Bayesian approach
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Predicting mission power requirement for mobile robots
PDF
Emotional appraisal in deep reinforcement learning
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Learning invariant features in modulatory neural networks through conflict and ambiguity
Asset Metadata
Creator
Ranasinghe, Nadeesha Oliver
(author)
Core Title
Learning to detect and adapt to unpredicted changes
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
09/15/2012
Defense Date
08/14/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adapting to change,artificial intelligence,autonomous navigation,autonomous surveillance,developmental learning,interference,Learning,OAI-PMH Harvest,predictive modelling,robotics,structure learning,Surprise,uninterpreted sensors,unpredictable changes,unpredicted changes
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Shen, Wei-Min (
committee chair
), Nevatia, Ramakant (
committee member
), Safonov, Michael G. (
committee member
)
Creator Email
nadeeshr@usc.edu,nadran@yahoo.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-97440
Unique identifier
UC11288139
Identifier
usctheses-c3-97440 (legacy record id)
Legacy Identifier
etd-Ranasinghe-1199.pdf
Dmrecord
97440
Document Type
Dissertation
Rights
Ranasinghe, Nadeesha Oliver
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
adapting to change
artificial intelligence
autonomous navigation
autonomous surveillance
developmental learning
interference
predictive modelling
robotics
structure learning
uninterpreted sensors
unpredictable changes
unpredicted changes