Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Active state learning from surprises in stochastic and partially-observable environments
(USC Thesis Other)
Active state learning from surprises in stochastic and partially-observable environments
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ACTIVE STATE LEARNING FROM SURPRISES IN STOCHASTIC AND PARTIALLY-OBSERVABLE ENVIRONMENTS by Thomas Joseph Collins A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Science) May 2019 Copyright 2019 Thomas Joseph Collins To my family. ii Acknowledgements This thesis would not have been possible without the passionate support and guidance of my advisor, Professor Wei-Min Shen. The work laid out in this dissertation is a continuation of his own dissertation research, and I feel honored that he would allow me to build upon a major component of his own work. I would also like to thank Prof. Shen for his patience, kindness, and his willingness to teach me rsthand how to navigate the (occasionally overwhelming) process of rigorous scientic research. I have emerged on the other side a better researcher, a better listener, a better teacher, and a better man. I would like to thank Prof. Paul Rosenbloom, Prof. John Carlsson, Prof. Ram Nevatia, and Prof. Aiichiro Nakano for donating their time and energy to be on my qualifying exam and defense committees. Their input was instrumental in improving the work laid out in this thesis. I owe a great deal to my friends in the Polymorphic Robotics Laboratory at the ISI for their support, feedback, encouragement, and time. In particular, I would like to thank Nadeesha Ranasinghe, whose pioneering work on Surprise-based Learning (SBL) was a major source of inspiration for this work. I would also like to thank Luenin Barrios, Chi-An Chen, and Jens Windau, among many others that had a profound impact on my work over the years. Finally, and most importantly, I would like to thank my wife, Kelsi, for her unwavering support, love, and patience throughout this process, and my daughter, Remy, for lighting up my world and providing a constant source of inspiration. Additionally, I would like to thank my parents, in-laws, and extended family (on both sides) for their emotional and nancial support. iii Table of Contents Acknowledgements iii List Of Figures vii Abstract xvi Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Overview and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Formal Denition and Key Assumptions . . . . . . . . . . . . . . . . . . . . 6 1.3 Overview of our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Scientic Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2: Related Work 14 2.1 On the Use of the Word Surprise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Surprise-Based Learning of State and Dynamics . . . . . . . . . . . . . . . . . . . . 17 2.3 Universally Optimal Learning Agents . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Learning Deterministic Finite State Machines . . . . . . . . . . . . . . . . . . . . . 19 2.5 Representation Learning and Deep Learning . . . . . . . . . . . . . . . . . . . . . . 21 2.6 Temporal and Nonparametric Graphical Models . . . . . . . . . . . . . . . . . . . . 22 2.7 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.8 Predictive State Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.9 Compression and Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.10 Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.11 Genetic Algorithms and Genetic Programming . . . . . . . . . . . . . . . . . . . . 30 Chapter 3: Autonomous Learning from Deterministic Environments 32 3.1 ALFE in Deterministic Environments . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Local Distinguishing Experiments (LDEs) . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Complementary Discrimination Learning (CDL) . . . . . . . . . . . . . . . . . . . 38 Chapter 4: Autonomous Learning from Stochastic Environments 44 4.1 ALFE in Stochastic Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Previous SBL Approaches and Stochastic ALFE . . . . . . . . . . . . . . . . . . . 49 4.3 Surprise in Stochastic Environments . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4 Stochastic Distinguishing Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.5 Probabilistic Surprise-based Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.6 Key Contributions to Surprise-Based Learning . . . . . . . . . . . . . . . . . . . . 66 iv Chapter 5: Predictive SDE Modeling 68 5.1 An Alternative Predictive Formulation of SDEs . . . . . . . . . . . . . . . . . . . . 68 5.2 PSDE Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3 PSBL for Learning Predictive SDE Models . . . . . . . . . . . . . . . . . . . . . . 76 5.3.1 PSBL Learning by Maximizing Predictive Accuracy . . . . . . . . . . . . . 77 5.3.1.1 Active Experimentation and Prediction . . . . . . . . . . . . . . . 80 5.3.1.2 PSDE Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.3.1.3 PSDE Rening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3.1.4 PSDE Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.3.2 PSBL Learning by Minimizing Surprise . . . . . . . . . . . . . . . . . . . . 92 5.3.2.1 The Limitations of Max Predict . . . . . . . . . . . . . . . . . . . 92 5.3.2.2 Dening Surprise in PSDE models . . . . . . . . . . . . . . . . . . 93 5.3.2.3 PSBL for Learning PSDE Models . . . . . . . . . . . . . . . . . . 96 5.4 Search-based Decision Making in PSDE Models . . . . . . . . . . . . . . . . . . . . 106 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Chapter 6: Latent-Predictive SDE Modeling (sPOMDPs) 112 6.1 Rewardless -POMDP Environments . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.2 Surprise-Based POMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.3 PSBL Learning of sPOMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.3.1 Dening Surprise in sPOMDP Models . . . . . . . . . . . . . . . . . . . . . 121 6.3.2 Dening Model Error in sPOMDPs . . . . . . . . . . . . . . . . . . . . . . . 122 6.3.3 Optimal sPOMDP Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.3.4 PSBL for sPOMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.3.4.1 sPOMDP Belief Updating . . . . . . . . . . . . . . . . . . . . . . . 135 6.3.4.2 SDE-Based Belief Smoothing in sPOMDPs . . . . . . . . . . . . . 137 6.3.4.3 sPOMDP Transition Function Updating . . . . . . . . . . . . . . . 141 6.3.4.4 sPOMDP One Step Transition Function Posteriors Update . . . . 142 6.3.4.5 Splitting sPOMDP Model States . . . . . . . . . . . . . . . . . . . 147 6.3.4.6 Vectorization in PSBL for sPOMDPs . . . . . . . . . . . . . . . . 150 6.4 Decision Making in sPOMDP Models . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.4.1 Learning Optimal Policies for sPOMDPs . . . . . . . . . . . . . . . . . . . . 152 6.4.2 Search-Based Decision-Making . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Chapter 7: Theoretical Results 157 7.1 Predictive SDE Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.1.2 Scope of the Results and Denitions . . . . . . . . . . . . . . . . . . . . . . 158 7.1.2.1 Scope and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 158 7.1.2.2 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.1.3 Convergence and Computational Complexity Results . . . . . . . . . . . . . 162 7.2 sPOMDP Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 7.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 7.2.2 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 7.2.3 Representational Capacity Proofs . . . . . . . . . . . . . . . . . . . . . . . . 168 7.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 v Chapter 8: Experimental Results 173 8.1 Environments Adapted from the Literature . . . . . . . . . . . . . . . . . . . . . . 175 8.1.1 The Shape Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 8.1.2 The Little Prince Environment . . . . . . . . . . . . . . . . . . . . . . . . . 176 8.1.3 The Circular 1D Maze Environment . . . . . . . . . . . . . . . . . . . . . . 177 8.1.4 The Shuttle Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 8.1.5 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 8.1.6 Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 8.2 Color World Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 8.3 Model Surprise, Error, and Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.3.1 Environments from the Literature . . . . . . . . . . . . . . . . . . . . . . . 184 8.3.2 Color World POMDP Environments . . . . . . . . . . . . . . . . . . . . . . 191 8.3.3 The Correlation Between Surprise, Error, and Accuracy . . . . . . . . . . . 198 8.4 A Comparative Analysis of Learning Models . . . . . . . . . . . . . . . . . . . . . . 201 8.4.1 Environments from the Literature . . . . . . . . . . . . . . . . . . . . . . . 203 8.4.2 Color World POMDP Environments . . . . . . . . . . . . . . . . . . . . . . 210 8.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 8.5 Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 8.6 Scalability Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 8.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Chapter 9: Conclusions and Future Work 241 9.1 Summary of Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 9.2.1 Factored Representations of Observation Space . . . . . . . . . . . . . . . . 244 9.2.2 Integration of Function Approximation Techniques . . . . . . . . . . . . . . 245 9.2.3 Continuous Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 Reference List 247 Appendix A Proofs of Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 A.1 Proofs of Predictive SDE Theoretical Results . . . . . . . . . . . . . . . . . . . . . 258 A.2 Proofs of sPOMDP Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . 262 A.3 Evaluation of the Convert procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 267 vi List Of Figures 1.1 Example inverse kinematics solutions for high-DOF SuperBot [130] manipulators (up to 180 DOF) found using our novel Particle Swarm Optimization (PSO [71]) based approach [37]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 A tree of 6 simulated SuperBot [130] modules autonomously performing a pick, carry, and place task to simulate autonomous assembly using our novel message- passing and distributed control algorithm for integrated and adaptive locomotion and manipulation [36]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Hardware setup for experiments simulating SuperBot [130] modules docking in microgravity (with a manipulator whose base in non-xed). Autonomous dock- ing is vitally important for building useful and adaptive congurations of self- recongurable modules in space applications. . . . . . . . . . . . . . . . . . . . . . 3 3.1 An illustration of a one-action LDE in the Shape environment [142, 119]. . . . . . 37 3.2 An illustration of a two-action LDE in the Shape environment [142, 119]. . . . . . 37 3.3 An illustration of the predict - surprise - identify - revise loop of CDL as applied to a robot in the Shape environment of Figure 3.1. . . . . . . . . . . . . . . . . . . 40 3.4 An illustration of the second predict - surprise - identify - revise loop of CDL as applied to a robot in the Shape environment of Figure 3.1. . . . . . . . . . . . . . . 42 4.1 An illustration of the Shape environment in which the robot's actions and observa- tions are subject to noise, as dened by and. This environment can be modeled as a rewardless POMDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 An example of executing the one-step Stochastic Distinguishing Experiment (SDE) fxg from state III in the stochastic Shape environment of Figure 4.1 (with = = 0:99). The opaque arrows exemplify transitions that would lead to the most likely observation sequence (f,g). The semi-transparent arrows indicate transitions due to noise that may likely cause the second observation sequence (f,g), which is most likely from state IV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 vii 4.3 An example of executing the one-step Stochastic Distinguishing Experiment (SDE) fxg from state IV in the stochastic Shape environment of Figure 4.1 (with = = 0:99). The opaque arrows exemplify transitions that would lead to the most likely observation sequence (f,g). The semi-transparent arrows indicate transitions due to noise that may likely cause the second observation sequence (f,g), which is most likely from state III. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4 An example of executing the two-step Stochastic Distinguishing Experiment (SDE) fy, xg from state I in the stochastic Shape environment of Figure 4.1 (with = = 0:99). The opaque arrows exemplify transitions that would lead to the most likely observation sequence (f, , g). The semi-transparent arrows indicate transitions due to noise that may likely cause the second observation sequence (f, ,g), which is most likely from state II. . . . . . . . . . . . . . . . . . . . . . . . . 56 4.5 An example of executing the two-step Stochastic Distinguishing Experiment (SDE) fy,xg from state II in the stochastic Shape environment of Figure 4.1 (with = = 0:99). The opaque arrows exemplify transitions that would lead to the most likely observation sequence (f,,g). The semi-transparent arrows indicate transitions due to noise that may likely cause the second observation sequence (f,,g), which is most likely from state I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.6 The runloops of Complementary Discrimination Learning[138] (left) and Proba- bilistic Surprise-Based Learning [38, 39] (right). . . . . . . . . . . . . . . . . . . . . 58 4.7 An illustration of the experiment - surprise - identify - revise loop of PSBL as applied to an agent in the stochastic Shape environment of Figure 4.1 with = = 0:99. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.8 An illustration of the second experiment - surprise - identify - revise loop of PSBL as applied to an agent in the stochastic Shape environment of Figure 4.1 with = = 0:99. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.1 An illustration of the Local Distinguishing Experiments (LDEs [141]) and their out- come sequences learned by Shen's Complementary Discrimination Learning (CDL [138]) algorithm in the deterministic Shape environment (originally from [119] and adapted by Shen in [142]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 An illustration of the Shape environment in which the robot's actions and observa- tions are subject to noise, as dened by and. This environment can be modeled as a rewardless POMDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3 An illustration of the simple PSDE dened in Equation 5.2. . . . . . . . . . . . . . 73 5.4 An illustration of the compound PSDE dened in Equation 5.3 and the possible agent histories that it covers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5 An illustration of a PSDE model for the stochastic Shape environment of Figure 5.2 with = = 0:99. The left side (a) shows the probability distributions dening the PSDE model. The right side (b) shows how these PSDEs can be organized into a tree of agent history that stretches backward in time. . . . . . . . . . . . . . . . 75 viii 5.6 An illustration of the one step extension experiments of the simple PSDE dened in Equation 5.2. SincejAj= 2 andjOj= 2 in the Shape environment (Figure 5.2), there arejAjjOj= 4 such one step extensions. . . . . . . . . . . . . . . . . . . . . . 81 5.7 An illustration of the example splitting procedure detailed in Section 5.3.1.2, Equa- tions 5.8-5.14. The left hand side of the gure (a) demonstrates the calculation of the expected gain in predictive accuracy when PSDE P (f;gjx) is split into its constituent one-step extensions. The right hand side of the gure (b) demonstrates how P (f;gjx) would be split into one compound PSDE and one simple PSDE, according to the predicted (most likely) observations of each one-step extension. Note that one step extensionsfx; ; xg,fx; ; xg, andfy; ; xg all predict , whereas one step extensionfy; ; xg would predict. It is precisely this spe- cialization in predicted observation along dierent agent histories that leads to the expected predictive accuracy gain. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.8 An illustration of the example renement procedure detailed in Section 5.3.1.3, Equations 5.16-5.17. The left hand side of the gure (a) demonstrates the cal- culation of the expected renement gain. The right hand side of the gure (b) demonstrates how P (f;gjfx;x;yg; f;;g; x) would be rened into two PS- DEs: P (f;gjfx;yg; f;g; x) and P (f;gjx; ; x). . . . . . . . . . . . . . . 89 5.9 A simple 2-state environment in which the agent can execute the actionsx andy and observe in state S1 and in state S2. Assuming this environment is deterministic and that predictions are made according to the most likely observation of the matching PSDE, both PSDE Model 1 and PSDE Model 2 would predict future observations with perfect accuracy. However, the model error (Equation 5.19) of PSDE Model 1 would be 0:0, while model error would be 0:693 for PSDE Model 2. 93 5.10 An illustration of computing model surprise (Equation 5.20) for a PSDE model of the stochastic Shape environment of Figure 5.2 with = 0:95 and = 1:0. The left side (a) illustrates the probability distributions (and associated observation counters, red) dening the PSDE model. The right side (b) illustrates how these probability distributions and observation counters are used to compute the total surprise of the PSDE model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.11 An example of building a counterTrie (line 4, Algorithm 5) to organize an agent's experience in its environment. Subgures (a) - (d) demonstrate adding 4 sequences of history (in order) to an initially empty counterTrie. The left side of each sub- gure shows the sequence of history being added. The red arrows and counters in each gure indicate what parts of the trie are changed once this sequence of history is added. Note how counters are maintained at every action node indicating the number of times each observation occurs after the agent encounters that history prex. The probabilities associated with any of these action nodes can be com- puted simply by dividing the value of each observation's counter by the sum of the observation counters (over both observations) at that node. . . . . . . . . . . . . . 100 5.12 An illustration of the one step extensions of the PSDE P (O t+1 = f;gj;x). Note that the agent begins with an initial observation of at time t in addition to executing x at time t. In PSBL for PSDE models (Algorithm 3), we typically assume that PSDEs begin with an initial observation (as opposed to an initial action).101 ix 5.13 An illustration of the search-based decision making procedure for PSDE models detailed in Algorithms 7-8 as applied to the stochastic Shape environment (Fig- ure 5.2). Subgure (a) illustrates the PSDEs in a PSDE model of the stochastic Shape environment. Subgure (b) illustrates the use of search-based decision mak- ing to take the agent from an initial history off;y;g to a goal observation of via the policyfyg. Note how dierent choices of actions lead to the agent's projected history matching dierent PSDEs in the model. . . . . . . . . . . . . . . 110 6.1 An illustration of the deterministic Shape environment [119, 142] (subgure (a)) and the -Shape environment (subgure (b)), which extends the Shape environment into a stochastic and partially-observable POMDP environment in which the agent experiences the same level of noise on each of its state-to-state transitions and observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.2 An illustration of the -Shape environment with = = 0:99 (a) and a learned sPOMDP model of this environment (b). . . . . . . . . . . . . . . . . . . . . . . . . 118 6.3 An illustration of the -Little Prince environment with = = 0:99 (a) and a learned sPOMDP model of this environment (b). . . . . . . . . . . . . . . . . . . . 118 6.4 An illustration of the-Circular 1D Maze environment with = = 0:99 (a) and a learned sPOMDP model of this environment (b). . . . . . . . . . . . . . . . . . . 119 6.5 An illustration of the-Shuttle environment (adapted from [33]) with = = 0:99 (a) and a learned sPOMDP model of this environment (b). . . . . . . . . . . . . . 120 6.6 -Shape environment with unlearnable structure when 0:7 and 0:5. . . . . 124 6.7 Runtimes for Optimal sPOMDP Learning (Algorithm 9) on increasingly large ran- dom -POMDPs with dierent numbers of hidden states. . . . . . . . . . . . . . . 128 6.8 Additional results for Optimal sPOMDP Learning (Algorithm 9) on increasingly large random (minimal) -POMDPs in whichjOj=jQj=2 andjAj= 2. In these environments, each observation covered a random number of environment states, with the only restriction being that each observation must be emitted as the mostly likely one in at least one state. The blue bars represent randomly constructed Moore machine environments ( = = 1). The red bars represent random -POMDPs in which and were randomly chosen uniformly from their allowable values (see the discussion above and Equation 6.6). Subgure (a) provides runtime results. Subgure (b) provides the average length of the longest outcome sequence of any state m2 M in the agent's model. Subgure (c) provides the average number of model statesjMj learned by the agent. All data points are averages over 50 runs of Algorithm 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.9 An illustration of an sPOMDP model of the-Shape environment of Figure 6.1(b) (subgure (a)) and the associated outcome trie (subgure (b)) which provides the agent with the outcome sequences of model states consistent with its history. If the full agent history is not represented in the outcome trie, the sequences consistent with the last valid traversed trie node are returned. Even though it is not visualized in the gure (to avoid unnecessary clutter), consistent outcome sequences are stored at both action and observation trie nodes. . . . . . . . . . . . . . . . . . . . . . . . 139 x 6.10 An illustration of computing total model surprise (Equation 6.9) for an sPOMDP model of the -Shape environment (subgure (a)). In subgure (b), the numbers above each transition distribution are the current Dirichlet posterior counts mam 0 t . In this example, the sPOMDP modelM has only two states ( and). . . . . . . 151 6.11 Continued from Figure 6.10, this is an illustration of computing the gain (Equa- tion 6.10) for the transition (m 0 ;a 0 ) = (;x) in the -Shape environment (Fig- ure 6.10 (a)). The normalized entropy of the transition distribution for (;x), which is 0:999, is calculated in Figure 6.10 (b). As was the case in Figure 6.10 (b), the numbers above each one step extension distribution are the current Dirichlet posterior counts mam 0 a 0 m 00 t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.1 An illustration of the Borel-Cantelli lemma. Consider a set of Bernoulli random variablesfX n :n 1g such that P (X n = 1) = 1=n 2 (e.g., rolling a 1 on a fair n 2 - sided die). Dene the sequence of events G n =fX n = 1g. In this case, the innite sum P 1 n=1 P (G n ) can be evaluated in closed form and shown to equal 2 =6, which is a nite value. The Borel-Cantelli lemma tells us that we will experience a random sequence of 0s and 1s according to the probabilities of P (G n ). In other words, we will successfully roll a 1 sometimes and a 0 other times up until some nite random time n = D. At time D, X D takes on the value 0, and for all k > D, X k will remain at 0. This will happen almost surely (that is, with probability 1). In other words, X n will equal 1 only a nite number of times. . . . . . . . . . . . . . . . . . 159 7.2 Visualization of the compound experiment in Equation 7.6. . . . . . . . . . . . . . 162 8.1 -Shape environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 8.2 -Little Prince environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 8.3 -Circular 1D Maze environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 8.4 -Shuttle environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 8.5 -Network environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 8.6 -Grid environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 8.7 Example color world POMDP environments from sizes 2x2 to 5x5. . . . . . . . . . 182 8.8 The ISI Floor environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.9 Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model predictive accuracy (subgures (k)-(o)) trajectories for the PSBL algorithm applied to the -Shape environment (Section 8.1.1) under varying environment noise levels.185 8.10 Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model predictive accuracy (subgures (k)-(o)) trajectories for the PSBL algorithm applied to the-Little Prince environment (Section 8.1.2) under varying environment noise levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 xi 8.11 Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model predictive accuracy (subgures (k)-(o)) trajectories for the PSBL algorithm applied to the-Circular 1D Maze environment (Section 8.1.3) under varying environment noise levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 8.12 Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model predictive accuracy (subgures (k)-(o)) trajectories for the PSBL algorithm applied to the-Shuttle environment (Section 8.1.4) under varying environment noise levels.188 8.13 Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model predictive accuracy (subgures (k)-(o)) trajectories for the PSBL algorithm applied to the-Network environment (Section 8.1.5) under varying environment noise levels.189 8.14 Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model predictive accuracy (subgures (k)-(o)) trajectories for the PSBL algorithm applied to the -Grid environment (Section 8.1.6) under varying environment noise levels. 190 8.15 Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model predictive accuracy (subgures (k)-(o)) trajectories for PSBL applied to random 2x2 color world environments with 2 colors (Section 8.2, Figure 8.7(b)) for a variety of environment noise levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 8.16 Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model predictive accuracy (subgures (k)-(o)) trajectories for PSBL applied to random 2x3 color world environments with 3 colors (Section 8.2, Figure 8.7(c)) for a variety of environment noise levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 8.17 Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model predictive accuracy (subgures (k)-(o)) trajectories for PSBL applied to random 3x3 color world environments with 5 colors (Section 8.2, Figure 8.7(e)) for a variety of environment noise levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 8.18 Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model predictive accuracy (subgures (k)-(o)) trajectories for PSBL applied to random 3x4 color world environments with 6 colors (Section 8.2, Figure 8.7(f)) for a variety of environment noise levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 8.19 Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model predictive accuracy (subgures (k)-(o)) trajectories for PSBL applied to random 4x4 color world environments with 8 colors (Section 8.2, Figure 8.7(g)) for a variety of environment noise levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 8.20 Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model predictive accuracy (subgures (k)-(o)) trajectories for PSBL applied to random 4x5 color world environments with 10 colors (Section 8.2, Figure 8.7(h)) for a variety of environment noise levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 8.21 Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model predictive accuracy (subgures (k)-(o)) trajectories for PSBL applied to random 5x5 color world environments with 13 colors (Section 8.2, Figure 8.7(i)) for a variety of environment noise levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 xii 8.22 Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model predictive accuracy (subgures (k)-(o)) trajectories for PSBL applied to the ISI Floor environment (Section 8.2, Figure 8.8) for a variety of environment noise levels.197 8.23 An illustration of the average Pearson correlation between model surprise and model predictive accuracy (subgures (a) and (b)) and the average Pearson correlation between model surprise and model error (subgures (c) and (d)) in a number of rewardless POMDP environments under varying noise levels. . . . . . . . . . . . . 199 8.24 A comparison of dierent learning algorithms and models in the-Shape environ- ment (Figure 8.1.1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 8.25 A comparison of dierent learning algorithms and models in the -Little Prince environment (Figure 8.1.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 8.26 A comparison of dierent learning algorithms and models in the -Circular 1D Maze environment (Figure 8.1.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 8.27 A comparison of dierent learning algorithms and models in random variations of the -Shape environment (Figure 8.1.1). . . . . . . . . . . . . . . . . . . . . . . . 206 8.28 A comparison of dierent learning algorithms and models in the -Shuttle envi- ronment (Figure 8.1.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 8.29 A comparison of dierent learning algorithms and models in the -Network envi- ronment (Figure 8.1.5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 8.30 A comparison of dierent learning algorithms and models in the -Grid environ- ment (Figure 8.1.6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 8.31 A comparison of dierent learning algorithms and models in random 2x2 color world environments with 2 color observations (Figure 8.7(b)). . . . . . . . . . . . 211 8.32 A comparison of dierent learning algorithms and models in random 2x3 color world environments with 3 color observations (Figure 8.7(c)). . . . . . . . . . . . . 212 8.33 A comparison of dierent learning algorithms and models in random 3x3 color world environments with 3 color observations (Figure 8.7(d)). . . . . . . . . . . . . 213 8.34 A comparison of dierent learning algorithms and models in random 3x3 color world environments with 5 color observations (Figure 8.7(e)). . . . . . . . . . . . . 213 8.35 A comparison of dierent learning algorithms and models in random 3x4 color world environments with 6 color observations (Figure 8.7(f)). . . . . . . . . . . . . 214 8.36 A comparison of dierent learning algorithms and models in random 4x4 color world environments with 8 color observations (Figure 8.7(g)). . . . . . . . . . . . . 214 8.37 A comparison of dierent learning algorithms and models in random 4x5 color world environments with 10 color observations (Figure 8.7(h)). . . . . . . . . . . . 215 xiii 8.38 A comparison of dierent learning algorithms and models in random 5x5 color world environments with 13 color observations (Figure 8.7(i)). . . . . . . . . . . . . 215 8.39 A comparison of dierent learning algorithms and models in the ISI Floor environ- ment (Figure 8.8). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 8.40 A comparison of the average relative predictive accuracies and model errors of dierent modeling approaches in the environments from the literature (subgures (a) and (c), see Section 8.1) and random color world environments (subgures (b) and (d), see Section 8.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 8.41 A comparison of the average numbers of model parameters for each model type in dierent environments (subgures (a) and (b)) and the average runtime (in seconds) of PSBL as applied to learning both PSDE models and sPOMDP models in dierent environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 8.42 A comparison of the average number of model states,jMj, in the sPOMDPs learned using PSBL against the ground truth number of environment states,jQj, in the environments from the literature (subgure (a), see Section 8.1) and random color world environments (subgure (b), see Section 8.2). . . . . . . . . . . . . . . . . . . 222 8.43 Decision making in the -Shape environment (Figure 8.1.1) with as the goal observation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 8.44 Decision making in the -Circular 1D Maze environment (Figure 8.1.3) with goal as the goal observation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 8.45 Decision making in the-Network environment (Figure 8.1.5) with up as the goal observation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 8.46 Decision making in random 4x4 color world environments with 8 color observations (Figure 8.7(g)) with red as the goal observation. . . . . . . . . . . . . . . . . . . . 226 8.47 Decision making in random 5x5 color world environments with 13 color observations (Figure 8.7(i)) with white as the goal observation. . . . . . . . . . . . . . . . . . . 226 8.48 A comparison of the average decision making performance (in terms of reward accrued over time) by PSDE (red) and sPOMDP (green) models against an opti- mal agent making decisions with access to the ground truth POMDP environment (yellow). A random policy (blue) is provided as a baseline. . . . . . . . . . . . . . . 228 8.49 An analysis of the scalability of PSBL for learning PSDE models (subgures (a), (c)) and for learning sPOMDP models (subgures (b), (d)) in random color worlds when the environment is fully observable (i.e.,jOj=jQj). . . . . . . . . . . . . . . . 231 8.50 An analysis of the number of PSDEs (subgure (a)) and the number of sPOMDP model states (subgure (b)) learned by PSBL in random color worlds when the environment is fully observable (i.e.,jOj=jQj). . . . . . . . . . . . . . . . . . . . . 232 xiv 8.51 An analysis of the scalability of PSBL for learning PSDE models (subgures (a), (c)) and for learning sPOMDP models (subgures (b), (d)) in random color worlds when 5% of environment states are hidden from the agent (i.e.,jOj= 0:95jQj). . . . 233 8.52 An analysis of the number of PSDEs (subgure (a)) and the number of sPOMDP model states (subgure (b)) learned by PSBL in random color worlds whenjOj= 0:95jQj, as well as an analysis of the model errors of the learned PSDE models (subgure (c)) and sPOMDP models (subgure (d)). . . . . . . . . . . . . . . . . . 234 8.53 An analysis of the scalability of PSBL for learning PSDE models (subgures (a), (c)) and for learning sPOMDP models (subgures (b), (d)) in random color worlds when 10% of environment states are hidden from the agent (i.e.,jOj= 0:9jQj). . . . 236 8.54 An analysis of the number of PSDEs (subgure (a)) and the number of sPOMDP model states (subgure (b)) learned by PSBL in random color worlds whenjOj= 0:9jQj, as well as an analysis of the model errors of the learned PSDE models (subgure (c)) and sPOMDP models (subgure (d)). . . . . . . . . . . . . . . . . . 238 A.1 Experimental results for the Convert procedure (Algorithm 21) and its extensions to -POMDPs in Algorithm 22 on increasingly large, random (minimal), rewardless -POMDP environments with varying numbers of hidden states. . . . . . . . . . . 268 A.2 Experimental results for the Convert procedure (Algorithm 21) on increasingly large random (minimal) -POMDPs in whichjOj=jQj=2 andjAj= 2. In these environments, each observation covered a random number of environment states, with the only restriction being that each observation must be emitted as the mostly likely one in at least one state. The blue bars represent randomly constructed Moore machine environments ( = = 1). The red bars represent random - POMDPs in which and were randomly chosen uniformly from their allowable values. Subgure (a) provides runtime results. Subgure (b) provides the average length of the longest outcome sequence of any state m2M in the agent's model. Subgure (c) provides the average number of model statesjMj learned by the agent.269 xv Abstract There is an ever-increasing need for autonomous agents and robotic systems that are capable of adapting to and operating in challenging partially-observable and stochastic environments. Standard techniques for learning in such environments are typically fundamentally reliant on an a priori specication of the state space in which the agent will operate. Designing an appropriate state space demands extensive domain knowledge, and even relatively minor changes to the task or agent might necessitate an expensive manual re-engineering process. Clearly, imbuing agents with the ability to actively and incrementally learn task-independent representations of state in such environments directly from experience would reduce the manual eort required to deploy these systems and enable them to adapt to changes in environment, task, or even unexpected disruptions of their own sensing and actuation capabilities. As an important step toward this goal, we address, in this dissertation, the challenging and open problem of actively learning a representation of the state and dynamics of unknown, discrete, stochastic and partially-observable environments directly from a stream of agent experience (i.e., actions and observations). In particular, we present a novel family of nonparametric probabilistic models called Stochastic Distinguishing Experiments (SDEs) and a novel biologically-inspired framework for actively and incrementally learning these models from experience called Probabilistic Surprise Based Learning (PSBL). SDEs are hierarchically-organized key sequences of ordered actions and expected observations (along with associated probability distributions) that, taken together, form an approximate and task-independent representation of environment state and dynamics. The key idea behind PSBL is that the agent begins with a minimal set of SDEs and xvi continuously designs and performs experiments that test whether extensions to these SDEs result in a model that causes the agent to be surprised less frequently by observations that do not match its predictions. PSBL can be understood as a procedure to minimize this surprise frequency. We provide formal proofs regarding the convergence and computational complexity of the PSBL algorithm for certain classes of SDE models. We formally prove the representational ca- pacity of certain classes of SDE models with respect to deterministic environments and a useful subclass of POMDP environments. These constructive proofs lead to a provably-optimal proce- dure that enables an agent with perfect sampling capabilities to learn a perfect SDE model of such environments, provided that noise levels are bounded according to certain technical criteria, which we formally derive. Extensive simulation results are provided to validate these theoretical analyses and demonstrate the eectiveness and scalability of PSBL and SDE modeling on a variety of simulated prediction and decision-making tasks in a number of challenging environments. xvii Chapter 1 Introduction In this chapter, we provide some key motivations for the work laid out in this dissertation and formally dene the problem being solved. We then provide a brief overview of our approach and enumerate the scientic contributions of this work. Finally, we end this chapter with an overview of the organization of this dissertation. 1.1 Motivation There is an ever-increasing demand for autonomous agents and robotic systems that are capable of adapting to and operating in a range of challenging real-world environments. Advances in au- tonomous driving [81], such as Waymo's self-driving vehicles 1 , have the potential to dramatically increase the safety, reliability, and eciency of automobile transportation. Search and rescue robots, such as JPL's RoboSimian [69, 59], have the potential to save human lives by surveying disaster areas to nd survivors and possibly even physically dragging them away from danger. Underwater robots might soon replace humans in laborious (and sometimes dangerous) tasks such as cleaning up oil spills [167] or removing the buildup of biological material from underwater structures [4]. Some of our previous work took important steps toward addressing the algorith- mic and hardware challenges associated with utilizing self-recongurable robotic systems for the 1 https://waymo.com 1 Figure 1.1: Example inverse kinematics solutions for high-DOF SuperBot [130] manipulators (up to 180 DOF) found using our novel Particle Swarm Optimization (PSO [71]) based approach [37]. Figure 1.2: A tree of 6 simulated SuperBot [130] modules autonomously performing a pick, carry, and place task to simulate autonomous assembly using our novel message-passing and distributed control algorithm for integrated and adaptive locomotion and manipulation [36]. autonomous assembly and repair of structures in space, which is currently slow and dangerous work that must be performed primarily by humans. In particular, we made important contri- butions in the areas of high-DOF inverse kinematics [37, 34] (Figure 1.1), autonomous assembly using distributed tree structures of self-recongurable robots [36] (Figure 1.2), and autonomous 6D docking (with a non-xed base), which we demonstrated on real self-recongurable robotic hardware [13] (Figure 1.3). The environments in which such agents and systems are expected to operate vary quite sub- stantially, but virtually all environments of interest exhibit signicant levels of partial-observability { meaning that the agent cannot directly sense all the relevant aspects of its environment { and 2 Figure 1.3: Hardware setup for experiments simulating SuperBot [130] modules docking in micro- gravity (with a manipulator whose base in non-xed). Autonomous docking is vitally important for building useful and adaptive congurations of self-recongurable modules in space applications. stochasticity { meaning that the agent's actions and observations are subject to some inherent level of nondeterminism. The combination of these factors makes autonomous learning particu- larly challenging, because it is dicult for the agent to disambiguate environmental noise from situations in which its model has failed to capture something important about the latent structure of its environment without signicant prior information about that environment. For this reason, manually-engineered features { one of the most common and important of which is a manually-designed state space in which the agent operates { remain some of the most important inputs to active and autonomous learning algorithms in the probabilistic machine learn- ing [156, 103], reinforcement learning [72], robotic learning [67] and (more recently) deep learning literature [78]. For example, particle lters [9] and SLAM algorithms such as FastSLAM [98] typically pre-suppose the existence of a Euclidean state space in which the agent knows or can infer { via an appropriate conguration of sensors, which might not always be available { local transformations. Reinforcement learning techniques such as value iteration [20], Q-learning [162], and SARSA [126] also pre-suppose the existence of a known state space. When recurrent deep 3 learning approaches such as LSTMs [60] are used to model dynamical systems, the number of hidden units (which is typically xed manually before learning begins) is directly related to the size and nature of the implicit representation of state formed by the LSTM ([57] Chapter 10). Reliance on human-engineered state spaces makes adaptation to novel situations, tasks, and environments dicult, because it ties the agent's representation of its environment inexorably to strong assumptions which may be incomplete or inaccurate. Furthermore, even relatively minor changes to the environment, task, or the agent's sensing and actuation capabilities might necessitate an expensive manual re-engineering process. This makes it somewhat surprising that relatively little attention has been given to the important problem of actively and autonomously learning the state and dynamics of unknown stochastic and partially-observable environments directly from experience. Clearly, solving this problem is vitally important, because the learned state space and its dynamics could then be used as input to a host of state-of-the-art algorithms solving important problems in robotics, articial intelligence, and machine learning. The ultimate goal of this work is to take a step toward truly adaptive robots and agents that are able to learn representations of unknown (and possibly even dynamic) environments in a way that allows them to successfully perform a broad range of tasks with minimal reliance on human-designed prior information. 1.2 The Problem 1.2.1 Overview and Challenges In this dissertation, we address a problem called autonomous learning from the environment (ALFE, rst formulated by Shen in [140]), in which an embodied agent is placed in an unknown discrete, partially-observable, stochastic environment and must actively build a task-independent model of the state and state dynamics of this environment from its experience, given only the actions it can take and the observations it can make about its environment. The agent is not 4 able to reset itself to a known state and has no prior knowledge about the number of underlying environment states or the expected results of executing its actions. The most distinguishing aspect of this problem (relative to other work in the literature) is that the agent must decide simultaneously and autonomously both what actions to take in the absence of any external rewards and how much experience is sucient to detect that it has built a useful model (i.e., it must autonomously decide when to stop learning so that it can perform its tasks). This problem presents a number of important challenges. First, in partially-observable en- vironments, the agent's observations are not necessarily a sucient statistic of history, meaning that the agent must consider a (theoretically unbounded) number of previous actions and obser- vations in order to make informed predictions and decisions. When the environment size is truly unknown (as is the case in this work), there is no natural limit on the amount of history the agent may need to take into consideration at each time step in order to understand its environment. Second, stochasticity further compounds the problem of partial-observability by making it di- cult for the agent to disambiguate true inherent environment noise from situations in which its model is simply not accurate enough to capture the structure of its environment. A third chal- lenge is that of unknown environment size. Given that a xed-size model would amount to the same type of strong prior knowledge we explicitly wish to avoid, it is vital that a solution to the problem of ALFE involve a model with a nonparametric form that can grow or shrink naturally as the agent experiences more of its environment. A fourth major challenge presented by ALFE is that of action selection. If the agent is provided with no external rewards or goals to guide its learning (other than a desire to understand its environment), how can the agent judiciously select its actions? One solution is to simply execute random actions, but this ignores the fact that some action sequences are more useful than others in distinguishing identical-looking states and inferring environment structure. Fifth, ALFE requires that the agent learn a representation of state specied directly in terms of its raw actions and observations, which is challenging given that there may not be any obvious direct mapping between its action-observation space and the latent 5 structure of the environment. This state representation must be usable for subsequent planning and decision making, and, ideally, it should also be usable directly in place of human-designed state spaces. The last major challenge of this problem is that of scalability. It is dicult to design rep- resentation or model learning algorithms in the space of agent actions and observations that are also reasonably scalable. Advances in representation and feature learning in areas such as object detection and image recognition (see [22] for a review) as well as successes in deep rein- forcement learning [96, 79] have demonstrated that, given a suciently deep representation and enough training data and training time, representations that scale to high-dimensional spaces are possible (see, e.g., VGG16 [148]). However, applying these methods directly in an active and online fashion to the ALFE problem remains infeasible due to, among other things, time restric- tions, resource restrictions, and a lack of prior information about the size or state space of the environment. ALFE is an important theoretical step toward what we call active, end-to-end, lifelong model learning and decision making in unknown, partially-observable and stochastic environments. By end-to-end, we mean a learning procedure that is designed to facilitate learning directly from raw experience (actions and observations), with minimal human-designed prior knowledge about the environment. By lifelong, we mean a learning procedure designed to be invoked multiple times throughout the lifetime of the agent or robot, such that it can adapt to changes to its environment or possibly even its own sensing and actuation capabilities over time. 1.2.2 Formal Denition and Key Assumptions The problem of ALFE, as discussed in the previous section, is a very dicult, large, and open problem. Accordingly, we make the following simplifying assumptions and restrict our attention to simulated embodied agents { as opposed to real-world robotic hardware { in order to make important theoretical progress: 6 1. We assume that the environment is a discrete, rewardless Partially-observable Markov De- cision Process (POMDP) [30]. This restriction enables us to perform important theoretical analyses without sacricing the fundamental diculties posed by simultaneous stochasticity and partial-observability. 2. We assume that the set of agent actions,A, and the set of agent observations,O, are discrete and nite. 3. We assume that time is discrete; however, the agent's actions with its environment are non- episodic, and there is no explicit bound on how much experience the agent might have with its environment during the learning process (the agent must decide this for itself). If we make the aforementioned assumptions, we can formally dene the ALFE problem ad- dressed in this dissertation as follows: LetE =<Q;A;T;O; > be a rewardless partially-observable Markov decision process (POMDP) environment, where Q is a discrete set of states, A is a discrete set of actions the agent can take, T is a discrete set of state transition probabilities, O is a set of discrete observations the agent may receive from its environment, and are the observation (emission) probabilities. LetR be an autonomous embodied agent situated inE that knows only A and O. At each time-step t2f0; 1;:::g,R receives an observationo t 2O and takes an actiona t 2A. Let be a nite, possi- bly random, integer inferred or chosen byR. How canR learn an approximate, task-independent representation ofE's state, Q, and dynamics, T and , via a nite sequence of interactions o 0 ;a 0 ;o 1 ;a 1 ;o 2 ;a 2 ;:::o ;a withE? The embodied agentR knows nothing aboutQ,T , or . Furthermore, it knows nothing about the relationship between O and A (i.e., how its observations change when it performs actions). It knows only the observations it can make about its environment and the actions it can take. It is given no other prior knowledge about its environment, nor is it given a reward function to aid its learning or assist it in choosing actions. The agent is not able to reset or transport itself 7 to a known initial state during the learning process, and learning is not partitioned into separate episodes. The fact that the agent must choose means that the agent must be able to identify for itself when the learning process has converged to a solution. Learning may be initiated from any state in the POMDP; there is no designated start state. Finally, the restriction that the model be task-independent means that it must not be tied to any particular task but must rather be a general-purpose representation ofE's state and dynamics in terms of the agent's actions and observations. This means, for example, that the same environment model can be used to achieve dierent goals (e.g., navigate to dierent target locations). 1.3 Overview of our Approach Though partial-observability and stochasticity are rampant in real-world environments and pose signicant challenges, our planet is teeming with biological life that is capable of learning models of unknown environments and using these learned models to achieve goals. In a famous set of experiments [158], Tolman and Hoznik demonstrate that rats do indeed seem to learn models of unknown maze environments from experience, even when there is no explicit reward given to them. When a reward is later oered to them for solving the maze, they exploit these learned models to quickly gather this reward. The Morris water maze [101] and its various extensions demonstrate the ability of rodents to learn a spatial representation sucient to escape a water container via an invisible submerged platform. In [106], the authors study the spatial abilities of chimpanzees on the Ivory Coast, concluding that these wild chimpanzees likely exploit complex Euclidean representations of their environments built from experience in order to nd food. In [25], the authors provide compelling evidence that lobsters use the magnetic eld of the earth to localize and orient themselves, even when transported to a completely unfamiliar location. These are only a few of the incredible number of examples of autonomous learning in nature that seem to indicate that biologically-inspired approaches might be a powerful way to approach the problem of ALFE. 8 In this work, we address ALFE in stochastic and partially-observable environments with a nonparametric probabilistic modeling approach called Stochastic Distinguishing Experiments (SDEs) [38] and a biologically-inspired algorithmic framework for actively and incrementally learn- ing these models from agent experience (actions and observations) called Probabilistic Surprise Based Learning (PSBL) [38, 35, 39]. The main idea behind SDE modeling is to hierarchically organize agent experience into key sequences of ordered actions and associated expected obser- vations { i.e., experiments the agent can perform in its environment { that serve as a means for statistically disambiguating identical-looking environment states or history contexts. Taken together, these SDEs (along with associated probability distributions) form an approximate and task-independent representation of the state and dynamics of the agent's environment. Probabilistic Surprise Based Learning (PSBL) is an active and incremental procedure that learns SDE models of unknown environments directly from agent experience, without requiring any environment resets or an explicit bound on the maximal length or number of SDEs that can be created. The key idea is that the agent begins with a minimal set of SDEs and continuously designs and performs experiments intended to test whether extensions to these SDEs result in a model in which the agent is likely to be surprised less frequently by observations that do not match its predictions 2 . A perfect SDE model 3 is one in which the agent is surprised no more frequently than the Bayes error rate of the environment, and PSBL proceeds by increasing model complexity only when doing so is expected to reduce surprise frequency. Thus, the approach we take to solving the problem of ALFE in this dissertation is to attempt to nd a model that results in minimal surprise frequency. In this dissertation, we present and rigorously analyze two distinct variants of SDE modeling and PSBL and demonstrate their eectiveness in learning high-quality, task-independent state and dynamics representations in a broad range of environments (under a 2 The implicit assumption, here, is that the agent is making predictions at each time step according to the most likely observation to be encountered (in order to maximize its predictive accuracy in its environment). 3 We more rigorously dene the notion of a perfect model in stochastic and partially-observable environments in Chapter 4. 9 variety of noise levels) in a way that allows for subsequent (near-optimal) decision-making and planning across multiple tasks. 1.4 Scientic Contributions This dissertation has addressed the challenging and open problem of ALFE in stochastic and partially-observable environments, resulting in four main scientic contributions: 1. A novel family of probabilistic, nonparametric models for representing rewardless deter- ministic and POMDP environments called Stochastic Distinguishing Experiments (SDEs). SDEs are a principled generalization of Shen's Local Distinguishing Experiments [141] to stochastic and partially-observable environments. (a) Predictive SDE (PSDE) models are tree-structured approximations of the history- dependent probability of future observation sequences given any sequence of agent actions (conditioned on any possible agent history) which form an implicit predictive representation of environment state and dynamics without needing to explicitly attempt to model latent environment structure. [38]. (b) Surprise-based Partially-observable Markov Decision Processes (sPOMDPs) are a hy- brid latent-predictive representation of state and dynamics which use SDEs to uncover and explicitly model hidden latent structure in the agent's environment, resulting in a model with an explicit learned state space that can be used directly in place of tra- ditional human-designed POMDPs [35, 39]. These models overcome some theoretical limitations of purely predictive approaches (e.g., Predictive State Representations, [84]) by providing a formal relationship between predictive experiments and latent environ- ment structure. 10 2. A novel biologically-inspired algorithmic framework for actively and incrementally learning SDE representations of unknown, rewardless deterministic and POMDP environments di- rectly from agent experience called Probabilistic Surprise-Based Learning (PSBL). PSBL extends traditional surprise-based learning (SBL) algorithms [137, 141, 117] to stochastic and partially-observable environments and generalizes the denition of surprise (as used in these works) 4 in a way that is applicable to both deterministic and stochastic environments. Deriving this learning procedure required that we generalize the key theory and denitions associated with the problem of ALFE (as originally laid out by Shen in [140] for determinis- tic environments) such that they became well-suited to stochastic and partially-observable environments while remaining applicable to deterministic ones. This enables us to formally dene the ALFE problem as that of minimizing model error and usefully frame our unied solution to it (PSBL) { which applies to both PSDE and sPOMDP models { as that of minimizing model surprise (Chapter 4). 3. Formal mathematical proofs that PSBL learning of PSDE models in rewardless POMDP environments converges to a solution in nite time with probability 1 (with no user-dened bounds required on the maximum length or number of PSDEs allowed in the model) and a rigorous analysis of the worst-case computational complexity of this procedure [38]. Ex- tensive experimental results have demonstrated that PSBL for both PSDEs and sPOMDPs converges to a solution in nite time with no user-dened bounds on the length or number of SDEs in the model (though we do not yet have a formal proof for this in the case of sPOMDP models). 4. Formal mathematical proofs of the representational capacity of sPOMDP models demon- strating that they are capable of perfectly representing minimal deterministic nite automata 4 The way we dene surprise in this work diers somewhat from other uses of the word in the literature. Nevertheless, given the heritage of the word in the works that inspired this one, we feel that it is the appropriate term to use. In order to avoid confusion, we discuss the relationship between our use of the word \surprise" and its other uses in the literature in Chapter 2.1. 11 (DFA) and a useful subclass of POMDP environments (with equivalent compactness) [39]. This presents a solution to a signicant open theoretical problem in the SBL literature [142]. These constructive proofs also lead to a provably-optimal sPOMDP learning algorithm, which, given innite sampling capabilities (in the form of an oracle that gives the correct transition probabilities) and perfect localization, can learn a perfect sPOMDP model (pro- vided environment noise levels are within certain bounds, which we formally derive). 1.5 Dissertation Organization This dissertation is organized into 9 chapters. In Chapter 2, we thoroughly discuss related ap- proaches in order to put this work into the proper context. In chapter 3, we discuss SBL approaches to solving the ALFE problem in fully deterministic environments. The limitations of these pre- vious SBL approaches (and their extensions) when applied to stochastic and partially-observable environments motivate the approaches laid out in this dissertation. In chapter 4, we generalize Shen's formulation of the ALFE problem in deterministic environ- ments [142] such that it becomes suitable for stochastic, partially-observable environments as well. We also generalize key theoretical concepts in SBL { most notably the denition of surprise { to stochastic and partially-observable environments such that SBL becomes amenable to the use of probabilistic techniques to model noise in a principled fashion. We then provide an overview of the unifying principles behind Stochastic Distinguishing Experiments (SDEs, [38]) and Probabilistic Surprise-Based Learning (PSBL, [38, 39]) and discuss the key ways in which they overcome some of the limitations of previous SBL approaches. In Chapter 5, we detail Predictive SDE models (PSDEs, [38]) and PSBL variants for learning these models from experience. Predictive SDE models are characterized by the fact that they directly approximate the history-dependent probability of future observations without attempt- ing to infer or model latent structure in the environment. In Chapter 6, we detail surprise-based 12 Partially-Observable Markov Decision Processes (sPOMDPs, [39]) and PSBL variants for learning them from experience. In contrast to PSDE models, sPOMDP models are hybrid latent-predictive models in which latent states are created and uniquely represented by the results of executing SDEs. Chapter 7 presents our key theoretical results, including formal proofs of the convergence and computational complexity of PSBL learning of Predictive SDE models [38] and the repre- sentational capacity of sPOMDP models [39]. Chapter 8 presents extensive experimental results validating our theoretical analyses and demonstrating the eectiveness of SDEs and PSBL across a range of environments, tasks, and noise levels. Finally, in Chapter 9, we conclude this disserta- tion with a summary of the work completed, some remarks about limitations, and notes on future research directions. 13 Chapter 2 Related Work In this chapter, we discuss the current state-of-the-art in a number of dierent elds which are related to the problems addressed and techniques used in this dissertation. 2.1 On the Use of the Word Surprise The word surprise has been used in a number of interesting ways in the literature, owing, it would seem, to the diculty inherent in mathematically modeling all the subtleties of such a useful biologically-inspired concept. The denitions of surprise that have thus far been oered fall into roughly two categories [50]. The rst category is based on Shannon surprise, surprisal, or information content, which is, intuitively, the amount of information one gains from sampling a random variable. It is dened as the negative log-probability of a random event [135, 110]. Works based on this denition consider unlikely events to be more surprising than more likely events. Note that the Shannon entropy [135] of a random variable is the expected value of its Shannon surprise. The second primary category of surprise in the literature is what is known as Bayesian surprise [64, 12, 11]. At a high-level, Bayesian surprise is dened as the distance that a posterior distribution moves away from a prior distribution when new data are conditioned upon. This denition is motivated by a signicant limitation of Shannon surprise: Shannon surprise does not explicitly take into account the important fact that new data will have dierent 14 signicance (i.e., be more or less surprising) to dierent observers of that data depending on their existing (prior) world models. In other words, the prior beliefs of each observer (or agent) have an important eect on the degree to which newly observed data are considered surprising. Bayesian surprise has interesting applications to, for example, optimal environment exploration and policy learning [153, 83] (though the problem of state space learning is not considered in these works). In [50], Faraji et al. dene surprise as a linear combination of Shannon surprise and Bayesian surprise, which has some interesting statistical properties. This denition allows them to cast the problem of learning in unknown environments as the problem of minimizing their denition of surprise (similar, in some ways, to what we do in Chapter 4). They even consider this in the context of color maze environments (similar to what we consider as part of our evaluation in Chapter 8). Crucially, however, they do not consider the vital problem of choosing agent actions. Their work instead focuses exclusively on updating agent beliefs about model parameter values using surprise as a modulation technique under automatic agent transitions. Thus, the environments they consider are Hidden Markov Models (HMMs, [17, 16]), not controlled processes. Even more importantly, the agent is given full knowledge of the underlying environment state space in these HMMs and need focus only on localizing itself and recovering the transition probabilities of the environment. The key benet of their approach is that the introduction of surprise allows the agent to track gradual or sudden changes to its environment over time. In contrast, in our work on SDE modeling in this dissertation, the agent is responsible for selecting its own actions and for inferring a representation of the state space of the environment itself (in addition to learning the transition and observation probability distributions dening that environment). We do not, however, consider environment changes over time in our work (this is left as an important future research direction). Our use of the word surprise in this dissertation is more closely related to Shannon surprise than to Bayesian surprise, but it diers from both. The work in this dissertation is based on previous approaches to Surprise-Based Learning (SBL) of state representations (e.g., [137, 117]). 15 In these earlier works, surprises were dened as single events in which the agent's actual and predicted observations diered from one another. In this dissertation, we call such an event a prediction failure. The key idea behind extending prediction failures into surprise in stochastic environments is as follows: though individual prediction failures provide (relatively) little infor- mation about the quality of the agent's model by themselves in stochastic environments, the frequencies with which these prediction failures occur under dierent model structures gives us important information about the relative quality of dierent candidate models. Assuming that the agent makes predictions at each time step according to the most likely observation it expects to receive (in an attempt to maximize predictive accuracy), the lower the frequency of these pre- diction failures, the more accurately we should expect our agent to predict in its environment. The lower the entropy of the distributions dening the agent's model (provided they are correct), the fewer prediction failures we should expect to receive. The denition of surprise we use in this work (see Chapter 4.3) is a weighted average of the entropies of the probability distributions dening our model. Each such distribution can be viewed as a predictor of sorts. Since entropy is itself the expected value of Shannon surprise, we can view the denition of surprise or model surprise in this work (Chapter 4.3) as a weighted average of the expected values of the Shannon surprise associated with each probability distribution dening our agent's model. In this way, min- imizing our denition of model surprise has a close relationship to minimizing expected Shannon surprise. Given this relationship and the heritage of the word surprise in the work that inspired the approaches in this dissertation, we argue that this is an appropriate term to use. In the rest of this dissertation, unless otherwise noted, when using the word surprise, we are referring to our denition of model surprise as laid out in Chapter 4.3. 16 2.2 Surprise-Based Learning of State and Dynamics The main ideas behind surprise-based learning (SBL) of state and dynamics representations in unknown environments were rst presented by Shen and Simon in [143]. These initial ideas were later formalized as Complementary Discrimination Learning (CDL) [137], which exploits the equivalence between discriminating a concept and generalizing the complement of this concept in formal logical languages (e.g., propositional logic) in order to learn a target concept autonomously from training instances. CDL was enhanced to support autonomous state representation learning in deterministic and partially-observable environments through the use of Local Distinguishing Experiments (LDEs) [141, 140], which are sequences of actions the agent can perform in its environment in order to perfectly disambiguate states that appear identical (i.e., emit the same observation) to the agent. These initial forms of SBL were heavily in uenced by results in psychology and the philosophy of science, most notably Simon and Lea's dual-space theory of problem solving and induction [147]. At a high level, this theory describes problem solving as intimately related searches in dual spaces: one space consists of all possible general-purpose rules dening the task or environment at hand, while the other space consists of possible problem instances in which these rules are tested and demonstrated. Clearly, these spaces are distinct but also closely related: the testing of rules leads to the generation of new problem instances, which, in turn, cause the learner to re-evaluate the learned rules. SBL was also in uenced by Piaget's theory of cognitive development [114], which holds that children model the world around them, observe mismatches (called prediction failures in the SBL literature) between what they currently know and what they observe in their environments, and then update their models of the world accordingly to resolve these discrepancies and bring their representation of the world into alignment with their sensory experiences. Finally, SBL was in uenced by the philosophical pragmatism of C.S. Peirce [112], which holds that our concept of any object is dened by the sum total of the practical eects it might conceivably have. 17 SBL has been applied to a number of simulated and real-world problems since its inception. These include biological gene discovery [136, 144], game playing [140], and learning from large knowledge bases [139]. In [117], Ranasinghe and Shen further rened SBL, extended it to handle continuous observations, and demonstrated its use in facilitating learning on real robots in re- sponse to unexpected changes. In [118], Ranasinghe utilized SBL to dynamically learn the state space of a Markov chain that performed human activity recognition from video stream data and demonstrated its resilience to gaps and sensory interference. These existing SBL techniques all utilize prediction rules based on propositional or rst-order logic in order to predict the results of future observations, and they modify their models in response to each prediction failure (in these works, predictions failures are called surprises), which is a single mismatch between what the agent expects to see and what it actually observes. This behavior leads to severe overtting in noisy environments because of the fundamental uncertainty associated with individual observations. Even with a perfect model of a stochastic environment, prediction failures will still occur. Splitting these rules in response to each prediction failure leads to many unnecessary and useless prediction rules, because environment noise is not being modeled in a principled fashion. Additionally, hidden-state detection and modeling techniques in the SBL literature, most notably Local Distinguishing Experiments (LDEs) [141], are usable only in fully deterministic environments. Stochastic Distinguishing Experiments (SDEs) and Probabilistic Surprise Based Learning (PSBL) generalize the surprise-based learning (SBL) framework discussed above to stochastic and partially- observable environments. In contrast to traditional SBL approaches, SDEs and PSBL handle environment noise in a principled way with probabilistic modeling techniques in order to reduce SBL's propensity for overtting, while still remaining applicable to deterministic environments. This work formally extends the denition of surprise in these works to stochastic environments in a way that allows us to cast the problem of autonomous learning from the environment (ALFE) as a minimization problem (where surprise is the quantity to be minimized). This new formulation 18 of the problem of ALFE allows for a much broader range of techniques to be brought to bear on it, several of which are explored in this dissertation. Please see Chapter 4.6 for a fuller discussion of the key contributions of the work in this dissertation to the SBL literature. 2.3 Universally Optimal Learning Agents At the broadest level, one might consider how agents based on universally optimal theories of inductive inference and decision making might fair on the problem of ALFE as dened in this dissertation. AIXI [63] is an extension of Solomono's theory of inductive inference [150, 151] to the case of reward-seeking agents that can aect their environments via actions. AIXI is a universally optimal reward-seeking agent (under some technical conditions). Unfortunately, AIXI is also non-computable { though computable approximations such as AIXItl do exist that have interesting theoretical properties of space and time limited optimality [63]. Additionally, the broad claims of universal optimality of AIXI agents do not hold in POMDP environments in general (see chapter 3 of [77]). The most signicant dierence between the problem of ALFE and the problems addressed in these works is that an ALFE agent is not provided with any external reward function to help guide its actions or generate its model of the environment. 2.4 Learning Deterministic Finite State Machines SBL approaches are related to a number of works in the AI literature that focus on the problem of learning representations of unknown deterministic automata. These include Gold's algorithm for system identication in the limit [55], which is itself an alternative view of work by Arbib and Zeiger on system identication [8]. Gold presents algorithms for learning state representations of unknown deterministic Mealy machines [93], Moore machines [100], and discrete time linear automata via careful experimentation (automatically crafted by the agent to speed learning), though he assumes that the environment can be reset to a known initial conguration. However, 19 it is worth noting that this problem is extremely dicult: Gold proves that the problem of nding the smallest automaton consistent with given input/output samples is NP-complete [56], and Pitt and Warmuth proved that even nding high-quality approximations is intractable [116]. Angluin further proved that such learning is intractable in general, even when the agent is capable of choosing the inputs to give to the machine, as opposed to just passively observing them [6]. Despite these discouraging theoretical results, Angluin developed a tractable alternative learn- ing methodology called L that operates via a combination of active and passive learning. The agent experiments actively in its environment but is also given counterexamples to incorrect hy- potheses by a teacher [7]. However, this procedure still relies critically on a method for resetting the environment to a known initial state. Rivest and Schapire [119] improved on the L learning algorithm by utilizing homing sequences for agent localization, thereby removing the need for an environment reset procedure. They also present algorithms that do not require a teacher to provide counter-examples and still learn perfect models with arbitrarily high probability (where this probability is related to the number of random actions taken in the environment). Shen's D algorithm [141] extends upon Rivest and Schapire's work by modeling unknown environments using a combination of observable and latent symbols constructed using a representation called Local Distinguishing Experiments (LDEs). These LDEs are learned from prediction failures. In this work, Shen proves that these LDEs form a homing sequence when concatenated together in the order of their creation (preventing the need for an environmental reset) and proves that the learned model is Probably Approximately Correct (PAC) in accordance with the theory laid out by Valiant in [160]. PSBL and the SDE modeling approaches laid out in this dissertation dier fundamentally from the above work in that they are applicable to both deterministic and stochastic environments and they handle noise in a principled fashion using probabilistic modeling techniques. However, this work was inspired heavily by Shen's LDEs and other SBL work in deterministic environments, which, in turn, was inspired by many of the approaches laid out above. Additionally, the above 20 works are important to this dissertation, because they have some relationship to the theoretical results presented in Chapter 7. 2.5 Representation Learning and Deep Learning ALFE is related to the larger problem of representation learning, which includes automatic fea- ture learning and extraction as sub-problems. This eld is currently dominated by research in probabilistic graphical models [73] and deep neural networks [57]. These techniques have achieved tremendous success in signal processing, speech and object recognition, articial intelligence, and natural language processing, among many others [22]. The recent successes of deep reinforcement learning [82] in learning to play Atari games better than human experts from raw pixel input and defeating professional human Go players are particularly poignant examples of the power of deep representations [95, 96, 146]. Interestingly, neural networks and graphical models have deep connections [57], with some of the most recent connections being discovered in the search for general and universal articial intelligence algorithms and representations [122, 121]. Though there has been some work that considers the problem of inferring optimal neural network structure [113, 24] or even integrating deep learning with other non-parametric approaches (e.g., Bayesian) [168] to encode useful prior information, deep learning remains primarily focused on parametric models. The network architecture and number of hidden units/layers in these models often require extensive manual tuning and experimentation to yield high-quality results. In recurrent neural networks (e.g., LSTMs [60], which are designed, among other things, for modeling dynamical systems), the number of hidden units is related to the number of underlying system states postulated by the model (see chapter 10 of [57]). Such information is not available in most real-world situations without extensive human engineering. Researchers have recently begun to explore the areas of active deep learning [165, 133, 52] and even online deep learning [129]. Thus far, active deep learning has tended to focus its attention 21 on reducing the amount of training data required by judiciously selecting which examples should be used for training from a pool of available training data. Online deep learning has focused its attention on passively processing sequences of training examples as they arrive (rather than having the entire training set available all at once) in order to operate in the many real-world scenarios in which training data are not available all at once and even to track changes to target concepts as they drift over time. To the best of our knowledge, such techniques have not been applied to problems sharing any signicant degree of similarity to ALFE. In SDE modeling, including both Predictive SDE modeling (Chapter 5) and surprise-based POMDP modeling (Chapter 6), the environment model is allowed to grow (and even shrink) as more data are observed, and there are no explicit bounds required on the number of SDEs that might make up the agent's model (or their lengths), making the approach fully nonparametric. A nonparametric model is important in ALFE, because the agent is given no prior knowledge about the size or dynamics of its environment. In addition, SDE models can be learned online in a fully active and incremental way using PSBL. 2.6 Temporal and Nonparametric Graphical Models Dynamic Bayesian Networks (DBNs, [41, 42]), which include Hidden Markov Models (HMMs, [17, 16]) as a special case, also require an a priori specication of the state variables, their possible values, and how they are interconnected over time. Recent work such as the innite Dynamic Bayesian Network (iDBN) [45] seeks to overcome this issue by using a Bayesian nonparametric model { in the form of a number of stochastic processes { to place a prior over possible DBN structures with an unbounded number of potential variables, values taken on by those variables, and connections between variables. The posterior distribution over possible model structures given the observed data is approximately inferred using Markov Chain Monte Carlo (MCMC) [5] 22 techniques. This work diers fundamentally from the work presented in this dissertation in that it does not consider the role of agent actions in the learning process. A very similar nonparametric model was proposed for learning state representations of un- known POMDPs called the innite Partially-observable Markov Decision Process (iPOMDP, [46, 47]). One crucial dierence between iPOMDPs and the approach presented in this disserta- tion is that PSBL does not require an external reward function for action selection: iPOMDPs require an expensive forward-looking search tree at every time step to select an approximately optimal next action based on a given reward function. This action selection routine also requires the oine solving of candidate POMDPs to estimate the Q values of various actions. In contrast, in our work, the agent uses SDEs to autonomously select actions in a task-independent way that does not require an external reward function or the oine solving of candidate models, and the entire PSBL learning process occurs online in an active and incremental fashion. 2.7 Reinforcement Learning Much of the learning work regarding Partially-observable Markov Decision Processes (POMDPs, [68]) focuses on discrete POMDPs and is in the area of reinforcement learning (RL, [154]), where the goal is typically not to construct a state representation but rather to learn an optimal policy (and often the POMDP's dynamics) given a known state space. Traditional exact and approximate techniques for solving POMDPs for optimal policies include [102, 115, 125]. One of the most prominent exceptions is the class of instance-based (IB) RL methods [89, 85, 90, 91, 134], which memorize interactions with their environment and organize them into a sux tree of actions and observations that approximates an unknown discrete or continuous state space nonparametrically. Crucially, this representation is task-specic and depends fundamentally on the given reward func- tion. Predictive SDE modeling (Chapter 5) can be viewed as building a type of task-independent sux tree whose suxes are mutually exclusive and exhaustive, but they require no memorization 23 of instances (reducing required memory and avoiding the dicult issue of determining a memory size large enough to contain the representative samples needed for good probability estimates) and require no reward function for action selection. Also, in contrast to notable IB methods such as [169] (as well as a large number of RL algorithms), PSBL does not require separate training episodes to learn an environment model. One shortcoming of traditional RL approaches like Q-learning [161], SARSA [126], and Tem- poral Dierence (TD) [154] learning is that the given reward function is usually associated with a single goal or outcome, making it dicult to transfer learned knowledge between problems or cope with nonstationarity. Multiple Model-based Reinforcement Learning (MMRL) [48] seeks to address this issue by maintaining multiple weighted RL models whose weights change over time in response to the current goal. None of these approaches address the problem of state represen- tation learning, however, as they rely on an a priori specication of the agent's state space. This makes the problem being solved fundamentally dierent than ALFE. Additionally, SDE modeling diers from MMRL in that it maintains a single environment model which it modies in response to environmental experience. There is a line of more recent work in deep reinforcement learning [82] that also leverages mismatches between expected and actual sensory feedback in response to actions. This work can be broadly characterized into two related approaches: reinforcement learning via intrinsic motivation [14, 32, 97, 2] and reinforcement learning via articial curiosity [152, 53, 111]. The primary unifying theme of these approaches is that, since external rewards are often extremely sparse (if they exist at all), RL agents should use self-dened (intrinsic) and goal-independent metrics/rewards to evaluate their own performance and push themselves into novel situations in order to develop a hierarchy of skills that are broadly useful across a wide range of tasks. However, to the best of our knowledge, these approaches rely on a human-designed state space as input. In [111], an RL algorithm is presented in which a deep neural network generates a curiosity signal that helps an RL agent eectively explore its environment and learn important skills in 24 order to maximize its ability to correctly predict future observations in a learned feature space. The agent is validated in video game environments with high-dimensional visual observation spaces (in this case, frames of a video game emulator). A state space, consisting of bundles of xed numbers of frames, is provided for the agent a priori. In [53], a curiosity-driven RL approach was successfully used for motion planning on real-world humanoid robotic hardware, though the authors indicate that the main drawback of their work is the need for the manual engineering of the state-action space in which the robot operates. In [97], a state representation of raw pixel data learned by a deep neural network is used in combination with a novel mutual information optimization algorithm to aid an RL agent in developing useful behaviors with no external rewards. Please see [109, 108] for an overview of curiosity and intrinsic motivation in the RL literature. SDE modeling and this line of RL research are both interested in mismatches between expected and actual observations in response to actions, but these RL approaches focus on using these mismatches to learn general-purpose skills, whereas SDE modeling uses them to learn a task-independent representation of environment state and dynamics. The vast majority of research that considers learning in POMDP environments assumes that the state space of the POMDP model is known a priori. Even recent approaches that attempt to infer POMDP models in a Bayesian fashion from data make this fundamental assumption and focus on learning transition and observation probability distributions (while localizing the agent) [124, 70]. There are a few works, however, that have considered the problem of learning POMDP state representations explicitly. In [40], the authors develop a novel loss function and use a linear model to build a representation of POMDP state for the mountain car problem [99] (which is a continuous state POMDP) with success. However, the model is trained oine using randomly selected sample trajectories of xed size, and the problem of action selection is not considered in this work. In [87], Mahmud considers the problem of constructing state representations of unknown stochastic and partially-observable environments but limits consideration to a subclass of POMDP environments in which each agent history deterministically maps to one of (nitely) 25 many states (which are a sucient statistic of agent history). Given this mapping between histories and states, the environment becomes an MDP that can be used for subsequent planning. In ALFE, we do not make such strong assumptions about the nature of the agent's environments. Additionally, this approach is an extension of instance-based RL methods [89, 85, 90, 91, 134], the advantages and disadvantages of which we discussed above. In [61], the authors use ground- truth geometric prior information to abstract a POMDP representation (including learning the state space) from a low-level, continuous geometric state space which is provided as input. They demonstrate the feasibility of this approach on simulated grasping tasks. In ALFE, we are given no information (geometric or otherwise) about the underlying environment state space (even at a low level), which makes such an abstraction infeasible. Clearly, learning state representations of unknown POMDP environments (particularly in an online, active, and incremental fashion) is still very much an open problem. See Chapter 6 for our approach to this problem, which we call sPOMDP modeling. 2.8 Predictive State Representations Predictive state representations (PSRs) [84] model the state of controlled dynamical systems in terms of the probabilities of tests (experiments), which consist of ordered sequences of actions and expected observations. For every controlled dynamical system, there exists a linearly independent set of tests to represent its state (i.e., is a sucient statistic for computing the probability of any test) that is no larger than its minimal POMDP representation. This establishes PSRs as a non-inferior alternative to the latent representation oered by POMDPs. Algorithms exist for nding a set of core tests that forms a PSR [163, 92] and for learning the parameters of the set of core tests [149, 163, 92], which generally must be performed separately and are most commonly performed oine using a human-designed or random action-selection scheme. Algorithms for dis- covering the set of core tests generally proceed in a search-and-check fashion in which subsets of 26 potential tests are repeatedly examined for linear independence using singular value decomposi- tions [26]. Many of these algorithms can only guarantee that the discovered set of tests is linearly independent and not that it is necessarily a valid PSR. More recent techniques, such as Trans- formed PSRs (TPSRs) [123, 26, 58], alleviate the diculty of discovering core tests by requiring the use of a large (and likely redundant) set of tests that contains a sucient subset or resort to a heuristic stochastic search [86] for a valid set of core tests. In [26], Boots et al. present some of the most promising PSR results to date, demonstrating the eectiveness of TPSRs on simulated autonomous robotic navigation using simulated vision in a continuous environment. However, the learning is neither active nor incremental, as the tests and associated parameters are learned oine from a xed number of random trajectories of xed length. This means that the problem of action selection is not considered (as it is in this dissertation). Additionally, the performance of the algorithm was heavily dependent on the dimensionality of the TPSR model, and the authors state that this quantity was set manually to provide optimal performance. This is a manual specication of fundamental prior information about the nature of the robot's environment (albeit in a somewhat indirect fashion). Nevertheless, these results demonstrate the potential scalability and utility of predictive representations. Like PSRs and their extensions, SDE models utilize experiments the agent can perform in order to form a representation of the state and dynamics of partially-observable and stochastic environments in terms of quantities that are directly observable and controllable by the agent. However, it is important to note that SDEs are dened and used quite dierently than tests in a PSR. PSRs focus on the linear independence of the probabilities of groups of tests, whereas each SDE in an SDE model is utilized due to its ability to statistically disambiguate identical-looking states or histories (contexts). Unlike most work in the PSR literature, SDEs can be discovered and their parameters learned in an integrated fashion actively and incrementally. Additionally, SDEs need not be linearly independent, saving expensive linear independence checks. This also 27 makes it easier to incrementally grow larger SDEs from smaller ones or even merge SDEs together as the agent's experience indicates a more (or less) complex representation is needed. Finally, it is worth mentioning that SDEs and PSRs share similar motivations. As Littman notes in [84], predictive models (e.g., PSRs and SDEs) built in terms of an agent's actions and quantities it can directly observe are potentially easier to learn and generalize better that those based on unobserved (latent) state spaces. The potential for learnability is due to the fact that the probabilities of experiments can be easily estimated via counting outcomes, while the potential for generalization is due to some results in the theory of learning that suggest that states that are themselves predictions may provide features that aid in making further predictions [84, 157, 18]. 2.9 Compression and Decision Trees SDE models { in particular, the Predictive SDE models presented in Chapter 5 { are related to some techniques in the compression and prediction literature, such as Variable Order Markov Models (VMMs) for prediction [19, 43] and recent extensions of these methods to controlled processes [44] for online Bayesian inference and decision making. PSBL learning of Predictive SDE models can be viewed as building a tree that bears similarity to the context and/or sux trees created by traditional VMM approaches for prediction (though SDEs are created in a much dierent way using surprise); however, such approaches do not consider how to select actions to improve the model or even how actions might be included in the model at all. Incorporating actions into these models is a non-trivial task that has been considered by recent extensions of these methods to controlled processes [44]. This work focuses on prediction, learning, and decision making in unknown POMDPs without explicitly modeling the state (though evidence is presented that the state is modeled approximately as a side eect). Additionally, actions are selected based on a given reward function rather than actively selected in a task-independent way in order to try to improve the model (as in PSBL). 28 Predictive SDE models can also, in some sense, be viewed as decision trees [159] that classify agent histories. However, since there is no limit on the amount of history that any SDE can represent, the number of attributes on which to split is unknown and, in theory, unbounded, making an application of traditional decision tree approaches infeasible. 2.10 Robotics In the robotics literature, SDE modeling bears some resemblance to Simultaneous Localization and Mapping (SLAM) techniques (see Chapters 10-13 of [155] for a good overview). SLAM techniques typically rely fundamentally on human-designed geometric prior information (including a geometric formulation of state), features (or landmarks), and an array of specialized sensors which may not always be available (e.g., in small modular or mobile robots). In ALFE, we do not have access to this prior information or geometric constraints that might simplify learning. Another important distinction is that, in Predictive SDE modeling, no localization is required, because the model is built entirely in terms of the agent's actions and observations. sPOMDP modeling does require localization, but this localization is informed by the results of executing automatically-designed SDEs rather than by monitoring geometric transformations or human- designed landmarks. In [104], a method is proposed for learning a model of unknown POMDP environments for robotic control purposes by clustering xed-length sequences of actions and observations into states based on trajectory similarities. The length of the sequence that is needed for learning remains an important input (and one that is not required in PSBL or SDE modeling, because the agent automatically decides this length). In [67], Jonschkowsi and Brock proposed the notion of robotic priors for learning task-dependent state representations (relative to a given reward function) from high-dimensional sensor data, which they successfully demonstrated on simulated 29 and real-world navigation tasks. However, the underlying state space was assumed to be low- dimensional, which may not be true, e.g., in the case of a robotic manipulator with many degrees of freedom, and a reasonable approximation of the dimensionality of the underlying state space was provided to the robot as input. While somewhat less related to the present work, SBL is, in general, related to the eld of evolutionary robotics [105], in which populations of robotic modules or controllers are evolved over time using a tness function that selects for better-performing robots. SBL is also related to the eld of developmental robotics [10] which is concerned with the open-ended learning of new skills and knowledge over time by individual robots. See [117] for more details on the connection between SBL and developmental robotics. 2.11 Genetic Algorithms and Genetic Programming Finally, we note that the key operations of splitting, rening, and merging used in PSBL for PSDEs (Chapter 5) bear some similarity to those involved in evolutionary computational strategies such as genetic algorithms [94] and the use of related techniques to automatically generate useful computer programs (often called genetic programming) [74]. It is particularly interesting to compare Predictive SDE modeling (Chapter 5) with genetic programming, given that the latter can eectively search the space of possible computer programs of various (even unbounded) sizes by representing them as recursively evaluated trees. However, there are crucial dierences between these methods. First, the set of PSDEs in Predictive SDE models always partitions, in a mutually exclusive and exhaustive fashion, all possible agent histories, whereas the individuals in a population maintained in a genetic algorithm have no such relationship. Second, though there is randomness in the PSBL learning procedure due to the noisiness of the agent's probability estimates (and possibly inherent stochasticity in the environment itself), there is no analog of the pure random mutation procedure 30 { a procedure crucial to genetic algorithms { in PSBL learning. Any such procedure would, with high probability, destroy the mutually exclusive and exhaustive nature of the set of PSDEs, making such a procedure largely ineectual. At each step in the PSBL learning algorithm, new PSDEs are derived from old ones in a principled manner based on surprise in a way that preserves the mutually exclusive and exhaustive nature of the Predictive SDE model. 31 Chapter 3 Autonomous Learning from Deterministic Environments In this chapter, we discuss previous Surprise-Based Learning (SBL, [143]) approaches to solving the classical version of the problem of autonomous learning from the environment (ALFE, [140, 142]), in which the environment (and the agent's model of this environment) are assumed to be fully deterministic. This dissertation provides important (and principled) generalizations of both the ALFE problem and SBL approaches to stochastic and partially-observable environments. 3.1 ALFE in Deterministic Environments We begin this chapter by formalizing the problem of ALFE in deterministic environments (follow- ing Shen [142]). We then discuss some important aspects of Shen's Local Distinguishing Experi- ments (LDEs, [141]) and the Complementary Discrimination Learning (CDL, [141]) algorithm for actively and incrementally learning LDE representations of unknown deterministic (but partially- observable) environments. Understanding the key ideas of these approaches is important, because their limitations when applied to stochastic environments motivated our generalization of LDEs into Stochastic Distinguishing Experiments (SDEs, [38]) and our development of a novel SBL approach to learning SDE models called Probabilistic Surprise-Based Learning (PSBL, [38, 39]). SDEs and PSBL will be discussed in the next chapter (Chapter 4). 32 First, it is important to revisit a few concepts from the literature on learning deterministic nite state machines (FSM) from agent experience. For clarity, we note that FSMs are sometimes also referred to as deterministic nite automata (DFAs). In particular, we will focus on the problem of learning Moore machine [100] models of unknown Moore machine environments. We will concentrate on Shen's formulation of Moore machine models and environments as dened in Chapter 4 of [142], which we now present. Shen denes a Moore machine modelM of environmentE as a tuple (A;P;S;;;t), in which: A is a nite set of agent actions. P is a nite set of percepts the agent can receive from its environment. S is a nite set of model states. is the transition function of modelM, where :SA!S. is the appearance (emission) function of the model, where :S!P . t is the current model state ofM. Shen assumes that the environmentE is similarly dened as a tuple (A;P;Q;;;r), in which: A is a nite set of basic actions. P is a nite set of environment output symbols. Q is a nite set of environment states. is the transition function of environmentE, where :QA!Q. is the appearance (emission) function of the environment, where :Q!P . r is the current environment state ofE. 33 In Shen's work on LDEs [141, 142], he focuses on partially-observable Moore machine envi- ronments in whichjQj>jPj, a situation which implies that at least two environment states look identical to the agent (i.e., emit the same symbol p2 P ). Note that, in order to successfully model such environments exactly, the agent will need to construct a representation of stateS that is larger than P , and the model's emission function must be many-to-one, which means that the agent will need to use more than just single percepts to determine its model state. Furthermore, the agent has no a priori knowledge of how many model states to create, because it does not have access tojQj or . Shen focuses on two key problems in such environments: 1. Model construction, which is the problem of deciding when and by what mechanism to construct new model states. 2. Model synchronization, which is the problem of determining the agent's current model state. This problem is also called localization. WhenjQj=jPj, it is easy to see that solving both of these problems becomes trivial. The agent can build a perfect model by exploring its environment and directly recording its experience. Since the environment is deterministic, a single example of each transition is sucient to build such a model. Furthermore, the agent can usejPj directly as a state space, because each environment state emits a unique symbol. The uniqueness of each environment state's appearance also solves the synchronization problem. In partially-observable environments, both of these problems are substantial. In [142], Shen denes the goal of autonomous learning from the environment (ALFE) in the context of deterministic environments as that of building a perfect and synchronized modelM of environmentE. Before we can formalize the notion of a perfect and synchronized Moore machine model, we need the following denition: 34 Denition 1. Deterministic visible equivalence. LetE be a Moore machine environment with current state r, and letM be a Moore machine model ofE with current state t. We say that model state t and environment state r are visibly equivalent if (t) =(r). Simply put, a model state and environment state are visibly equivalent in deterministic envi- ronments if they always emit the same observation (i.e., look identical to the agent). We can now formally dene a perfect and synchronized Moore machine model: Denition 2. Perfect and synchronized Moore machine model. LetE be a Moore machine environment, and letM be a Moore machine model ofE. We say thatM is a perfect and synchronized model of environmentE if the current model state t is visibly equivalent to the current environment state r, and, for all possible sequences of actions of any possible length b2B, the model state (t;b) remains visibly equivalent to environment state (r;b). A perfect and synchronized model is one in which, once the agent has localized itself in this model, it can predict any future sequence of observations perfectly, given any sequence of actions of any length. It will never be surprised by an observation that does not match its predictions. Angluin'sL algorithm [7] is primarily directed at solving the problem of model construction, whereas the extensions to L proposed by Rivest and Schapire [119] primarily addressed the synchronization problem. The idea behind LDEs and the CDL algorithm is to solve both problems actively and incrementally under a unied framework { a predict - surprise - identify - revise loop, discussed below { that does not depend on a teacher to supply counter-examples and does not require that the agent be able to reset itself to a known initial state (as in L ). 3.2 Local Distinguishing Experiments (LDEs) In this section, we will discuss Shen's Local Distinguishing Experiments [141, 142]. We will enhance the traditional presentation of these ideas with insights gleaned during the preparation of this dissertation. This will serve to motivate the need for extending these ideas into Stochastic 35 Distinguishing Experiments (SDEs) [38] for modeling stochastic environments (Chapter 4), unify some shared aspects of LDEs and SDEs, and also clarify their rather stark dierences. The key idea behind Local Distinguishing Experiments (LDEs) is that, in deterministic envi- ronments, if the agent executes the same action sequence beginning at the same model state at two dierent points in time and does not observe identical percept sequences, that model state is representing (or covering) at least two (hidden) environment states transitioning into diering model states under the same action. These distinct outcome sequences can be used as unique labels for these inferred hidden states, and the action sequence that led to the discovery of these latent states { this action sequence is called a Local Distinguishing Experiment, or LDE { can be used as an experiment by the agent to determine which of these latent states it must have been in before executing this LDE. As a minor technical note, when more than two model states emit the same percept, the result of executing an LDE may only be to exclude possible model states, rather than to uniquely determine the agent's model state, because, in the worst case, each LDE distinguishes only two model states. Below, we will discuss how LDEs can be used to solve the full synchronization problem. As an example of LDEs, consider the Shape environment illustrated in Figures 3.1 and 3.2 (as adapted by Shen [142], but originally from [119]), in which a robot exists in a simple 4-state Moore machine environment and observes (in states I and II) and (in states III and IV). The agent can execute the actionsx andy, which have the eects illustrated in the gure. Note thatjQj= 4, whilejPj= 2, so the environment is only partially observable. Assume that the robot's initial model of the environmentM consists of model states corresponding exactly to its observations, (covering environment states I and II, though this association is initially unknown to the agent) and (covering environment states III and IV). The agent can localize itself in its modelM by observing the current percept output by the environment. If this percept is, for example, the agent is localized in model state; however, it may actually be in environment state III or IV (both of which are visibly equivalent to model state). 36 Surprise-based partially-observable Markov decision processes (sPOMDPs) sPOMDPs Motivation Problem Challenges Related Work SDEs Learning Results Future Information Sciences Institute x Surprise-based partially-observable Markov decision processes (sPOMDPs) sPOMDPs Motivation Problem Challenges Related Work SDEs Learning Results Future Information Sciences Institute x Figure 3.1: An illustration of a one-action LDE in the Shape environment [142, 119]. Surprise-based partially-observable Markov decision processes (sPOMDPs) sPOMDPs Motivation Problem Challenges Related Work SDEs Learning Results Future Information Sciences Institute y x Surprise-based partially-observable Markov decision processes (sPOMDPs) sPOMDPs Motivation Problem Challenges Related Work SDEs Learning Results Future Information Sciences Institute y x Figure 3.2: An illustration of a two-action LDE in the Shape environment [142, 119]. As Figure 3.1 (left) demonstrates, if the robot executes action x from state III, the resulting percept is. In contrast, executing x from state IV results in a nal percept of (Figure 3.1, right). Crucially, executing the same action sequencefxg from the same model state at dierent times resulted in dierent model states (and, consequently, dierent percept sequences). The agent can therefore infer the existence of two environment states that look like a but have dierent dynamics. These states can be uniquely represented by their outcome sequences under LDEfxg: f,g andf,g. After executing the action sequencefxg upon seeing a, the agent can infer which of these states it must have been at when it saw the initial percept. 37 Figure 3.2 illustrates how the two-action LDEfy, xg can be used to distinguish environment states I and II, which are both covered by the model state. Executing the LDEfy, xg from environment state I results in the outcome sequence f, , g, whereas executing the same LDE from state II results inf,,g. Again, the agent can infer the existence of two states with diering dynamics that emit the observation and can use the LDEfy, xg to distinguish between them. Careful readers might have noticed that the suxes of these outcome sequences, f,,g andf,,g, are identical to the outcome sequences of the two environment states covered by (III and IV), as well as the fact that the sux of this new LDEf,xg is precisely the LDE used to generate the outcome sequences of states III and IV. This is a general pattern that can be exploited to learn LDE representations of unknown Moore machine environments using an algorithm called CDL [141], as we now discuss. 3.3 Complementary Discrimination Learning (CDL) Complementary Discrimination Learning (CDL) [138] is a procedure for active and incremental au- tonomous learning in unknown deterministic environments. CDL agents model their environments via prediction rules (specied in a formal logic-based language) that describe the expected results of their actions under various conditions. These conditions are dened in terms of the agent's percepts or, in some cases, its experience history. CDL is applicable to both fully-observable environments (see Chapter 4 of [142]) and partially-observable environments (see Chapter 5 of [142]) via an extension called Local Distinguishing Experiments (LDEs, [141], discussed in the previous section). We focus on this latter variant of CDL, as it is the one most relevant to the work in this dissertation. CDL operates as a loop consisting of the following primary steps: 1. Predict: The agent predicts the results of each of its actions based on its current set of prediction rules (the one whose condition is satised). 38 2. Surprise: The agent is surprised if the percept resulting from its action does not match its prediction. In a deterministic environment, this indicates to the agent that its model is not yet perfect and synchronized with respect to its environment. 3. Identify: The agent compares the current situation with those situations that previously satised the same prediction rule and led to a correct prediction in order to nd the dier- ences that explain the surprise. Such dierences are guaranteed to exist in fully-observable and deterministic environments. In partially-observable and deterministic environments, the agent must search some (provably nite) number of steps backwards in its experience history to nd such dierences. 4. Revise: The agent uses these dierences in condition or experience history to modify its prediction rules such that the encountered surprise will not occur again. Intuitively, in order to achieve this, the agent either splits one prediction rule into multiple prediction rules or renes its prediction rules by making their conditions more specic. The specics of the Revise process depend on the application, and there are several variants of CDL and other surprise-based learning (SBL) approaches that operate dierently. However, these specics are not particularly important to the discussion here. Please see [142] for more information about classical CDL approaches to revision and [117, 118] for an extension of these revision approaches to robotic learning and learning with continuous observations. Figures 3.3 and 3.4 illustrate two examples of the predict - surprise - identify - revise loop of the CDL algorithm corresponding to a robot in the Shape environment of Figure 3.1. In Figure 3.3, The CDL robot begins with an initial model, M 0 , that treats each percept p2 P as a model state and directly records the transitions between these model states that it encounters. Note that, even though the number of environment states may be greater than the number of distinct percepts (i.e.,jQj>jPj), such a model may be perfect and synchronized if all the underlying 39 Predict Surprise Identify Revise M 0 = 휙 ( , y) = 휙 ( , x) = 휙 ( , y) = 휙 ( , x) = Prediction Sequence: { , } Observation Sequence: { , } LDE: {x} 0: { , x, } 1: { , x, } M 1 = 휙 ( , y) = ? 휙 ( 0, x) = 휙 ( 0 , y) = 휙 ( , x) = 휙 ( 1, x) = ? 휙 ( 1 , y) = Action Sequence: {x} Figure 3.3: An illustration of the predict - surprise - identify - revise loop of CDL as applied to a robot in the Shape environment of Figure 3.1. environment states covered by the same percepts (model states) have identical dynamics in model space. However, that is not the case in the Shape environment. In the top right of Figure 3.3, we see that the robot rst experienced and recorded the transition f,xg!. Thus, the next time it encounters the model state and executes actionx, it predicts that it will encounter the percept. If we study the Shape environment in Figure 3.1, we can see that the robot must have been in environment state III when it rst encountered and recorded the transitionf, xg!. When the robot later executes x from environment state IV (which is also covered by the model state), its prediction of will not match the actual environment output of, and the robot will be surprised (see the bottom right corner of Figure 3.3). In this case, executing the same action (x) from the same model state (), results in a predicted sequence off,g and an actual observation sequence off,g. 40 The robot has identied that the model state is actually incorrectly representing two en- vironment states with diering dynamics, so it splits the model state into two latent states, 0, which has the outcome sequencef,g when the action sequencefxg is executed, and1, which has the outcome sequencef,g whenfxg is executed (see the bottom left of Figure 3.3). The action sequencefxg is called a Local Distinguishing Experiment (LDE) for the two latent environment states that it distinguishes (III and IV, in this case). Note that localization is now required in the model, because0 and1 are latent states that can only be disambiguated by a specic action. If the robot's history includes only observing and then executing the action y, it could be in either0 or1. We will discuss how CDL solves this problem in a moment. The model resulting from this split, M 1 , is shown in the top left corner of Figure 3.3. Note that the agent is now uncertain about the result of executing x from model state1 and the result of executing y from model state. The robot must experiment further to determine whether these actions lead to0 or1. At the moment, it only knows that these actions lead to states that are visibly equivalent to. Figure 3.4 continues this example of CDL learning in the Shape environment, beginning this time from model M 1 . The robot experiments more in its environment, recording the experienced transitionsf1, xg!0 andf, yg!0. The transition from1 to0 under x is correct and will not lead to future surprises (there are only two environment states visibly equivalent to ). However, by studying the Shape environment again, we can see that the transitionf, yg !0 must have been experienced from environment state I. Note that, in order to determine the transitionf, yg!0 occurred, the full action sequence must have beenfy, xg, because, otherwise, the robot would not have been able to conclude that it ended up in0 as opposed to 1. When the robot later executesfy,xg from environment state II (which is still covered by the model state), the predicted outcome sequence off,,g will not match the actual outcome sequence off,,g, and the robot will again be surprised. It is important to understand that this means that the predicted transition under action y (written in terms of model states rather 41 Predict Surprise Identify Revise Prediction Seq.: { , , } Observation Seq.: { , , } LDE: {y, x} 0: { , y, , x, } = M 2 휙 ( 1, y) = 1 휙 ( 0, x) = 0 휙 ( 0 , y) = 0 휙 ( 1, x) = 1 휙 ( 1, x) = 0 휙 ( 1 , y) = 1 M 1 = 휙 ( , y) = 0 휙 ( 0, x) = 휙 ( 0 , y) = 휙 ( , x) = 휙 ( 1, x) = 0 휙 ( 1 , y) = 1: { , y, , x, } Action Sequence: {y, x} 휙 ( 0, y) = 0 휙 ( 0, x) = 1 Figure 3.4: An illustration of the second predict - surprise - identify - revise loop of CDL as applied to a robot in the Shape environment of Figure 3.1. than percepts) isf,0g, while the actual observed transition wasf,1g. Thus, again, we see that underlying environment states covered by the same model state exhibit dierent dynamics in the model. The robot has now identied that the model state is mistakenly representing multiple environment states and must be split. We form a new LDE for distinguishing these states,fy,xg, by prepending the action that caused the surprise (y) to the beginning of the LDE distinguishing 0 and1 (fxg), the two dierent model states being transitioned into from dierent underlying environment states covered by. The outcome sequences of these new states are similarly formed by prepending the percept to the outcome sequences of the0 and1. This results in latent model states0, with the outcome sequencef,,g, and1, with the outcome sequence f,,g. Once this change is applied to the model, it will induce some uncertainty into some of 42 the transitions, but the robot can learn them through further experimentation. This will result in model M 2 , which is a perfect model of the Shape environment. The above example demonstrated the model construction process of CDL, but it did not address the problem of model synchronization (localization). In [142], Shen proves that the concatenation of the actions of the current LDEs in the order of their creation always results in a homing sequence [131] for the current model: Denition 3. Homing sequence A sequence of actionsf is called a homing sequence for Moore machineM if, for every pair of states u;v2S, (u;f)6=(v;f) =) (u;f)6=(v;f). A homing sequence h is a sequence of actions such that the observation sequence generated by executing h in the model (from any model state) uniquely identies the nal resulting model state. At the beginning of each predict-surprise-identify-revise iteration, CDL rst localizes the agent using these homing sequences before proceeding to execute actions and respond to surprises. In [140, 141], Shen proves that the model output by CDL is probably approximately correct [160]. 43 Chapter 4 Autonomous Learning from Stochastic Environments In this chapter, we generalize Shen's original formulation of autonomous learning from the en- vironment (ALFE) as laid out in [142] (which applied only to deterministic environments) such that it is applicable to both deterministic and stochastic environments (represented as reward- less POMDPs). We then discuss the limitations of previous Surprise-Based Learning (SBL, [143, 138, 117, 118]) approaches (which were designed primarily to address this original formula- tion of ALFE) when applied to stochastic and partially-observable environments. To overcome these limitations, we generalize key theory and denitions in the SBL literature and derive SBL approaches that model noise in a principled fashion using probabilistic modeling techniques well suited to stochastic and partially-observable environments. We end this chapter by summarizing our key contributions to the SBL literature. 4.1 ALFE in Stochastic Environments Recall that Shen [142] denes ALFE as the problem of constructing a perfect and synchronized Moore machine model of an unknown Moore machine environment (see Chapter 3.1). In this work, we focus on building perfect models of unknown rewardless (discrete) Partially-observable Markov Decision Process (POMDP) [68] environments. We restrict our attention to such environments in order to make important theoretical progress while still addressing the key challenges that arise 44 when attempting to learn in unknown environments that exhibit both stochasticity and partial- observability. There are several equivalent formulations of POMDPs. For the purposes of this work, we dene a rewardless, discrete POMDP environmentE as a 5-tuple (Q,A,T ,O, ), where: Q is a discrete and nite set of latent states. A is a discrete and nite set of actions the agent can take in its environment. T is a set of transition probabilities satisfying the Markov property [88]: P (q 0 jq;a), for all q;a;q 0 2 QAQ. In words, the probability distribution over the next state can be determined according to the current state and the action taken by the agent in that state. O is a discrete and nite set of agent observations. is a set of observation probabilities satisfying the (sensor) Markov property: P (ojq), for all o;q2 OQ. In other words, the probability distribution over possible observations is dened completely by the current state. Recall also that Shen [142] denes a perfect and synchronized modelM as one in which, once the agent has localized itself in this model, it can predict any future sequence of observations perfectly, given any sequence of actions of any length (see Denition 2 in Chapter 3.1). This denition is clearly inappropriate in stochastic environments, because it is possible for the agent to have perfect knowledge of its environment, (i.e.,M =E) and still make incorrect predictions about future observations. In particular, this occurs when the Bayes error rate [54] of the environment is greater than 0 (which means that even the classication performance of an oracle in this environment is imperfect). Crucial to Shen's denition of a perfect model is the notion of visible equivalence [142] (see Denition 1 in Chapter 3.1). In deterministic environments with discrete observations, a model state and environment state are visibly equivalent if they always emit the same observation. In stochastic environments, this type of visible equivalence is not particularly useful, because a 45 mismatch between expected and actual observations can occur even when the agent has a perfect model of its environment. The Predictive State Representation (PSR) literature [84] solves this problem by recognizing that, if the agent knows the probabilities of all possible sequences of observations it could encounter (for any given sequence of actions), it knows everything there is to know about its environment. We solve this problem slightly dierently by recognizing the following: if the agent, for any history h t of actions and observations up to time t of any length (and excluding the observation at time t), can always correctly generate the probability distribution over its next observation given that history, P (O t jh t ), it knows everything there is to know about its environment. Thus, we can generalize the notion of visible equivalence to stochastic environments as follows: Denition 4. Stochastic visible equivalence. LetE be a rewardless, discrete POMDP envi- ronment and letM be a model of this environment. Let h t represent the history of the agent's actions and observations in its environment up to time t (but excluding the observation at time t), and assume thatM can generate the probability distribution over possible observations at time t, P M (O t jh t ), for any given agent history, h t , of any length. Letb E;t represent a distribution over possible environment statesq2Q at timet (before an observation at timet is made) under history h t . We say that environmentE and modelM are stochastically visibly equivalent (EM) at time t if,8 o2O P q2Q P E (ojq)b E;t (q) =P M (ojh t ). Thus, environmentE and modelM are stochastically visibly equivalent at timet if they induce the same (history-dependent) probability distribution over possible observations at time t. Note that, since the environmentE is a rewardless POMDP, a belief distribution over possible states at timet is a sucient statistic of agent history. We can marginalize out the state at timet to get the ground truth probability distribution over possible observations at time t, conditioned on the agent's actions and observations up to time t (but excluding the observation at time t), h t . The particular form of the modelM is left intentionally vague in Denition 4. Dierent SDE 46 models generate P (O t jh t ) in dierent ways, and stochastic visible equivalence is independent of the way in which these distributions are computed or represented. In fact, Denition 4 applies equally well to non-SDE environment models. Stochastic visible equivalence allows us to extend Shen's denition of perfect Moore machine models of Moore machine environments [142] (see also Denition 2 in Chapter 3.1) to rewardless POMDP environments: Denition 5. Perfect models of rewardless POMDP environments. LetE be a rewardless, discrete POMDP environment, and letM be a model ofE that is, initially, stochastically visibly equivalent toE. We say thatM is a perfect model ofE if, for all action sequences of any length k, f = (a 1 ;:::;a k ),M andE remain stochastically visibly equivalent after f is executed in both M andE. In other words, ifM is a perfect model of discrete rewardless POMDPE andME initially, M will remain stochastically visibly equivalent toE regardless of the subsequent actions executed by the agent. It is important to note that Denitions 4 and 5 apply to both deterministic and stochastic environments. If a discrete rewardless POMDP has fully deterministic transition and observation probability distributions, it reduces to a Moore machine 1 , and Denitions 4 and 5 reduce to Denitions 1 and 2, respectively. These generalized denitions keep the goal of ALFE the same in stochastic and partially-observable environments as it is in deterministic environments: the agent wishes to build a perfect model of its environment. Thus far, we have not discussed the problem of the agent localizing itself in a probabilistic model of a stochastic and partially-observable environment. In the previous chapter, we dis- cussed the fact that CDL utilizes homing sequences (Denition 3, Chapter 3.3) to provide perfect localization in deterministic environments. This ensures that prediction failures are always the result of modeling errors (and never the result of ambiguity in the agent's current model state). In stochastic and partially-observable environments, perfect localization is impossible in general, and 1 More precisely, it is trivial to construct a Moore machine from a discrete, rewardless POMDP with fully deterministic transition and observation probability distributions. 47 the agent must model its residual uncertainty in model state in a principled fashion that allows for eective learning. Localization in the face of stochasticity and partially-observability has been extensively studied (see [132, 156] for good overviews). We make use of some of these approaches to localize agents in SDE models that learn latent state spaces (see Chapter 6), although doing so requires some novel extensions. One limitation of previous SBL approaches to ALFE in stochastic environments, including Ranasinghe's work [117, 118] and Shen's CDL algorithm [138] and extensions [141, 142], is the lack of a formal metric for quantifying the discrepancy between the agent's modelM and its environmentE. In this work, we call this model error. In particular, when these previous SBL approaches are applied to stochastic and partially-observable environments, this makes it dicult to demonstrate that model revisions in response to prediction failures are actually making the model a better representation of its environment (as opposed to mistakenly modeling noise). The denition of stochastic visible equivalence (Denition 4) suggests how we might construct such a metric: we can compute the distance between P q2Q P E (O t jq)b E;t (q) andP M (O t jh t ) { that is, the probability distribution over the next observation as specied by the environment (ground truth) and model, respectively { for possible agent histories and take some statistic of these errors (e.g., the mean) as a measure of the expected per-step model error ofM with respect toE. Unfortunately, computing the model error this way this would require considering an innite number of agent histories. However, we can approximate this modeling error (denoted ^ E M;E ) to an arbitrarily high degree of accuracy by performing T random actions in both environmentE and modelM (beginning withM andE being stochastically visibly equivalent) and computing: ^ E M;E = 1 T T X t=1 s X o2O (P M (ojh t ) X q2Q P E (ojq)b E;t (q)) 2 (4.1) Equation 4.1 provides an estimate of the average per-step observation probability error ofM with respect to environmentE. The observation probability error at time t is dened as the 48 L2 norm of the vector P q2Q P E (O t jq)b E;t (q)P M (O t jh t ), which is a vector in R jOj . Note that ^ E M;E 0, and a perfect model is one such that ^ E M;E = 0 after an innite number of random actions. The problem of ALFE in rewardless, discrete POMDP environments { a problem which we call Stochastic ALFE { reduces to nding a model,M , that minimizes an innitely precise measure of Equation 4.1: M = arg min M E M;E (4.2) In practice, of course, we may setT to be a very large number and simply try to nd a model ^ M which minimizes the following: ^ M = arg min M ^ E M;E (T ) (4.3) Note that we have specied in Equation 4.3 that ^ E M;E (T ) is a function of the number of simulation time steps T . We have thus dened Stochastic ALFE as a minimization problem (albeit a very dicult one to compute, even approximately). We will now turn to discussing the key limitations of previous SBL approaches when applied to this Stochastic ALFE problem. 4.2 Previous SBL Approaches and Stochastic ALFE The rst SBL approaches, including Complementary Discrimination Learning (CDL, [138]) and its extension to partially-observable environments using Local Distinguishing Experiments (LDEs, [141]), are crucially reliant on the fact that, in fully deterministic environments, the existence of two (or more) distinct observation sequences resulting from executing the same sequence of actions from the same model state { in our work, we call this a prediction failure so as to distinguish it from the generalized denition of surprise that we present in Section 4.3 { necessarily implies the existence of hidden environment states with diering dynamics that are covered by the same model 49 state. It is easy to see that this property does not hold in general in stochastic environments. Executing the same sequence of actions starting at the same underlying environment state two dierent times may result in dierent observation sequences. Likewise, sensor noise may result in identical observation sequences being generated from underlying environment states that have radically dierent dynamics and observation probabilities. In stochastic and partially-observable environments, individual prediction failure events do not give the agent sucient information to determine if or how it should modify its model to better represent its environment. CDL and LDEs also depend on some mechanism (e.g., homing sequences [131]) for perfect localization, such that the agent always knows with certainty what model state it is in before making predictions. This requirement is vital, because it ensures that surprises are always the result of an incorrect model (as opposed to the agent's uncertainty as to its current model state). In stochastic environments, localization is still a vital concept, but perfect localization is unobtainable in general. The agent will almost always have residual uncertainty regarding its actual model state and must still make progress on improving its model in spite of this uncertainty. Indeed, a number of the key theoretical underpinnings of CDL and LDEs do not naturally generalize to stochastic environments and probabilistic models of those environments. Ranas- inghe's work on SBL [117, 118] took some steps toward accommodating stochasticity (and, to a lesser extent, partial-observability), but his extensions to CDL, though eective, were largely ad-hoc and dicult to formally justify. The work laid out in this dissertation represents the rst formal generalization of the key theory behind SBL approaches to stochastic and partially- observable environments. This allows us to derive novel SBL approaches that handle stochasticity and partial-observability in a principled fashion with probabilistic modeling techniques. All previous SBL approaches have relied on formal logic-based prediction rules and, in practice, suer from a high degree of overtting in stochastic environments. This is primarily due to the fact that, in these previous approaches (e.g., [138, 141, 117, 118]), every prediction failure leads to a modication of the agent's model. The prediction failures resulting from the stochastic 50 nature of the agent's environment { in practice, these types of prediction failures can be quite common { lead to the creation of uselessly complex prediction rules that attempt (unsuccessfully) to capture this noise. In [117, 118], Ranasinghe presents some mechanisms that address this issue to some degree, including the forgetting of poorly performing rules and predictions softened with pre-dened tolerance bounds, but these extensions are largely ad-hoc. Furthermore, even with these extensions, the problem of overtting remains substantial, because environment noise is not modeled in a principled fashion. Nevertheless, we show, in the next few sections, that the key ideas behind SBL can be usefully generalized in a principled way to stochastic environments. 4.3 Surprise in Stochastic Environments Model error (Equation 4.1) cannot be estimated directly by the agent, because the agent does not have access to the ground truth observation probability distributions of environmentE. The key unifying idea behind our Stochastic Distinguishing Experiment (SDE) and Probabilistic Surprise- Based Learning (PSBL) approaches in this dissertation is that the agent can eectively use surprise as a proxy for this modeling error, such that reducing surprise (typically) causes a corresponding reduction in model error. Assume that the agent's modelM is composed of a nite set of D conditional probabil- ity distributions, P (R 1 jc 1 );P (R 2 jc 2 );:::;P (R D jc D ), where c 1 ;:::;c D are mutually exclusive and exhaustive model conditions, and R 1 ;:::;R D are random variables representing the results of sat- isfying those conditions in environmentE. The distributions over possible results are dierent, in general, under each condition c i , but the set of possible values R i can take on is the same for each of R 1 ;:::;R D . These conditions c i may be, for example, history contexts or state-action pairs, while P (R i jc i ) might represent, for example, the distribution over the next observation or model state, given the condition that has been satised. These probability distributions are used 51 to compute P M (O t jh t ) for any agent history h t of any length. The total surprise of modelM, denoted S(M), is dened as follows: S(M) = D X i=1 w i H(R i jc i ) (4.4) In the above equation, H(R i jc i ) refers to the entropy associated with P (R i jc i ) and w i is a weight associated with this entropy value. As a technical detail, we normalize each H(R i jc i ) by dividing it by the maximum possible entropy of a probability distribution overjR i j events (i.e., a uniform distribution overjR i j events). This ensures that each entropy value is in the range [0; 1]. We require that D P i=1 w i = 1 such that Equation 4.4 forms a weighted average of the normalized entropies of the conditional probability distributions deningM. This implies that S(M) is also in the range [0; 1]. We call the quantity expressed in Equation 4.4 surprise, because, in our work, eachP (R i jc i ) is trained according to agent experience and is used to make testable predictions about future events. When these distributions are t to agent experience, the larger the value of S(M), the less sure the agent is about the results of its actions. When S(M) is very close to 1, the agent is almost completely uncertain about the results of its actions. Any particular resulting observation or state is essentially equally likely under all model conditions. It will be surprised very frequently by observations that do not match its predictions (assuming, of course, that the agent makes predictions according to the most likely results of its actions). When S(M) is very close to 0, the agent has almost complete condence about the results of its actions. It will rarely be surprised by results that do not match its predictions. Though we do not, at this point, have a formal proof that reducing model surprise (Equa- tion 4.4) necessarily results in a reduction in model error (Equations 4.2 and 4.3), the relationship between the two is borne out in our extensive experimental results spanning a broad range of POMDP environments (see Chapter 8). Empirically, reducing surprise almost always leads to a 52 Alpha-epsilon POMDP environments sPOMDPs Motivation Problem Challenges Related Work SDEs Learning Results Future Information Sciences Institute ϵ 1−ϵ α (1−α)/3 (1−α)/3 (1−α)/3 Figure 4.1: An illustration of the Shape environment in which the robot's actions and observations are subject to noise, as dened by and . This environment can be modeled as a rewardless POMDP. corresponding reduction in model error. Intuitively, such a result makes sense: we should expect that a model that enables the agent to make more accurate and condent predictions about future observations is more closely representing the observation probabilities induced by environmentE. 4.4 Stochastic Distinguishing Experiments In order to illustrate the fundamental ideas of Stochastic Distinguishing Experiments (SDEs, [38]), which are an extension of Shen's Local Distinguishing Experiments (LDEs [141], discussed in detail in Chapter 3.2) to stochastic and partially-observable environments, we return to our Shape environment example from the previous chapter. Recall that, in the Shape environment, the robot moves between states I, II, III, and IV via the actions x and y. It observes in states 53 I and II and in states III and IV (see Figure 4.1). We will consider a stochastic version of this environment in which the agent's actions have the intended eect with probability . With probability (1)=3, the robot transitions (mistakenly) into one of the three other possible states. Similarly, the robot's sensor is noisy. It sees the correct observation in each state with probability . With probability 1, it sees the incorrect observation. This is also illustrated in Figure 4.1. The key idea behind Stochastic Distinguishing Experiments (SDEs) is the following: even though it may be possible for an agent to observe any sequence of observations after executing any sequence of actions (regardless of which environment state it begins at) in a stochastic en- vironment, some sequences of observations are much more likely than others (depending on the starting state), and these dierences in most likely observation sequences under the same action sequences can be used to statistically distinguish states that look identical to the agent. The more deterministic the environment is (i.e., the closer and are to 1), the more useful these observation sequences are for this statistical disambiguation. Surprise-based partially-observable Markov decision processes (sPOMDPs) sPOMDPs Motivation Problem Challenges Related Work SDEs Learning Results Future Information Sciences Institute P( |x) = 0.9736 P( |x) = 0.0164 Figure 4.2: An example of executing the one-step Stochastic Distinguishing Experiment (SDE) fxg from state III in the stochastic Shape environment of Figure 4.1 (with = = 0:99). The opaque arrows exemplify transitions that would lead to the most likely observation sequence (f, g). The semi-transparent arrows indicate transitions due to noise that may likely cause the second observation sequence (f,g), which is most likely from state IV. We argue that many important real-world environments are largely deterministic but infused with a manageable amount of stochasticity and partial-observability. Robotic navigation [3, 155], 54 Surprise-based partially-observable Markov decision processes (sPOMDPs) sPOMDPs Motivation Problem Challenges Related Work SDEs Learning Results Future Information Sciences Institute P( |x) = 0.9736 P( |x) = 0.0164 Figure 4.3: An example of executing the one-step Stochastic Distinguishing Experiment (SDE) fxg from state IV in the stochastic Shape environment of Figure 4.1 (with = = 0:99). The opaque arrows exemplify transitions that would lead to the most likely observation sequence (f, g). The semi-transparent arrows indicate transitions due to noise that may likely cause the second observation sequence (f,g), which is most likely from state III. manipulation [80], and SLAM [27], for example, are crucially reliant on our ability to model the world kinematically and dynamically and to eectively model (typically low-level) sensor and actuation noise. Real-world environments typically exhibit important and relatively consistent structure that can be modeled and used by robotic systems to achieve their tasks. For these reasons, we argue that the key underlying principles of SDEs are useful and valid. To make these ideas more concrete, we will consider an example in the stochastic Shape environment of Figure 4.1 in which we assume that = = 0:99. This environment is both stochastic and partially-observable, and we can represent it as a rewardless POMDP. Recall from Section 3.2 that, in the deterministic Shape environment, the LDEfxg perfectly disambiguated states III and IV. Executing x from state III always resulted in the observation sequencef, g, whereas executing x from state IV always resulted inf,g. When = = 0:99, it is possible for the robot to see either observation sequence when it executes x after observing a. However, by marginalizing out the possible state sequences, we can observe how the probabilities of each of these observation sequences change depending on whether the robot begins at state 55 III (see Figure 4.2) or state IV (see Figure 4.3) 2 . The probability of observingf,g after Surprise-based partially-observable Markov decision processes (sPOMDPs) sPOMDPs Motivation Problem Challenges Related Work SDEs Learning Results Future Information Sciences Institute P( |y, x) = 0.9544 P( |y, x) = 0.0193 Figure 4.4: An example of executing the two-step Stochastic Distinguishing Experiment (SDE) fy, xg from state I in the stochastic Shape environment of Figure 4.1 (with = = 0:99). The opaque arrows exemplify transitions that would lead to the most likely observation sequence (f, ,g). The semi-transparent arrows indicate transitions due to noise that may likely cause the second observation sequence (f,,g), which is most likely from state II. Surprise-based partially-observable Markov decision processes (sPOMDPs) sPOMDPs Motivation Problem Challenges Related Work SDEs Learning Results Future Information Sciences Institute P( |y, x) = 0.9544 P( |y, x) = 0.0193 Figure 4.5: An example of executing the two-step Stochastic Distinguishing Experiment (SDE) fy, xg from state II in the stochastic Shape environment of Figure 4.1 (with = = 0:99). The opaque arrows exemplify transitions that would lead to the most likely observation sequence (f, ,g). The semi-transparent arrows indicate transitions due to noise that may likely cause the second observation sequence (f,,g), which is most likely from state I. executing x from state III is no longer 1 (as it was in the deterministic Shape environment), but it is still very high (Figure 4.2). Importantly, it is much more probable that observingf, 2 To prevent clutter in the illustrations of Figures 4.2, 4.3, 4.4, and 4.5, only example transitions that would lead to the given observation sequences with maximal probability are shown. In reality, the probabilities of these observation sequences are computed by marginalizing out over every possible state sequence the robot may take under the given action(s). 56 g. Similarly, the probability of observingf,g after executing x from state IV is very high, while the probability of observingf,g is very low (Figure 4.3). A single experiment no longer provides enough information for the robot to conclude with certainty which environment state it was in before executingfxg, but, crucially, if there were some way for the robot to repeat this experiment multiple times from the same environment state, it could become very certain about the starting state of its experiment. The robot does not know this state by name, but it can use the consistent results of this experiment as a unique predictive label for this underlying environment state. Through repeated experimentation, the robot can learn to statistically distinguish between identical-looking states with dierent predictive labels (and, by denition, distinct dynamics under the same action sequences). We therefore call such action sequences (e.g.,fxg in the stochastic Shape environment) Stochastic Distinguishing Experiments or SDEs. More formally: Denition 6. Stochastic Distinguishing Experiment (SDE). For each state q 2 Q in rewardless POMDP environmentE, dene b q as a belief state that assigns probability 1 to being in state q. Dene: g q = arg max o1;:::;o k+1 X q2;:::;q k+1 P (o k+1 jq k+1 )P (q k+1 jq k ;a k )P (o 2 jq 2 )P (q 2 jq;a 1 )P (o 1 jq) (4.5) In words,g q is the most probable sequence ofk+1 ordered observations the agent will encounter upon executing actions (a 1 ;:::;a k ), in order, beginning at state q (with probability 1). We call this the lengthk +1 outcome sequence of stateq. If the ordered sequence of actions (a 1 ;:::;a k ) has the property that, when executed from belief states b q 1;:::;b q d, g q i 6=g q j 8 i6=j , then (a 1 ;:::;a k ) is called a k-length Stochastic Distinguishing Experiment (SDE) for states q 1 ;:::;q d . Intuitively, the larger the probability of g q i and g q j given actions (a 1 ;:::;a k ), the more useful this SDE is in statistically disambiguating states q i and q j , because g q i will necessarily have a low probability of occurring starting from belief state b q j and g q j will necessarily have a low 57 P red i ct S u rp ri se I d en ti fy Revi se E xp eri men t S u rp ri se I d en ti fy Revi se Figure 4.6: The runloops of Complementary Discrimination Learning[138] (left) and Probabilistic Surprise-Based Learning [38, 39] (right). probability of occurring starting from belief state b q i. Figures 4.4 and 4.5 continue this example of SDEs in the stochastic Shape environment by considering the SDEfy, xg for states I and II. 4.5 Probabilistic Surprise-based Learning In this section, we provide an overview of our novel procedure for learning SDE models of unknown rewardless POMDP environments called Probabilistic Surprise-Based Learning (PSBL, [38, 39]). There are several variants and extensions of PSBL that we discuss in detail in the coming chap- ters. In this section, we provide a unifying overview of PSBL approaches and discuss the key ideas underlying them. We contrast PSBL with traditional SBL approaches like Complementary Discrimination Learning (CDL [138]) and its extensions in works such as [141, 117]. In Chapter 3.3, we discussed the how the CDL algorithm operates in a predict-surprise- identify-revise loop in which the agent continuously makes predictions, compares those predic- tions with actual observations, uses mismatches between its predictions and reality (prediction failures) in order to identify faulty model components, and, nally, revises its model based on these mismatches such that it better represents its environment. PSBL operates in a somewhat similar experiment-surprise-identify-revise loop (see Figure 4.6), but the surprise, identify, and revise procedures are radically dierent in PSBL than they are in CDL: 58 1. Experiment: The agent performs active experimentation in its environment. It executes the SDEs it currently has in its model that are applicable to the current environment situation as well as random actions. The results of these experiments are used to train the probability distributions P (R 1 jc 1 );P (R 2 jc 2 );:::;P (R D jc D ) dening the agent's modelM (see Section 4.3). These experiments are also used to train a number of one step extension probability distributions of the form P (R i jh i ;c i ), where h i represents one more step of agent history (before c i ). In general, each R i will have multiple such one step extension distributions (representing the dierent values that h i can take on), the set of which we denote asfP (R i jh i ;c i )g. 2. Surprise: The agent computes the total surprise of its model (Equation 4.4). This re- quires that the agent also compute the normalized entropy of each model distribution P (R 1 jc 1 );P (R 2 jc 2 );:::;P (R D jc D ). The agent also computes the normalized entropy of each trained one step extension distribution infP (R i jh i ;c i )g for each R i . 3. Identify: The agent searches through the distributions P (R i jc i ) dening its model to nd one such that the weighted average of the normalized entropies of its one step extension distributionsfP (R i jh i ;c i )g is signicantly smaller than the normalized entropy ofP (R i jc i ). Crucially, this indicates that c i is not a sucient statistic of agent history (and, thus, not a sucient representation of state, because the Markov property is violated), and condition c i needs to be revised in some way to correct this. 4. Revise: Once an insucient conditionc i has been identied, this condition is split into mul- tiple (typically more specic) conditions that consider one additional step of agent history (i.e., the various values that h i can take on). The specics of the revision process depend on the type of SDE model being learned. This typically causes additional probability dis- tributions to be added to modelM and may also cause an expansion in the set of possible values each R i can take on. 59 Recall that the purpose of a state (or a belief state) in an environment that satises the Markov property [88] is to act as a sucient statistic of agent history. In other words, the conditional probability of the agent's future experience in the environment is independent of its past experience given its current state (or belief state, in the context of POMDPs [68]). The key idea of PSBL learning is that the agent actively experiments in its environment to test for violations of the Markov property. The set of conditions c 1 ;:::;c D in an SDE modelM can be understood as an implicit or explicit approximate representation of state (which is derived using the SDEs of modelM, as we discuss later). If condition c i is truly acting as a sucient statistic of agent history, then adding an additional step of agent history h i prior to c i being satised should not alter the probability distribution P (R i jc i ), because R i and h i should be conditionally independent given c i . However, in reality, c i is only an approximate sucient statistic of agent history. If we nd thatP (R i jh i ;c i ) is signicantly dierent than P (R i jc i ), we know that we need to modify c i such that it is a better approximate sucient statistic of agent history 3 . In particular, in the identify step, PSBL looks for situations in which adding an additional step of history h i to c i causes the agent's predictions P (R i jh i ;c i ) to become more condent (i.e., normalized entropy is reduced), when a weighted average is taken over the various possible values that h i can take on. When such a situation is encountered, PSBL splits c i into more specic conditions that (hopefully) form a better approximate sucient statistic of agent history. At the end of the PSBL procedure { we will discuss termination criteria in detail in future chapters { the agent has a set of conditionsc 1 ;:::;c N that, taken together, form a task-independent, approximate (implicit or explicit) representation of state. The probability distributions associated 3 We note that, though we can detect some violations of the Markov property this way, it is not necessarily true that all violations of the Markov property can be detected this way. If P (R i jh i ;c i ) 6= P (R i jc i ), we know that the Markov property must be violated. However, it is not necessarily true that the Markov property is upheld if P (R i jh i ;c i ) = P (R i jc i ). It may very well be that adding additional time steps of history eventually causes the Markov property to be violated. In other words, P (R i jh i ;c i ) =P (R i jc i ) is a necessary condition for the Markov property to be upheld, but it is not sucient. Our work on sPOMDPs (see Chapter 6) addresses this problem to some extent by considering one previous model state/action pair rather than one previous observation/action pair. An argument could be made that this implicitly tests adherence to the Markov property on longer sequences of history, as model states are dened by results of executing SDEs (which may have a length greater than 1). In future work, we may consider more sophisticated tests for violations of the Markov property. See, e.g., [31] for work on testing for the Markov property in time series data. 60 with these conditions P (R i jc i ) encode the dynamics of this environment and tell the agent how its observations are expected to change in response to its actions. In particular, the agent can use its learned modelM to approximate the history-dependent probability of any future sequence of observations given any sequence of actions. This learned model can then be used for additional prediction, planning, or decision making tasks, as we demonstrate in future chapters. High-level pseudocode for PSBL learning is provided in Algorithm 1. Algorithm 1: PSBL Learning Input: numActions: number of actions to perform; explore: probability of executing random actions; A: actions; O: observations Output:M: SDE model of unknown environmentE 1 Initialize modelM with one model state per observation o2O 2 foundSplit := true 3 while foundSplit do 4 for i2f1;:::; numActionsg do 5 if policy.empty() then /* Add actions of an SDE to the policy or random actions */ 6 policy := updatePolicy(explore) 7 action := policy.pop() 8 observation :=E.takeAction(action) /* Train model distributions on experiment results */ 9 M.updateModelParameters(action, observation) /* Compute model and transition surprise */ 10 M.computeSurprise() /* Try to find violations of Markov property on which to split M */ 11 foundSplit :=M.trySplit() 12 returnM The PSBL learning algorithm operates as a single large while loop (lines 3-11). The agent maintains a policy which tells it the action to execute next. If policy is empty, then it is lled with either the actions of a current SDE in modelM (one that is consistent with the current envi- ronment situation) or a series of random actions of equal length (lines 5-6). The probability with which policy is updated with the actions of an SDE (rather than random actions) is controlled by the explore hyperparameter. Typically, the value of explore does not have a signicant impact on learning, and it is usually set to 0:5 (such that SDE actions and random actions are equally likely). In lines 4-9, the agent performs active experimentation in its environment by executing 61 numActions (another hyperparameter) according to its policy. The current parameters of model M (i.e., the probability distributions P (R i jc i ) and their one step extension distributions) are updated in line 9 according to the action taken and the resulting observation from the environ- ment. The agent then updates its total model surpriseS(M) and the normalized entropies of the probability distributions P (R i jc i ) deningM (along with their one step extensions) in line 10. Finally, in line 11, the agent searches for violations of the Markov property in its current model M (as described above) and splits the model if it nds such a violation. This causes additional probability distributions to be added toM and, in some cases, expands the number of possible values that each random variable R i can take on. foundSplit is set to true if such a violation is found and the model is, in fact, split. Otherwise, foundSplit takes on the value false and the algorithm terminates. Experiment Surprise Identify Revise M 0 = P({ } | , x) = {0.98, 0.02} M 1 = P({ } | , y) = {0.03, 0.97} P({ } | , x) = {0.51, 0.49} P({ } | , y) = {0.97, 0.03} S(M 0 ) = 0.33 H({ } | , y) = 0.19 H({ } | , x) = 0.14 H({ } | , x) = 0.997 H({ } | , y) = 0.19 G( , x) = 0.84 P({ 0 1} | , x) = ? P({ 0 1} | , y) = ? P({ 0 1} | 0, x) = ? P({ 0 1} | 0, y) = ? P({ 0 1} | 1, x) = ? P({ 0 1} | 1, y) = ? x x 0: 1: Figure 4.7: An illustration of the experiment - surprise - identify - revise loop of PSBL as applied to an agent in the stochastic Shape environment of Figure 4.1 with = = 0:99. 62 Figures 4.7 and 4.8 demonstrate two iterations of the main PSBL learning runloop on the stochastic Shape environment with = = 0:99. For this example, we concentrate on SDE models (e.g., sPOMDPs, see Chapter 6) that infer explicit representations of latent environment state. In the next chapter (Chapter 5), we will demonstrate how SDEs can be used in a purely predictive fashion such that they estimate the conditional probability of future observations directly without explicitly modeling latent environment structure. In Figure 4.7 (top right), we see that the agent begins with an SDE modelM 0 that treats each of its observations as a model state. In other words, it assumes that its observations do, in fact, satisfy the Markov property (and thus form a sucient statistic of agent history). Note that, in this case,M 0 has mutually exclusive and exhaustive conditionsc 1 =f;xg,c 2 =f;yg, c 3 =f;xg, and c 4 =f;yg. The set of possible results of satisfying these conditions is simply the set of current model statesf;g. R 1 ;:::;R 4 can take on values in this set of possibilities. After performing active experimentation (Figure 4.7, top right), the agent has estimates of the probability distributions over its next model state, given any current model state and action (i.e., P (R i jc i ) for i2f1;:::; 4g). Though not shown in this gure, the agent also estimates the probabilities of the one step extensions of each P (R i jc i ), as discussed above. The agent then computes the total surprise of its model S(M 0 ), which equals 0:33 in this case (Figure 4.7, bottom right). This indicates that the agent is still (on average) quite unsure about the results of its actions. The agent also calculates the normalized entropies of each of its model transition probability distributions P (R i jc i ). It nds relatively low levels of entropy on all these transition distributions, with the exception of theP (f;gj;x), which is extremely close to 1 (the maximum possible value of normalized entropy). The reason that this entropy occurs is vital to understanding PSBL. When the agent observes , it is very likely to be in either state III or state IV (though, due to the agent's imperfect sensor, this is not guaranteed). Whenx is executed from state III, the most likely result is that the agent will see a, whereas, from state IV, the most likely result is that the agent will see a after 63 executingx. If the agent is doing a sucient job of exploration in its environment and a sucient job of executing SDEs consistent with its current observations, we should expect that the agent will end up in state III and state IV with about equal probability and, thus, will execute x from state III and state IV about the same number of times. The agent will also occasionally execute x after mistakenly seeing a in states I or II, but these events should manifest as low levels of noise, because and are close to 1. Since the agent has not yet distinguished between the two environment states that are covered by the observation, their diering dynamics manifest as entropy in the agent's transition probabilities. In other words, the high normalized entropy in the distribution P (f;gj;x) is due primarily to the fact that two underlying environment states (III and IV) covered by the same model state () transition, with high probability, to environment states currently covered by visibly distinct model states. Intuitively, the agent's model state space is not correctly capturing the underlying state space of its environment. How does the agent determine that this entropy is likely the result of incorrect modeling rather than environment noise? In Figure 4.7 (bottom left), the agent uses its one step extension distributions to compute the gain G(;x) (i.e., reduction in normalized entropy) that occurs when the agent considers one additional model-state/action pair of history before observing the model state and executing x (averaged over all possible values of this model-state/action pair). Gain is always less than or equal to 1, and a large positive gain (0:84, in this case) indicates that considering an additional step of history makes the agent much more certain about the resulting model state. In this case, is not acting as a good approximate sucient statistic of agent history (the Markov property is egregiously violated). To remedy this, the agent splits the state into two new latent model states0 and1. The outcome sequences of0 and1 are formed by concatenating and the action executed (x) to the outcome sequences of the two most likely model states to be transitioned into ( and). The agent has inferred the existence of two latent environment states covered by model state which can be represented as0 :f;x;g, and 64 Experiment Surprise Identify Revise = M 2 S(M 1 ) = 0.22 G( , y) = 0.72 y x 0: 1: M 1 = P({ 0 1} | , x)={0.991,0.008,0.001} P({ 0 1} | , y)={0.04, 0.42, 0.54} P({ 0 1} | 0, x)={0.99,0.005,0.005} P({ 0 1} | 0, y)={0.96, 0.03, 0.01} P({ 0 1} | 1, x)={0.01, 0.98, 0.01} P({ 0 1} | 1, y)={0.98,0.015,0.005} H({ 0 1} | , x)= 0.05 H({ 0 1} | , y)= 0.752 H({ 0 1} | 0, x)= 0.057 H({ 0 1} | 0, y)= 0.173 H({ 0 1} | 1, x)= 0.102 H({ 0 1} | 1, y)= 0.0995 y x P({ 0 1 0 1} | 0, x)= ? P({ 0 1 0 1} | 0, y)= ? P({ 0 1 0 1} | 1, x)= ? P({ 0 1 0 1} | 1, y)= ? P({ 0 1 0 1} | 0, x)= ? P({ 0 1 0 1} | 0, y)= ? P({ 0 1 0 1} | 1, x)= ? P({ 0 1 0 1} | 1, y)= ? Figure 4.8: An illustration of the second experiment - surprise - identify - revise loop of PSBL as applied to an agent in the stochastic Shape environment of Figure 4.1 with = = 0:99. 1 :f;x;g. These are the observation sequences most likely to be observed upon executing the SDEfxg after seeing a. The agent now has a new, three-state modelM 1 , and it must again experiment to estimate the transition probabilities between these three model states. Note that, in this case, both the number of conditions c i (and associated distributions P (R i jc i )) in the agent's model and the possible values that each R i can take on have increased in modelM 1 . Note also that the agent now faces a non-trivial localization problem, because it needs to consider multiple steps of history in order to determine its most likely model state. The problem of learning model state to model state transition probabilities is now also substantially more dicult, because the agent must consider the possible model state transitions that could have occurred based on its current model belief state and the observation sequences resulting from executing its actions. These issues are discussed in detail in Chapter 6. 65 Figure 4.8 continues this example for one additional experiment-surprise-identify-revise loop of PSBL. The agent again nds substantial normalized entropy along one of its transitions (f,yg, though it should be noted thatS(M 1 )<S(M 0 )) and nds a substantial reduction in normalized entropy (gain) on average when considering one additional model-state/action pair of history. This time, model state clearly violates the Markov property and the agent splits it into two latent model states0 and1. The outcome sequences of these model states are formed by concatenating and y (the observation of the insucient model state and the action that led to the Markov property violation, respectively) to the outcome sequences of the two most likely model states to be transitioned into (0 and1, respectively). Thus,0 is represented with the outcome sequencef;y;;x;g and1 is represented with the outcome sequencef;y;;x;g. After one additional PSBL loop (not visualized), the agent is unable to nd any gain along any model transitions, and all of its transition distributions have low normalized entropy. Thus, the procedure terminates. S(M 2 ) is also substantially lower than S(M 1 ). The resulting modelM 2 is a probabilistic equivalent of the LDE model of the deterministic Shape environment learned by CDL in Chapter 3.3. The ratio of the predictive accuracy of this model to an oracle in the same environment that knows the underlying environment perfectly { we call this relative accuracy { is approximately 0:996, and its model error (averaged over 10; 000 time steps) is approximately 0:02, meaning thatM 2 induces nearly the same history-dependent observation probability distributions as the ground-truth stochastic Shape environment. 4.6 Key Contributions to Surprise-Based Learning In this section, we summarize the key contributions of the work laid out in this dissertation to the Surprise-Based Learning (SBL, [143, 138, 117]) literature. We feel that this is important, given that this work is a generalization of previous SBL approaches. 66 1. A generalization of the problem of autonomous learning from the environment (ALFE) { the problem on which SBL techniques are most typically applied { to stochastic and partially-observable environments. This included developing generalizations of key ALFE concepts (that were only applicable in deterministic environments), including the denitions of visible equivalence and perfect models. These generalizations allowed these concepts to become useful in stochastic environments (while still remaining applicable to deterministic environments). We also developed an error metric to quantify the discrepancy between an SBL agent's model and its environment. These generalizations enabled us to frame stochastic ALFE in a novel and useful way as a minimization of model error. 2. A principled generalization of the denition of surprise { the key concept in SBL { to stochas- tic environments. This generalized denition of surprise facilitates the use of probabilistic techniques within the SBL framework to model noise in a principled fashion. 3. A generalization of Local Distinguishing Experiments (LDEs, [141]) for deterministic envi- ronments called Stochastic Distinguishing Experiments (SDEs, [38]), which can be applied in a principled way to both stochastic and deterministic (partially-observable) environments. SDEs are notable for being the rst probabilistic models in the SBL literature. 4. A novel family of probabilistic SBL learning procedures called Probabilistic Surprise-Based Learning (PSBL [38, 39]) which operate by seeking to minimize model surprise (as a proxy for model error). PSBL is the rst SBL algorithm to adopt a statistical approach to eliminat- ing surprises. Previous SBL algorithms responded to every prediction failure by modifying the agent's model. 67 Chapter 5 Predictive SDE Modeling In this chapter, we discuss how Stochastic Distinguishing Experiments (SDEs, [38]) can be used as a purely predictive modeling technique (similar, in some ways, to Predictive State Representations, or PSRs [84]) that directly approximates the history-dependent probability distribution over future observations for any given sequence of agent actions. These models are called Predictive SDE (PSDE) models. Importantly, PSDE models approximate the probabilities of future observations without explicitly modeling latent environment structure. In this way, they can be understood as providing an implicit (approximate) representation of environment state (and dynamics). 5.1 An Alternative Predictive Formulation of SDEs Recall that Shen's Local Distinguishing Experiments (LDEs, [141], discussed in detail in Chap- ter 3.2) can be understood as ordered sequences of actions and expected observations that dis- ambiguate states sharing the same observation that have inconsistent dynamics. Two states q 1 ;q 2 2Q in environmentE covered by the same LDE model state m have inconsistent dynamics if there exists an action a such that (q 1 ;a) and (q 2 ;a) are covered by distinct model states m 0 1 and m 0 2 . LDE model states m 0 1 and m 0 2 are associated with distinct outcome sequences (see the discussion on Complementary Discrimination Learning [138], or CDL, in Chapter 3.3). It is often 68 the nal observations of these outcome sequences that are distinct 1 , because, in CDL, shorter LDEs are used to incrementally (and recursively) construct longer LDEs. Learning sPOMDPs Information Sciences Institute Learning sPOMDPs Motivation Problem Challenges Related Work SDEs sPOMDPs Results Future x x 0: 1: y x 0: y x 1: Figure 5.1: An illustration of the Local Distinguishing Experiments (LDEs [141]) and their out- come sequences learned by Shen's Complementary Discrimination Learning (CDL [138]) algorithm in the deterministic Shape environment (originally from [119] and adapted by Shen in [142]). Figure 5.1 provides an example of this in the deterministic Shape environment from the pre- vious chapter. States III and IV are disambiguated by the LDEfxg, leading to distinct outcome sequences f,g (for State III) andf,g (for state IV). Note that these outcome sequences dier in their nal observations. Similarly, States I and II are disambiguated by the LDEfy, xg, leading to outcome sequencesf,,g (State I) andf,,g (State II). Again, these 1 A dierence in nal observation is typical in many environments, but it is not necessarily guaranteed. In general, if (q 1 ;a) and (q 2 ;a) are covered by distinct model states m 0 1 and m 0 2 , we know that m 0 1 and m 0 2 must have distinct outcome sequences (for, otherwise, they would not be distinct model states). However, the dierence between the outcome sequence of m 0 1 and m 0 2 may, in general, occur at any point. This is simply a motivating example intended to convey the intuition behind our use of the predictive SDE modeling techniques laid out in this chapter. 69 outcome sequences dier in their nal observations because they are recursively constructed from the outcome sequences for States III and IV. When learning a set of LDEs, CDL assumes that executing the same action sequence from the same environment state any number of times will result in the same observation sequence and uses the historical dierences in trajectories originating in dierent states to create and modify LDEs (see Chapter 3.3). As discussed in detail in the previous chapter, stochastic transitions break this fundamental assumption because executing the same sequence of actions multiple times from the same environment state may result in dierent observation sequences (and, thus, dierent nal observations). In stochastic and partially-observable environments, LDEs are no longer sucient to form a principled model of the agent's environment. One way to overcome this limitation of LDEs in stochastic environments is to soften them into history-dependent probability distributions over possible nal observations. In other words, we transform each LDE into a probability distribution over possible nal observations, given that the actions and observations of the agent were consistent with that LDE through its nal action. Intuitively, the more peaked this distribution is on one observation, the more like an LDE it is, and, thus, the more useful it is for localizing the agent in its environment so that it can predict future observations accurately. We call these softened LDEs Predictive Stochastic Distinguishing Experiments (PSDEs). Interestingly, though LDEs are used by CDL to discover latent structure in the agent's environment that allows the agent to better predict future observations, it turns out that PSDEs can be used in a purely predictive fashion to directly approximate the history- dependent probability of future observations without needing to directly model latent environment structure (thus forming an approximate, predictive representation ofE's state and dynamics). In the next chapter, we discuss how a slightly dierent formulation of Stochastic Distinguishing Experiments (SDEs [39]) can also be used to model latent environment structure (analogously to the way in which LDEs model latent environment states). Formally, a Predictive SDE (PSDE) is dened as follows: 70 Denition 7. Predictive Stochastic Distinguishing Experiment (PSDE). A Predictive Stochastic Distinguishing Experiment (PSDE) of length k (denoted v k ) is a time-invariant con- ditional probability distribution over the next observation given a nite, variable-length ordered sequence of actions and expected observations up to the present: v k =P (O t+1 ja tk+1 ;o tk+2 ;:::;a t1 ;o t ;a t ) (5.1) The sequence of actions and observations behind the conditioning bar is called the experiment of the PSDE, denoted e v k , and we say that e v k covers trajectories through the environmentE ending in the (ordered) k actions and k 1 observations matching those in e v k . k is called the length of the PSDE and its experiment. We say that an experiment succeeds if executing its actions in order results in the expected ordered sequence of observations up to time t. PSDEs can also have compound experiments, in which case the agent has a choice of actions to perform at one or more time-steps, with each choice corresponding to a set of one or more possible expected observations at the next time-step. Such a PSDE is called a compound PSDE (in contrast to a simple PSDE as dened in Equation 5.1). Note that PSDEs can also be dened such that the agent also considers an observation at time tk + 1 (as opposed to beginning with an initial action at time tk + 1 and making an observation rst at time tk + 2). The denition in Equation 5.1 simplies the notation of the theoretical analyses and proofs presented in Chapter 7, but the results apply to either form of the denition of a PSDE. As a specic example, consider the stochastic version of the Shape environment in Figure 5.2 (rst introduced in Chapter 4) in which the transitions marked by the arrows have probability = 0:925, with the remaining probability mass divided equally amongst transitions to the other 3 states. = 1:0 (such that the agent's sensor is perfect). An example 2-action simple PSDE generated using the PSDE learning algorithm of Section 5.3.1 is: 71 Alpha-epsilon POMDP environments sPOMDPs Motivation Problem Challenges Related Work SDEs Learning Results Future Information Sciences Institute ϵ 1−ϵ α (1−α)/3 (1−α)/3 (1−α)/3 Figure 5.2: An illustration of the Shape environment in which the robot's actions and observations are subject to noise, as dened by and . This environment can be modeled as a rewardless POMDP. P (O t+1 =f;gjy; ; x) =f0:386; 0:614g (5.2) In words, in this environment, the probability that the observation at time t + 1 is given that the agent's history ends by executing the action y at time t 1, observing at time t, and executing action x at time t is 0:386. The probability of observing is therefore 0:614. See Figure 5.3 for an illustration of this PSDE and the agent histories that it covers. An example 4-action compound PSDE from a dierent execution of the algorithm in Sec- tion 5.3.1 is (see Figure 5.4 for an illustration): P (O t+1 =f;gjy; ; fx;y;yg; f;;g; y; ; x) =f0:308; 0:692g (5.3) 72 Learning sPOMDPs Information Sciences Institute Learning sPOMDPs Motivation Problem Challenges Related Work SDEs sPOMDPs Results Future y x P( { }| ) t-1 t t+1 y x Figure 5.3: An illustration of the simple PSDE dened in Equation 5.2. The semantics of this PSDE are similar except that, at timet2, the agent can either perform action x and observe at time t 1, perform action y and observe at time t 1, or perform action y and observe at time t 1. Learning sPOMDPs Information Sciences Institute Learning sPOMDPs Motivation Problem Challenges Related Work SDEs sPOMDPs Results Future t-3 t-2 t-1 y x t y t+1 y y { } x x P( { }| y {x y y} { } y x ) Figure 5.4: An illustration of the compound PSDE dened in Equation 5.3 and the possible agent histories that it covers. 5.2 PSDE Models A set ofN PSDEsM is called a Predictive SDE model if it has the property that the experiments of the simple and compound PSDEs that make upM partition the space of possible histories the agent could encounter in a mutually exclusive and exhaustive fashion. In the parlance of Chapter 4.3, the experiments of these PSDEs form the set of mutually exclusive and exhaustive model conditionsc 1 ;:::;c N , eachR i can take on values from the set of possible agent observations 73 O, and each P (R i jc i ) is a probability distribution over possible observations at time t + 1, given the agent's experience up to timet. Thus, it is often simpler to denote the distributions in a PSDE model as P (Ojc i ) for i2fc 1 ;:::;c N g (where c i is understood to be a context of agent history up through time t), since these distributions are time-invariant. IfD is the length of the longest PSDE inM, then the lastD actions andD1 observations the agent experiences will uniquely satisfy one (and only one) of the conditions c i dening modelM. In other words, the experiment of exactly one PSDE will succeed, and the associated probability distributionP (Ojc i ) directly species the probability of any observation the agent might encounter at the next time step. Since the agent always knows what model condition is satised (provided it keeps D time steps worth of history), a formal localization procedure is not required. Indeed, in such a predictive model, localization is not a useful concept, because there is no latent model state space. In Chapter 6, we will discuss SDE models that do maintain such a latent state space and, as such, do require that the agent localize itself within its model. Recall that the the primary unifying concept of the Predictive State Representation (PSR) literature [84] is that, if the agent knows the probabilities of all possible sequences of observations it could encounter (for any given sequence of actions under any agent history), it knows everything there is to know about its environment. We can use PSDE models to approximate the probability of any sequence of observations for any sequence of agent actions of any length (following any agent history h of at least D actions and D 1 observations) as follows: P (o 1 ;:::;o k jh;a 0 ;:::;a k ) = k i=1 P (o i jc i ) (5.4) In Equation 5.4, c i is the model condition satised by the D previous actions and D 1 previous observations encountered by the agent when computing the probability of observation o i . The similarity in motivation between PSRs and PSDE models is why we call the latter a predictive modeling approach. (It should be noted, however, that there are signicant dierences 74 between PSRs and PSDE models. Please see Chapter 2.8 for more details.) PSDE models directly predict the observations resulting from agent actions without attempting to model underlying latent environment structure. In this way, PSDE models form an implicit and approximate representation of environment state and dynamics. x 1. P( { }| ) = {0.98, 0.02} 2. P( { }| ) = {0.02, 0.98} y 3. P( { }| ) = {0.96, 0.04} y x 4. P( { }| ) = {0.5, 0.5} x x 5. P( { }| ) = {0.95, 0.05} x 6. P( { }| ) = {0.67, 0.33} x y x 7. P( { }| y x ) = {0.02, 0.98} 8. P( { }| y x ) = {0.33, 0.67} y 9. P( { }| y x ) = {0.96, 0.04} x 10. P( { }| y x ) = {0.49, 0.51} y PSDE Model: (a) PSDEs and associated distributions y x {0.02,0.98} {0.96,0.04} {0.98,0.02} {0.5, 0.5} {0.95,0.05} {0.67,0.33} y {0.33,0.67} {0.02,0.98} {0.96,0.04} {0.49,0.51} t-2 t-1 t t+1 x y x (b) PSDE tree (trie) Figure 5.5: An illustration of a PSDE model for the stochastic Shape environment of Figure 5.2 with = = 0:99. The left side (a) shows the probability distributions dening the PSDE model. The right side (b) shows how these PSDEs can be organized into a tree of agent history that stretches backward in time. Figure 5.5 illustrates an example PSDE model for the stochastic Shape environment of Fig- ure 5.2 in which = = 0:99. For simplicity, we assume that the agent's history begins with an observation at time tk + 1 (rather than beginning with an action). This model consists of 10 PSDEs (and thus 10 history-dependent conditional probability distributions over the agent's pos- sible observations of and) which are enumerated in Figure 5.5 (a). Note that the majority of these probability distributions are very peaked on a single observation, indicating that the agent is very sure about the results of satisfying the associated model conditions (PSDE experiments). Three of these PSDEs { those with experimentsf;x;;xg,f;y;;xg, andf;y;;y;;xg { correspond to model conditions that are exceedingly rare, and their probability estimates remain very close to their initial values off0:5; 0:5g. This leaves only one PSDE (the one with experiment f;y;;y;;xg) where the trajectory has high probability and the PSDE's distribution has very 75 high entropy (the agent is very unsure about the next observation). However, by expanding history f;y;;xg one time-step into the past, the agent was still able to increase its ability to accurately predict future observations. We will demonstrate how the agent makes these determinations in the next section. Figure 5.5 (b) shows how this PSDE model can be organized into a context-tree of agent history that extends backwards in time. In Section 5.3.2, we demonstrate how such a representation can be used to eciently update the counters of multiple, variable-length PSDEs to re ect the results of agent experiments and eciently identify the current model condition c i based on the agent's previous D time-steps of history. Note the one-to-one correspondence between the PSDEs in Figure 5.5 (a) and leaves on the PSDE tree in Figure 5.5 (b). In the next section, we discuss how to actively learn PSDE models of unknown environments directly from agent experience. 5.3 PSBL for Learning Predictive SDE Models In this section, we detail our Probabilistic Surprise-Based Learning (PSBL, see Section 4.5) ap- proach for actively and incrementally learning PSDE models of unknown partially-observable and stochastic environments (assumed to be formulated as rewardless POMDPs [68]). We will be- gin with our original formulation of PSBL for learning PSDE models [38] (Section 5.3.1), which incrementally learned increasingly complex simple and compound PSDEs in order to maximize the agent's predictive accuracy. We then demonstrate how this procedure can be reformulated as one that minimizes surprise and also made signicantly more ecient with a trie-based im- plementation (Section 5.3.2). This reformulation serves to unify this approach with the hybrid latent-predictive SDE modeling and learning approaches detailed in Chapter 6 and the general PSBL framework for minimizing surprise introduced in Chapter 4.5. 76 5.3.1 PSBL Learning by Maximizing Predictive Accuracy We originally proposed PSBL in [38], in which it was formulated as a procedure that sought to maximize an agent's predictive accuracy in its environment. In this original formulation, which we call Max Predict, the agent continuously made predictions about the results of its actions and gathered statistics about the demonstrable predictive accuracy of each of the PSDEs in its model. The term surprise was used in this work to indicate statistical events that caused the agent to conclude that modifying its model in certain ways was likely to increase the overall predictive accuracy of that model. In Section 5.3.2, we demonstrate how this procedure can be reformulated as a procedure that minimizes model surprise as dened in Chapter 4.3 (thus unifying it with the generalized formulation of PSBL in Chapter 4.5). Algorithm 2: PSDE Learning by Maximizing Accuracy (Max Predict) Input: A: a discrete set of actions the agent can perform O: a discrete set of observations the agent can observe sgain, rgain, mgain: gain parameters numExp: number of experiments to perform convergeTol: number of times each PSDE must try splitting before convergence Output: M: a Predictive SDE model 1 Function LearnPSDEs() 2 M :=fP (Ojnull)g; 3 successCounts :=fP (Ojnull) : 0g; 4 while successCounts[v] < convergeTol for one or more v2M do 5 select random PSDE, v, with successCounts[v] < convergeTol; 6 if v is a compound PSDE with multiple possible rst actions then 7 v := ReneBySurprise(v, rgain)/* Section 5.3.1.3 */ 8 for v o 2fMvg do 9 fdidMerge;vg := MergeBySurprise(v o , v, mgain)/* Section 5.3.1.4 */ 10 if didMerge then 11 break; 12 E := getExperiments(v) /* Section 5.3.1.1 */ 13 oneSteps := performExperiments(E, numExp) /* Section 5.3.1.1 */ 14 didSplit := SplitBySurprise(v, oneSteps, sgain) /* Section 5.3.1.2 */ 15 if didSplit == false then 16 successCounts[v] := successCounts[v] +1 77 Algorithm 2 gives the high-level pseudocode of the Max Predict algorithm, which enables a learning agentR to actively build a PSDE modelM of the state and dynamics of unknown, rewardless POMDPE directly from experience. The key idea of the Max Predict algorithm is that the agentR continuously designs experiments to perform in its environment, performs those experiments, predicts the results of those experiments based on the most likely observation of the matching PSDE (i.e., arg max o P (ojc i ), where c i is the current model condition satised), and compares the actual result to the predicted result (active experimentation, Section 5.3.1.1). Counters are kept to estimate the predictive accuracy and conditional probabilities over next observation associated with each PSDE. These statistics trigger surprises that lead to the model being modied via three key operations: splitting, rening, and merging. PSDE splitting (Section 5.3.1.2) refers to the process of increasing the length of a PSDE by one action and observation (i.e., one additional time-step into the past) in order to increase the estimated predictive accuracy of the PSDE model. Note that this also requires increasing the number of PSDEs in the model, because each possible action and observation at this new past time step must be considered to maintain the mutual exclusivity and exhaustiveness of the model. Intuitively, splitting a PSDE occurs when the agent has reason to believe that its model is too coarse along the possible agent histories covered by that PSDE. Such a situation is triggered by a splitting surprise. PSDE rening (Section 5.3.1.3) refers to the process of separating out a simple or compound experiment from a compound PSDE into its own PSDE when its probability distribution over next observation diers signicantly from the distributions associated with other experiments in that compound PSDE. This increases the number and specialization of PSDEs of the same size. Like splitting, the process of rening a compound PSDE is also triggered when the agent has reason to believe its model is too coarse along the possible agent histories covered by that compound PSDE. Such a situation is triggered by a renement surprise. 78 PSDE merging (Section 5.3.1.4) refers to the process of combining two statistically similar PSDEs into a compound PSDE that represents both of them (i.e., covers all the trajectories covered by either of them) in order to create as compact of a model as possible. Merging all possible rst actions and observations of a k + 1 length PSDE results in the rst action and observation being dropped, returning the PSDE to a k length PSDE. In contrast to splitting and rening, merging is triggered when the agent has reason to believe that its model is too ne- grained along certain possible agent trajectories and that these trajectories may be considered as part of the same PSDE in its model of the environment. Such a situation is triggered by a merging surprise. The algorithm hyperparameters sgain, rgain, and mgain control the statistical thresholds necessary to trigger splitting, rening, and merging surprises, respectively. To summarize, PSDE splitting and rening increase the number of PSDEs in the model (i.e., the number of distributionsP (Ojc i )) as well as the number of distinct model conditionsc i . Merg- ing, on the other hand, reduces both the number of PSDEs and the number of model conditions in the model. Algorithm 2 consists of a while loop (line 4) that runs until no PSDE has a successCount less than convergeTol. The successCount of each PSDE is incremented when the PSDE splitting procedure (line 14) fails to nd sucient evidence that increasing the length of PSDE v will increase the model's predictive accuracy (i.e., no splitting surprises occur). In line 5, a random PSDE v with successCounts[v] < convergeTol is selected for experimentation. If v is compound with multiple choices of action at the rst step (line 6), line 7 performs PSDE rening to separate out individual experiments covered byv into their own PSDEs if renement surprises are detected. Next, the PSDE merging procedure (lines 8-11) searches through the other PSDEs in the model and attempts to merge v with a statistically similar PSDE to make the model more compact. Merging is only performed if a merging surprise occurs, and only one successful merge is allowed per while loop iteration (see lines 10-11). In line 12, we expandk-action PSDEv into a set ofjAjjOj PSDEs with lengthk + 1 (oneSteps) by appending each possible pair of action and observation to 79 the beginning ofv's experiment. We perform the experiments of these oneSteps (by executing their actions in order) uniformly at random numExp times (line 13), actively predicting and using the results of these experiments to estimate the conditional observation probabilities and predictive accuracies of the PSDEs in oneSteps. When possible, we use the sub-trajectories generated by this experimentation to update estimates of these quantities for the PSDEs currently in model M. Note that, at this stage in the algorithm, the PSDEs in oneSteps are not yet part ofM. Finally, in line 14, after experimentation has been completed, PSDE splitting is used to partition k-action PSDE v into a set of two or more k + 1-action PSDEs (created by merging its oneSteps PSDEs together according to predicted observation and replacingv in PSDE modelM with these new PSDEs) when such a split causes an increase in estimated predictive accuracy, as indicated by the occurrence of a splitting surprise. 5.3.1.1 Active Experimentation and Prediction Recall that the key idea of the Max Predict algorithm is that the learning agent continuously designs new experiments, predicts the results of these experiments based on the most likely obser- vation of the matching PSDE, and performs these experiments in its environment, comparing the actual result of the experiment to the predicted result and gathering statistics about the frequen- cies of observations and prediction failures, which are dened as mismatches between actual and predicted observations. These statistics generate surprises that are used to update the agent's world model via splitting (Section 5.3.1.2), rening (Section 5.3.1.3) or merging (Section 5.3.1.4) in a way that increases the estimated predictive accuracy of the model or allows for further specialization of the model along specic agent histories. In line 12 of Algorithm 2, the function getExperiments returns thejAjjOj one-step extension experiments (each with length k + 1 and stored in the list E) formed by appending each possible pair of a single action and observation to the beginning of v'sk-length experiment. Recall that v is the PSDE selected in line 5 of Algorithm 2 and that the experiments of PSDEs extend backward 80 Learning sPOMDPs Learning sPOMDPs Motivation Problem Challenges Related Work SDEs sPOMDPs Results Future y x P( { }| ) t-1 t t+1 y x t-2 x y x y x x y x x y y x y y x y Figure 5.6: An illustration of the one step extension experiments of the simple PSDE dened in Equation 5.2. SincejAj= 2 andjOj= 2 in the Shape environment (Figure 5.2), there arejAjjOj= 4 such one step extensions. in time until an initial action at time tk + 1 and initial observation at time tk + 2. The k + 1 lengthjAjjOj one-step extension experiments represent all the possible experiments that begin with performing one of thejAj possible actions at timetk, making one of thejOj possible observations at time tk + 1, and then matching v's k-length experiment exactly from time tk + 1 to time t. The exception to this is that, at the beginning of the algorithm, only one PSDE with a null experiment exists (which trivially covers all trajectories and tracks the prior distribution over observations), and the one step extension experiments are formed by appending each possible action to this null experiment. Temporary PSDEs are created for each one-step extension experiment and stored in the list oneSteps returned by performExperiments (line 13). Note that these oneSteps PSDEs are not added to the modelM at this point. As an example, the one step extension experiments for the PSDE in Equation 5.2 would be: 1. fx,, y,, xg 2. fx,, y,, xg 3. fy,, y,, xg 4. fy,, y,, xg 81 Note that these four one-step extension experiments do not cover all possible trajectories that can occur by the agent executing actionx ory at timet 2 followed by actiony at timet 1 and x at time t. In line 13, the agent performs numExp of these experiments { where numExp is a user-dened hyperparameter { in its environment. Performing an experiment consists of selecting one of these one-step extension experiments uniformly at random and executing the actions of the selected one-step extension experiment in order. If the generated trajectory causes the experiment of one of the PSDEs in oneSteps to succeed, the agent predicts the next observation will be the most likely observation according to that PSDE's current estimated probability distribution over the next observation. The actual observation is compared to the predicted observation. If it matches, the successful and total prediction counters of the PSDEs are both incremented. If it does not match, only the total prediction counter is incremented. The observation counter associated with the actual observation is also incremented. The observation counters are all set to the same oating point value at the beginning of the algorithm. This value is greater than 0 in order to avoid zero count problems 2 . The estimated conditional probability of each observation given the success of the PSDE's experiment is the value of its observation counter divided by the total number of the observation counters of all observations (a categorical distribution), while its estimated predictive accuracy is the number of times it successfully predicted the next observation divided by the total number of times it was used for prediction (each time the PSDE's experiment succeeded). Though not mentioned in Algorithm 2, it is often useful to introduce some random actions (for the purposes of exploration) in between performing these experiments, particularly in larger environments. This 2 As we discuss in the next chapter, these initial counter values can be formally viewed as Dirichlet prior distributions over the observation probability distributions of each PSDE. Since the initial counter values are all set to the same value, this prior distribution encodes the (usually untrue) assumption that all observations are equally likely under every model condition. We argue that such an assumption is the only reasonable one when the environment is completely unknown. However, the magnitude of these initial counter values is also important, because it encodes how strongly these initial assumptions should be weighted against actual future observations. Intuitively, the smaller these initial values are, the more peaked we expect these observation distributions to be, and the more quickly the data (actual observations) overwhelm this prior. The magnitude of these values is of more concern in the formulations of PSBL that minimize model surprise (see Section 5.3.2), because these values have a direct impact on the entropies of the observation probability distributions being learned (particularly when few data samples have yet been seen). 82 helps prevent the repeated performing of the same or similar experiments from causing the agent to remain in only one area of the environment during this experimentation procedure. Regardless of whether the trajectory generated by each such experiment causes the experiment of one of the PSDEs in oneSteps to succeed, sub-trajectories of this trajectory from length 0 (just observations) to length k are used to make predictions and update the observation counters and prediction counters of any matching PSDEs currently in the modelM of length k or less in the same way. This allows parameter learning of existing PSDEs inM to continue opportunistically, even when they are not being explicitly evaluated. 5.3.1.2 PSDE Splitting The key idea behind PSDE splitting is that the learning agent should be surprised when the one- step extension PSDEs of the selected PSDE, taken together, have a signicantly greater estimated predictive accuracy than the selected PSDE, because it means that considering one more time- step of history in the past provides signicantly more information about accurately predicting the next observation. Intuitively, this indicates to the learning agent that its model is too coarse along the possible agent trajectories covered by the selected PSDE (i.e., the model condition c i represented by this PSDE is not a good approximate sucient statistic of agent history). Assume, for the moment, that agentR is given the probabilities of every possible trajectory of actions and observations through environmentE. In this case,R can easily calculate the con- ditional probability distribution associated with any k-length simple or compound PSDE for any length k. Consider any PSDE v k in some PSDE modelM dened by these known probabilities. During the testing of this model (using random actions), agentR will makeT v k predictions about the next observation using this PSDE, where T v k represents the number of times the experiment of v k succeeds. We make the common assumption that all these prediction events are indepen- dent and identically distributed 3 . Let ^ o i be the observation predicted by PSDE v k in the ith 3 This is not strictly true, because these prediction events are not generated according to completely randomly sampled trajectories of agent history. They are generated sequentially via a random walk through the environment. 83 prediction event, and let o i be the actual observation produced by the environment in the ith prediction event. We wish for each PSDEv k to minimize the following 0 1 loss function over all its prediction events: L(^ o;o ) = 1 T v k Tv k X i=1 I(^ o i 6=o i ) (5.5) The loss function over the entire PSDE model could therefore be dened by averaging these losses over all PSDEs in the model. It is a matter of elementary statistics to show that this overall loss function is minimized when the prediction of each PSDE v k is always the most likely observation according to its conditional probability distribution over observations. We dene this value as p v k . p v k is also often called the maximum a posteriori or MAP estimate. We also call p v k the predictive accuracy of v k , because, by the strong Law of Large Numbers [62], as the number of predictions made by v k increases toward innity, the ratio of successful to total predictions will converge almost surely to p v k . Note that the denition of this 0 1 loss function is dened in terms of prediction failures, which we are trying to minimize. For this reason, during training and testing, agentR always predicts the next observation will be the most likely observation according to the PSDE whose experiment succeeded. The set ofjAjjOj one-step extension PSDEs (oneSteps) covers the same trajectories as the original PSDEv k in a mutually exclusive and exhaustive fashion but allows for specialization (dif- fering predictions) amongst sub-trajectories that had the same prediction in v k . Dening p v i;k+1 However, we argue that it is a reasonable assumption in the environments that we consider, and this appears to be borne out in the experimental results presented in both this chapter and Chapter 8. 84 analogously top v k for all one step extension PSDEs, and dening w i;k+1 as the conditional prob- ability of one-step extensionv i;k+1 's experiment succeeding given thatv k 's experiment succeeded, an increase in predictive accuracy occurs when: ( jAjjOj X i=1 p v i;k+1 w i;k+1 )>p v k (5.6) In other words, an increase in predictive accuracy occurs when the mixture model consisting of the predictions of the various one-step extensions, weighted by the probabilities of those various one step extensions occurring, outperforms the predictive accuracy of v k . We do not know p v k , p v i;k+1 orw i;k+1 . However,p v k can be estimated using the prediction counters of v k in modelM, p v i;k+1 can be estimated using the prediction counters of the one-step extension PSDEs (oneSteps), and we can estimatew i;k+1 as the fraction of times one-step PSDEv i;k+1 's experiment succeeded divided by the total number of times any one-step PSDE's experiment succeeded during the active experimentation procedure described in the previous section. A splitting surprise occurs when: jAjjOj P i=1 ^ p v i;k+1 ^ w i;k+1 ^ p v k >sgain (5.7) Where sgain is an algorithm hyperparameter whose value is greater than or equal to 1, and the hat symbols indicate that we have substituted in our empirical estimates for the actual probabili- ties. When a splitting surprise occurs,v k is replaced inM with at mostjOj simple and compound PSDEs formed by merging its one-step extension PSDEs according to predicted observation. v k is only eligible for splitting if all of its one-step extension experiments succeed at least once during experimentation. In larger environments, it is often useful to relax this to a requirement that some user-dened percentage of one-step extension experiments succeed at least once during ex- perimentation. This change does not aect the theoretical analysis of the algorithm. Due to the fact that larger PSDEs are more dicult to gather statistics for, it is often desirable to increase 85 sgain with PSDE length. For example, sgain might be dened as 1 + 0:05k, wherek is the length of the PSDE being tested for splitting surprises. Such a denition of sgain intuitively acts as a prior that favors smaller models over larger ones. As a specic example (generated using the Max Predict algorithm in the stochastic Shape environment), consider the following PSDE: P (O t+1 =f;gjx) =f0:777; 0:223g (5.8) and its one-step extensions: P (O t+1 =f;gjx; ; x) =f0:942; 0:058g (5.9) P (O t+1 =f;gjx; ; x) =f0:913; 0:087g (5.10) P (O t+1 =f;gjy; ; x) =f0:943; 0:057g (5.11) P (O t+1 =f;gjy; ; x) =f0:368; 0:632g (5.12) The estimated probability distribution over each of these one-step extensions occurring given that the original PSDE's experiment succeeded ( ^ w i;k+1 ) is f0:409; 0:136; 0:209; 0:246g, which causes a splitting surprise for sgain = 1:1 because: jAjjOj P i=1 ^ p v i;k+1 ^ w i;k+1 ^ p v k = 0:862 0:777 = 1:11 Splittingv k according to this splitting surprise would lead to it being replaced by two new PSDEs in modelM: 86 P (O t+1 =f;gjfx;x;yg;f;;g; x) =ff0:942; 0:058g;f0:913; 0:087g;f0:943; 0:057gg (5.13) P (O t+1 =f;gjy; ; x) =f0:368; 0:632g (5.14) x P( { }| ) = {0.777, 0.223} x 1. P( { }| ) = {0.942, 0.058} x x 2. P( { }| ) = {0.913, 0.087} x x 3. P( { }| ) = {0.943, 0.057} y x 4. P( { }| ) = {0.368, 0.632} y One-step extensions: {0.409, 0.136, 0.209, 0.246} Distribution over one-step extensions: 0.409*0.942 + 0.136*0.913 + 0.209*0.943 + 0.246*0.632 = 0.862 Weighted average one-step predictive accuracy: Compute Gain: 0.862/0.777 = 1.11 Original PSDE: (a) Split gain calculation. x P( { }| ) = {0.777, 0.223} x }| ) = {0.942, 0.058} x x x }| ) = {0.913, 0.087} x x }| ) = {0.943, 0.057} y 3. P( { x }| ) = {0.368, 0.632} y 4. P( { One-step extensions: y 1. P({ }|{ ) = x } { } x {0.942, 0.058}, {0.913, 0.087}, {0.943, 0.057} Splitting results: x 2. P({ }| ) = {0.368, 0.632} y Original PSDE: Split 1. P( { 2. P( { (b) SplittingP (f;gjx) on its one step extensions. Figure 5.7: An illustration of the example splitting procedure detailed in Section 5.3.1.2, Equa- tions 5.8-5.14. The left hand side of the gure (a) demonstrates the calculation of the expected gain in predictive accuracy when PSDE P (f;gjx) is split into its constituent one-step exten- sions. The right hand side of the gure (b) demonstrates howP (f;gjx) would be split into one compound PSDE and one simple PSDE, according to the predicted (most likely) observations of each one-step extension. Note that one step extensionsfx; ; xg,fx; ; xg, andfy; ; xg all predict, whereas one step extensionfy; ; xg would predict. It is precisely this specializa- tion in predicted observation along dierent agent histories that leads to the expected predictive accuracy gain. Notice that the individual probability distributions { and their associated observation counters { are kept associated in the compound PSDEs with the appropriate experiments for now, which is crucial for the rening operation discussed next, which attempts to separate out these experiments from compound PSDEs when they begin to behave statistically dierently. However, only one set of prediction counters is kept. The successCounts of these new PSDEs are set to 0. When a compound PSDE with multiple choices of action at the rst step, such as in Equation 5.13, is split, a normalized average over each observation over all the combined experiments in the rst step is used in the above splitting procedure, which, in the case of Equation 5.13, would be 87 f0:933; 0:067g. p v k would be equal to 0:933 for this PSDE, which would be compared with the predictive accuracies of its one step extensions according to the above procedure. (Note that its one-step extension PSDEs would be compound as well in that case.) The sgain parameter controls how much of a predictive accuracy increase justies making the current PSDE one step longer. Finally, we note that splitting is the only procedure by which the length of PSDEs increases. In other words, PSDE splitting is the only means by which model conditions c i increase in length. The example split described above is illustrated in Figure 5.7. 5.3.1.3 PSDE Rening The key idea behind PSDE rening is that the learning agent should be surprised when dierent experiments covered by the same compound PSDE have signicantly dierent conditional proba- bility distributions over the next observation, as this provides evidence that multiple experiments being treated as a single PSDE (and thus a single model conditionc i ) have quite dierent dynam- ics, even though they lead to the same prediction of the next observation. Rening a k-length compound PSDE with multiple choices of action at the rst time step refers to the process of sepa- rating out one of its experiments into its ownk-length PSDE, reducing the number of experiments covered by the original compound PSDE. Note that the grouping done by the splitting procedure discussed in the previous section ensures that the observation with the maximum probability will be the same for all probability distributions in a compound PSDE. Let p avg denote the average of these maximum probability values over all experiments in the compound PSDE. In the PSDE specied by Equation 5.13, for example, p avg = 0:933. Let p max be the individual probability value with the largest absolute value dierence from p avg . In the case of Equation 5.13, p max = 0:913. A renement surprise is dened as the event that: max ( p max p avg ; p avg p max )>rgain (5.15) 88 Where rgain 1 is a renement gain hyperparameter. If rgain = 1:02, the above example results in a renement surprise, and the PSDE in Equation 5.13 would be replaced inM with the following two PSDEs (with their successCounts set or reset to 0) formed by rening out the middle trajectory: P (O t+1 =f;gjfx;yg; f;g; x) =ff0:942; 0:058g;f0:943; 0:057gg (5.16) P (O t+1 =f;gjx; ; x) =f0:913; 0:087g (5.17) Original PSDE: x y )= x } { } x {0.942, 0.058}, {0.913, 0.087}, {0.943, 0.057} P({ }|{ 0.942 + 0.913 + 0.943 3 = 0.933 0.913 1. p_max = 0.933 0.913 =1.022 p_avg p_max = Compound PSDE Statistics: Compute Gain: 2. p_avg = (a) Rene gain calculation. Original PSDE: x y )= x } { } x {0.942, 0.058}, {0.913, 0.087}, {0.943, 0.057} P({ }|{ x 1. P({ }| ) = {0.913, 0.087} x x y )= } { } x 2. P({ }|{ {0.942, 0.058},{0.943, 0.057} Refine Out Refinement Results: (b) Rening out experimentfx;;xg. Figure 5.8: An illustration of the example renement procedure detailed in Section 5.3.1.3, Equations 5.16-5.17. The left hand side of the gure (a) demonstrates the calculation of the expected renement gain. The right hand side of the gure (b) demonstrates how P (f;gjfx;x;yg; f;;g; x) would be rened into two PSDEs: P (f;gjfx;yg; f;g; x) and P (f;gjx; ; x). The denition of renement surprise above works particularly well when there are few obser- vations. In the case of many observations, it is usually preferable to dene renement surprises in terms of more sophisticated statistical measures. As an example, renement surprises can be usefully dened as a KL divergence [76, 75] from the average probability distribution above some rgain threshold (which would be greater than or equal to 0). In other words, any probability distribution too \far away" from the average distribution (where \distance" is computed using KL divergence) would lead to the associated experiment being rened out into its own PSDE. 89 Note that renement increases the number of PSDEs in the model by 1 and that the new PSDEs it creates have the same length as the original PSDE (i.e., no model condition c i increases in length). Furthermore, we note that renement does not necessarily have an immediate impact on the model's estimated predictive accuracy, because it doesn't change the observation predicted by any PSDE. Rather, it allows for increased specialization of the rened out trajectory later in the algorithm. Figure 5.8 illustrates the example renement process described above. 5.3.1.4 PSDE Merging The key idea behind PSDE merging is that the learning agent should be surprised when the conditional probability distributions represented by two dierent PSDEs start to behave similarly. In contrast to splitting and rening, this indicates to the agent that its model may be unnecessarily complex along certain trajectories and represents an opportunity to make the model more compact by combining PSDEs. Merging reduces the size of PSDE modelM by 1 PSDE by combining two simple or compound PSDEs into a single PSDE that covers the trajectories covered by both of the original PSDEs. Two PSDEs are only eligible to be merged if they are the same size (length k), their experiments dier only in the rst action (at time tk + 1) and observation (at time tk + 2) pair, and if they predict the same observation. Merging allows the model to forget improper renes and splits as it learns more about the environment. Let v 1 and v 2 be two PSDEs that meet the above eligibility criteria. Let p max;1 be the probability of the most likely observation in v 1 (possibly averaged over multiple distributions if v 1 is compound with multiple choices of action at the rst step) and let p max;2 be the probability of the most likely observation in v 2 (possibly similarly averaged over multiple distributions). A merging surprise occurs when: max( p max;1 p max;2 ; p max;2 p max;1 )<mgain (5.18) 90 Where mgain 1 is a merging gain hyperparameter. As with the denition of renement surprises, this denition of merging surprises works particularly well when there are few observations. With many observations, it is usually preferable to dene merging surprises in terms of statistical measures such as KL divergence. A merging surprise can be usefully dened as the KL divergence between the average probability distributions of v 1 and v 2 being below threshold mgain (which, in this case, would be greater than or equal to 0). Merging may be considered an optional procedure in the Max Predict algorithm because, while it makes the model smaller (which is important in many applications), it may do so at the expense of the model's predictive accuracy. Finally, we note that it is possible that alljAjjOj one- step extension PSDEs are merged together after being split and/or rened out. In this case, the probabilities of each observation are averaged and normalized into a single probability distribution, and the rst action and observation of the experiment is dropped, returning the merged PSDE to a k-length PSDE from a k + 1-length one. Though it is not explicitly mentioned in Algorithm 2, the pairs of experiments involved in all splitting, rening, and merging operations performed are saved (via a hash map) and, before performing any new operations, the algorithm ensures that the same operation has not been performed before on the same experiments. This prevents possible innite loops of merging and splitting or merging and rening the same PSDEs. As an example, merging the PSDEs in Equations 5.13 and 5.14 would result in the PSDE in Equation 5.8 but with a dierent probability distribution: f0:792; 0:208g, which is obtained by averaging the 4 probability values associated with each observation and ensuring the resulting distribution is normalized. Note that these two PSDEs are not technically eligible for merging as they predict dierent observations, but it serves to illustrate how the merging operation is performed and, in some cases, it may be preferable to allow merging even when dierent observations are predicted. As another example, merging the PSDEs in Equations 5.9, 5.10, and 5.11 results in the PSDE in Equation 5.13. ThesuccessCount of the merged PSDE is reset to 0. Merging can typically be seen as the inverse of renement (with the exception being cases in which alljAjjOj one step extension 91 PSDEs are merged back together). For example, the renement process illustrated in Figure 5.8 (b) can be reversed to illustrate the merging of Equations 5.16 and 5.17 into Equation 5.13. 5.3.2 PSBL Learning by Minimizing Surprise 5.3.2.1 The Limitations of Max Predict Max Predict has two primary shortcomings. First, surprise is not dened and used in a principled fashion. In Max Predict, splitting, rening, and merging surprises are disparate statistical events that are unied only in the sense that they all indicate that a change to the agent's model is warranted. Max Predict takes important steps toward generalizing surprise to stochastic envi- ronments, but it does not do so in a unied fashion that is applicable to dierent types of models. Second, predictive accuracy is a limited measure of the quality of an agent's environment model. To see why, it is useful to recall the denition of model error from Chapter 4.1: ^ E M;E = 1 T T X t=1 s X o2O (P M (ojh t ) X q2Q P E (ojq)b E;t (q)) 2 (5.19) Recall that the goal of stochastic ALFE is to nd a model that makes Equation 5.19 as small as possible. Note that, provided the agent maintains an amount of history equal in length to its longest PSDE, P M (Ojh t ) is directly provided by the PSDE whose experiment succeeds given h t . Thus, Equation 5.19 is very easy to calculate for PSDE models. Consider the following simple 2-state environment and PSDE models of that environment (Figure 5.9). In Figure 5.9, PSDE Model 1 and PSDE Model 2 both predict the next observation (either or) with perfect accuracy (assuming the environment is fully deterministic), because the most recent action (x or y) is a sucient statistic of agent history up to that point. However, the model error of PSDE Model 1 (see Equation 5.19) is 0:0, whereas it is 0:693 for PSDE Model 2. Clearly, there is an important relationship between the predictive accuracy of a PSDE modelM and its model error with respect to environmentE (e.g., a PSDE modelM with a model error 92 S2 S1 x y y x PSDE Model 1: PSDE Model 2: P({□○}|y) = {1.0, 0.0} P({□○}|x) = {0.0, 1.0} P({□○}|y) = {0.51, 0.49} P({□○}|x) = {0.49, 0.51} Figure 5.9: A simple 2-state environment in which the agent can execute the actions x andy and observe in state S1 and in state S2. Assuming this environment is deterministic and that predictions are made according to the most likely observation of the matching PSDE, both PSDE Model 1 and PSDE Model 2 would predict future observations with perfect accuracy. However, the model error (Equation 5.19) of PSDE Model 1 would be 0:0, while model error would be 0:693 for PSDE Model 2. of 0:0 should be expected to predict as well as an oracle would in environmentE). However, in addition to desiring that the agent make correct predictions in environmentE, we also want a modelM that makes these predictions with the appropriate condence (according to the nature of environmentE). 5.3.2.2 Dening Surprise in PSDE models The generalization of surprise that we presented in Chapter 4.3 addresses both of these short- comings of Max Predict by providing the agent with a continuous estimate of the quality of the probability distributions dening its model, not just the expected predictive accuracy of an agent making predictions according to these probability distributions. The key idea, which is validated by our experimental analysis presented in Chapter 8, is that surprise and model error will (in many important environments) be positively correlated, such that reducing model surprise will cause a corresponding decrease in model error. Thus, the agent can use surprise to drive the probability distributions dening its model closer to those dening the underlying environment. 93 As discussed in Section 5.2, a PSDE modelM consists of N PSDEs of the form P (Ojc i ) for i2f1;:::;Ng, where c 1 ;:::;c N are a set of mutually exclusive and exhaustive model conditions. In the case of PSDE models, each c i is a sux of agent history (i.e., a sequence of actions and associated ordered observations up to time t), and each P (Ojc i ) is a separate probability distribution over possible agent observations at time t + 1, given that the agent's history ended by matching c i . Therefore, model surprise (see Chapter 4.3) has the following form for PSDE models: S(M) = N X i=1 w i H(Ojc i ) (5.20) Recall that H(Ojc i ) refers to the normalized Shannon entropy of P (Ojc i ). We normalize the entropy of a probability distribution overjOj elements by dividing it by the maximum possible entropy overjOj events (i.e., the entropy of a uniform distribution overjOj events). Typically, w i is set to be proportional to the number of times associated model condition c i (which is an agent history sux) has been satised. This is done primarily because some agent histories are much more probable than others, particularly in environments that are highly deterministic. In deterministic environments, some agent histories may not be possible at all. In environments that are very close to deterministic, any agent history may be possible, but some histories may be extremely improbable. The PSDEs associated with such agent histories are likely to have very high normalized entropy, even after extensive agent experimentation, because each PSDE begins with a uniform prior distribution over observations that is only updated when that PSDE's experiment succeeds. Setting w i proportionally to the number of times in which c i was satised ensures that the agent is not punished for having uncertainty about histories it could not possibly hope to experience. Furthermore, we argue that such a weighting scheme is reasonable in the case of this work, because we ensure that the agent spends a signicant proportion of its time executing random actions and randomly selected SDEs. This eectively prevents the agent from 94 skewing all of its experience toward PSDEs about which it is already most certain in order to minimize surprise by favorably tweaking these weights. x 1. P( { }| ) = {0.964, 0.036} 2. P( { }| ) = {0.024, 0.976} y 3. P( { }| ) = {0.963, 0.037} y x 4. P( { }| ) = {0.556, 0.444} x x 5. P( { }| ) = {0.926, 0.074} x 6. P( { }| ) = {0.357, 0.643} x y 7. P( { }| ) = {0.636, 0.364} x y 771 29 16 650 413 16 10 8 126 10 126 227 7 4 PSDE Model: (a) PSDE model with observation distribu- tions/counters 1. 771 + 29 = 800 Observation Counters: 2. 16 + 650 = 666 3. 413 + 16 = 429 4. 10 + 8 = 18 5. 126 + 10 = 136 6. 126 + 227 = 353 7. 7 + 4 = 11 Total: 2413 1. 800/2413 = 0.332 Distribution Over PSDEs: 2. 666/2413 = 0.276 3. 429/2413 = 0.178 4. 18/2413 = 0.007 5. 136/2413 = 0.056 6. 353/2413 = 0.146 7. 11/2413 = 0.005 1. 0.225 PSDEs Entropies: 2. 0.163 3. 0.230 4. 0.991 5. 0.379 6. 0.940 7. 0.946 0.332*0.225 + 0.276*0.163 + 0.178*0.230 + 0.007*0.991 + 0.056*0.379 + 0.146*0.940 + 0.005*0.946 = 0.331 Model Surprise Calculation: (b) Calculation of PSDE model surprise Figure 5.10: An illustration of computing model surprise (Equation 5.20) for a PSDE model of the stochastic Shape environment of Figure 5.2 with = 0:95 and = 1:0. The left side (a) illustrates the probability distributions (and associated observation counters, red) dening the PSDE model. The right side (b) illustrates how these probability distributions and observation counters are used to compute the total surprise of the PSDE model. More specically, let i represent the sum of the observation counters associated with PSDE P (Ojc i ). Let = N P i=1 i . We set each w i = i for i2f1;:::;Ng. Note that this causes N P i=1 w i = 1, meaning that Equation 5.20 represents a weighted average of PSDE observation probability normalized entropies. Intuitively,fw i g represents a probability distribution over PSDEs in which the probability of each PSDE is proportional to the total number of times that its experiment succeeded (i.e., the number of times the corresponding model condition was satised). Figure 5.10 illustrates how surprise (Equation 5.20) is calculated for an example PSDE model of the stochastic Shape environment (Figure 5.2), with = 0:95 and = 1:0. 95 5.3.2.3 PSBL for Learning PSDE Models With the computation of surprise formally specied for PSDE models, we can now detail how PSDE models can be learned within the general PSBL framework specied in Chapter 4.5. Al- gorithm 3 provides the high-level pseudocode of PSBL for learning PSDE models by minimizing model surprise. Algorithm 3: PSBL Learning of PSDE Models Input: numActions: number of actions to perform; explore: probability of executing random actions; A: actions; O: observations Output:M: PSDE model of unknown environmentE /* Algorithm 4 */ 1 M := initializeModel(A, O) 2 policy := queue() 3 foundSplit := true 4 while foundSplit do 5 for i2f1;:::; numActionsg do 6 if policy.empty() then /* Add actions of an SDE to the policy or random actions */ 7 policy := updatePolicy(M, explore, policy) 8 action := policy.pop() 9 prevOb :=E.currentObservation() 10 nextOb :=E.takeAction(action) /* Algorithm 5 */ 11 M.updateModelParameters(action, prevOb, nextOb) /* Compute model and transition surprise */ 12 M.computeSurprise() /* Algorithm 6 */ 13 foundSplit :=M.trySplit() 14 returnM We begin in line 1 of Algorithm 3 by initializing the PSDE model. The details of this ini- tialization are provided in Algorithm 4. In Algorithm 4 lines 1-2, we initialize a new (empty) PSDE model. In lines 3-11 of Algorithm 4, we generate unique PSDEs for every possible action- observation pair a;o2 AO (line 7). The experiment of each of these PSDEs is simply its corresponding action-observation pair. We store these PSDEs inM via two hash maps. The rst, stringsToPSDEs, hashes each PSDE on the unique string representation of its experiment (line 8). The second, idsToPSDEs, hashes each PSDE on its unique ID number (line 9). In line 10, we 96 put the each new PSDE's experiment in reverse order into a trie data structure (see Figure 5.5 for an example). Algorithm 4: Initialize PSDE Model Input: A: actions; O: observations Output:M: Initialized PSDE model 1 M := PredictiveSDEModel() 2 sdeCount := 0 3 for o2O do 4 for a2A do 5 newT :=fo.id, a.idg 6 newTString := toString(newT) 7 newPSDE := PSDE(sdeCount, newT) 8 M.stringsToPSDEs[newTString] = newPSDE 9 M.idsToPSDEs[sdeCount] = newPSDE 10 M.PSDETrie.addSequence(newT.reverse()) 11 sdeCount + = 1 12 returnM In turns out that maintaining a sux trie to organize the agent's PSDEs is vital to making this algorithm as ecient as possible (though its worst case runtime is still exponential in the length of the longest PSDE, see Chapter 7), because it allows us to determine the matching PSDE for any given sequence of agent history in time linear in the length of this history. Assume that the agent's longest PSDE is of length D and that the agent has access to the previous D actions and D observations of its history. Then, we can reverse this history (which is of total size 2D) and traverseM.PSDETrie in order to nd the longest matching PSDE experiment for this history. The longest matching PSDE experiment consistent with this history is the one that succeeds. Returning to Algorithm 3, after model initialization, we enter the main while loop of the PSBL learning algorithm (lines 4-13). We continue executing this while loop (each time splitting an existing PSDE into its one-step extensions, line 13) until we can nd no more evidence that splitting any existing PSDE is likely to reduce model surprise. In line 6, we determine whether we still have actions available in our policy, which is organized as a queue of actions to take. If we do not have any actions left in our policy, we update this policy in line 7. This policy is updated either with the actions of a random one-step extension of a random PSDE consistent with 97 the current environment observation or a sequence of random actions of the same length. The probability of appending a random action sequence to policy (rather than the actions of a PSDE) is explore, which is an algorithm hyperparameter. Based on our extensive experimentation, this learning procedure does not seem to be very sensitive to the particular value of explore, provided it is not extremely close to 1 or extremely close to 0. In practice, we typically leave this value at its default of 0.5. In lines 5-11 of Algorithm 3, we perform numActions actions from our policy in the environment to further train the PSDEs in modelM (and their one-step extension distributions). After each action is taken and each current observation (currOb) and resulting observation (nextOb) is recorded (lines 8-10), we update the parameters of modelM based on this new agent history (line 11). The pseudocode of the model parameter update of line 11 is provided in Algorithm 5. Algorithm 5: Update PSDE Model Parameters Input: a: action taken; prevOb: previous observation; nextOb: resulting observation 1 M.history.append(a) 2 M.history.append(prevOb) 3 ifM.history.length()M.maxPSDELength + 2 then 4 M.counterTrie.updateCountersProbabilities(M.history + nextOb) 5 M.history.popleft() 6 M.history.popleft() In Algorithm 5, we see that PSDE modelM maintains a deque called history of the agent's previous actions and observations. The agent maintains an amount of history 1 time-step (1 action-observation pair) longer than the longest PSDE in M (by length, here, we mean the number of actions plus the number of observations in a PSDE's experiment). This enables the agent to recall enough experience to learn the probabilities associated with the one step extensions of any PSDE in modelM. In lines 1-2 of Algorithm 5, the agent appends the most recent action taken (a) and current environment observation (prevOb) to the end of history. Once a sucient amount of experience has built up in history (tested in line 3), we can pop o the oldest action and observation from the left side of this deque (lines 5-6), after we have used this history to 98 update the counters and probability distributions of the PSDEs (and their one step extensions) in modelM (line 4). It turns out that by organizing the agent's experience in its environment into a specic kind of trie data structure (distinct from theM.PSDETrie trie discussed previously), we can perform the update in line 4 very eciently (in time linear in the length ofM.history). This data struc- ture is called the counterTrie in line 4 of Algorithm 5. counterTrie is initially empty. Whenever a never-before-seen agent history is experienced, it is added to the trie in regular order (with observations at the rst level, actions at the second level, observations again at the third level, etc.) This trie stretches forward in time, and, as PSDEs get longer throughout the course of the algorithm, this tree will become deeper. At each action level node in counterTrie (i.e., every other level), we maintain a set of counters, one for each possible agent observation. These counters { initialized to the same non-zero value as in Max Predict (Section 5.3.1) { maintain the number of times the agent has seen each of these observations after having experienced a history consistent with the counterTrie up to this action node. The observation probabilities associated with this action node can be obtained by simply dividing each observation counter value by the sum total of the observation counters at that node. Thus, in line 4 of Algorithm 5, we can simply traverse counterTrie according toM.history. At each action node at leveli already in counterTrie consis- tent withM.history through index i, we increment the counter associated with the observation M.history[i + 1] (the agent's observation after its history ending at action i). When traversing M.history, we create new nodes consistent withM.history if they are not currently in counter- Trie. We make sure to create observation counters for each new action node added (initialized to the appropriate default nonzero values). Crucially, for any history of any length, counterTrie enables us to update the counters associ- ated with the resulting observations of all the prexes of this history (including the full history) with only a single pass over the actions and observations of history. In practice, this results in a signicant speedup over Max Predict (though it is important to recognize that this trie-based 99 t t-1 t-2 y y y [1, 2] [2, 1] [1, 2] y y y (a) Addingf;y;;y;;y;g to empty trie t t-1 t-2 y y y [1, 2] [2, 1] [1, 2] y y y y y y [2, 1] [1, 2] [2, 1] (b) Addingf;y;;y;;y;g to trie t t-1 t-2 y y y [1, 3] [3, 1] [1, 2] y y x y y y [2, 1] [1, 2] [2, 1] x [2, 1] (c) Addingf;y;;y;;x;g to trie t t-1 t-2 y y y [1, 3] [3, 1] [1, 2] y x y y y y [3, 1] [1, 2] [2, 1] x [2, 1] x y [2, 1] [1, 2] (d) Addingf;y;;x;;y;g to trie Figure 5.11: An example of building a counterTrie (line 4, Algorithm 5) to organize an agent's experience in its environment. Subgures (a) - (d) demonstrate adding 4 sequences of history (in order) to an initially empty counterTrie. The left side of each subgure shows the sequence of history being added. The red arrows and counters in each gure indicate what parts of the trie are changed once this sequence of history is added. Note how counters are maintained at every action node indicating the number of times each observation occurs after the agent encounters that history prex. The probabilities associated with any of these action nodes can be computed simply by dividing the value of each observation's counter by the sum of the observation counters (over both observations) at that node. implementation is applicable to Max Predict as well). An example of building a counterTrie from scratch in the Shape environment is provided in Figure 5.11. Returning again to Algorithm 3, in line 12, we compute the total model surprise of the current PSDE model (see Equation 5.20). Finally, in line 13, we attempt to split PSDE modelM by replacing one of its PSDEs with the one step extensions of that PSDE (when such a replacement 100 is expected to reduce total model surprise). The specics of the splitting procedure are given in Algorithm 6. Algorithm 6 gives the pseudocode for attempting to split a PSDE model. In the case of PSBL for learning PSDE models (Algorithm 3), splitting means replacing a current PSDE with PSDEs representing all its possible one-step extensions (that is, all possible ways of extending the selected PSDE one action-observation pair into the past). In PSBL for learning PSDE models, we typically assume that the agent begins with an initial observation at time tk + 1. This is a slight dierence from how Max Predict (Section 5.3.1) was presented, but both procedures work equally well regardless of whether the agent's history begins with an action or an observation. Nevertheless, for the sake of clarity, Figure 5.12 visualizes the one step extensions of the PSDE P (O t+1 =f;gj;x), in which the agent begins by making an initial observation (rather than beginning with an action). Learning sPOMDPs Learning sPOMDPs Motivation Problem Challenges Related Work SDEs sPOMDPs Results Future x P( { }| ) t-1 t t+1 x x x x y x x x y x y x y Figure 5.12: An illustration of the one step extensions of the PSDE P (O t+1 =f;gj;x). Note that the agent begins with an initial observation of at timet in addition to executingx at timet. In PSBL for PSDE models (Algorithm 3), we typically assume that PSDEs begin with an initial observation (as opposed to an initial action). In Algorithm 6, we begin by selecting a PSDE currPSDE (line 1). M maintains a round robin counter called sdeCounter which ranges from 0 tojMj1, wherejMj denotes the number of PSDEs currently inM. It is incremented (modulojMj) every time Algorithm 6 is called such that it visits every PSDE in turn repeatedly until all of them have a tryCount greater 101 Algorithm 6: Attempt to Split PSDE Model Output: foundSplit: true if non-converged PSDEs still exist, false otherwise. 1 currPSDE :=M.idsToPSDEs[M.sdeCounter] 2 oneSteps := createOneStepExtensions(currPSDE) 3 oneStepProbabilities := (jAjjOj)jOj 2D array of zeros 4 oneStepCounts :=(jAjjOj)jOj 2D array of zeros 5 validCount := 0 6 for i inf0, ...,jAjjOj1g do 7 node := counterTrie.nodeIfExists(oneSteps[i]) 8 if node then 9 oneStepCounts[i, :] = node.observationCounters 10 oneStepProbabilities[i, :] = node.observationProbabilities 11 validCount += 1 12 currNode := counterTrie.nodeIfExists(currPSDE.trajectory) 13 if validCount == 0 or not currNode or currNode.observationCounters.sum() < M.minCountAllowed or max i oneStepCounts[i, :].sum() <M.minOneCountAllowed then 14 currPSDE.tryCount + = 1 15 M.sdeCounter + = 1 (modulojMj) 16 return checkConvergence() 17 remove all-zero rows in oneStepCounts and oneStepProbabilities 18 oneEntropies := normalized entropies of distributions remaining in oneStepProbabilities 19 oneWeights := counts of each remaining one in oneStepCounts divided by sum of all counts 20 currEntropy := normalized entropy of currNode.observationProbabilities 21 gain := currEntropy (oneWeights oneEntropies).sum() 22 if gain >M.minGain then 23 remove currPSDE fromM.idsToPSDEs andM.stringsToPSDEs 24 for one in oneSteps do 25 add new PSDE with trajectory one toM.idsToPSDEs andM.stringsToPSDEs 26 M.PSDETrie.addSequence(one.reverse()) 27 M.maxPSDELength = max(M.maxPSDELength, oneSteps[0].length) 28 else 29 currPSDE.tryCount += 1 30 M.sdeCounter + = 1 (modulojMj) 31 return checkConvergence() than or equal to some limit (M.tryLimit, which is an algorithm hyperparameter). The function checkConvergence() (lines 16 and 31) returns true if there exists any PSDE with a tryCount less thanM.tryLimit and false otherwise. In line 2, we create the one step extensions (oneSteps) of currPSDE by prepending each possible action-observation pair to the beginning of currPSDE's experiment (denoted currPSDE.trajectory). In lines 3-4, we create 2D arrays of zeros of size 102 (jAjjOj)jOj to store the probabilities (oneStepProbabilities) and observation counters (oneStep- Counts) associated with each one step extension. validCount, which is initialized to 0 in line 5, is used to count how many of the one step extensions in oneSteps have actually been experienced by the agent (and are thus in experience trie counterTrie). In lines 6-11, we loop through each one step extension in oneSteps (recall from previous sections that, for any PSDE, there arejAjjOj one step extensions, which correspond to every possible pair of action and observation that could precede that PSDE's experiment). If that one step extension has been experienced, it will exist in counterTrie, and we extract the nal action node of that one step extension (line 7) and copy the observation counters and observation probabilities associated with that trie node into the ap- propriate rows of oneStepCounts and oneStepProbabilities (lines 9-10). We increment validCount for every such one step extension that exists in counterTrie (line 11). In line 12, we extract the nal action node (currNode) matching currPSDE.trajectory, if such a node exists in counterTrie. There are four main conditions (checked in line 13) that must be met in order to continue evaluating (and possibly splitting) currPSDE: 1. validCount must be greater than 0. 2. currNode must exist in counterTrie (it must not be null). 3. The sum of the observation counters of currNode must be larger than or equal to hyperpa- rameter valueM.minCountAllowed. 4. The sum of the observation counters of some one step extension must be larger than or equal to hyperparameter valueM.minOneCountAllowed. Conditions 3 and 4 are hyperparameters which help control how much data is necessary in order to consider splitting currPSDE. If any of these conditions are violated, currPSDE.tryCount is incremented by 1 (line 14),M.sdeCounter is incremented (modulojMj) by 1 (line 15), and the result of the checkConvergence() procedure is returned (line 16). Recall that checkConvergence() 103 returns true if there exists at least one PSDE whose tryCount is less thanM:tryLimit and false otherwise. In line 17, the all-zero rows (corresponding to one-step extensions that did not exist in coun- terTrie) are removed from both oneStepCounts and oneStepProbabilities. In line 18, the normal- ized entropies of the remaining observation probability distributions in oneStepProbabilities are computed and stored in oneEntropies. The normalized entropy of the observation probability distribution associated with currPSDE is computed and stored in currEntropy (line 20). In line 19, each remaining one step extension is assigned a weight proportional to the sum of its ob- servation counters in oneStepCounts. The sum of the observation counters associated with each remaining one step extension is divided by the sum of all the observation counters over all one step extensions. These weights are stored in oneWeights. Since oneWeights must sum to 1 (by construction), this can be viewed as a probability distribution over valid one step extensions. This computation is completely analogous to the process of generating a probability distribution over the PSDEs inM in order to compute total model surprise (illustrated in Figure 5.10). In line 21, we compute the gain associated with replacing currPSDE with its one step exten- sions. Let w o represent the K 1 vector of weights in oneWeights, where K is the number of valid one step extensions. Let h o represent the K 1 vector of normalized entropies associated with these K valid one step extensions (computed using their associated observation probability distributions). Leth c represent the normalized entropy of the observation probability distribution associated with currPSDE. Then, the estimated gain of replacing currPSDE with its one step extensions can be expressed as: gain =h c w T o h o (5.21) Equation 5.21 computes an estimate of the reduction in normalized entropy the agent should expect when it splits currPSDE into its constituent one step extensions. If the value of gain is very 104 small (or negative), splitting currPSDE is not expected to further decrease the agent's uncertainty about future observations. On the other hand, if gain is very large, the agent expects that splitting currPSDE will dramatically decrease its uncertainty about future observations (at least under the model condition c i associated with currPSDE). Equation 5.21 is similar, in some ways, to the denition of a splitting surprise (Equation 5.7) in the Max Predict algorithm (Section 5.3.1). However, here, the agent uses the learned observation probability distributions of its PSDEs (and one step extensions) directly rather than empirical estimates of predictive accuracy (which turn out to be an unnecessary complication). If the computed gain is larger than hyperparameterM.minGain (line 22), we split currPSDE by replacing it inM with new PSDEs representing all of its one step extensions (even those that have not yet been experienced, lines 23-26). currPSDE has to be replaced with all of its one step extensions to ensure that the updated model has mutually exclusive and exhaustive model conditions. We add these new PSDEs to toM.idsToPSDEs andM.stringsToPSDEs. We also add a reversed version of each one step extension toM.PSDETrie. In line 27, the length of the longest PSDE inM is increased if these newly added PSDEs (which all have the same length) are longer than any PSDEs currently inM. Note that a successful split increasesjMj byjAjjOj1 PSDEs (the1 is due to the fact that we are replacing currPSDE withjAjjOj new PSDEs). If the computed gain is not large enough to warrant splitting, we increment currPSDE.tryCount by 1 (line 29). Either way, we incrementM.sdeCounter by 1 (modulojMj, line 30) and return the results of the checkConvergence() procedure (line 31). As a nal note, Algorithms 3-6 do not make use of compound PSDEs, PSDE rening, or PSDE merging. While it is possible to introduce these concepts and procedures into Algorithm 3, we have found that they actually degrade the performance of Algorithm 3 in many important cases. Furthermore, the highly-optimized trie-based implementation of Algorithm 3 presented in this section has enabled this procedure to scale to very large environments, even when compound experiments are not used to reduce the number of PSDEs (see Chapter 8). We believe that this is 105 due to the fact that, in many environments (particularly large environments with relatively high determinism), it is only possible for the agent to experience a small percentage of the possible one step extensions of any PSDE. Even though these impossible (or highly improbably) one step extensions are added to the model when a PSDE is split, the conditions in line 13 of Algorithm 6 prevent them from further splitting. Intuitively, this dramatically reduces the eective branching factor of the agent's PSDE model, even whenjAjjOj is very large. 5.4 Search-based Decision Making in PSDE Models A number of methods have been proposed in the literature for planning and decision-making in Predictive State Representation (PSR, [84]) models (see, e.g., [1, 66]). It remains to be seen whether these methods can be extended to support PSDE models, which are also predictive models that share important similarities with PSRs. (The similarities and dierences between PSRs and PSDE models are discussed extensively in Chapter 2.8.) However, when the environment is relatively deterministic (not too noisy), a direct search in the space of actions and most likely observations can be shown to produce policies that perform well on decision making tasks (see Chapter 8). We assume, here, that the agent's goal is to take a series of actions that leads it from an initial observation (along with some history of previous actions and observations) to a goal observation. Since PSDE models are general purpose (i.e., task-independent), this goal observation can be any one of the agent's observations and can be easily changed. The key idea of search-based decision making in PSDE models is to perform a breadth-rst search from a starting PSDE through the space of PSDEs that are most likely to be encountered under dierent sequences of future agent actions. The search terminates when the goal observation is the most likely one to be observed after some agent action is taken. The full procedure is given in Algorithms 7 and 8. 106 Algorithm 7: Search-based Decision Making for PSDEs Input: goalOb: goal observation; maxLength: max lookahead length;M: learned PSDE model 1 history := deque() 2 policy := deque() 3 whileE.currentObservation() ! = goalOb do 4 history.append(E.currentObservation()) 5 if history.length() <M.maxPSDELength then 6 action := randomAction() 7 history.append(action.id) 8 else 9 if policy.empty() then /* Algorithm 8 */ 10 policy := actionsToGoal(goalOb, history, maxLength,M) 11 action := policy.popleft() 12 history.append(action) 13 history.popleft() 14 history.popleft() 15 E.transition(action) In lines 1-2 of Algorithm 7, we initialize empty deques for both agent history and the current agent policy. Lines 3-15 form the main while loop of the algorithm. This loop continues executing until the current environment observation matches the goal observation (goalOb, checked in line 3). In line 4, we append the current environment observation to the agent's history. If the agent does not currently have enough history (it requires history equal in length to its longest PSDE), it executes a random action (lines 6-7 and line 15). Once the agent has accumulated enough history, it generates a policy that is expected to lead it from its current model condition to goalOb (line 10, which utilizes Algorithm 8). Note that, in this procedure, we assume that PSDE modelM has already been learned by the agent. In lines 11-15, the agent takes the next action in policy and updates its history by removing the oldest action and observation. This keeps history the same size at every iteration (once a sucient amount of history has built up). This is directly analogous to how history is maintained in the per-action update procedure (Algorithm 5) discussed in the previous section. 107 Algorithm 8: Actions to a Goal Observation in PSDE Models Input: goalOb: goal observation; history: agent history; maxLength: max lookahead length (in timesteps);M: learned PSDE model Output: actions: sequence of actions to reach goalOb. 1 originalLength := history.size() 2 open := queue() 3 open.push(history) 4 while not open.empty() do 5 hist := open.pop() /* Account for the fact that history had originalLength elements */ 6 if hist.size() > 2maxLength + originalLength then 7 continue 8 for a2A do 9 histCopy := copy(hist) 10 histCopy.append(a.id) 11 matchingPSDE :=M.PSDETrie.nd(histCopy.reverse()) 12 matchingNode :=M.counterTrie.nodeIfExists(matchingPSDE.trajectory) 13 if matchingNode then 14 mostLikelyObservation := arg max o matchingNode.observationProbabilities 15 histCopy.append(mostLikelyObservation.id) 16 open.push(histCopy) 17 if mostLikelyObservation == goalOb then 18 histCopy := histCopy[originalLength 1:] 19 return histCopy[1::2] 20 return a random action The specics of how policy is updated in line 10 of Algorithm 7 are given in Algorithm 8. In line 1 of Algorithm 8, we save the original length of history, which we use later to separate the agent's history of already taken actions from the actions that this procedure expects will lead it to the goal observation (goalOb). In lines 2-3, we create a new queue called open and put history on that queue. Lines 4-19 perform a breadth-rst search in the space of the most likely PSDEs resulting from the agent's actions. In line 5, we pop the next element to be expanded, hist, from the open queue. Since it is possible that the agent won't be able to nd a sequence of actions that ends with a most likely observation of goalOb (e.g., if one doesn't exist), this procedure is provided with a maximum number of time steps the agent is allowed to look into the future (maxLength). If hist becomes longer than 2maxLength + originalLength (line 6), we do not expand hist and instead move to the next element in open. Notice that we multiply maxLength by 2 because 108 each time step of history includes both an action and an observation. We add originalLength because we want the agent to be able to look maxLength time steps past the given history. If hist is not too long, we look at each possible action a2 A that the agent can take (line 8). We rst make a copy of hist, called histCopy, in line 9, because we want to consider the results of each possible action starting from hist. We append the ID number of action a to histCopy in line 10. SinceM is a learned PSDE model, histCopy must be uniquely consistent with one PSDE inM. We determine the trajectory of this matching PSDE in line 11 (usingM.PSDETrie) and nd the nal associated action node inM.counterTrie in line 12. If such a matchingNode exists inM.counterTrie (line 13), we use matchingNode.observationProbabilities to determine the observation that is most likely to occur at the next time step (mostLikelyObservation, line 14). In line 15, we append the ID number of this observation to histCopy and, in line 16, we add histCopy to the open queue. If mostLikelyObservation is equal to the goal observation (goalOb), checked in line 17), then we terminate the search. We use originalLength to isolate the part of histCopy that was appended to the original history during the search process (line 18), and we return every other element of this subsequence of histCopy, beginning with element 1 (line 19). These are precisely the actions expected to lead the agent to goalOb. The procedures detailed in Algorithms 7 and 8 work well in practice (as we demonstrate in Chapter 8), but they are not guaranteed to nd a policy that leads the agent to the goal observation (let alone an optimal policy). It is possible for the agent to become stuck in a PSDE from which there is no predictive path through most likely observations to goalOb. However, in our experience, such situations are rare and can be ameliorated to some degree by introducing random actions when the agent notices that it is stuck. This randomness is typically enough to place the agent in a new PSDE from which there is a valid path to goalOb through most likely observations. We also note that, in practice, we maintain a closed queue in Algorithm 8 such that we can detected repeated PSDEs, and we keep pointers from each PSDE to the one that discovers it, such that the path to goalOb can be reconstructed without having to generate 109 x 1. P( { }| ) = {0.964, 0.036} 2. P( { }| ) = {0.024, 0.976} y 3. P( { }| ) = {0.963, 0.037} y x 4. P( { }| ) = {0.556, 0.444} x x 5. P( { }| ) = {0.926, 0.074} x 6. P( { }| ) = {0.357, 0.643} x y 7. P( { }| ) = {0.636, 0.364} x y 771 29 16 650 413 16 10 8 126 10 126 227 7 4 PSDE Model: (a) PSDEs and associated distributions. y y x 6.P( { }| ) x y x y x y 0.643 y y 3. P( { }| y) y y 0.963 Goal Policy = {y} (b) Illustration of the search-based decision making procedure (Algorithms 7-8) for PSDE models. Figure 5.13: An illustration of the search-based decision making procedure for PSDE models detailed in Algorithms 7-8 as applied to the stochastic Shape environment (Figure 5.2). Subgure (a) illustrates the PSDEs in a PSDE model of the stochastic Shape environment. Subgure (b) illustrates the use of search-based decision making to take the agent from an initial history of f;y;g to a goal observation of via the policyfyg. Note how dierent choices of actions lead to the agent's projected history matching dierent PSDEs in the model. increasing longer sequences of projected agent experience. We present the simpler version here to convey the main ideas without unnecessary detail. Figure 5.13 provides an illustration of this search-based decision-making procedure. 5.5 Conclusions In this chapter, we detailed how Stochastic Distinguishing Experiments (SDEs, [38]) could be used to create predictive SDE (PSDE) models that directly estimate the probability distribution over future observation sequences for any given sequence of agent actions without directly modeling la- tent environment structure. We detailed a learning procedure called Max Predict (Section 5.3.1), which learned simple and compound PSDEs by attempting to directly maximize the agent's pre- dictive accuracy in its environment. We then discussed how this learning procedure could be dramatically improved and simplied by reframing it as a procedure to minimize model surprise (Section 5.3.2) in a way that is consistent with the PSBL framework presented in Chapter 4.5 and 110 by appropriately utilizing trie data structures for eciency. We concluded this chapter with a dis- cussion regarding how PSDE models could be used for decision making. We focused on situations in which the agent wishes to get to a goal observation from its current environment condition and presented a search-based planning technique for PSDE models that works well in practice (for relatively deterministic environments). In the next chapter, we detail how SDEs can be used to create hybrid latent-predictive models of stochastic and partially-observable environments. In these models, SDEs are used to discover and uniquely represent latent environment states. 111 Chapter 6 Latent-Predictive SDE Modeling (sPOMDPs) As we have discussed in previous chapters, standard techniques for autonomous learning in partially-observable and stochastic environments are typically fundamentally reliant on human- engineered features, one of the most important of which is an a priori specication of the latent state space of the agent's environment. Designing an appropriate state space demands extensive domain knowledge, and even minor changes to the task or the agent might necessitate an expensive manual re-engineering process. The limitations of these traditional, latent-state approaches have motivated the use of end-to-end, predictive learning approaches, such as Predictive State Repre- sentations (PSRs, [84]) and our Predictive Stochastic Distinguishing Experiments (PSDEs, [38], see Chapter 5), which learn a representation of state encoded in the probabilities of key sequences of raw actions and observations (i.e., experiments the agent can perform in its environment). How- ever, discovering these experiments remains a key challenge, in part because existing techniques lack a formal relationship between predictive experiments and latent environment structure. In this chapter, we discuss how to extend our PSDE representation from the previous chap- ter into a hybrid latent-predictive model, called a Surprise-based Partially Observable Markov Decision Process (sPOMDP, [39, 35]), which is a partially-observable Markov decision process (POMDP, [68]) in which each latent model state is uniquely identied and represented by a 112 maximally-probable predictive sequence of observations created by executing an associated Stochas- tic Distinguishing Experiment (SDE) from that model state. Dierences in these observation se- quences can be used to statistically disambiguate identical-looking states. sPOMDPs, like other predictive models (e.g., PSRs [84] and PSDEs [38]), are grounded in the raw actions and obser- vations of the agent, enabling end-to-end learning that requires few, if any, human-engineered features; however, in contrast to other predictive models, sPOMDPs can also be used as tradi- tional POMDPs, so state-of-the-art POMDP planning, decision-making, and reinforcement learn- ing techniques can be applied straightforwardly to them. We present PSBL variants that enable agents to actively and incrementally learn sPOMDP models of unknown stochastic and partially- observable environments directly from experience and detail how such models can be exploited to make decisions and generate policies. 6.1 Rewardless -POMDP Environments (a) Deterministic Shape environment Alpha-epsilon POMDP environments sPOMDPs Motivation Problem Challenges Related Work SDEs Learning Results Future Information Sciences Institute ϵ 1−ϵ α (1−α)/3 (1−α)/3 (1−α)/3 (b) -Shape environment Figure 6.1: An illustration of the deterministic Shape environment [119, 142] (subgure (a)) and the-Shape environment (subgure (b)), which extends the Shape environment into a stochastic and partially-observable POMDP environment in which the agent experiences the same level of noise on each of its state-to-state transitions and observations. 113 Before detailing sPOMDPs, we must rst discuss an important subclass of rewardless POMDP environments called -POMDPs. These environments are important both because they moti- vated the sPOMDP modeling approach, and because they have an important relationship to the representational capacity of sPOMDP models. In Chapter 7, we prove that deterministic Moore Machine [100] environments and rewardless-POMDP environments can be perfectly represented by sPOMDPs no larger than the minimal representations of these environments. As we discuss in Chapter 7, this result answers an important open theoretical question regarding the repre- sentational capacity of our Stochastic Distinguishing Experiments (SDEs, [38]) and Shen's Local Distinguishing Experiments (LDEs, [141]). Recall that, in the deterministic version of the Shape environment (Figure 6.1 (a)), a robot exists in a simple 4-state environment in which it observes (in states I and II) and (in states III and IV). The agent can execute the actions x and y, which have the eects illustrated in the gure. Now, consider \softening" the deterministic Shape environment of Figure 6.1 (a) into a rewardless POMDP in which the arrows represent the most likely transitions between states (each with probability ) and the shapes of each state represent the most likely observation to be emitted (with probability ). The remaining transition probability mass is divided equally amongst transitions to the other possible states, and the remaining observation probability mass is divided equally amongst the other possible observations. Figure 6.1 (b) illustrates this conversion for the actionx from state I and the observation in state I. We call this a rewardless -POMDP environment: Denition 8. Rewardless -POMDP. A rewardless -POMDPE is a seven-tuple (Q, A, T ,O, ,,), whereQ is a discrete set of states,A is a discrete set of actions,O is a discrete set of observations, T is the set of transition probabilities (satisfying the Markov property), and is the set of observation (emission) probabilities (satisfying the observation Markov property). T is dened such that8 q;a2QA ,P (q max a jq;a) =, whereq max a is the most likely state to be transitioned into from stateq under actiona. Additionally,8 q;a;q 0 6=q max a P (q 0 jq;a) = (1)=(jQj1). Similarly, 114 is dened such that8 q P (o max q jq) = , where o max q is the most likely observation to be emitted in state q. Additionally,8 q;o6=o max q P (ojq) = (1)=(jOj1). As Mahmud notes in [87], learning a model of an unknown POMDP is NP-hard, even under severe restrictions [128], and one potential way to overcome this is to consider simpler but still useful subclasses of the class of POMDP environments that might permit more ecient solutions. This is the motivation behind specically considering -POMDPs in detail in this chapter. (As we discuss in later sections and demonstrate experimentally in Chapter 8, however, sPOMDPs can also be used eectively in practice to model rewardless POMDPs that that are not -POMDPs.) As a nal technical note, when = = 1, the denition of an -POMDP (denition 8) reduces to the denition of a deterministic Moore machine [100]. 6.2 Surprise-Based POMDPs At this point, it is necessary to recall the denition of a Stochastic Distinguishing Experiment (SDE) from Chapter 4.4: Denition 9. Stochastic Distinguishing Experiment (SDE). For each state q 2 Q in rewardless POMDP environmentE, dene b q as a belief state that assigns probability 1 to being in state q. Dene: g q = arg max o1;:::;o k+1 X q2;:::;q k+1 P (o k+1 jq k+1 )P (q k+1 jq k ;a k )P (o 2 jq 2 )P (q 2 jq;a 1 )P (o 1 jq) (6.1) In words,g q is the most probable sequence ofk+1 ordered observations the agent will encounter upon executing actions (a 1 ;:::;a k ), in order, beginning at state q (with probability 1). We call this the lengthk +1 outcome sequence of stateq. If the ordered sequence of actions (a 1 ;:::;a k ) has the 115 property that, when executed from belief states b q 1;:::;b q d, g q i 6=g q j 8 i6=j , then (a 1 ;:::;a k ) is called a k-length Stochastic Distinguishing Experiment (SDE) for states q 1 ;:::;q d . Intuitively, the larger the probability of g q i and g q j given actions (a 1 ;:::;a k ), the more useful this SDE is in statistically disambiguating statesq i andq j , becauseg q i will necessarily have a low probability of occurring starting from belief stateb q j andg q j will necessarily have a low probability of occurring starting from belief state b q i. In the deterministic Shape environment (Figure 6.1 (a)), for example, we saw in Chapter 3.2 that the SDE (y, x) disambiguates states I and II, with g I = (,,) andg II = (,,). g I has probability 1 when (y;x) is executed from belief state b I and probability 0 when (y;x) is executed from belief state b II . Similarly, g II has probability 1 when (y;x) is executed from belief state b II and probability 0 when (y;x) is executed from belief stateb I . The key idea behind SDEs is that, provided the environment exhibits some level of determinism (even if it is not fully deterministic), the probability of observingg q i from belief state b qi is often much higher (in many environments) than any other observation sequences, and the agent can use the dierences in these maximally-probable observation sequences to statistically disambiguate identical-looking states. The closer the probability of eachg q i (from belief stateb qi ) is to 1, the easier (and more successful) we should expect this statistical disambiguation to be. As discussed previously, one of the primary limitations of PSDEs [38] and other purely pre- dictive modeling approaches such as Predictive State Representations (PSRs, [84]) is that the relationship between predictive experiments and latent environment structure is not formally dened. This is because predictive models directly estimate the probabilities of possible agent ex- periments without attempting to model latent environment structure. This makes it dicult, for example, to directly apply the rich literature of POMDP solution techniques to predictive models (for planning and decision-making) without extensive reformulations of these procedures (e.g., [1, 66]). Nevertheless, predictive representations oer important advantages, such as the fact that the probabilities of events of interest can usually be estimated via simple counting procedures. 116 Surprise-based Partially Observable Markov Decision Processes (sPOMDPs) are a hybrid latent- predictive model designed to combine the strengths of these modeling techniques. An sPOMDP is a traditional POMDP augmented with a set of Stochastic Distinguishing Experiments (SDEs), each of which is associated with and statistically distinguishes at least two latent states emitting the same most likely observation. In contrast to traditional POMDPs, each latent state in an sPOMDP is uniquely identied by the most likely sequence of k + 1 ordered observations (called the outcome sequence of the model state) the agent will encounter when executing the associated k-length SDE from that state (more specically, from a belief state that puts probability 1 on being in that model state). Denition 10. sPOMDP. A rewardless surprise-based partially-observable Markov decision process (sPOMDP) is 6-tuple (M, S, A, T , O, ), where M is a discrete set of model states, S is a discrete set of SDEs, A is a discrete set of actions, O is a discrete set of observations, T is the set of transition probabilities (satisfying the Markov property), and is the set of observation (emission) probabilities (satisfying the observation Markov property). For each SDE s2S, there is a set of associated model statesfm s g such that each m i s is identied by its outcome sequence g m i s upon executing s. S s fm s g = M and T s fm s g =; (i.e., each model state is associated with exactly one SDE), and8 m16=m22MM g m1 6= g m2 (i.e., no two outcome sequences are identical, so they can be used as unique identiers for each model state). The name surprise-based POMDP comes from the fact that these models can be learned in an active and incremental fashion using a variant of the generalized PSBL framework introduced in Chapter 4.5 (see Section 6.3 for more details). Figures 6.2-6.5 show example sPOMDP models learned from POMDP environments in the literature (adapted to be -POMDP environments) using the PSBL learning algorithm for sPOMDPs detailed in Section 6.3.4. and were both set to 0:99 in each example environment, but these parameters were unknown to the learner, as was 117 I II x 0.99 III y 0.99 x 0.99 IV y 0.99 x 0.99 y 0.99 y 0.99 x 0.99 (a) -Shape environment □, y, ◊, x, ◊ x 0.97 ◊, x, ◊ y 0.98 y 0.96 ◊, x, □ x 0.99 □, y, ◊, x, □ x 0.99 y 0.95 x 0.96 y 0.97 I II III IV (b) Learned sPOMDP model Figure 6.2: An illustration of the -Shape environment with = = 0:99 (a) and a learned sPOMDP model of this environment (b). S0 t 0.99 S3 f 0.99 S2 b 0.99 b 0.99 t 0.99 S1 f 0.99 f 0.99 t 0.99 b 0.99 b 0.99 f 0.99 t 0.99 rose nothing nothing volcano (a) -Little Prince environment r t 0.97 n, b, r f 0.95 n, b, v b 0.95 b 0.99 t 0.92 v f 0.96 f 0.97 t 0.96 b 0.98 b 0.95 f 0.95 t 0.98 S0 S3 S1 S2 (b) Learned sPOMDP model Figure 6.3: An illustration of the -Little Prince environment with = = 0:99 (a) and a learned sPOMDP model of this environment (b). the structure and number of states in each example environment. Each state in the original - POMDP on the left side of each gure (subgure (a)) is labeled with its most likely observation, and each sPOMDP model state on the right side of each gure (subgure (b)) is labeled both 118 by its outcome sequence and the corresponding environment state in the -POMDP. To reduce clutter, only the most likely transitions between states and the probabilities of these transitions are shown. It should be noted that, due to factors including the ability of sPOMDPs to model certain environments very compactly and the possibility of approximation errors in the sPOMDP learning process, there may not always be a one-to-one correspondence between learned sPOMDP model states and underlying environment states. We provide examples in which there is such a one-to-one correspondence here in order to make the key ideas behind sPOMDPs more clear. left goal west 0.99 middle east 0.99 east 0.99 right west 0.99 west 0.99 east 0.99 east 0.99 west 0.99 nothing goal nothing nothing (a) -Circular 1D Maze environment nothing, east, nothing, east, goal nothing, east, nothing, east, nothing west 0.96 nothing, east, goal east 1.00 east 0.98 goal west 0.99 west 0.84 east 1.00 east 0.89 west 0.95 middle left goal right (b) Learned sPOMDP model Figure 6.4: An illustration of the -Circular 1D Maze environment with = = 0:99 (a) and a learned sPOMDP model of this environment (b). As a specic example of an sPOMDP, consider Figure 6.2 (b). From this gure, it is clear that this sPOMDP has two SDEs in the set S, s 1 =fxg and s 2 =fy;xg. The agent has a set of 4 model states M =fm 1 ;m 2 ;m 3 ;m 4 g. m 1 is associated with the outcome sequencef;y;;x;g, while m 2 is associated with the outcome sequencef;y;;x;g. m 1 and m 2 are disambiguated using SDE s 2 . m 3 is associated with the outcome sequencef;x;g, whereas m 4 is associated with the outcome sequencef;x;g. m 3 and m 4 are disambiguated according to SDE s 1 . Note how each model state is associated with exactly one SDE which distinguishes it from at least 119 0 1 t 0.99 4 f 0.99 7 b 0.99 f 0.99 b 0.99 t 0.99 t 0.99 b 0.99 5 f 0.99 t 0.99 f 0.99 b 0.99 2 f 0.99 t 0.99 3 b 0.99 b 0.99 t 0.99 6 f 0.99 b 0.99 f 0.99 t 0.99 t 0.99 f 0.99 b 0.99 seeLRVDocked seeMRVForward seeMRVForward seeNothing seeNothing seeLRVForward seeLRVForward seeMRVDocked (a) -Shuttle environment seeLRVforward, b, seeLRVforward f 0.97 b 0.97 seeNothing, b, seeLRVdocked t 0.96 t 0.86 seeLRVdocked b 0.96 seeMRVforward, b, seeNothing f 0.95 seeMRVforward, b, seeMRVforward f 0.98 b 0.99 seeNothing, b, seeMRVdocked t 0.98 t 0.98 seeMRVdocked b 0.98 seeLRVforward, b, seeNothing, b, seeMRVdocked f 0.96 t 0.97 f 0.97 b 0.98 f 0.95 b 0.94 t 0.95 t 0.93 f 0.85 b 0.93 b 0.92 f 0.96 t 0.87 6 3 2 5 0 4 7 1 (b) Learned sPOMDP model Figure 6.5: An illustration of the -Shuttle environment (adapted from [33]) with = = 0:99 (a) and a learned sPOMDP model of this environment (b). one other model state. Also, note that each model state is associated with a unique outcome sequence. Additional illustrations and details regarding this particular example can also be found in Chapters 4.4 and 4.5, which provide an overview of SDEs and PSBL, respectively. 6.3 PSBL Learning of sPOMDPs We now turn to the more interesting problem of learning sPOMDP representations of unknown environments actively from agent experience by minimizing model surprise (utilizing the Proba- bilistic Surprise Based Learning framework introduced in Chapter 4.5). To develop the full PSBL algorithm for learning sPOMDPs in Section 6.3.4, we will assume initially that the environment is an -POMDP, the agent knows the values of and , and that the agent has access to an oracle that gives it the transition probabilities between any pair of model states under any action (regardless of the changes it makes to its model). This leads to an optimal PSBL learning proce- dure for learning perfect sPOMDP models of -POMDPs, which we detail in Section 6.3.3. In Section 6.3.4, we remove the need for this oracle via active experimentation and remove the need 120 for the agent to know and via a model surprise gain test. PSBL for sPOMDPs, then, becomes applicable to any discrete rewardless POMDP environment (not just -POMDP environments). 6.3.1 Dening Surprise in sPOMDP Models Before discussing how the PSBL framework can be applied to learning sPOMDP models, we must rst formally dene surprise in sPOMDP models. In contrast to the PSDE models presented in Chapter 5, sPOMDPs have a true latent model state space M. This means that, in the parlance of Chapter 4.3, the pairs m;a2 MA form a set of mutually exclusive and exhaustive model conditions c 1 ;:::;c jMjjAj . Each R 1 ;:::;R jMjjAj , takes on possible values in the set of model states M and represents the model state resulting from executing action a in model state m. Thus, sPOMDP models can be described by a set of probability distributionsP (M 0 ma jm;a), whereM 0 ma is a random variable representing the model statem 0 2M resulting from taking actiona in model state m. Then, total model surprise can be dened as follows in sPOMDP modelM: S(M) = X m;a2MA w ma H(M 0 ma ) (6.2) H(M 0 ma ) is the normalized entropy of the probability distribution P (M 0 ma jm;a). w ma is a set of weights with the property that each w ma 0 and P m;a2MA w ma = 1, such that Equation 6.2 is a weighted average over the normalized entropies of the model state to model state transition distributions deningM. As before, these entropies are normalized by dividing each by the maximum possible entropy of a distribution overjM 0 ma j=jMj elements, such that H(M 0 ma ) is always in the range [0; 1]. w ma can be set such that it induces a uniform distribution over model transition entropies. Alternatively, w ma can be set proportionally to the number of times the agent has experienced the transition from model state m under action a. In practice, weighting each transitionw ma proportionally to the number of times the agent experienced it seems to work better than a uniform weighting strategy. 121 6.3.2 Dening Model Error in sPOMDPs Recall the denition of model error from Chapter 4.1 (repeated here): ^ E M;E = 1 T T X t=1 s X o2O (P M (ojh t ) X q2Q P E (ojq)b E;t (q)) 2 (6.3) This can be easily extended to sPOMDP models. We evaluate the error of our sPOMDP model M against environmentE via a simulation ofT random actions applied to both the environment and the model. Let b E;t (q) and b M;t (m) represent the current belief distributions over states in the environment and model, respectively, at time t. At time step 1, assumeE andM are visibly equivalent (i.e., they induce the same probability distribution over possible observations). See denition 8 in Section 7.2.2 for additional details. The estimated per-step model error between M andE is dened as: ^ EM;E = 1 T T X t=1 s X o2O ( X m2M PM(ojm)bM;t(m) X q2Q PE (ojq)bE;t(q)) 2 (6.4) More specically, since the model state spaceM is designed to be a sucient statistic of agent history, we can compute the term P M (Ojh t ) in Equation 6.3 by maintaining a belief state over models states at each time step (b M;t (m)) and marginalizing out over possible model states to compute the history-dependent probability distribution over possible observations at time step t according to modelM. This is completely analogous to the way in which environment states are marginalized out at each time step in Equation 6.4. 6.3.3 Optimal sPOMDP Learning As was mentioned above, we begin our discussion of learning sPOMDPs by assuming that the agent knows the environment is an -POMDP, knows the values of and , and has access to an oracle that gives it the transition probabilities between any pair of model states under any action. We also assume that each observation o2 O is most likely in at least one environment state. In other words, for each o2 O, there exists at least one environment state q2 Q such 122 that P (ojq) = . With this information, the agent knows that its environment can be modeled by some unknown number of states, each of which transitions to some other most likely state with probability under each possible action. Let m2 M be a state in the agent's modelM and let m max a represent the most likely model state to be transitioned into under action a. If P (m max a jm;a)<, the agent can infer that the (unknown) environment states currently sharing m's outcome sequence { the oracle, unbeknownst to the agent, keeps a mapping of the environment states currently covered in this way by each model state in order to compute model transition probabilities { are transitioning with probability into states covered by m max a and at least one other model state m 0 a 6= m max a , indicating that the set of current model states is insucient to perfectly capture the dynamics of environmentE. Intuitively, this means that a model state m is covering identical-looking environment states with inconsistent dynamics under action a, and this manifests as entropy in P (M 0 ma jm;a). On the other hand, if P (m max a jm;a), clearly the agent can infer that all the environment states covered by model state m are transitioning into environment states covered by m max a under action a. The procedures utilized in the constructive proofs of the sPOMDP representational capac- ity theorems (detailed in Chapter 7.2.3) can be extended into a provably-optimal, oracle-based sPOMDP learning procedure based on the above reasoning if consistency restrictions are placed on the allowable values of and in environmentE (discussed next). Intuitively, these restrictions must enforce the following rather natural environment property: at each time step t, regardless of the agent's current model state and the action it takes, the most likely observation to be received should be equivalent to the most likely observation emitted in the most likely environment state to be transitioned into. We discuss below how violations of this property can lead to unlearnable environment structure. Dene, (1)=(jQj1), which is the probability of making any non-maximal environment state to environment state transition. More formally, is the probability of transitioning from some environment state q into any environment state q 0 under action a that is not q max a = 123 x y x y x x α 1−α 3 1−α 3 1−α 3 1−ϵ ϵ I II IV III Figure 6.6: -Shape environment with unlearnable structure when 0:7 and 0:5. arg max q 00P (q 00 jq;a). Let L,jQjjOj+1, which is the maximum number of model states that can emit a single observation as the most likely one under our assumption that all observations are most likely in at least one state. L is also the maximum number of environment states ever covered by a single model state (see Chapter 7.2.3). Then, we clearly require that > 1=jQj and >L, for otherwise a model state's most likely transition might be obscured. For example, in Figure 6.6, the action x from state I (covered by the model state initially, because the model is initialized with one model state per observation, see Algorithm 9) leads to the model state with probability , dierentiating it under the action x from states III and IV (also covered by the model state). However, if L = 3, then the agent is at least as likely to transition back to model state under x as it is to transition to model state, obscuring the unique dynamics of state I. Solving for , we have the condition: > L L+jQj1 . We actually require a stronger 124 condition to guarantee learnability: must be large enough that the probability of accidentally transitioning into a model state covering the maximum possible number of environment states (L) is smaller than the probability of purposefully transitioning into a model state covering L environment states under some most likely state to state transition: L< L + L 1 L (6.5) Expressing this inequality with on the left hand side, we have the condition: > L 2 + 1L L 2 +jQjL L L +jQj1 (6.6) Finally, we require that > 1=jOj, such that each environment state has a unique maximally- probable observation. These restrictions on the allowable values of and are sucient to ensure that the most likely observation following each action will always be equivalent to the most likely observation emitted in the most likely environment state to be transitioned into under each action. Therefore,P (m max a jm;a)< implies that model statem needs to be split, as it is covering environment states with inconsistent dynamics. If, instead, P (m max a jm;a), we know that all the environment states covered by model statem have consistent dynamics, andm does not need to be split. Using the procedures in the constructive proofs of the representational capacity of sPOMDP models presented in Chapter 7.2.3, we can create the following oracle-based learning algorithm (Algorithm 9), which provably learns a perfect model of a given minimal -POMDP (or Moore machine, if = = 1) environmentE 1 . This algorithm has a central while loop (lines 5-21) that executes until8 m;a P (m max a jm;a), where m max a = arg max m 0 2M P (m 0 jm;a). Given the constraints we impose on and , such a situation must imply that, for every pair (m;a) 2 M A and for all q 2 fq m g, q max a = 1 As a technical note, a minimal -POMDP is an-POMDP \softened" from a minimal Moore machine [100]. See Sections 6.1 and 7.1.2 for more details. Also, see Chapter 7.1.2 for more details regarding the denition of a perfect sPOMDP model. 125 Algorithm 9: Optimal sPOMDP Learning Input:E: minimal -POMDP; : transition noise; : observation noise; A: actions; O: observations Output:M: perfect sPOMDP model ofE 1 S =fg 2 Initialize M with one model state per observation o2O /* Ask oracle for environment parameters and L, see Section 6.3.3 */ 3 ;L getEnvironmentNoiseParameters() 4 change true 5 while change do 6 change false 7 foreach m2M do 8 foreach a2A do /* ask oracle for P (M 0 ma jm;a) */ 9 P (M 0 ma jm;a) getTransProbabilities(m, a) /* A hidden state must exist */ 10 if max m 0P (m 0 jm;a)< then 11 Find m 0 1 6=m 0 2 2M 0 ma s.t. m.rstOb +a +m 0 1 and m.rstOb +a +m 0 2 match up to a rst dierence in observation, P (m 0 1 jm;a)>L, and P (m 0 2 jm;a)>L 12 m new 1 m.rstOb +a +m 0 1 13 m new 2 m.rstOb +a +m 0 2 14 M.append(m new 1 ) 15 M.append(m new 2 ) 16 S.append(actionsToFirstDiInObservation(m new 1 , m new 2 )) 17 M.erase(m) 18 change true 19 break 20 if change then 21 break /* get transition probabilities from oracle for all (m;a) pairs */ 22 T getTransitionProbabilities() 23 computeObservationProbabilities() 24 returnM = (M, S, A, T , O, ) arg max q 0 2Q P (q 0 jq;a) is covered by m max a , wherefq m g denotes the set of environment states covered by model state m. Intuitively, the underlying environment states covered by each model state agree on their most likely transitions under each action, allowing the probability of each most likely transition between model states to meet or exceed . In Chapter 7.2.3, Lemmas 4 and 6 prove that any sPOMDP model with this property is a perfect model of Moore machine or -POMDP environmentE. In line 2, the agent initializes its model state set M with one model state per observationo2O. The outcome sequence of each model state is simply the observation 126 with which it is initialized. The oracle provides the agent with environment parameters and L in line 3 and the model state to model state transition probability distribution for a given (m;a) pair in line 9. If P (m max a jm;a)< (line 10), the agent knows that it needs to split model state m, as it is representing at least two environment states with inconsistent dynamics. To perform this split (line 11), we nd a pair of model states m 0 1 and m 0 2 whose outcome sequences (with m.rstOb+a prepended) match one another up until a rst dierence in observation. The incremental construction of the SDEs and associated outcome sequences ensures that such a pair of model states exists whenever a model state must be split. The agent requires that P (m 0 1 jm;a) and P (m 0 2 jm;a) are both greater than L, in order to ensure that m 0 1 and m 0 2 both represent maximal transitions from environment states covered by model state m under action a, rather than simply environment noise. We create new outcome sequences m new 1 and m new 2 (lines 12-13) by prepending the rst observation of m's outcome sequence and the action a to the outcome sequences of m 0 1 and m 0 2 , respectively. m new 1 and m new 2 are disambiguated by a new SDE formed by concatenatinga and the actions shared by the outcome sequences ofm 0 1 andm 0 2 up to their rst dierence in observation (i.e., actiona prepended to the SDE disambiguatingm 0 1 andm 0 2 ). m new 1 andm new 2 are added to the set of model statesM (lines 14-15), while the SDE distinguishing them is added to S (line 16). We then erase model state m (line 17, replaced by m new 1 and m new 2 ) and setchange totrue (line 18) indicating that we modied the model and should perform the while loop again. Lines 22-23 compute the transition and observation probabilities (T and ) of the sPOMDP. The oracle directly gives us T , while the observation probabilities are easily computed by assigning probability to the most likely observation to be emitted at each model state (the rst observation in its outcome sequence) and (1)=(jOj1) to other observations. Algorithm 9 runs in O(jQj 4 jAj) time. In Figure 6.7, we evaluate the performance of our Optimal sPOMDP Learning procedure (Algorithm 9) on increasingly large, random (minimal) -POMDP environments wherejQj in- creased from 4 to 104. Each data point is an average over 200 random -POMDPs, and each 127 0 5 10 15 20 25 30 35 0 0.05 0.1 0.15 0.2 0.25 0.3 |Q|=4, |A|=2, |O|=2 |Q|=6, |A|=2, |O|=3 |Q|=8, |A|=2, |O|=4 0 5 10 15 20 25 0 10 20 30 40 50 60 70 sPOMDP Model Predictive SDE Model sPOMDP vs. Predictive SDE Model Size Model Size, |M| sPOMDP Predictive SDE ⍺ = 1.0 ε=1.0 ⍺ = 0.99 ε=1.0 ⍺ = 0.95 ε=1.0 ⍺ = 0.9 ε=1.0 ⍺ = 0.85 ε=1.0 ⍺ = 0.8 ε=1.0 ⍺ = 0.99 ε=0.99 ⍺ = 0.95 ε=0.95 Model Error |Q|=4, |A|=2, |O|=2 |Q|=6, |A|=2, |O|=3 |Q|=8, |A|=2, |O|=4 0 20 40 60 80 100 120 Number of environment states, |Q| 0 100 200 300 400 500 600 700 800 900 1000 Runtime (seconds) |Q| vs. Runtime (seconds) for Optimal sPOMDP Learner |O| = |Q|/2 |O| = |Q|/4 |O| = |Q|/8 |O| = |Q|/12 Runtime (seconds) Number of Environment States, |Q| |O|=|Q|/2 |O|=|Q|/4 |O|=|Q|/8 |O|=|Q|/12 Figure 6.7: Runtimes for Optimal sPOMDP Learning (Algorithm 9) on increasingly large random -POMDPs with dierent numbers of hidden states. curve represents random -POMDP environments in which each observation was most likely in increasingly large numbers of states (2, 4, 8, and 12). The red line, for example, represents random environments in whichjOj=jQj=2,jAj=jQj=2, and each observation was most likely in exactly two environment states. and were chosen uniformly at random from their allowable values. Figure 6.8 provides additional results for the Optimal sPOMDP Learning procedure of Algo- rithm 9 on randomly generated minimal Moore machines (where = = 1) and-POMDPs with jOj=jQj=2 andjAj= 2 (and and again chosen from their allowable values, as discussed above). In contrast to Figure 6.7, in Figure 6.8, each observation could be most likely in a random number of states, with the only restriction being that each observation had to be most likely in at least one environment state. All data points are averages over 50 random -POMDP environments. Note that the vertical axis of subgure (c) in Figure 6.8 is on a log scale. The most important takeaways from these results are: 1) average runtime is clearly O(jQj 4 jAj), which is consistent 128 (a) Runtime (b) Longest outcome sequence (c) Model size Figure 6.8: Additional results for Optimal sPOMDP Learning (Algorithm 9) on increasingly large random (minimal) -POMDPs in whichjOj=jQj=2 andjAj= 2. In these environments, each observation covered a random number of environment states, with the only restriction being that each observation must be emitted as the mostly likely one in at least one state. The blue bars represent randomly constructed Moore machine environments ( = = 1). The red bars represent random-POMDPs in which and were randomly chosen uniformly from their allowable values (see the discussion above and Equation 6.6). Subgure (a) provides runtime results. Subgure (b) provides the average length of the longest outcome sequence of any state m2 M in the agent's model. Subgure (c) provides the average number of model statesjMj learned by the agent. All data points are averages over 50 runs of Algorithm 9. with our theoretical analysis (see Chapter 7.2.3); 2) the average longest outcome sequence in the model is always far below the provable upper bound of lengthjQj (again see Chapter 7.2.3), and no outcome sequence ever exceeded this theoretical upper bound in any trials; 3) the agent always learns exactly the same number of model states as environment states, which is expected because the environments are guaranteed to be minimal Moore machine and -POMDP environments. 129 Though not shown in the gures, the model error of each learned model (Equation 6.4) was 0:0 over 10000 timesteps. In other words, empirically, these models induced exactly the same proba- bility distributions over observations at each time step as the environment. This is in accordance with our theoretical results in Chapter 7.2.3 that this is a provably-optimal procedure for learning a perfect model. 6.3.4 PSBL for sPOMDPs We now discuss how to actively and incrementally learn approximate sPOMDP models of unknown Moore machine and rewardless POMDP environments without resorting to an oracle that gives the agent the probabilities of transitioning between model states (or the values of ,, andL) and without requiring that the environment be an-POMDP. We discuss how several key operations involved in learning sPOMDPs can be usefully framed as approximate Bayesian inference. This framing provides us with per-step updates that can be applied inside the general PSBL framework introduced in Chapter 4.5. The high-level pseudocode for applying the PSBL framework to learning sPOMDP models is provided in Algorithm 10. In line 1, we initialize sPOMDP modelM. The specics of this initialization are provided in Algorithm 11. In line 1 of Algorithm 11, we create an empty sPOMDP model. In line 2, we create an empty trie, which we will use to organize the outcome sequences of the model states inM for ecient prex searching. The outcome sequences will be stored inM.outcomeTrie in a forward direction this time, rather than a reverse direction, as in Chapter 5.3.2. In line 3, we initialize the agent's belief state over its current model state,M.beliefState. In lines 5-14, we create a new model state newMState for every unique observation o2 O. The outcome sequence of each of these states is simply the observation they represent (line 6). We add these new model states to the set of model states M in line 8 and hash them on stateCount inM.idsToStates (line 10) and a string representation of their trajectory inM.stringsToStates (line 11). In line 9, we add the outcome sequence of newMState toM.outcomeTrie. In lines 12-13, we assign probability 1 to 130 Algorithm 10: PSBL Learning of sPOMDP Models Input: numActions: number of actions to perform; explore: probability of executing random actions; A: actions; O: observations; patience: Number of splits to wait before a new minimum surprise model. Output:M: sPOMDP model of unknown environmentE /* Algorithm 11 */ 1 M := initializeModel(A, O) 2 minSurpriseModel := null 3 minSurprise := maxFloatValue() 4 splitsSinceMin := 0 5 policy := queue() 6 foundSplit := true 7 while foundSplit do 8 for i2f1;:::; numActionsg do 9 if policy.empty() then /* Add actions of an SDE to the policy or random actions */ 10 policy := updatePolicy(M, explore, policy) 11 action := policy.pop() 12 prevOb :=E.currentObservation() 13 nextOb :=E.takeAction(action) /* Algorithm 13 */ 14 M.updateModelParameters(action, prevOb, nextOb) 15 newSurprise :=M.computeSurprise() 16 if newSurprise < minSurprise then 17 minSurprise := newSurprise 18 minSurpriseModel :=M.copy() 19 splitsSinceMin := 0 20 else 21 splitsSinceMin += 1 22 if splitsSinceMin > patience then 23 break /* Algorithm 18 */ 24 foundSplit :=M.trySplit() 25 return minSurpriseModel the model state matching the current environment observation inM.beliefState. In line 15, we initialize transition counters,M.TCounts, for all possible triplesm;a;m 0 2MAM. All these counters are set to 1, initially. In line 16, we initialize model transition probability distributions, M.T, which are computed easily fromM.TCounts (line 19, see Algorithm 12). In line 17, we initialize one step extension counters,M.OneTCounts. In line 18, we initialize associated one step extension probability distributionsM.OneT, which are calculated usingM.OneTCounts 131 (line 20) in a way that is analogous to Algorithm 12. Both are empty initially, because they are lled in dynamically during the learning procedure (as discussed later). They are used in the computation of gain for model splitting purposes in a way that is somewhat similar to the computation of gain in Chapter 5.3.2. In lines 21-23, we initialize deques forM.actionHistory, M.observationHistory, andM.beliefHistory to maintain sliding windows of the agent's actions, observations, and belief states, respectively.M.observationHistory is initialized with the current environment observation (line 22), andM.beliefHistory is initialized with a copy of the agent's current belief state,M.beliefState (line 23). Returning to Algorithm 10, after model initialization, We initialize minSurpriseModel (line 2) and minSurprise (line 3) to track the model of minimal surprise found during the PSBL learning procedure. In line 4, we initialize splitsSinceMin to 0. This counter is used to track the number of model splits that have occurred since minSurpriseModel was updated to a new model. If this quantity exceeds our patience in line 22 (given as an input hyperparameter), we terminate learning. This can be thought of as a procedure analogous to early stopping in training deep neural networks [21]. Next, we enter the main while loop of the PSBL learning algorithm (lines 7-24). We continue executing this while loop (each time splitting an existing sPOMDP model state into two new model states, line 24) until we can nd no more evidence that splitting any existing model state inM is likely to reduce model surprise (or our patience is exceeded). In line 9, we determine whether we still have actions available in our policy, which is organized as a queue of actions to take. If we do not have any actions left in our policy, we update this policy in line 10. This policy is updated either with the actions of a random SDE consistent with the current environment observation or a sequence of random actions of the same length. The probability of appending a random action sequence to policy (rather than the actions of an SDE) is explore, which is an algorithm hyperparameter. Based on our extensive experimentation, this learning procedure does 132 not seem to be very sensitive to the particular value of explore, provided it is not extremely close to 1 or extremely close to 0. In practice, we typically leave this value at its default of 0.5. Algorithm 11: Initialize sPOMDP Model Input: A: actions; O: observations Output:M: Initialized sPOMDP model 1 M := sPOMDPModel() 2 M.outcomeTrie = Trie() 3 M.beliefState := 1jOj array of zeros 4 stateCount := 0 5 for o2O do 6 trajectory := [o.id] 7 newMState := ModelState(stateCount, trajectory) 8 M:M.addState(newMState) 9 M.outcomeTrie.add(trajectory) 10 M.idsToStates[stateCount] := newMState 11 M.stringsToStates[toString(trajectory)] := newMState 12 if o ==E.currentObservation() then 13 M.beliefState[stateCount] = 1.0 14 stateCount += 1 15 M.TCounts :=jMjjAjjMj array of ones 16 M.T :=jMjjAjjMj array of zeros 17 M.OneTCounts :=fg 18 M.OneT :=fg /* Algorithm 12 */ 19 updateTransitionProbabilities() 20 updateOneStepProbabilities() 21 M.actionHistory =fg 22 M.observationHistory.append(E.currentObservation()) 23 M.beliefHistory.append(M.beliefState.copy()) 24 returnM Algorithm 12: Update Transition Probabilities 1 for m inM:M do 2 for a inM:A do 3 total := 0.0 4 for m 0 inM:M do 5 total +=M.TCounts[m][a][m 0 ] 6 for m 0 inM:M do 7 M:T [m][a][m 0 ] :=M:TCounts[m][a][m 0 ]/total In lines 8-14 of Algorithm 10, we perform numActions actions from our policy in the envi- ronment to further train the sPOMDP model transition distributions (M.T and their one-step 133 extension distributionsM.OneT). After each action is taken and each current observation (cur- rOb) and resulting observation (nextOb) is recorded (lines 11-13), we update the parameters of modelM based on this new agent experience (line 14). The pseudocode of the model parameter update of line 14 is provided in Algorithm 13. Algorithm 13: Update sPOMDP Model Parameters Input: a: action taken; prevOb: previous observation; nextOb: resulting observation 1 M.actionHistory.append(a) 2 M.observationHistory.append(nextOb) 3 history :=M.actionHistory +M.observationHistory interleaved in action-observation pairs 4 if history.length()M.maxOutcomeLength + 6 then /* Section 6.3.4.2, Algorithm 15 */ 5 M.beliefHistory := smoothBeliefHistory(history,M.beliefHistory) /* Section 6.3.4.3, Algorithm 16 */ 6 updateTransitionFunctionPosteriors(a, nextOb,M.beliefHistory) /* Section 6.3.4.4, Algorithm 17 */ 7 updateOneStepFunctionPosteriors(history,M.beliefHistory) 8 M.actionHistory.popleft() 9 M.observationHistory.popleft() /* Section 6.3.4.1, Algorithm 14 */ 10 M.beliefState := updateBeliefState(M.beliefState, a, nextOb) 11 M.beliefHistory.append(M.beliefState.copy()) 12 ifM.beliefHistory.size() >M.actionHistory.size() then 13 M.beliefHistory.popleft() In lines 1-2 of Algorithm 13, the agent appends the most recent actiona and the resulting obser- vation nextOb toM.actionHistory andM.observationHistory, respectively. In line 3, these action and observation deques are concatenated and interleaved into history such that history consists of time-ordered action-observation pairs. Once we have built up a sucient enough history (3 time steps, or 6 elements, past the longest model state outcome sequence,M.maxOutcomeLength), we can smooth the agent's belief state history,M.beliefHistory (line 5, discussed in detail in Sec- tion 6.3.4.2 and Algorithm 15). Intuitively, we smooth the agent's beliefs over its previous model states by considering which model state outcome sequences are consistent with the actions and observations it subsequently experienced in history. In line 6, we perform an update of our model transition probability distributions using approximate Bayesian inference. This is discussed in detail in Section 6.3.4.3 and Algorithm 16. In line 7, we perform a similar update for one step 134 extension probability distributions (Section 6.3.4.4 and Algorithm 17). In lines 8-9, we remove the oldest action and observation fromM.actionHistory andM.observationHistory, in order to maintain a xed size window of agent experience history. In line 10, we perform an approximate Bayesian update of the agent's currentM.beliefState based on the action a and resulting ob- servation nextOb. We discuss this procedure in Section 6.3.4.1 and Algorithm 14. Finally, we append a copy of this updated belief state toM.beliefHistory and pop the oldest belief state from M.beliefHistory (if necessary) in order to ensure thatM.beliefHistory, M.actionHistory, and M.observationHistory are always the same size (lines 11-13). The next four subsections detail the procedures outlined in Algorithm 13, several of which can be usefully framed as approximate Bayesian inference. 6.3.4.1 sPOMDP Belief Updating The sPOMDP model relies on the probabilities of transitions between model states represented by predictive sequences of actions and observations; it therefore seems a reasonable approximation to consider learning only the transitions between model states, with the unreliability of our sensors simply adding uncertainty to these transitions. We incrementally learn these transition parameters while simultaneously localizing the agent using an approximate Bayesian update executed every time an action is taken and an observation is made. We place uninformative prior distributions over the agent's initial model state (belief distribution,b M;0 ) and the transition functions for each m;a2MA pair (P (T ma 0 )) under the rather common assumption [124] that these parameters are all mutually independent of one another. b M;0 is dened as a uniform categorical distribution over possible model states (jOj of them, sincejMj=jOj initially), while each of thejMjjAj transition function priors P (T ma 0 ) = Dir(jMj; mam1 0 = 1;:::; mam jMj 0 = 1) is a Dirichlet distribution with hyperparameters all initialized to 1 (indicating uniform uncertainty over transition functions) 2 . 2 It should be noted, here, that it is not a requirement that all Dirichlet hyperparameters be set to 1. If we have reason to suspect that the agent's environment will have high determinism, for example, we can set these initial parameters to be much less than 1 (but still equal to one another). This encodes a prior that is quickly overwhelmed by the data (while still avoiding zero count problems), which is justiable if, in fact, the environment 135 In practice, performing an exact Bayesian update of the agent's model state beliefs b M;t after taking actiona and observingo is too expensive, even in small environments. Instead, we use the following approximation: bM;t(m 0 )/1 om 0(m 0 ) X m2M T mam 0 t bM;t1(m) (6.7) 1 om 0(m 0 ) is the indicator function for the event o m 0 which is 1 when o is visibly equivalent to m 0 (matches the rst observation of its outcome sequence) and 0 otherwise. Let ma t = P jMj i=1 mami t represent the sum of the hyperparameter counts of Dirichlet posteriorP (T ma t ). LetT ma t represent a categorical random variable in which the probability of eachm 0 2M (denoted T mam 0 t ) is dened by the expected value of the corresponding coordinate in P (T ma t ), mam 0 t = ma t . Equation 6.7 is an approximation of the traditional POMDP belief update equation in which we replace the probability of making observation o in model state m 0 with an indicator function that zeros out the belief of being in states inconsistent with the current observation. Thus, we dene our modelM's emission probabilities as8 m 0 2M;o2O P (ojm 0 ) = 1 om 0(m 0 ). This ignores the probability of being in some state and seeing one of the least likely observations (e.g., seeing in states I or II in the -Shape environment of Figure 6.1(b)), but it is a reasonable approxima- tion when observation probabilities are not too noisy (i.e., the agent's sensors are fairly reliable). Pseudocode for this update is provided in Algorithm 14. In lines 1-4 of Algorithm 14, we compute the joint distribution over possible model states at times t 1 and t (given the agent's history) by multiplying the probability of transitioning from statem to statem 0 under actiona (M:T [m][a][m 0 ]) by the current belief that the agent is in model state m (given its history), b[m]. We do this for all m;m 0 2MM. In lines 5-8, we marginalize out over the state at time t 1, leaving an updated distribution over possible model states at time t (after taking into account that the agent executed action a), b 0 . In lines 9-11, we apply exhibits high determinism. These hyperparameter magnitudes become more important as the environment becomes larger, because it's more dicult for the agent to experience a large number of transitions many times to gather statistics. 136 our observation model approximation. For each m2M, we multiply b 0 [m] by 1 if observation o (also given) matches the rst observation ofm's outcome sequence (i.e., ifo is visibly equivalent to m). Otherwise, we multiply b 0 [m] by 0. To reiterate, this is an approximation that zeros out the agent's belief of being in any model states whose most likely observations are inconsistent with o. Lines 12-16 simply re-normalize the result into a probability distribution over model states at time t (after having taken into account both the agent's action a at time t 1 and resulting observation o at time t). Algorithm 14: sPOMDP Belief Update Input: b: current model state beliefs; a: action; o: observation Output: b 0 : updated belief state. 1 joint :=jMjjMj 2D array 2 for m inM.M do 3 for m 0 inM:M do 4 joint :=M:T [m][a][m 0 ]b[m] 5 b 0 := 1jMj array of zeros 6 for m 0 inM.M do 7 for m inM:M do 8 b 0 [m 0 ] += joint[m][m 0 ] 9 for m inM:M do 10 multFactor := int(m.rstObservation == o) 11 b 0 [m] *= multFactor 12 total := 0.0 13 for m inM:M do 14 total += b 0 [m] 15 for m inM:M do 16 b 0 [m] /= total 17 return b 0 6.3.4.2 SDE-Based Belief Smoothing in sPOMDPs As it turns out, Equation 6.7 (and Algorithm 14) are sucient when applying an already learned sPOMDP model to decision making or planning tasks, but they are not enough to break symmetry between identical-looking states during the learning process. This is to be expected, because sPOMDP model states are associated with (in general) multi-action SDEs. It is the results of 137 these SDEs (outcome sequences) that statistically distinguish between identical-looking states. Thus, we need a way for the agent to observe the results of these SDEs before updating the parameters of its model. The way to solve this problem is to maintain a history of agent actions and observations (similar to what was done in the PSBL for PSDEs learning procedure detailed in Chapter 5.3.2). The primary dierence here is, not only do we maintain a history of previous agent actions and observations, but we also maintain the agent's belief state at each point in this history (according to Equation 6.7 and Algorithm 14). We then smooth each of these belief states (starting at the one corresponding to the rst observation in history) by zeroing out the agent's belief that it is in any model state whose outcome sequence is not consistent with the subsequent agent actions and observations in history. We use these smoothed belief estimates to update model parameters (lines 6-7, Algorithm 13). We call this process SDE-based belief smoothing. In order to facilitate this type of smoothing eciently, the agent maintains a trie of model state outcome sequences that is very similar to the PSDETrie maintained by PSBL for PSDEs in Algorithm 3 (and illustrated in Figure 5.5). We call this an outcome trie (maintained in M.outcomeTrie, see Algorithm 11). The dierence, here, is that, at each node of this trie, we maintain a list of all the outcome sequences of model states consistent with traversing the trie up to that trie node. Additionally, these outcome sequences are not reversed as they are in Algorithm 3 in Chapter 5.3.2, because we are interested here in a forward-looking consistency of subsequent (not previous) actions and observations (i.e., a check for matching prexes). Thus, in time linear in the number of steps of history maintained, the agent can retrieve the outcome sequences of all model states consistent with its history. It can then zero out belief in all model states whose outcome sequences are not in this list of consistent outcome sequences. Figure 6.9 illustrates an example of this kind of trie in the -Shape environment (Figure 6.1(b)). Algorithm 15 provides the pseudocode for SDE-based belief smoothing. In Algorithm 15, we process the agent's oldest 3 belief states. Analogously to the previous chapter (Chapter 5), the agent maintains history and beliefHistory as xed-size windows. The 138 { y x } { y x } m1: m2: { x } { x } m3: m4: {x} s1: {y x} s2: Model States: SDEs: Transition Distributions: P({m1, m2, m3, m4}|m1, x) P({m1, m2, m3, m4}|m2, x) P({m1, m2, m3, m4}|m3, x) P({m1, m2, m3, m4}|m4, x) P({m1, m2, m3, m4}|m1, y) P({m1, m2, m3, m4}|m2, y) P({m1, m2, m3, m4}|m3, y) P({m1, m2, m3, m4}|m4, y) P({ }|m1) Observation Distributions: P({ }|m2) P({ }|m3) P({ }|m4) (a) sPOMDP model of the -Shape environment t+2 t+1 t x { x } { x } { x } { x } y x { y x } { y x } { y x } { y x } { y x } { y x } Input: { y } (b) Trie with consistent sequences at each node Figure 6.9: An illustration of an sPOMDP model of the -Shape environment of Figure 6.1(b) (subgure (a)) and the associated outcome trie (subgure (b)) which provides the agent with the outcome sequences of model states consistent with its history. If the full agent history is not represented in the outcome trie, the sequences consistent with the last valid traversed trie node are returned. Even though it is not visualized in the gure (to avoid unnecessary clutter), consistent outcome sequences are stored at both action and observation trie nodes. Algorithm 15: Smooth Belief History Input: history: history of agent actions and observations; beliefHistory: history of agent belief states. Output: smoothedBeliefHistory: smoothed belief states. 1 for i inf0; 1; 2g do 2 savedBeliefs := beliefHistory[i].copy() 3 matching :=M.outcomeTrie.consistentSequences(history[2i:]) 4 beliefHistory[i] := 1jMj array of zeros 5 for match in matching do 6 matchingState :=M.stringsToStates[match] 7 beliefHistory[i][matchingState] := savedBeliefs[matchingState] 8 total := 0:0 9 for m inM.M do 10 total += beliefHistory[i][m] 11 for m inM.M do 12 beliefHistory[i][m] /= total 13 return beliefHistory reason for processing 3 belief states will become clear in Sections 6.3.4.3 and 6.3.4.4 when we discuss inferring transition (and one step extension) probability distribution posteriors. For each of these 3 belief states, we rst make a copy (line 2) and useM.outcomeTrie (see Figure 6.9) to retrieve the list of model state outcome sequences consistent with agent history from index 139 2i to the end of history (line 3). For example, to process beliefHistory[0] { which, according to Equation 6.7, has already been processed to assign non-zero probability only to model states matching the observation at history[0] { we traverseM.outcomeTrie (starting at history[0]) until we reach the end of history or until we reach a point inM.outcomeTrie where there are no longer valid nodes in the trie consistent with history. We return the outcome sequences consistent with the last valid node encountered during this traversal (see Figure 6.9 (b) for an example). We do the same for beliefHistory[1] beginning at history[2] and beliefHistory[2] beginning at history[4]. We ensure that history is at least 3 time steps (6 elements) longer than the longest model state outcome sequence (M.maxOutcomeLength), such that history extendsM.maxOutcomeLength elements past beliefHistory[2], which is the last belief state smoothed at each time step. This is the reason for the +6 in line 4 of Algorithm 13. In this way, if the agent's subsequent experience from any of these 3 belief states does exactly match a model state outcome sequence, the agent will be able to detect this and assign probability 1 to being in the associated model state in the appropriate beliefHistory. This is crucial to using SDEs as a localization aid. Line 4 zeros out all the entries in beliefHistory[i], but they have been saved in savedBeliefs. In lines 5-7, we iterate over all the consistent model outcome sequences in matching. For each such match, we nd the model state matchingState2M:M associated with this outcome sequence via a map from outcome sequence strings to model states (M.stringsToStates, line 6). We then copy the value in savedBeliefs associated with matchingState back to beliefHistory[i] in line 7. After the for loop ending in line 7, the only model states still with non-zero probability in beliefHistory[i] are those whose outcome sequences are consistent with history (from element 2i to the last valid trie node inM.outcomeTrie). However, beliefHistory[i] is no longer necessarily normalized. Lines 8-12 renormalize beliefHistory[i] such that it is a valid probability distribution over model states (i.e., a belief state). The smoothed beliefs are returned in line 13. 140 6.3.4.3 sPOMDP Transition Function Updating As was the case with belief updating (Equation 6.7), an exact Bayesian update of transition function posteriors is also impractical. We instead perform the following online update which approximates the posterior over the transition function associated with each (m;a) pair by con- sidering the possible transitions between model states that could have led to observing o after taking actiona and weighting each proportionally to how probable it is (according to our current model parameters). Recall that P (T ma t1 ) = Dir(jMj; mam1 t1 ;:::; mam jMj t1 ). We perform what we call a soft count update for each possible m to m 0 transition under a: mam 0 t = mam 0 t1 +1 om 0(m 0 )T mam 0 t1 bM;t1(m) (6.8) Equation 6.8 increments the counter of each possible transition from m to m 0 under a by a weight proportional to the likelihood of this transition. normalizes these soft counts to sum to 1, and P (T ma t ) = Dir(jMj; mam1 t ;:::; mam jMj t ). Note that only the components mam 0 t such that m 0 is visibly equivalent to o will actually be changed. Algorithm 16 gives the pseudocode for performing this soft count update of sPOMDP transition function parameters,M:TCounts (which maintains the set of counts mami t for all m;a2MA pairs at time t). Algorithm 16: Transition Function Posteriors Update Input: a: action taken; o: observation; beliefHistory: history of belief states. 1 counts := anjMjjMj array of zeros 2 totalCounts := 0.0 3 for m inM:M do 4 for m 0 inM:M do 5 multFactor := int(m 0 .rstObservation == o) 6 counts[m][m 0 ] := multFactorM:T [m][a][m 0 ]beliefHistory[0][m] 7 totalCounts += counts[m][m 0 ] 8 for m inM:M do 9 for m 0 inM:M do 10 counts[m][m 0 ] /= totalCounts 11 M:TCounts[m][a][m 0 ] += counts[m][m 0 ] /* Update M.T using new values of M.TCounts, Algorithm 12 */ 12 updateTransitionProbabilities() 141 Algorithm 16 is a straightforward implementation of Equation 6.8. Lines 3-7 compute the unnormalized soft counts associated with each possible transition from m to m 0 under action a. If m 0 is visibly equivalent to observation o, the soft count associated with the pair (m;m 0 ) is proportional toM:T [m][a][m 0 ]beliefHistory[0][m], which is an approximation of the likelihood of the agent making the transition from model state m to m 0 under action a conditioned on its history (according to the current transition probabilities, M.T). Note the use of our rst smoothed belief state beliefHistory[0] in line 6 (see Section 6.3.4.2 and Algorithm 15). In lines 8-11, we normalize counts by dividing by the sum total of all the soft counts and add each normalized soft count in counts to the appropriate entry inM:TCounts. In line 12, we use the new expected values of the Dirichlet posterior components inM:TCounts to update transition probabilitiesM:T for each (m;a) pair in MA (Algorithm 12). As a nal note, in practice, setting multFactor to be beliefHistory[1][m 0 ] in line 5 generally performs better than Algorithm 16 as presented. This is dicult to justify in a completely formal way, but it makes intuitive sense. Like beliefHistory[0], beliefHistory[1] has been smoothed according to future agent actions and observations (see Algorithm 15). Multiplying by beliefHistory[1][m 0 ] in line 6 eectively restricts the agent's transition probability updates at each time step to pairs of model statesm;m 0 2MM consistent with all its known subsequent actions and observations (recorded in history). Such an update makes full use of the results of SDE-based smoothing. This substitution is particularly eective when the environment has a high level of determinism. 6.3.4.4 sPOMDP One Step Transition Function Posteriors Update Recall from Chapter 5.3.2.3 (and, more specically, Equation 5.21) that one of the most important aspects of PSBL is the idea of using one-step extension distributions to determine which parts of agent modelM should be changed in order to reduce model surprise. The key insight to solving this problem for sPOMDPs is to recall that the purpose of each model state m 0 2 M is to act as a sucient statistic of agent history. The probability that the agent transitions from model 142 statem 0 to model statem 00 under actiona 0 should not be altered by considering the model states, actions, and observations encountered by the agent before it transitioned into state m 0 . In other words, the state space M of sPOMDPM should satisfy the transition Markov property [88] as closely as possible. Intuitively, the agent can use deviations from the transition Markov property that suggest a likely reduction in total model surprise to guide model splitting. Before discussing this, however, we must formally dene model surprise for sPOMDP models in terms of the model parameters discussed in Section 6.3.4.3. Recall that we dene ma t = P jMj i=1 mami t as the sum of the hyperparameter counts of Dirichlet posterior P (T ma t ), and let t = P m;a2MA ma t . Recall also that we deneT ma t as a categorical random variable in which the probability of each m 0 2 M is dened by the expected value of the corresponding coordinate in P (T ma t ), mam 0 t = ma t . The total model surprise of modelM, denoted S(M), is dened as: S(M), X m;a2MA H(T ma t ) ma t t (6.9) H(T ma t ) is the normalized entropy of T ma t (the categorical probability distribution dened by the expected values of the components in Dirichlet posteriorP (T ma t )), where the normalization is performed by dividing each H(T ma t ) by the entropy of a uniform distribution overjMj elements, wherejMj is the number of states currently in modelM (and, thus, the number of possible values each random variable T ma t can take on). Let T mam 0 a 0 t represent an estimate at time t of the probability distribution over model states m 00 2M resulting from executing the actiona 0 in model statem 0 , given that the state-action pair that caused the transition to model state m 0 was m;a. We call this a one-step extension of the transition distribution corresponding to the model state-action pair m 0 ;a 0 , denoted T m 0 a 0 t as in the previous section. There arejMjjAj such one-step extensions for each such model state-action pair (m 0 ;a 0 )2 MA. Let w ma denote the probability that one-step extension m;a causes the transition into state m 0 . We dene the one-step transition gain for this transition as: 143 G m 0 a 0 =H(T m 0 a 0 t ) ( X m;a2MA w ma H(T mam 0 a 0 t )) (6.10) Equation 6.10 subtracts the sum of the weighted (normalized) entropies of each one-step extension distribution (weighted by the probability that extension m;a causes the transition to model state m 0 , w ma ) from the normalized entropy resulting from simply executing a 0 in model statem 0 (without regard to preceding transitions). Note that P m;a w ma = 1. IfG m 0 a 0 is very close to 0, we have strong evidence that the transition Markov property holds for that transition (at least approximately, up to a single model state-action pair of history). A negative gain indicates that the agent becomes more unsure about the resulting model statem 00 when considering another step of its history (which can happen with poor probability estimates). A positive gain indicates that the expected uncertainty the agent faces over the resulting model state m 00 decreases when the agent considers another step of history, providing strong evidence that state m 0 is not a good approximate sucient statistic of agent history and that splitting this state is likely to result in a reduction in the entropy of this transition (and, likely, a reduction in total model surprise, S(M)). WhenG m 0 a 0 is larger than a gain threshold hyperparameter (M.minGain) the agent splits model state m 0 into two new model states (discussed next in Section 6.3.4.5). Since the magnitudes of the entropies calculated in equation 6.10 depend on the current size of the model state setjMj, we normalize each entropy value to lie in the range [0; 1] by dividing it by the entropy of a uniform distribution overjMj possible values (i.e., the distribution overjMj possible values with maximal entropy). This allows us to use a xed value ofM.minGain that does not need to increase asjMj increases. These one-step distributions T mam 0 a 0 t can be learned in a Bayesian fashion by a procedure that is analogous to the way in which transition function posteriors were approximately inferred in Section 6.3.4.3 (Algorithm 16). For each (m 0 ;a 0 )2MA pair, we initialize an uninformative Dirichlet prior over each of itsjMjjAj one-step extensions, P (T mam 0 a 0 0 ) = Dir(jMj; mam 0 a 0 m1 0 = 144 1;:::; mam 0 a 0 m jMj 0 = 1). We again assume that all these one-step extension parameters are inde- pendent of one another and independent of the transition function and model belief state param- eters learned by the procedures in Sections 6.3.4.1 and 6.3.4.3. The soft count update equation for approximately inferring one-step extension posteriors after the agent takes actions a and a 0 and makes observations o and o 0 is (analogously to equation 6.8) simply: mam 0 a 0 m 00 t = mam 0 a 0 m 00 t1 +1 o 0 m 00(m 00 )1 om 0(m 0 )T m 0 a 0 m 00 t T mam 0 t bM;t2(m) (6.11) Equation 6.11 increments the count associated with each possible sequence of transitions be- tween three model states (m,m 0 , andm 00 ) under actionsa anda 0 by a weighted value proportional to the likelihood of this sequence of transitions (according to the current transition probabilities of the model). Again, normalizes these count updates to sum to 1, and: P (T mam 0 a 0 t ) = Dir(jMj; mam 0 a 0 m1 t ;:::; mam 0 a 0 m jMj t ) As with equation 6.8, the only components mam 0 a 0 m 00 t updated in this approximate Bayesian update will be those corresponding to model states m 0 visibly equivalent to o and m 00 visibly equivalent to o 0 . The model state beliefs of the agent used in this expression must be for time step t 2 (when the agent began this sequence of transitions), while the most up-to-date transi- tion parameter estimates at time t are used (which we update at each time step directly before performing the update in equation 6.11, see Algorithm 13). In practice, rather than storing Dirichlet distributions for all possible one-step extensions of each model state transition, we can generate them dynamically as the agent experiences them and only consider the experienced one-step extensions in equations 6.10 and 6.11. This approach can dramatically reduce the amount of storage by ignoring one-step extensions that likely have low probability of occurring (and would thus likely not change the results of equations 6.10 and 6.11 signicantly). The pseudocode for the update in Equation 6.11 is provided in Algorithm 17. 145 Algorithm 17: One Step Transition Function Posteriors Update Input: history: history of agent action-observation pairs; beliefHistory: history of belief states. 1 o := history[0] 2 a := history[1] 3 o 0 := history[2] 4 a 0 := history[3] 5 counts := anjMjjMjjMj array of zeros 6 totalCounts := 0.0 7 for m inM:M do 8 for m 0 inM:M do 9 for m 00 inM:M do 10 multFactor1 := int(m 0 .rstObservation == o) 11 multFactor2 := int(m 00 .rstObservation == o 0 ) 12 counts[m][m 0 ][m 00 ] := multFactor2multFactor1 M:T [m 0 ][a 0 ][m 00 ]M:T [m][a][m 0 ]beliefHistory[0][m] 13 totalCounts += counts[m][m 0 ][m 00 ] 14 for m inM:M do 15 for m 0 inM:M do 16 for m 00 inM:M do 17 counts[m][m 0 ][m 00 ] /= totalCounts 18 M:OneTCounts[m][a][m 0 ][a 0 ][m 00 ] += counts[m][m 0 ][m 00 ] /* Update M.OneT using new M.OneTCounts, analogous to Algorithm 12 */ 19 updateOneStepProbabilities() In lines 1-4 of Algorithm 17, we extract actions a and a 0 and observations o and o 0 from the agent's history. In lines 7-13, we compute the unnormalized soft counts for all possible sequences of 3 model states m;m 0 ;m 00 2 MMM under the actions a and a 0 , analogously to Algorithm 16. In lines 10-11, we enforce that only counts corresponding to triples (m;m 0 ;m 00 ) such thatm 0 is visibly equivalent too andm 00 is visibly equivalent too 0 will be non-zero (similar, again, to Algorithm 16). In lines 14-18, we normalize these soft counts to sum to 1 and add the normalized soft counts to the appropriate entries inM.OneTCounts. Finally, in line 19, we update the corresponding one step probability distributionsM.OneT using the updated values ofM.OneTCounts via a procedure analogous to that provided in Algorithm 12 (but that takes into account that some one-step extensions may not yet have been experienced). For the same reasons discussed in Section 6.3.4.3, setting multFactor1 to be beliefHistory[1][m 0 ] and setting 146 multFactor2 to be beliefHistory[2][m 00 ] typically leads to better performance, so we generally make this substitution in practice, even though it is dicult to justify in a completely formal sense. 6.3.4.5 Splitting sPOMDP Model States We can now nally nish our discussion of Algorithm 10. In line 15, we compute the current model surprise of sPOMDP modelM (see Equation 6.9 in Section 6.3.4.4). If it is less than minSurprise (line 16), we update minSurprise (line 17) and minSurpriseModel (line 18), and we reset splitsSinceMin to 0 (line 19). Otherwise, we increment splitsSinceMin in line 21, because a new minimal surprise sPOMDP was not found. If splitsSinceMin exceeds our patience with nding a new minimal surprise model, we exit the learning procedure and return the minimum surprise model found (lines 22-23, 25). The nal remaining component of Algorithm 10 to be discussed is the model splitting procedure of line 24. The key idea behind this procedure is nd some model transition (m;a)2MA, such that the estimated gain on that transition G ma (Equation 6.10) is larger than hyperparameter M.minGain. This indicates that the transition Markov property for state m2M is likely being violated (to a high degree) under action a and addressing this violation is likely to reduce total model surprise (Equation 6.9). To address this violation, we split state m into two latent states sharing the same rst observation and action in a way that is very similar to the procedure in lines 12-17 of Algorithm 9. The details of this splitting procedure are given in Algorithm 18. In line 1, we compute the gain G associated with each (m;a) pair in MA, and, in line 2, we sort these gains in increasing order of the length of m's outcome sequence. The idea is to generate shorter SDEs rst, such that they can be used to discover longer ones. It also has the eect of encouraging the agent to wait for more data before splitting states with longer outcome sequences. In lines 3-37, we iterate over these gains (and associated m;a pairs) in order. 147 Algorithm 18: sPOMDP Model State Splitting Output: true if model was successfully split, false otherwise. /* Compute gain for each m;a according to Equation 6.10 */ 1 G :=f((m;a);G m;a ) for m;a2MAg 2 G := sort G in order of increasing length of outcome sequence m.trajectory 3 for gs in G do 4 state := gs[0][0] 5 action := gs[0][1] 6 gainValue := gs[1] 7 if gainValue >M.minGain then 8 m 1 , m 2 := twoMostLikelyStates(state, action) 9 newOutcome1 := state.rstObservation + action + m 1 .trajectory 10 newOutcome2 := state.rstObservation + action + m 2 .trajectory 11 outcomesToAdd :=fg 12 if newOutcome1 not inM:M.outcomeSequences then 13 outcomesToAdd.append(newOutcome1 ) 14 M.outcomeTrie.add(newOutcome1 ) 15 if newOutcome2 not inM:M.outcomeSequences then 16 outcomesToAdd.append(newOutcome2 ) 17 M.outcomeTrie.add(newOutcome2 ) 18 M.maxOutcomeLength := max(M.maxOutcomeLength, newOutcome1.length(), newOutcome2.length()) 19 if outcomesToAdd.length() > 1 then 20 Remove state fromM:M,M.stringsToStates, andM.idsToStates 21 newState1 := ModelState(state.id, newOutcome1 ) 22 M.idsToStates[state.id] := newState1 23 M.stringsToStates[newOutcome1 ] := newState1 24 M:M.addState(newState1) 25 newState2 := ModelState(jM:Mj, newOutcome2 ) 26 M.idsToStates[jM:Mj] := newState2 27 M.stringsToStates[newOutcome2 ] := newState2 28 M:M.addState(newState2) /* Reinitialize for new model size jMj */ 29 reinitializeModel() 30 return true 31 else if outcomesToAdd.length() == 1 then 32 newState := ModelState(jM:Mj, outcomesToAdd[0]) 33 M.idsToStates[jM:Mj] := newState 34 M.stringsToStates[outcomesToAdd[0]] := newState 35 M:M.addState(newState) /* Reinitialize for new model size jMj */ 36 reinitializeModel() 37 return true 38 return false 148 If we nd a gain, gs, whose gainValue is larger thanM.minGain (line 7), we proceed with splitting the associated model state on the associated action. In line 8, we nd the two most likely model states to be transitioned into from state under action. The idea here is similar to that of Optimal sPOMDP Learning (Algorithm 9): we suspect that states m 1 and m 2 represent the con icting results of underlying environment states covered by state that have inconsistent dynamics under action. This indicates that state needs to be split into two new latent model states representing these inconsistent results under action. As in Algorithm 9, we form the outcome sequences of these new model states by prepending state.rstObservation + action to the outcome sequences of states m 1 and m 2 (lines 9-10). In lines 12-17, we add newOutcome1 and newOutcome2 to outcomesToAdd andM.outcomeTrie only if they do not match outcome sequences already in model state space M. In line 18, we updateM.maxOutcomeLength to the length of newOutcome1 or newOutcome2 if either is larger than the current value ofM.maxOutcomeLength. If both newOutcome1 and newOutcome2 are unique (as is typically the case, checked in line 19), we create new model states for them and add both of these new model states toM:M,M.idsToStates, andM.stringsToStates (lines 21-28). We remove state from the same data structures in line 20. If only one of the new outcome sequences is unique (not already inM:M, checked in line 31), we add a new model state associated with the unique outcome sequence toM:M,M.idsToStates, andM.stringsToStates (lines 32-35). Note that, in the case of only one unique outcome sequence, we don't delete state. In either case (see lines 29, 36), we reinitializeM.T,M.TCounts,M.OneT,M.OneTCounts, M.beliefState, M.beliefHistory, M.actionHistory, andM.observationHistory to accommodate the change in model state space M (which is now larger). More specically, when the model is split, the agent's beliefs about its current model state are reset to indicate uniform uncertainty over all new model states that are visibly equivalent to the current environment observation. The transition probability distributions for all (m;a) pairs are reset to Dir(jMj; mam1 t = 1;:::; mam jMj t = 1), wherejMj is the new size of model state set M 149 after the split. The exception to this is that the values of all mam 0 t such that model statesm and m 0 are unchanged in the new model are preserved in these expanded Dirichlet distributions to speed up the learning process. Similarly, one step extension counters mam 0 a 0 m 00 t involving triples of model states m;m 0 ;m 00 that are all unchanged can also be preserved. In either case, true is returned in line 30 or line 37 to indicate the model was successfully split. If no such successful split occurs after considering every possible (m;a) pair, false is returned in line 38. As a nal note, in PSBL for sPOMDPs (Algorithm 10), we don't explicitly maintain a list of SDEs. Rather, we store the outcome sequences associated with each model state. The actions of these outcome sequences are usually SDEs, but, due to the possibility of sampling errors and the agent's lack of knowledge about true environment noise levels, the guarantees of the Optimal sPOMDP Learning procedure (Algorithm 9) no longer apply to ensure this. Thus, when we update the policy in line 10 of Algorithm 10 with the actions of an SDE, what we actually do is append the actions of the outcome sequence of a random model state consistent with the current environment observation. The set of SDEs learned by Algorithm 10, then, is simply the actions of the outcome sequences of the learned model states (with any duplicates removed). Chapter 4.5 provides an extended example of this splitting procedure in the -Shape envi- ronment (though the computations of gain in Equation 6.10 and model surprise in Equation 6.9 are not considered in detail). Examples of computing model surprise and gain are provided in Figures 6.10 and 6.11, respectively. Before concluding this section, we discuss how vectorization can be used to dramatically enhance the eciency and scalability of the procedures explained in the previous sections. 6.3.4.6 Vectorization in PSBL for sPOMDPs In Python, which is the programming language used to implement all the supporting code for this dissertation, NumPy [107] provides an exceptionally powerful tool for scientic computing, 150 Alpha-epsilon POMDP environments sPOMDPs Motivation Problem Challenges Related Work SDEs Learning Results Future Information Sciences Institute ϵ 1−ϵ α (1−α)/3 (1−α)/3 (1−α)/3 (a) The -Shape environment x 1. P( { }| ) = {0.964, 0.036} 2. P( { }| ) = {0.024, 0.976} y 3. P( { }| ) = {0.963, 0.037} y 4. P( { }| ) = {0.519, 0.481} x 771 29 16 650 413 16 269 249 sPOMDP Model: Totals: 800 666 429 518 2413 Distribution Over Model Transitions: {800/2413, 666/2413, 429/2413, 518/2413} = {0.33, 0.28, 0.18, 0.21} Entropies of Transition Distributions: {0.224, 0.163, 0.228, 0.999} Calculation of Model Surprise: 0.224*0.33 + 0.163*0.28 + 0.228*0.18 + 0.999*0.21 = 0.37 (b) Example of model surprise computation Figure 6.10: An illustration of computing total model surprise (Equation 6.9) for an sPOMDP model of the -Shape environment (subgure (a)). In subgure (b), the numbers above each transition distribution are the current Dirichlet posterior counts mam 0 t . In this example, the sPOMDP modelM has only two states ( and). Totals: Distribution Over One Step Extensions: {514/2018, 505/2018, 486/2018, 513/2018} = {0.26, 0.25, 0.24, 0.25} Entropies of One Step Extension Distributions: {0.081, 0.204, 0.146, 0.140} Weighted Sum of One Step Entropies: 0.081*0.26 + 0.204*0.25 + 0.146*0.24 + 0.140*0.25 = 0.142 x }| ) = {0.990, 0.010} x x }| ) = {0.979, 0.021} x x }| ) = {0.032, 0.968} y 2. P( { x }| ) = {0.980, 0.020} y 4. P( { One Step Extensions: 1. P( { 3. P( { 508 6 514 17 488 505 475 11 486 502 11 513 2018 Gain with Respect to x: 0.999 - 0.142 = 0.857 Figure 6.11: Continued from Figure 6.10, this is an illustration of computing the gain (Equa- tion 6.10) for the transition (m 0 ;a 0 ) = (;x) in the -Shape environment (Figure 6.10 (a)). The normalized entropy of the transition distribution for (;x), which is 0:999, is calculated in Fig- ure 6.10 (b). As was the case in Figure 6.10 (b), the numbers above each one step extension distribution are the current Dirichlet posterior counts mam 0 a 0 m 00 t . and, in particular, numerical computations involving multi-dimensional arrays. One of the best- known and most widely utilized features of NumPy is called vectorization, which refers to the 151 automated use of highly optimized Fortran and C code for numerical computations involving multidimensional arrays. Much of this Fortran and C code also makes use of SIMD architectures in modern processors (when available) to perform these numerical computations in parallel (at the hardware level). The more procedures we can cast as operations on multi-dimensional arrays (rather than explicit for loops), the more ecient we can expect our implementation to be. As it turns out, nearly all the key operations in PSBL for sPOMDP Learning (Algorithm 10) can be vectorized in this way. Some of these vectorizations are quite intuitive and obvious (e.g., the updating of transition and one step probabilities from counters in Algorithm 12 and the updating of the agent's belief state in Algorithm 14). Others are actually quite nuanced (e.g., vectorizing the soft count updates of Algorithms 16 and 17 and computing the gain in Equation 6.10 for all m;a pairs). We refer the reader to the code itself for more details on the extensive vectorization utilized in this code base 3 . This vectorization is key to the scalability of PSBL for sPOMDP learning, which we demonstrate in Chapter 8. 6.4 Decision Making in sPOMDP Models 6.4.1 Learning Optimal Policies for sPOMDPs As was discussed in Section 6.2, sPOMDPs can also be used directly as traditional POMDPs. An sPOMDPM has a traditional latent state spaceM, and the transition probabilitiesT associated with each pair (m;a)2MA satisfy the Markov property. We dene the observation probabilities ofM such that, for each m2 M and o2 O, P (ojm) = 1 if o is visibly equivalent to the rst observation ofm's outcome sequence and 0 otherwise. These observation probabilities also satisfy the (observation) Markov property. By dening a reward functionR (and setting a discount factor ) forM, we can utilize any existing exact or approximate POMDP solution technique to learn an optimal policy for the agent with respect to reward function R. Some exact and approximate 3 The code can be found at https://github.com/tjcollins/StochasticDistinguishingExperiments 152 techniques for solving POMDPs for optimal policies include [102, 115, 125]. In Chapter 8, we demonstrate this by using Cassandra's famous pomdp-solve software 4 (which implements a number of exact and approximate POMDP solution techniques) to solve sPOMDPs for optimal policies that perform well on simulated decision-making tasks. 6.4.2 Search-Based Decision-Making When the environment is relatively deterministic and the agent's goal is to navigate from its current environment situation to a goal observation, it is possible to use a search-based decision making approach for sPOMDPs that is very similar to the one developed for PSDEs in Chap- ter 5.4 (Algorithm 7). Algorithm 19 gives the full breadth-rst-search (BFS) based procedure for navigating from a starting sPOMDP model state to a goal observation. Algorithm 19: Actions to a Goal Observation in sPOMDP Models Input: goalOb: goal observation; startState: starting model state;M: learned sPOMDP Output: actions: sequence of actions to reach goalOb. 1 mostLikelyOb := startState.rstObservation 2 if mostLikelyOb == goalOb then 3 return [] 4 open := queue() 5 open.push((startState, [])) 6 closed :=fm : false for m inM:Mg 7 while not open.empty() do 8 state, hist := open.pop() 9 closed[state] := true 10 for a2A do 11 histCopy := copy(hist) 12 histCopy.append(a) 13 nextState := arg max m 0M.T[state][a][m 0 ] 14 if not closed[nextState] then 15 open.push((nextState, histCopy)) 16 mostLikelyObservation := nextState.rstObservation 17 if mostLikelyObservation == goalOb then 18 return histCopy 19 return a random action 4 http://www.pomdp.org/code/ 153 Algorithm 19 takes in a goal observation (goalOb), a starting model state (startState), and a learned sPOMDP modelM. In lines 1-3, we detect whether the goal observation is the most likely observation to be emitted in startState. If so, we return an empty sequence of actions (i.e., the agent should stay where it is in order to receive the goal observation as quickly as possible). In lines 4 and 6, we create an open queue for the BFS search and a closed hash map (which maps model states to true or false) to monitor for loops in the BFS search. This ensures the while loop of lines 7-18 will terminate. In line 5, we push startState onto the open queue along with an associated empty list (which will be used to incrementally build the sequence of actions from startState to goalOb). In line 8, we pop the next element o of the open queue. We set closed[state] := true in line 9 to indicate that state has now been explored. We then loop through each possible action a2 A in lines 10-18. We create a copy of hist in line 11 and append the actiona to this copy (histCopy) in line 12 to indicate that the action a was taken at this point in the BFS search tree. In line 13, we nd the most likely state (nextState) to be transitioned into from state under action a according to the transition probabilities of sPOMDPM. If nextState has not already been expanded (line 14), we add it to open (line 15) along with histCopy, which represents the sequence of actions that led to discovering nextState. In line 16, we determine the mostLikelyObservation in nextState, which is simply the rst observation in nextState's outcome sequence. If mostLikelyObservation is equivalent to goalOb (line 17), we return histCopy (line 18), which is precisely the sequence of actions from startState to a model state whose most likely observation is goalOb. If the while loop in lines 7-18 terminates without a solution, we return a random action in line 19. There are a number of ways in which an agent could use Algorithm 19 as part of a decision- making strategy. Algorithm 20 provides one (high-level) example implementation. The key idea is for the agent to explore in its environment by taking random actions until it has become suciently localized and then to use its most likely current state as the startState for Algorithm 19 when it needs to generate actions to take. More specically, the agent initializes a beliefState in line 154 Algorithm 20: Search-based Decision Making for sPOMDPs Input: goalOb: goal observation;M: learned sPOMDP model 1 beliefState := Uniform belief in model states matchingE.currentObservation() 2 policy := deque() 3 whileE.currentObservation() ! = goalOb do 4 if H(beliefState) is too high then 5 action := randomAction() 6 else 7 if policy.empty() then 8 startState := state with maximal probability in beliefState /* Algorithm 19 */ 9 policy := actionsToGoal(goalOb, startState,M) 10 action := policy.popleft() 11 ob :=E.transition(action) /* Algorithm 14 */ 12 beliefState := updateBeliefState(M, beliefState, action, ob) 1 that assigns uniform probability to all model states that are visibly equivalent to the current environment observation. The agent then enters the while loop of lines 3-12, which executes until the current environment observation is equivalent to the goal observation goalOb. If the agent deems that it is not yet suciently localized (e.g., by considering the entropy of its beliefState, line 4), it decides to take a random action (line 5). Otherwise, if its policy is empty (line 7), it determines its most likely model state, startState (the one with maximal probability in beliefState, line 8). In line 9, the agent invokes Algorithm 19 from startState to generate a new policy. In line 10, the agent pops the next action o this policy as the next action to take. Regardless of how action is generated, it is taken in the environmentE in line 11, and the resulting observation ob is recorded. In line 12, the agent updates beliefState using Algorithm 14 according to the action taken and resulting observation ob. Instead of using purely random actions to localize itself (line 5), the agent could also make use of the SDEs inM to suggest sequences of actions that are expected to help the agent localize itself quickly. These could be executed in conjunction with random action sequences as well (as in Algorithm 10). 155 6.5 Conclusions In this chapter, we detailed how Stochastic Distinguishing Experiments (SDEs) can be used to create a hybrid latent-predictive model called a Surprise-based Partially-Observable Markov De- cision Process (sPOMDP). sPOMDPs, like traditional POMDPs, have a latent state space and transition and observation probabilities satisfying the Markov property. However, sPOMDPs also share important similarities with purely predictive models (e.g., PSDEs [38] and PSRs [84]), as each latent state in an sPOMDP is associated with and uniquely identied by the result of a pre- dictive experiment. Each of these predictive experiments (SDEs, automatically designed by the agent) statistically distinguishes at least two latent states in the sPOMDP. We derived an optimal sPOMDP learning procedure, which provably learned a perfect sPOMDP model of the given min- imal -POMDP environment by using an oracle that provided perfect knowledge of transition probabilities and environment noise levels. We extended this optimal learning procedure into an approximate sPOMDP learning procedure that utilized the PSBL framework introduced in Chap- ter 4.5 and required no oracle (and no prior knowledge about its environment). This approximate learning procedure, PSBL for sPOMDPs, is applicable to any rewardless POMDP environment (not just-POMDP environments). We nished this chapter by discussing how learned sPOMDP models could be used for decision making either via traditional POMDP solution techniques in the literature or a BFS search through the model state space. 156 Chapter 7 Theoretical Results In this chapter, we state our key theoretical results regarding Predictive Stochastic Distinguishing Experiment (PSDE) models [38] (see Chapter 5), Surprise-based POMDP (sPOMDP) models [35, 39] (see Chapter 6), and the use of Probabilistic Surprise Based Learning (PSBL, see Chapter 4.5) for learning such models in unknown, rewardless POMDP environments. We relegate the proofs of these results to Appendix A, due to their length. In this chapter, we focus instead on the nature and scope of these results and why they are important in the context of the work laid out in this dissertation. 7.1 Predictive SDE Theoretical Results 7.1.1 Introduction In this section, we prove that the algorithms for learning PSDE models detailed in Chapter 5 terminate in nite time with a solution without any user-dened explicit limit on the length or number of PSDEs. We also provide an analysis of the worst-case computational complexities of these algorithms. More precisely, these theoretical results apply to both the Max Predict algorithm (Algorithm 2, Chapter 5.3.1), which learned PSDE models by attempting to maximize the agent's 157 predictive accuracy directly, and the PSBL for PSDEs algorithm (Algorithm 3, Chapter 5.3.2), which utilized the PSBL framework to learn PSDE models by minimizing model surprise. Recall from Chapter 1.2.1 that a key challenge related to the problem addressed in this disser- tation is the agent's complete lack of prior knowledge regarding its environment. We argued that this necessitates a nonparametric learning approach that does not force hard limits on the number of parameters the agent can use to model its environment. The theoretical results in this section demonstrate that PSDE models can, in fact, be learned in a truly nonparametric fashion without any explicit restrictions on the length or number of PSDEs. The length and number of PSDEs learned arises naturally from a combination of the nature of the environment being learned and the amount of experimentation the agent is able to perform in that environment. This also re- lates to another key challenge discussed in Chapter 1.2.1: the theoretically unbounded amount of history the agent may need to consider when its observations do not satisfy the transition Markov property [88] (as is typically the case in the environments we consider in this dissertation). The theoretical results presented in this section indicate that the PSDE learning procedures of Chap- ter 5 allow the agent to automatically select a provably nite amount of history, D, to serve as an (approximate) sucient statistic of its full history. We demonstrate the high quality of this approximation across a number of dierent environments in Chapter 8. 7.1.2 Scope of the Results and Denitions 7.1.2.1 Scope and Assumptions In the theoretical results presented in this section, we limit our consideration to discrete, rewardless POMDP environments (denotedE). This is in accordance with the problem denition for this dissertation (see Chapter 1.2.2) and all the theoretical and experimental results in the rest of this dissertation. Additionally, we note that the proofs in this section rely on a result from probability theory called the Borel-Cantelli lemma [49, 28]. We state and provide intuition for 158 this well-established lemma in the next subsection. Finally, we note that the results in this section also rely on the rather mild assumption that no observation probabilities in rewardless POMDP environmentE be fully deterministic. In other words, we require that8 s 2 S and8 o 2 O, 0 < P (ojs) < 1. In practice, in our extensive experimentation, violating this assumption has never prevented the algorithm from terminating, but it is a required technical condition for the proofs of the lemmas and theorems below. We also limit our analysis to environments with more than one possible action and more than one possible observation to avoid technical problems that might be caused by degenerate environments. 7.1.2.2 Denitions n =1 n = 2 n = 3 … n = D n = D+1 … 1 0 1 Random sequence of 0s and 1s 0 0 0s Random Events: G1, G2, G3, … ∞ ∑ i=1 P(G n ) = π 2 6 <∞ Figure 7.1: An illustration of the Borel-Cantelli lemma. Consider a set of Bernoulli random variablesfX n : n 1g such that P (X n = 1) = 1=n 2 (e.g., rolling a 1 on a fair n 2 -sided die). Dene the sequence of events G n = fX n = 1g. In this case, the innite sum P 1 n=1 P (G n ) can be evaluated in closed form and shown to equal 2 =6, which is a nite value. The Borel- Cantelli lemma tells us that we will experience a random sequence of 0s and 1s according to the probabilities ofP (G n ). In other words, we will successfully roll a 1 sometimes and a 0 other times up until some nite random time n =D. At timeD,X D takes on the value 0, and for allk>D, X k will remain at 0. This will happen almost surely (that is, with probability 1). In other words, X n will equal 1 only a nite number of times. The Borel-Cantelli lemma is a well-established result in probability theory. The lemma states that ifG 1 ;G 2 ;::: is a sequence of events in a probability space and the sum of their probabilities is nite (i.e., P 1 n=1 P (G n )<1), then the probability that innitely many of these events occur is 0. As an example of the Borel-Cantelli Lemma (based on [145]), consider the set of Bernoulli random variablesfX n : n 1g such that P (X n = 1) = 1=n 2 (e.g., rolling a 1 on a fair n 2 -sided die). 159 Dene the sequence of events G n =fX n = 1g. Since, in this case, P 1 n=1 P (G n ) = 2 =6 <1, fX n = 1g only occurs a nite number of times before X n takes on the value 0 with probability 1. Thus, there must exist a random time D such that X n = 0 for all nD. We use this lemma to prove that any simple or compound experiment fails with probability 1 if it is longer than some nite (but random) length. Figure 7.1 illustrates this example of the Borel-Cantelli lemma. Recall that we dene environmentE as a discrete, rewardless POMDP with states S, actions A, observations O, transition probabilities T , and observation probabilities , where o s denotes the probability of observing observation o in state s. LetR be a learning agent situated in environmentE. We dene a k-action simple experiment e k performed byR onE as a sequence of k ordered actions beginning at time t with an initial action: (a t ;:::;a t+k1 ) (7.1) With k ordered expected observations beginning at time t + 1: (o t+1 ;:::;o t+k ) (7.2) Let E k be a Bernoulli random variable that takes on the value 1 when the result ofR executing thek actions in experimente k in order beginning at timet is thek expected observations in order, starting at time t + 1, and takes on the value 0 otherwise. Then: P (E k = 1) =P (o t+1 ;:::;o t+k ja t ;:::;a t+k1 ) (7.3) P (E k = 1) is the probability that simple experiment e k succeeds. P (E k = 0) = 1P (E k = 1) is, then, the probability that simple experimente k fails. Note that we condition on the actions in a simple experiment since these values are xed by the agent rather than drawn from a distribution. 160 Actions are not random variables in a simple experiment because there is only one choice of action at each time step in a simple experiment. We dene a k-action compound experiment e c;k performed byR onE as a sequence ofk ordered sets of possible actions that can be selected from at each time step beginning at time t: (fa m t g;:::;fa m t+k1 g) (7.4) Andk ordered sets of allowed observations corresponding to the action selected in the previous time step, beginning at time t + 1: (fo r t+1 g;:::;fo r t+k g) (7.5) At each time step i2ft;:::;t +k 1g, the agent can choose between at least one and no more thanjAj actions. The choice of action m,a m i , at time step i corresponds to a particular set of allowed observations at time step i + 1,fo r i+1 g, which will contain between 1 andjOj allowed observations and will, in general, be dierent for each action. Seeing any observation r, o r i+1 , in this set is valid. Crucially, however, at each time step i, there must exist at least one action such thatjfo r i+1 gj<jOj. Were this not the case, any observation would be allowed at time step i + 1 for all possible selections of allowed action at time step i, and k-length compound experiment e c;k would be equivalent to a k 1-length experiment that ignored the action at time step i and corresponding observation at time step i + 1. Let E c;k be dened analogously to E k for compound experiment e c;k . In a compound experi- ment, the actions are modeled as random variables because, at each time stepi, the learning agent chooses uniformly at random between the available actions. Consider the following example of a compound experiment in the -Shape environment (Figure 5.2), beginning at time t, written in alternating action and observation form: 161 Learning sPOMDPs Information Sciences Institute Learning sPOMDPs Motivation Problem Challenges Related Work SDEs sPOMDPs Results Future t t+1 y t+2 P( { }|{x, y} {{ ) },{ {x, y} {{ }} },{ } x y x y x { } Figure 7.2: Visualization of the compound experiment in Equation 7.6. (fx; yg;ff; g;fgg;fx;yg;ffg;fgg) (7.6) If the action x is executed at time t, either observation in the setf; g is allowed at time t + 1. In contrast, ify is executed at timet, then only is allowed at timet + 1. Likewise, either x ory may be executed at timet + 1, and, in either case, only is allowed to be observed at time t + 2. Figure 7.2 provides a visualization of this compound experiment. We now state the main theoretical results of this section (leaving the full proofs to Appendix A.1). Lemma 1 proves that any simple experiment longer than some nite random length fails with probability 1. Lemma 2 proves the same result for compound experiments. Theorem 1 proves that all PSDEs have nite length with probability 1. Theorem 2 bounds the worst-case computational complexity of the Max Predict and PSBL for PSDEs algorithms (Algorithms 2 and 3 in Chapter 5). 7.1.3 Convergence and Computational Complexity Results For all time-stepsi, letA i be a random variable representing the chosen action, letO i be a random variable representing the observation, and letS i be a random variable representing the state. Note that all A i are independent of one another, because each action is chosen uniformly at random 162 from the available actions at that time step, regardless of the actions taken at previous or future time steps. Subscripted lowercase letters, e.g.,o t+3 ,a t+1 , refer to xed but not necessarily known values assigned to these random variables at specic time steps. Summations over all possible values or all possible combinations of values of a random variable or sequence of random variables at certain time steps are denoted by putting the subscripted lowercase letter(s) underneath the summation sign. For example: P at means to sum over all possible assignments toA t and P st+1;:::;st+4 means to sum over all possible combinations of values that could be assigned to random variables S t+1 , S t+2 , S t+3 , S t+4 . In the proof of Lemma 2 (see Appendix A.1), A i can take on the value of any action at any time step. Specic actions at time-step i are referred to with superscripts, a 1 i ;:::a jAj i . Thus, as we do in the proof of Lemma 2, we can decompose summations over all possible values ofA i into separate summations over specic action values at time i. For example, P ai P (a i ) is equivalent to jAj1 P j=1 P (a j i ) +P (a jAj i ). Recall that we assume POMDPE is dened such thatjAj> 1,jOj> 1. Lemma 1 (Any simple experiment longer than some nite random length fails with probability 1). If, for all o2 O and s2 S, 0 < o s < 1, then, for some nite random integer D, k-action simple experiment e k will fail for all k>D with probability 1 (almost surely). Proof. Please see Appendix A.1. Lemma 2 (Any compound experiment longer than some nite random length fails with probabil- ity 1). If, for all o2O and s2S, 0< o s < 1, then, for some nite random integer D, k-action compound experiment e c;k will fail for all k>D with probability 1 (almost surely). Proof. Please see Appendix A.1. Since splitting is the only means by which PSDE experiments increase in length, and we only split a PSDE v (in the most lenient case), if at least one of its one-step extensions succeeds at least once, Lemmas 1 and 2 can be used to prove that all PSDEs learned by the Max Predict and 163 PSBL for PSDEs algorithms will have nite length. This is because Lemmas 1 and 2 guarantee the existence of a nite experiment length D past which all one step extensions of all PSDEs will fail with probability 1. This argument is formally proved in the following theorem: Theorem 1 (All PSDEs have a nite length). Assume that, for allo2O ands2S, 0< o s < 1. M is a PSDE Model of rewardless POMDP environmentE learned by either Max Predict or PSBL for PSDEs (Algorithms 2 and 3, respectively, in Chapter 5). Let V denote the set of PSDEs in M. jv i j denotes the length of v i 2 V 's experiment e vi . With probability 1 (almost surely), there exists a nite random integer D such that8 vi2V jv i jD. Proof. Please see Appendix A.1. Once we have such a nite upper bound, D, on the length of any PSDE in the agent's model M, it becomes possible to place an upper bound on the worst-case computational complexity of the Max Predict and PSBL for PSDEs algorithms (Algorithms 2 and 3, respectively, in Chapter 5). The computational complexity of these procedures is bounded in the following theorem: Theorem 2 (Computational Complexity). The worst-case computational complexity of the Max Predict and PSBL for PSDEs algorithms (Algorithms 2 and 3 in Chapter 5) is O(DjAj 2D+1 jOj 2D+2 ), where D is the maximum length of any PSDE in the agent's environment modelM. Proof. Please see Appendix A.1. 7.2 sPOMDP Theoretical Results 7.2.1 Introduction In this section, we present theoretical results regarding the representational capacity of sPOMDP models. By representational capacity, we mean the class or family of functions that an sPOMDP 164 model can represent. (The formal learnability of these models in specic environments is a sep- arate matter that we do not specically consider in this dissertation.) In the case of this work, we are concerned with functions representing the probability distributions over future agent ob- servation sequences given any agent history (i.e., any sequence of actions and observations of any length). More specically, we are interested in models such as POMDPs and that sPOMDPs that allow for the exact computation of these history-dependent observation sequence probabilities using a nite number of parameters via a state space with transition and observation probability distributions satisfying the Markov property [88]. We are concerned, in this section, with what types of environments can be perfectly represented by sPOMDPs and what guarantees we can make about the size of the sPOMDPs needed for this perfect representation. While Shen was able to prove that Local Distinguishing Experiments (LDEs, [141]) could be used in conjunction with Complementary Discrimination Learning (CDL, [138]) to build envi- ronment models that were probably approximately correct (PAC, [160]) according to the amount of time the agent spent experiencing its environment, no formal proof was presented that LDEs could be used to represent any deterministic Moore machine environment. In other words, for any deterministic Moore machine environment, does there exist an LDE model that perfectly rep- resents that environment? If so, can we make guarantees regarding the size of the LDE model (and the length of the LDEs) needed to represent this environment? This is an important open question in the SBL literature [142, 140] that we answer in this section. Recalling that SDEs reduce to LDEs in deterministic Moore machine [100] environments, we prove, in this section, that, for any deterministic Moore machine environmentE, there exists a perfect sPOMDP modelM that is just as compact as the minimal representation of environment E. We do this via a constructive argument that provides a procedure that converts any minimal deterministic Moore machine environmentE into an equivalent sPOMDPM (in polynomial time). We then extend this result into a constructive proof that, for any rewardless -POMDP such that > 1=jQj and > 1=jOj, there exists a perfect sPOMDP model that is just as compact 165 as the minimal representation of this -POMDP. These results, along with sPOMDP modeling itself (Chapter 6) serve to address the key challenge (discussed in Chapter 1.2.1) of formally representing agent state using predictive sequences of actions and observations. 7.2.2 Denitions Before presenting the main theoretical results of this section, we rst provide some key deni- tions regarding deterministic Moore machine [100] and -POMDP environments (Denition 8, Chapter 6.1). These denitions are vital for formalizing the proofs of the representational capac- ity of Surprise-based Partially Observable Markov Decision Processes (sPOMDPs, Denition 10, Chapter 6.2) in this section. A Moore machine [100] is a type of deterministic nite automaton (DFA) characterized by the fact that the symbols it outputs are determined only by its current state. This dierentiates Moore machines from other types of DFAs, e.g., Mealy machines [93], whose output depends on both the current state and input (action). The following denitions are adapted from [131, 142]. We dene a Moore machine environment as follows: Denition 3. Moore machine environment A Moore machine environmentE is a determin- istic nite automaton (DFA) dened as a 6-tuple (A;O;Q;;;r), where A is a nite set of input actions, O is a nite set of observations, Q is a nite set of environment states, :QA!Q is the environment's transition function, : Q! O is the environment's appearance (emission) function, and r is the current state of environment. If the agent knows that its environmentE is a Moore machine, we can dene its model of the environment,M, as a Moore machine (A;O;M;;;t), whereA is a nite set of input actions,O is a nite set of output symbols (observations), M is a nite set of model states, :MA!M is the model's transition function, :M!O is the model's appearance (emission) function, and t is the current state of the model. Note that the sets of actions A and observations O are the same for both the model and the environment. When useful, we can also dene the agent's model 166 of a Moore machine environment to be an sPOMDP with deterministic transition and observation functions. We assume that the learner can apply any basic actiona2A to bothE andM and can observe the current output o = (r)2 O from the environment. We say that a model state m2 M is visibly equivalent to an environment state q if (q) = (m). We assumejOjjQj, such that the environment is either fully transparent or there exists at least one observation representing more than one environment state. We also assume that each o2O is emitted by at least one state in E and thatjOj> 1,jAj> 1, andjQj> 1 to avoid degenerate environments. Autonomous learning from the environment is concerned with learning a perfect modelM of some environmentE. Denition 4. Perfect Moore machine model A perfect modelM of Moore machine environ- mentE is one in which the current model state t is visibly equivalent to the corresponding current environment state r and, for all action sequences of any length k, f = (a 1 ;:::;a k ), (r;f) remains visibly equivalent to (t;f). Our theoretical proofs in this section require three other denitions applicable to DFAs: Denition 5. Minimal DFA A DFAE is called minimal if, for every DFAE 0 such thatjQ 0 j< jQj, L(E 0 )6=L(E), where L denotes the language recognized by a DFA. Intuitively, a DFA is called minimal if no DFA with fewer states performs the same task. Denition 6. Separating sequence An action sequence f is called a separating sequence for two states u;v2Q in Moore machine environmentE if (u;f)6=(v;f). A separating sequence is a sequence of actions that, when executed from states u2 Q and v2Q for someu6=v, deterministically leads to a dierent sequence of observations. Importantly, any two states in a minimized DFA are guaranteed to have a separating sequence with no more thanjQj1 actions [131]. Denition 7. Homing sequence A sequence of actionsf is called a homing sequence for Moore machineE if, for every pair of states u;v2Q, (u;f)6=(v;f) =) (u;f)6=(v;f). 167 In words, whenever the nal environment state transitioned into under action sequence f is not the same starting at states u and v, the sequence of observations generated starting at u ((u;f)) and the sequence of observations generated starting at v ((v;f)) will also be distinct. One can uniquely determine the agent's nal state by looking at the sequence of observations generated by executing this homing sequence in the environment. Denition 8. Stochastic visible equivalence. LetE be a rewardless-POMDP environment and letM be an sPOMDP. Let b E;t represent a distribution over possible environment states q2 Q at time t. Let b M;t represent a distribution over possible model states m2 M at time t. We say that environmentE and modelM are visibly equivalent (EM) at time t if,8 o2O P q2Q P (ojq)b E;t (q) = P m2M P (ojm)b M;t (m). In words, environmentE and modelM are visibly equivalent at time t if they induce the same probability distribution over possible observations at time t (conditioned on the same agent history). Denition 9. Perfect sPOMDP model. LetE be a rewardless -POMDP environment, and letM be an sPOMDP that is stochastically visibly equivalent toE initially. We say thatM is a perfect model ofE if, for all action sequences of any length k, f = (a 1 ;:::;a k ),M andE remain stochastically visibly equivalent after f is executed in bothM andE. 7.2.3 Representational Capacity Proofs We can now state our key theorems regarding the representational capacity of sPOMDP models. As we did in the previous section, we leave the proofs of these results to Appendix A.2. We begin by proving that any Moore machine environmentE can be represented perfectly (denition 4) by an sPOMDP no larger than the minimal Moore machine representation of environmentE. We then generalize these results to -POMDP environments using the denition of a perfect sPOMDP model (denition 9). 168 Theorem 3. For any Moore machine environmentE, there exists a perfect sPOMDP modelM wherejMjjQj,jSjjQj1, and each SDE s2S has, at most,jQj1 actions, wherejQj is the number of environment states in the minimal representation ofE. Proof. Please see Appendix A.2. Theorem 3 is proved in Appendix A.2 by means of a constructive argument. We present an algorithm called Convert (Algorithm 21), which provably produces, in polynomial time, a perfect sPOMDP model of any given deterministic (minimal) Moore machine environmentE. Further, this theorem tells us that the resulting sPOMDP model will have no more thanjQj states (where jQj is the number of states in the minimal representation ofE) and no more thanjQj1 SDEs. It also tells us that each SDE will have no more thanjQj1 actions. This is what we mean by saying that sPOMDPs are provably equivalently as compact as deterministic Moore machine environments. The full argument is nuanced and the resulting proof is quite long. For this reason, we break up the argument into the following three lemmas, which, taken together, complete the proof of Theorem 3. Lemma 3. The Convert algorithm (Algorithm 21) terminates in O(jQj 4 jAj) time with a model M with no more thanjQj model states (jMjjQj) and no more thanjQj1 SDEs (jSjjQj1), with each SDE having no more thanjQj1 actions and each outcome sequence associated with each model state having no more thanjQj observations. Proof. Please see Appendix A.2. Importantly, Lemma 3 establishes that the Convert procedure of Algorithm 21 terminates in time polynomial in the number of states and actions in Moore machine environmentE. It also establishes bounds on the length and number of SDEs in modelM as well as the number of states in modelM. Note that a bound on the length of any SDE also bounds the length of the longest outcome sequence of any model state inM. 169 Lemma 4. Letfq m g denote the set of environment states covered by model state m. If, for every triple (m;a;m 0 )2 MAM in modelM, q 0 = (q;a) is covered by m 0 = (m;a) for all q2fq m g and all q 0 2fq m 0g, thenM is a perfect model of environmentE. Proof. Please see Appendix A.2. Intuitively, Lemma 4 establishes that any modelM of environmentE with the property that all the underlying environment states covered by each model state agree on their transitions in model space (for all actions) is a perfect model ofE. In other words, in a perfect model, we cannot have a situation in which two environment states covered by the same model state disagree on the model state to which they will transition under the same action. Lemma 5. The Convert algorithm (Algorithm 21) terminates in polynomial time with a deter- ministic sPOMDP modelM that is a perfect model of input Moore machine environmentE. Proof. Please see Appendix A.2. Lemma 5 establishes that, by construction, the modelM produced by the Convert algorithm (Algorithm 21) upholds the property established in Lemma 4 and is thus a perfect model of given environmentE. Lemmas 3-5 complete the proof of Theorem 3. Theorem 4. For any rewardless -POMDP environmentE in which > 1=jQj and > 1=jOj, there exists a perfect sPOMDP modelM wherejMjjQj,jSjjQj1, and each SDE s2 S has, at most, jQj1 actions, where jQj is the number of environment states in the minimal representation ofE. Proof. Please see Appendix A.2. Theorem 4 directly extends the results of Theorem 3 to rewardless -POMDP environments E and sPOMDP modelsM that have nondeterministic transition and observation functions. This is done by means of an extension to the Convert procedure (Algorithm 21) which is specied in 170 Algorithm 22. The key idea is to use an explicit mapping from model states to covered environment states in order to calculate the appropriate transition and observation probability distributions for modelM according to the known values of and. In addition, it turns out that the most likely transitions between pairs of environment or model states can be used in place of the deterministic transitions present in Moore machine environments in the Convert algorithm. This extension leads directly to the following corollary of Lemma 3. Corollary 1. The Convert algorithm (Algorithm 21) extended to rewardless -POMDP envi- ronmentsE terminates in O(jQj 4 jAj) time with an sPOMDP modelM with no more thanjQj model states (jMjjQj) and no more thanjQj1 SDEs (jSjjQj1), with each SDE having no more thanjQj1 actions and each outcome sequence associated with each model state having no more thanjQj observations. Lemma 6. Letfq m g denote the set of environmental states covered by model statem. If transition probabilitiesT and emission probabilities are calculated according to Algorithm 22 and, for every pair (m;a)2MA in modelM and for all q2fq m g, q max a = arg max q 0 2Q P (q 0 jq;a) is covered by m max a = arg max m 0 2M P (m 0 jm;a), thenM is a perfect model of environmentE. Proof. Please see Appendix A.2. Lemma 6 extends the results of Lemma 4 to -POMDP environments. Intuitively, it estab- lishes that any sPOMDP modelM of -POMDP environmentE with the property that all the underlying environment states covered by each model state agree on their most probable transi- tions in model space (for all actions) is a perfect model ofE (provided we calculate transition and observation probabilities appropriately). This works because all the most likely transitions in an -POMDP have the same probability () and all the least likely transitions also have the same probability ((1)=(jQj1)). Lemma 7 proves that the Convert algorithm (Algorithm 21) ex- tended to rewardless-POMDP environments via Algorithm 22 produces sPOMDP models that uphold the property of Lemma 6, meaning that they are perfect models of their environments. 171 Lemma 7. The Convert algorithm (Algorithm 21) extended to rewardless -POMDP environ- ments terminates in polynomial time with an sPOMDP modelM that is a perfect model of input rewardless -POMDP environmentE. Proof. Please see Appendix A.2. As a nal note, Appendix A.3 provides experimental validation of the Convert procedure (Algorithm 21) and its extensions (Algorithm 22). 7.3 Conclusions In this chapter, we presented extensive theoretical results regarding Predictive Stochastic Distin- guishing Experiment (PSDE) models [38] (see Chapter 5), Surprise-based POMDP (sPOMDP) models [35, 39] (see Chapter 6), and the use of Probabilistic Surprise Based Learning (PSBL, see Chapter 4.5) to learn these models. We presented proofs of the convergence (termination in nite time) and worst case computational complexity of the PSDE learning procedures detailed in Chapter 5. We also presented proofs that sPOMDP models can perfectly represent any deter- ministic Moore machine or-POMDP environment with equivalent compactness. We discussed, in detail, how these theoretical results address some of the most important challenges involved in solving the stochastic ALFE problem (see Chapter 1.2.1). In the next chapter, we discuss our extensive experimental results regarding PSDEs, sPOMDPs, and PSBL. 172 Chapter 8 Experimental Results In this chapter, we present the results of extensive experimentation evaluating the performance of Probabilistic Surprise-Based Learning (PSBL, Chapter 4.5) on learning Predictive SDE (PSDE) models (Chapter 5) and surprise-based POMDP (sPOMDP) models (Chapter 6) of unknown re- wardless POMDP environments from experience. First, we present results across a broad range of rewardless -POMDP and general POMDP environments that clearly demonstrate a strong positive correlation between model surprise (Chapter 4.3, Equation 4.4) and model error (Chap- ter 4.1, Equation 4.1), as desired. These results also establish a strong negative correlation between model surprise and model predictive accuracy (i.e., how well this model can predict future ob- servations), such that reducing surprise typically causes an increase in model predictive accuracy, also as desired. We then compare the predictive accuracies (normalized relative to an oracle's performance), runtimes, model sizes, and model errors of learned PSDE and sPOMDP models in a number of rewardless-POMDP environments (adapted from environments in the literature) against xed- length history-based (k-order Markov [19]) models and Long Short-Term Memory (LSTM) [60] neural network models. We perform this analysis under a number of dierent environment noise levels (i.e., settings of and ). We perform a similar comparison in scalable 2D grid-based POMDP environments in which the agent observes colors associated with each grid cell/state, 173 and there are signicantly fewer colors than grid cells (meaning there are a signicant number of hidden states). It is important to note that these color world environments are not-POMDPs. Thus, these results demonstrate the eectiveness of PSDE and sPOMDP models on general re- wardless POMDP environments representing grid-based navigation tasks. Almost without excep- tion, PSDE models perform equally as well (and quite often better) than the xed-size, parametric models with which they are compared in terms of both model predictive accuracy and model error. (These parametric models are trained oine on large and xed-size datasets in order to maxi- mize their performance.) When the environment is almost fully deterministic, sPOMDPs can typically match (and sometimes exceed) the performance of PSDEs while requiring dramatically fewer model parameters (particularly as environment size increases). Additionally, these results demonstrate that PSBL for sPOMDPs typically learns sPOMDPs with model state spaces whose size is a close approximation to the number of true underlying environment states. Next, we compare the performance of PSDE models and sPOMDP models against ground truth POMDPs on decision-making tasks in the same types of environments (again under a variety of dierent noise levels). The agent is tasked with making decisions that will lead it to a goal observation that is tied to a scalar reward. These results demonstrate that we can close the loop between model learning and planning in a way that supports high-quality decisions that are (typically) nearly optimal. Additionally, as we would expect, in almost every environment tested, sPOMDP modeling matches the performance of or outperforms PSDE modeling in terms of decision-making (again, while requiring a much smaller number of model parameters). This is due to the fact that we can we can use powerful existing exact or approximate solvers to solve an sPOMDP exactly like a POMDP for an optimal policy 1 that leads the agent to the desired goal observation from any model state (while taking environment noise into account in a principled fashion). In contrast, PSDE models have to rely on the search-based planning approach detailed in Algorithm 7, Chapter 5.4, which provides no guarantees of optimality under uncertainty (even 1 As a technical note, the optimality of this policy is relative to the sPOMDP model, which is itself an approxi- mation. 174 if the PSDE model is a perfect representation of the environment). We discuss how the same PSDE and sPOMDP models could easily accommodate changes in goal observation due to their task-independent nature. A nal set of experimental results demonstrates the scalability of PSBL as applied to both PSDE and sPOMDP models. These scalability analyses are performed on random color world environments { which, as was mentioned above, are not -POMDPs { with dierent noise levels and dierent levels of environment hiddenness. By environment hiddenness, we mean the ratio of the number of observations to the number of underlying environment states. The closer this ratio is to 1, the less hidden the environment is to the agent. These scalability results further demonstrate the superiority of PSDE models in terms of runtime and model error, while they also demonstrate the accuracy of sPOMDP modeling in approximately inferring the true number of underlying environment states with many fewer model parameters. 8.1 Environments Adapted from the Literature In this chapter, the performance of PSDE models and sPOMDP models is compared against xed- history (k-order Markov) and LSTM models on a variety of common and important POMDP environments from the literature. We have modied these environments to be rewardless - POMDP environments (see Denition 8, Chapter 6.1). In this section, we illustrate and dene these environments. In the illustrations, to save space and reduce clutter, only the most likely transitions between states are shown. Importantly, all of these environments have signicant numbers of hidden states (i.e.,jOj<jQj), multiple possible agent actions (jAj> 1) and multiple possible agent observations (jOj> 1), making them good candidate environments for evaluating PSBL. 175 Alpha-epsilon POMDP environments sPOMDPs Motivation Problem Challenges Related Work SDEs Learning Results Future Information Sciences Institute ϵ 1−ϵ α (1−α)/3 (1−α)/3 (1−α)/3 Figure 8.1: -Shape environment 8.1.1 The Shape Environment The -Shape environment (Figure 8.1) was rst introduced in Chapter 4. This environment is originally due to Rivest and Schapire [119] and was later adapted by Shen [142]. In this 4-state environment, the agent can observe in states I and II and in states III and IV. It can execute the actions x and y, which have the eects illustrated in the gure. In this chapter, we also consider randomizations of the -Shape environment in which the arrows and observations at each state may be changed (though we require that each observation be most likely in at least one of the four states and that there is at least one path between any pair of environment states). More formally, the environment states, actions, and observations in the -Shape environment are Q =fI;II;III;IVg, A =fx;yg, and O =f;g, respectively. 8.1.2 The Little Prince Environment The -Little Prince environment (Figure 8.2, originally from [120]) is another 4-state environ- ment. In this environment, the little prince can go forward (f), backward (b), or turn-around (t) 176 S2 S3 S0 S1 t t t b f f b f f b b rose volcano nothing nothing Figure 8.2: -Little Prince environment at each state. The agent sees nothing in states S2 and S3. In state S0, it sees a rose. In state S1, it sees a volcano. More formally, the environment states, actions, and observations in the-Little Prince environment are Q =fS0;S1;S2;S3g, A =ff;b;tg, and O =frose;volcano;nothingg, respectively. 8.1.3 The Circular 1D Maze Environment Left Middle Goal West East Right West East West East Nothing Nothing Nothing Goal East West Figure 8.3: -Circular 1D Maze environment The -Circular 1D Maze environment (Figure 8.3, originally from [29]) is another 4-state environment in which the agent can move west and east at each state. The agent sees nothing in states Left, Middle, and Right, and it sees goal in state Goal. This environment is made circular (and thus more dicult) by having the action west in state Left lead to the state Goal and having 177 the action east from state Goal lead back to the state Left. More formally, the environment states, actions, and observations in the-Circular 1D Maze environment are, Q =fLeft, Middle, Right, Goalg, A =feast, westg, and O =fnothing, goalg, respectively. 8.1.4 The Shuttle Environment 0 1 t 1.00 4 f 1.00 7 b 1.00 f 1.00 b 1.00 t 1.00 t 1.00 b 1.00 5 f 1.00 t 1.00 f 1.00 b 1.00 2 f 1.00 t 1.00 3 b 1.00 b 1.00 t 1.00 6 f 1.00 b 1.00 f 1.00 t 1.00 t 1.00 f 1.00 b 1.00 seeLRVForward seeLRVDocked seeMRVDocked seeMRVForward seeNothing seeNothing seeMRVForward seeLRVForward Figure 8.4: -Shuttle environment 178 The -Shuttle environment (Figure 8.4, originally from [33]) is an 8-state environment that simulates docking with and transporting materials between two space stations separated by open space. LRV refers to the least recently visited of these stations. MRV refers to the most recently visited one. The agent always faces exactly one of the two space stations and can use the action TurnAround to reverse its orientation. The agent can also move forward or backward using the actions GoForward and Backup, respectively. The agent must transition to empty space before transitioning from one space station to the other one. The agent docks at a space station by performing the action Backup while at that station but facing open space. The agent can see station MRV (seeMRVForward) or station LRV (seeLRVForward) while facing the appropriate station (either from open space or directly in front of that station). When facing open space while at one of these stations (but not docked at that station), the agent will not see anything (seeNothing). The agent can also see whether it is docked at MRV (seeMRVDocked) or docked at LRV (seeLRVDocked). More formally, the environment states, actions, and observations in the -Shuttle environment are, Q =f0; 1; 2; 3; 4; 5; 6; 7g, A =fTurnAround, GoForward, Backupg, and O =fseeLRVForward, seeMRVForward, seeMRVDocked, seeNothing, seeLRVDockedg, re- spectively. 8.1.5 Network Figure 8.5 illustrates the 7-state -Network POMDP environment (originally from [29]), which models a network monitoring task. The agent can only see whether the network is up or down and can perform the actions unrestrict, steady, restrict, and reboot. More formally, the environment states, actions, and observations in the-Network environment are,Q =fs000, s020, s040, s060, s080, s100, crashg, A =funrestrict, steady, restrict, rebootg, and O =fup;downg, respectively. 179 s000 restrict 1.00 s040 unrestrict 1.00 s060 steady 1.00 s100 reboot 1.00 reboot 1.00 unrestrict 1.00 restrict 1.00 crash steady 1.00 restrict 1.00 steady 1.00 reboot 1.00 s080 unrestrict 1.00 reboot 1.00 steady 1.00 unrestrict 1.00 restrict 1.00 s020 restrict 1.00 reboot 1.00 unrestrict 1.00 steady 1.00 restrict 1.00 reboot 1.00 steady 1.00 unrestrict 1.00 restrict 1.00 unrestrict 1.00 steady 1.00 reboot 1.00 Up Up Down Up Down Down Up Figure 8.5: -Network environment 8.1.6 Grid The -Grid environment (Figure 8.6, originally from [127]) models an agent navigating in a 2D grid environment of size 3x4. The agent can take the actions up, down, left, and right, and it can see whether a wall exists at the left and right side of each state.k means that the agent sees a wall to its left and to its right.j means that the agent sees a wall only on its left side.j means that the agent sees a wall only on its right side. means that the agent sees no walls. The grayed out square in the middle is an obstacle that blocks agent movement, so there are only 11 states. More formally, the environment states, actions, and observations in the -Grid environment are, 180 Figure 8.6: -Grid environment Q =f(1, 1), (1, 2), (1, 3), (1, 4), (2, 1), (2, 3), (2, 4), (3, 1), (3, 2), (3, 3), (3, 4)g,A =fup, down, left, rightg, and O =fk;j;j;g, respectively. 8.2 Color World Environments We also evaluate the performance of PSDE models and sPOMDP models against xed-history (k-order Markov) and LSTM models on random Color World POMDPs of varying sizes (see Figure 8.7). In color world POMDPs, the agent can take the actions",#, , and!. Attempting to move o of the grid keeps the agent in place. Each grid cell is associated with a certain color, which the agent can perceive (albeit noisily). The agent senses the correct color at a grid cell state with some probability. With probability 1, it observes one of the incorrect colors (with each incorrect color being equally likely). However, these color worlds are not -POMDPs, because environment noise can only cause the agent to end up in neighboring grid cells by mistake. The correct transition under each action happens with probability , but the agent will mistakenly take one of the other 3 possible actions (each being equally likely) with probability 1 (see Figure 8.7(a) for an illustration of this transition model for the intended action"). These color 181 α (1-α)/3 (1-α)/3 (1-α)/3 (a) Color world transition model (b) 2x2 color world with 2 colors (c) 2x3 color world with 3 colors (d) 3x3 color world with 3 colors (e) 3x3 color world with 5 colors (f) 3x4 color world with 6 colors (g) 4x4 color world with 8 colors (h) 4x5 color world with 10 colors (i) 5x5 color world with 13 colors Figure 8.7: Example color world POMDP environments from sizes 2x2 to 5x5. worlds can be made arbitrarily large, and we can change the number of observable colors in order to make the environments more or less hidden to the agent. We can also endow these color world environments with interesting structure by beginning with an mn color world grid and blocking o some cells as obstacles. This results in environ- ments whose structures are reminiscent of the oor plans of oce buildings. For example, the 182 environment in Figure 8.8 (created from a 6x5 color world grid) is inspired by the map of one of the oors of the Information Sciences Institute at the University of Southern California. We call this environment the ISI Floor environment. Note that this can result in relatively narrow corridors that make learning challenging. Figure 8.8: The ISI Floor environment 8.3 Model Surprise, Error, and Accuracy Recall that the key idea of Probabilistic Surprise-based Learning (PSBL, Chapter 4.5) is to min- imize model surprise (Chapter 4.3, Equation 4.4) as a proxy for model error (Chapter 4.1, Equa- tion 4.1). Model error is the quantity that we truly wish to minimize but are unable to compute, because we do not know the underlying environment state space or ground truth transition and observation probabilities. In this section, we present the results of extensive experimentation 183 demonstrating a strong positive correlation between model surprise and model error. Reduc- ing model surprise typically results in a corresponding reduction in model error, while increasing model surprise typically results in a corresponding model error increase. Additionally, these results demonstrate that model surprise and model predictive accuracy are strongly negatively correlated. As model surprise decreases, model predictive accuracy tends to increase. Model predictive ac- curacy is simply the number of correct predictions the agent makes about its next observation divided by the total number of predictions made during a purely random walk of T time steps through its environment (starting from a random initial state). The agent is forced to make a prediction before seeing each observation (i.e., it can never refuse to make a prediction), and we force the agent to take random actions, such that it can't exploit its model to only predict in areas of its environment in which it is the most condent. These results are remarkably consistent across dierent environments and noise levels. The results in this section focus on the use of PSBL to learn sPOMDP models, but similar correlation results can be obtained with PSBL applied to PSDE models. We study these correlations in both -POMDP environments adapted from the literature (Section 8.3.1) and in random color world POMDP environments (Section 8.3.2) of varying sizes. 8.3.1 Environments from the Literature Figures 8.9-8.14 plot the trajectories of model surprise (top row, subgures (a)-(e)), model error (middle row, subgures (f)-(j)), and model predictive accuracy (bottom row, subgures (k)-(o)) that result from using PSBL to learn sPOMDP models (Algorithm 10, Chapter 6.3.4) of the rewardless -POMDP environments from the literature introduced in Section 8.1. The hyper- parameters used for learning were explore = 0:5, numActions = 5000, patience = 3. For all experiments in this and future sections, minGain was left at 0:0 (see Chapter 6.3.4.5). Please also see Chapter 6.3.4 for more details on these hyperparameters. Each gure in each row represents a separate setting of environment noise parameters and . Other noise levels were tested as 184 Number of PSBL model splits Model surprise (a) = = 1 (b) = 0:99; = 1 (c) = 0:95; = 1 (d) = 0:9; = 1 (e) = = 0:95 Model error (f) = = 1 (g) = 0:99; = 1 (h) = 0:95; = 1 (i) = 0:9; = 1 (j) = = 0:95 Model accuracy (k) = = 1 (l) = 0:99; = 1 (m) = 0:95; = 1 (n) = 0:9; = 1 (o) = = 0:95 Figure 8.9: Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model pre- dictive accuracy (subgures (k)-(o)) trajectories for the PSBL algorithm applied to the -Shape environment (Section 8.1.1) under varying environment noise levels. well { namely, ( = 0:85; = 1:0), ( = 0:8; = 1:0), and ( = 0:99; = 0:99) { but they are not shown in these gures due to space. The horizontal axis in each gure represents the number of times the sPOMDP model has been split by the PSBL learning procedure (see Section 6.3.4.5, Algorithm 18). Each line represents a separate run of PSBL until termination. There are 50 such runs (each a dierent line) represented in each subgure of Figures 8.9-8.14. Figure 8.9 plots these surprise, error, and accuracy trajectories for the -Shape environment (Section 8.1.1). Recall that, in PSBL learning of sPOMDPs, we begin with a model state space that hasjOj model states, each one representing a separate agent observation.jOj= 2 in the Shape environment, so, initially, the agent has two model states (one for and one for). Since the- Shape environment has four environment states, we should expect model surprise to decrease the rst two times the model is split (assuming quality splits are being performed) and then increase indenitely after two splits, as the agent begins inferring non-existent states. This is precisely what we see in subgures (a)-(e) of Figure 8.9. Regardless of the noise level, minimal model surprise 185 is achieved after two splits. Subsequent splits only serve to increase model surprise. Note how Number of PSBL model splits Model surprise (a) = = 1 (b) = 0:99; = 1 (c) = 0:95; = 1 (d) = 0:9; = 1 (e) = = 0:95 Model error (f) = = 1 (g) = 0:99; = 1 (h) = 0:95; = 1 (i) = 0:9; = 1 (j) = = 0:95 Model accuracy (k) = = 1 (l) = 0:99; = 1 (m) = 0:95; = 1 (n) = 0:9; = 1 (o) = = 0:95 Figure 8.10: Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model pre- dictive accuracy (subgures (k)-(o)) trajectories for the PSBL algorithm applied to the -Little Prince environment (Section 8.1.2) under varying environment noise levels. closely the model error curves (Figure 8.9, subgures (f)-(j)) match the curves of model surprise (subgures (a)-(e)). The model error curves also exhibit a clear minimum value after 2 model splits. The closer the environment is to fully deterministic, the more closely the model surprise and model error curves match one another. Interestingly (though not unexpectedly), the curves of model accuracy (subgures (k)-(o)) are very close to mirror images (about the horizontal axis) of the model surprise and model error curves. We should expect that reducing model error will, in general, result in a model with higher predictive accuracy, and this is exactly what we observe. Again, the closer the environment is to fully deterministic, the more strongly this relationship seems to hold. The results in Figures 8.9-8.14 also demonstrate that PSBL for sPOMDPs has, in every case tested so far, converged to a solution in a reasonable amount of time without requiring any explicit bounds on the maximum number of model states that can be inferred or the maximum allowable 186 Number of PSBL model splits Model surprise (a) = = 1 (b) = 0:99; = 1 (c) = 0:95; = 1 (d) = 0:9; = 1 (e) = = 0:99 Model error (f) = = 1 (g) = 0:99; = 1 (h) = 0:95; = 1 (i) = 0:9; = 1 (j) = = 0:99 Model accuracy (k) = = 1 (l) = 0:99; = 1 (m) = 0:95; = 1 (n) = 0:9; = 1 (o) = = 0:99 Figure 8.11: Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model predic- tive accuracy (subgures (k)-(o)) trajectories for the PSBL algorithm applied to the -Circular 1D Maze environment (Section 8.1.3) under varying environment noise levels. lengths of model state outcome sequences. Though we have not yet proved the convergence of PSBL for sPOMDPs to a solution in nite time as we have for PSDE models (see Chapter 7.1.3), we strongly suspect that such a proof is possible using similar arguments (even without requiring a patience hyperparameter). At the moment, the convergence of PSBL for sPOMDP models is an important open theoretical question that we leave for future work. Figures 8.10-8.14 demonstrate, to a large degree, the same basic trends we see in Figure 8.9 (the -Shape environment). There are a few exceptions and special situations that are worth mention- ing. First, in some environments, including the -Little Prince environment (Section 8.1.2 and Figure 8.10) and the Shuttle environment (Section 8.1.4, Figure 8.12), surprise decreases (roughly) to a minimum value after one or more model splits and then largely plateaus, with several up- ward spikes shooting up from this plateau. This happens more frequently as environment noise increases. In these environments, some possible model splits are benign (though not particularly useful) and have little eect on model surprise (or error), whereas others cause a sharp increase in 187 Number of PSBL model splits Model surprise (a) = = 1 (b) = 0:99; = 1 (c) = 0:95; = 1 (d) = 0:9; = 1 (e) = = 0:95 Model error (f) = = 1 (g) = 0:99; = 1 (h) = 0:95; = 1 (i) = 0:9; = 1 (j) = = 0:95 Model accuracy (k) = = 1 (l) = 0:99; = 1 (m) = 0:95; = 1 (n) = 0:9; = 1 (o) = = 0:95 Figure 8.12: Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model pre- dictive accuracy (subgures (k)-(o)) trajectories for the PSBL algorithm applied to the-Shuttle environment (Section 8.1.4) under varying environment noise levels. both model surprise and model error. The agent has found a very good model of the environment in both cases, but sometimes unnecessary splits are being performed before the agent realizes this. The important point is that the particular choice of which model state and action pair to split next (when multiple valid options exist) can make a dramatic dierence in these types of environments on runtime and the number of states the agent infers. Currently, when multiple model states are eligible for splitting, we split the one with the shortest outcome sequence (with ties broken arbitrarily). However, it is possible that a more sophisticated ordering strategy for splitting may further improve sPOMDP modeling performance (particularly for noisier environments). The-Circular 1D Maze environment (Section 8.1.3) exhibits some very interesting surprise, error, and accuracy trajectories (see Figure 8.11). Unlike many of the other environments tested, there appear to be multiple very distinct local minima in surprise space in this environment. Sometimes the agent makes good splitting choices, which leads it to a very low value of surprise. In other cases, the agent makes poor splitting choices and the algorithm terminates with high 188 Number of PSBL model splits Model surprise (a) = = 1 (b) = 0:99; = 1 (c) = 0:95; = 1 (d) = 0:9; = 1 (e) = = 0:95 Model error (f) = = 1 (g) = 0:99; = 1 (h) = 0:95; = 1 (i) = 0:9; = 1 (j) = = 0:95 Model accuracy (k) = = 1 (l) = 0:99; = 1 (m) = 0:95; = 1 (n) = 0:9; = 1 (o) = = 0:95 Figure 8.13: Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model predic- tive accuracy (subgures (k)-(o)) trajectories for the PSBL algorithm applied to the -Network environment (Section 8.1.5) under varying environment noise levels. surprise (and high model error). In both cases, though, model surprise, model error, and model accuracy remain highly correlated (as they are in the other environments). We believe that this is due largely to the narrow corridor-like nature of the environment and the fact that introducing noise in an-POMDP fashion teleports the agent around this hallway randomly, making repeated and structured experimentation more dicult. Thus, as noise increases, this environment quickly loses its structure, making it very dicult for the agent to distinguish between the states in which it sees nothing. Based on our experimentation, this convergence to a poor local optimum in surprise space seems to largely vanish if the transition model of the agent is changed such that the agent can only \slip" into neighboring states (as in the color world environments we discuss in future sections). Interestingly, as we show below, sPOMDP models of the -Circular 1D Maze environment support higher-quality decision making on average than PSDE models of this environment (which actually predict future observations more accurately), even with these undesired local minima in surprise space. 189 Number of PSBL model splits Model surprise (a) = = 1 (b) = 0:99; = 1 (c) = 0:95; = 1 (d) = 0:9; = 1 (e) = = 0:95 Model error (f) = = 1 (g) = 0:99; = 1 (h) = 0:95; = 1 (i) = 0:9; = 1 (j) = = 0:95 Model accuracy (k) = = 1 (l) = 0:99; = 1 (m) = 0:95; = 1 (n) = 0:9; = 1 (o) = = 0:95 Figure 8.14: Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model pre- dictive accuracy (subgures (k)-(o)) trajectories for the PSBL algorithm applied to the -Grid environment (Section 8.1.6) under varying environment noise levels. The -Network environment (Section 8.1.5, Figure 8.13) and the -Grid environment (Sec- tion 8.1.6, Figure 8.14) exhibit fairly smooth surprise, error, and accuracy trajectories. What is interesting about these environments is that the clear corresponding minima in surprise and model error space occur several splits before the agent has created a model state for each environment state. In other words, the agent consistently infers fewer states in these environments than actu- ally exist. This occurs because almost every observation in these environments represents several (up to 4) environment states, and there are few other observations the agent can use to help dis- ambiguate these environment states. In general, in environments where most (or all) of the agent's observations cover several model states, sPOMDP modeling can be very challenging, because most of the environment's structure is hidden, and there are few unique observations to aid in local- ization. As we show in future sections, however, the sPOMDP models of these environments are still capable of supporting high-quality decision-making, even with fewer states than the ground truth POMDP environments. From a decision-making perspective, the -Grid environment is 190 also interesting, because it is the only one of the literature environments in which Predictive SDE models outperformed sPOMDP models on average. Further study of this environment may very well lead to important discoveries that improve sPOMDP modeling. 8.3.2 Color World POMDP Environments Number of PSBL model splits Model surprise (a) = = 1 (b) = 0:99; = 1 (c) = 0:95; = 1 (d) = 0:9; = 1 (e) = = 0:95 Model error (f) = = 1 (g) = 0:99; = 1 (h) = 0:95; = 1 (i) = 0:9; = 1 (j) = = 0:95 Model accuracy (k) = = 1 (l) = 0:99; = 1 (m) = 0:95; = 1 (n) = 0:9; = 1 (o) = = 0:95 Figure 8.15: Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model pre- dictive accuracy (subgures (k)-(o)) trajectories for PSBL applied to random 2x2 color world environments with 2 colors (Section 8.2, Figure 8.7(b)) for a variety of environment noise levels. Figures 8.15-8.21 plot the trajectories of model surprise (top row, subgures (a)-(e)), model error (middle row, subgures (f)-(j)), and model predictive accuracy (bottom row, subgures (k)-(o)) that result from using PSBL to learn sPOMDP models (Algorithm 10, Chapter 6.3.4) of random color world POMDPs (Section 8.2, Figure 8.7) of varying sizes. The hyperparameters used for learning were explore = 0:5, numActions = 15000, patience = 3, minGain = 0:0. Each gure in each row represents a separate setting of environment noise parameters and. Recall, however, that color world POMDPs are not -POMDPs, due to their transition model (Figure 8.7(a)). Other noise levels were tested as well { namely, ( = 0:85; = 1:0), ( = 0:8; = 1:0), and 191 ( = 0:99; = 0:99) { but they are not shown in these gures due to space. The horizontal axis in each gure again represents the number of times the sPOMDP model has been split by the PSBL learning procedure (see Section 6.3.4.5, Algorithm 18). Each line represents a separate run of PSBL until termination on a random color world POMDP of the given size and number of color observations. There are 50 such runs (each a dierent line) represented in each subgure of Figures 8.15-8.21. The randomization of these color worlds is performed by randomly assigning each color observation to exactly two environment states (unless there are an odd number of states, in which case one environment state will have a unique color observation). So, these environments are approximately 50% hidden the agent. Number of PSBL model splits Model surprise (a) = = 1 (b) = 0:99; = 1 (c) = 0:95; = 1 (d) = 0:9; = 1 (e) = = 0:95 Model error (f) = = 1 (g) = 0:99; = 1 (h) = 0:95; = 1 (i) = 0:9; = 1 (j) = = 0:95 Model accuracy (k) = = 1 (l) = 0:99; = 1 (m) = 0:95; = 1 (n) = 0:9; = 1 (o) = = 0:95 Figure 8.16: Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model pre- dictive accuracy (subgures (k)-(o)) trajectories for PSBL applied to random 2x3 color world environments with 3 colors (Section 8.2, Figure 8.7(c)) for a variety of environment noise levels. This randomization is important in order to ensure that the learning procedures presented in this dissertation apply to a broad range of environment structures, not merely hand-engineered environment structures (such as the environments from the literature in the previous section). 192 This randomization is also what causes the initial model surprise and error values in, for example, Figures 8.15-8.17 to be more spread out than in the environments from the literature. Number of PSBL model splits Model surprise (a) = = 1 (b) = 0:99; = 1 (c) = 0:95; = 1 (d) = 0:9; = 1 (e) = = 0:95 Model error (f) = = 1 (g) = 0:99; = 1 (h) = 0:95; = 1 (i) = 0:9; = 1 (j) = = 0:95 Model accuracy (k) = = 1 (l) = 0:99; = 1 (m) = 0:95; = 1 (n) = 0:9; = 1 (o) = = 0:95 Figure 8.17: Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model pre- dictive accuracy (subgures (k)-(o)) trajectories for PSBL applied to random 3x3 color world environments with 5 colors (Section 8.2, Figure 8.7(e)) for a variety of environment noise levels. In general, we see the same basic trends in the surprise, error, and accuracy curves related to the random color worlds (Figures 8.15-8.21) that we see in the environments from the literature (Figures 8.9-8.14). In particular, model surprise and model error consistently exhibit remark- ably similar trajectories, indicating a strong positive correlation. Model predictive accuracy also consistently looks like a mirrored image (about the horizontal axis) of the surprise and model error trajectories, indicating a strong negative correlation. In Figures 8.17-8.21 (the larger color world POMDPs tested), we see a clear bend in the curve at the expected number of model splits (jQjjOj, wherejQj is the number of underlying environment states andjOj is the number of agent observations) past which surprise and error plateau or begin to climb back up. For exam- ple, we see such bends very clearly in Figure 8.17 (the 3x3 color world environment with 5 color observations) at 4 model splits. The agent begins with a model state space with 5 model states 193 (one per unique color observation), and it takes 4 model splits to get to 9 model states, which is equivalent to the value ofjQj in this environment. Thus, this is exactly where we would expect (and wish) such a bend in the curve to occur, because further splitting results in overtting (in- ferring non-existent states). As another example, we see clear bends in these curves at (or near) 12 model splits in the 5x5 color world environment with 13 color observations (Figure 8.21), as expected (for analogous reasons). Model predictive accuracy tends to reach its highest value and plateau at these bends as well, as expected and desired. Number of PSBL model splits Model surprise (a) = = 1 (b) = 0:99; = 1 (c) = 0:95; = 1 (d) = 0:9; = 1 (e) = = 0:95 Model error (f) = = 1 (g) = 0:99; = 1 (h) = 0:95; = 1 (i) = 0:9; = 1 (j) = = 0:95 Model accuracy (k) = = 1 (l) = 0:99; = 1 (m) = 0:95; = 1 (n) = 0:9; = 1 (o) = = 0:95 Figure 8.18: Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model pre- dictive accuracy (subgures (k)-(o)) trajectories for PSBL applied to random 3x4 color world environments with 6 colors (Section 8.2, Figure 8.7(f)) for a variety of environment noise levels. It is also particularly interesting how well-behaved these trajectories are (particularly when compared to those in Figures 8.9-8.14), even though the environments are randomly generated. Surprise and model error decrease almost completely monotonically with the number of model splits (up to the point at which they plateau). Conversely, model predictive accuracy increases almost completely monotonically with the number of model splits (again, up to the point at which the plateau occurs). As was the case with the environments from the literature in the previous 194 section, these curves are more well-behaved (and the trends highlighted more readily apparent) the more deterministic the environment is. Number of PSBL model splits Model surprise (a) = = 1 (b) = 0:99; = 1 (c) = 0:95; = 1 (d) = 0:9; = 1 (e) = = 0:95 Model error (f) = = 1 (g) = 0:99; = 1 (h) = 0:95; = 1 (i) = 0:9; = 1 (j) = = 0:95 Model accuracy (k) = = 1 (l) = 0:99; = 1 (m) = 0:95; = 1 (n) = 0:9; = 1 (o) = = 0:95 Figure 8.19: Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model pre- dictive accuracy (subgures (k)-(o)) trajectories for PSBL applied to random 4x4 color world environments with 8 colors (Section 8.2, Figure 8.7(g)) for a variety of environment noise levels. There are a few other interesting trends that are worth highlighting in these color world gures (Figures 8.15-8.21). In the environments from the literature (Figures 8.9-8.14), most of the surprise and error curves exhibited a \U" or \V" like shape. Surprise (or error) decreased for some number of model splits (typically almost completely monotonically) and then, after that, increased very sharply. In the color world environments, we do not see this behavior nearly as often. We see it in the smallest color world POMDPs tested (2x2 with 2 colors in Figure 8.15, and 2x3 with 3 colors in Figure 8.16) to some degree. However, more typically, surprise and error both decrease gradually in color world environments to their lowest values and then plateau, which is interesting. We are actively investigating if this should be expected to be a general trend in larger environments, or if it is the particular nature of these color world environments that causes such consistent behavior in surprise and model error space. 195 Number of PSBL model splits Model surprise (a) = = 1 (b) = 0:99; = 1 (c) = 0:95; = 1 (d) = 0:9; = 1 (e) = = 0:95 Model error (f) = = 1 (g) = 0:99; = 1 (h) = 0:95; = 1 (i) = 0:9; = 1 (j) = = 0:95 Model accuracy (k) = = 1 (l) = 0:99; = 1 (m) = 0:95; = 1 (n) = 0:9; = 1 (o) = = 0:95 Figure 8.20: Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model pre- dictive accuracy (subgures (k)-(o)) trajectories for PSBL applied to random 4x5 color world environments with 10 colors (Section 8.2, Figure 8.7(h)) for a variety of environment noise levels. Number of PSBL model splits Model surprise (a) = = 1 (b) = 0:99; = 1 (c) = 0:95; = 1 (d) = 0:9; = 1 (e) = = 0:95 Model error (f) = = 1 (g) = 0:99; = 1 (h) = 0:95; = 1 (i) = 0:9; = 1 (j) = = 0:95 Model accuracy (k) = = 1 (l) = 0:99; = 1 (m) = 0:95; = 1 (n) = 0:9; = 1 (o) = = 0:95 Figure 8.21: Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model pre- dictive accuracy (subgures (k)-(o)) trajectories for PSBL applied to random 5x5 color world environments with 13 colors (Section 8.2, Figure 8.7(i)) for a variety of environment noise levels. 196 Finally, particularly as the color world environments become larger and more noise is intro- duced, we notice some signs that PSBL sometimes stops too early (i.e., before it has found all the latent structure in the environment). See Figures 8.19(e),(j),(o) and 8.20(e),(j),(o) for char- acteristic examples. In contrast to the early stopping we saw in the -Network and -Grid environments (Figures 8.13 and 8.14), this early stopping does not correspond to a dramatic increase in either model surprise or model error. Most likely, the agent is simply not able to nd sucient evidence that further splitting its model will be useful. This can be remedied (to some extent) either by performing more actions in the environment (though the number of actions needed is dicult to know a priori) or lowering the initial pseudocount values of the Dirichlet prior distributions dening model transition and one-step functions (see Chapter 6.3.4.3 and 6.3.4.4) such that data are very heavily weighted as compared with the agent's prior (uninformed) beliefs that all observations are equally likely under all possible actions (for all possible agent histories). Number of PSBL model splits Model surprise (a) = = 1 (b) = 0:99; = 1 (c) = 0:95; = 1 (d) = 0:9; = 1 (e) = = 0:95 Model error (f) = = 1 (g) = 0:99; = 1 (h) = 0:95; = 1 (i) = 0:9; = 1 (j) = = 0:95 Model accuracy (k) = = 1 (l) = 0:99; = 1 (m) = 0:95; = 1 (n) = 0:9; = 1 (o) = = 0:95 Figure 8.22: Model surprise (subgures (a)-(e)), model error (subgures (f)-(j)), and model pre- dictive accuracy (subgures (k)-(o)) trajectories for PSBL applied to the ISI Floor environment (Section 8.2, Figure 8.8) for a variety of environment noise levels. 197 Figure 8.22 illustrates the model surprise, error, and accuracy trajectories corresponding to PSBL learning of sPOMDP models in the ISI Floor environment (Section 8.2, Figure 8.8), which is a specic instance of a 6x5 color world POMDP in which several grid cells have been designated as obstacles that block the agent's path. This environment is particularly interesting, because, when = = 1 (Figure 8.22(a)) and when = 0:99 and = 1:0 (Figure 8.22(b)), there is a very shallow local minimum in surprise space (which does not seem to correlate well with model error) that causes learning occasionally to halt almost immediately with a very poor model. In other cases, the agent's surprise increases sharply after one or two model splits and then nds its way down gradually to a low surprise value before terminating with a high quality model. Fortunately, most runs of PSBL nd their way down to a low value of surprise eventually. Additionally, introducing environment noise actually helps the agent learn a better model here, which is not always the case (as we saw with the -Circular 1D Maze environment of Figure 8.11). It is possible that this behavior is the result of making particularly bad initial model splits, which might be remedied in future work with a more sophisticated split ordering strategy. 8.3.3 The Correlation Between Surprise, Error, and Accuracy In the previous section, we analyzed a number of illustrations from runs of PSBL in dierent rewardless POMDP environments that seemed to indicate that model surprise and model error are, in general, positively correlated, whereas model surprise and model predictive accuracy are, in general, negatively correlated. Both of these correlations are desirable, and the more tightly these quantities are correlated, the better we can expect PSBL to perform, because it means that minimizing surprise is very eectively minimizing model error (and increasing model predictive accuracy). In this section, we experimentally demonstrate and quantify the strength of this correlation. The Pearson correlation coecient [164] of a sample of paired data consisting of the values of two variables is a measure of the linear correlation between the two variables. A value of 198 (a) Pearson correlation between model surprise and accuracy in environments from the literature. (b) Pearson correlation between model surprise and accuracy in color world environments. (c) Pearson correlation between model surprise and error in environments from the literature. (d) Pearson correlation between model surprise and error in color world environments. Experimental Results: Spearman Correlation (Surprise/Accuracy) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results α = 1.0,ϵ = 1.0 α = 0.99,ϵ = 1.0 α = 0.95,ϵ = 1.0 α = 0.9,ϵ = 1.0 α = 0.85,ϵ = 1.0 α = 0.8,ϵ = 1.0 α = 0.99,ϵ = 0.99 α = 0.95,ϵ = 0.95 (e) Legend for subgures (a)-(d). Figure 8.23: An illustration of the average Pearson correlation between model surprise and model predictive accuracy (subgures (a) and (b)) and the average Pearson correlation between model surprise and model error (subgures (c) and (d)) in a number of rewardless POMDP environments under varying noise levels. +1 indicates total positive linear correlation, while a value of1 indicates total negative linear correlation. A value of 0 indicates that the two variables are not linearly correlated at all. Recall that each subgure in Figures 8.9-8.22 represents 50 runs of the PSBL learning algorithm on the given rewardless POMDP environment under a certain level of environment noise. For each such 199 run, we can compute the pairwise Pearson correlation between the curves for model surprise and model error and between the curves for model surprise and model predictive accuracy. Averaging 2 these correlations over these 50 runs of PSBL gives us a good indication of the expected correlation of these quantities for each environment and noise level tested. Figure 8.23 visualizes these average Pearson correlations. Subgure (a) shows the average Pearson correlation between model surprise and model predictive accuracy in the environments from the literature (Section 8.1). Subgure (b) shows the average Pearson correlation between model surprise and model predictive accuracy in color world environments (Section 8.2). Subgure (c) shows the average Pearson correlation between model surprise and model error in the environments from the literature. Subgure (d) shows the average Pearson correlation between model surprise and model error in the color world environments. Each group of bars represents a dierent environment type, whereas each bar within that group represents a dierent noise level tested in that environment. Recall that each bar itself is an average over 50 runs of the PSBL algorithm. Subgure (e) provides a legend for subgures (a)-(d). Figure 8.23(a) and Figure 8.23(b) clearly indicate a strong, negative correlation between model surprise and model predictive accuracy, which is exactly in line with our expectations and desires. The idea behind PSBL is to minimize model surprise in order to make a model that is a better representation of its environment. As the model more closely conforms to its environment, we would expect the predictive accuracy of this model to increase. These correlations are strong across all environments and noise levels tested, though sometimes these correlations become slightly weaker as more noise is introduced, which is not surprising. Figure 8.23(c) and Figure 8.23(d) clearly indicate a strong positive correlation between model surprise and model error across all environments and noise levels tested. This is a very powerful empirical demonstration that 2 As a technical point, it is well known in statistics that correlation coecients (such as Pearson's) are not additive. This means, in particular, that averaging correlation coecients directly does not provide a measure of the average correlation between two variables. We overcome this by rst performing a Fisher transformation [51] of each correlation coecient to its corresponding z-value, averaging these z-values (which are additive), and transforming the result back into a Pearson correlation coecient using the inverse Fisher transformation. This results in a quantity that can be usefully interpreted as representing average or expected correlation. 200 reducing model surprise does, in fact, eectively result in a reduction in model error, as we had desired. These results are extremely consistent across all environments and noise levels tested. 8.4 A Comparative Analysis of Learning Models In this section, we compare the predictive accuracies (normalized relative to an oracle's perfor- mance), runtimes, model sizes, and model errors of learned PSDE and sPOMDP models in a number of rewardless -POMDP environments (adapted from environments in the literature, see Section 8.1) against xed-length history-based (k-order Markov [19]) models and Long Short- Term Memory (LSTM) [60] neural network models. We perform this analysis under a number of dierent environment noise levels (i.e., settings of and). We also perform a similar comparison in the color world environments of Section 8.2 (see Figure 8.7). More specically, in Figures 8.24-8.39, we compare PSDE (orange) and sPOMDP (cyan) mod- els learned by PSBL against rst- (blue), second- (red), and third-order (yellow) Markov models that considered one, two, and three (respectively) action-observation pairs of history before mak- ing a prediction about the next observation the agent would see. Note thatk-order Markov models are purely predictive representations that do not attempt to model latent environment structure, as is the case with PSDE models. In contrast to PSDE models, however, the history window of k-order Markov models is xed manually before learning and is the same for each history con- text. We also compare PSDE and sPOMDP models against LSTM models (green) trained to make predictions about the agent's next observation given sequences consisting of its previous k actions and observations. In contrast to k-order Markov models, LSTM models can be seen as maintaining a representation of latent state encoded in the weights of the neural network, making them a little more similar to sPOMDP models in that regard. One of the most important things to realize about these comparisons is that k-order Markov and LSTM models are parametric models, as opposed to the nonparametric representations oered 201 by PSDEs and sPOMDPs. Additionally, we train thek-order Markov and LSTM models oine on large, xed-size datasets ranging from 200; 000 to 500; 000 random agent trajectories (depending on the size of the environment and the models) in order to maximize their performance. This is in contrast to our sPOMDP and PSDE models, which are trained in an online, active, and incremental fashion directly from a single stream of experience. For this reason, we show only the runtimes associated with learning PSDE and sPOMDP models. PSBL is also responsible for deciding how much experience the agent needs when learning sPOMDP and PSDE models (i.e., the agent is not given a xed-size training set to randomly shue). This combination of factors makes the learning problem signicantly more dicult for PSBL than it is for learning k-order Markov models and LSTMs. As a nal note, the LSTMs against which we compare our PSDE and sPOMDP models in this section had a single layer and their various hyperparameters were (extensively) manually tuned in order to maximize performance. We found that setting the number of hidden units to be between 2 and 3 times the number of underlying environment states provided optimal performance (note that this is a vital piece of manually-specied prior knowledge about the environment). We also found that setting the length of each random agent trajectory fed into these LSTMs as input to 4 action and observation pairs resulted in the best performance in the environments tested. We trained these LSTMs for 20 epochs (with early stopping [166]) to output a probability distribution over possible agent observations at the next time step for any given input sequence of agent actions and observations. This means that we can use our generalized model error metric (Section 4.1, Equation 4.1) to evaluate the average model error of these LSTM models (as well as the k-order Markov models) in the same way we can for PSDE and sPOMDP models. This is one of the primary benets of adopting such an error metric. 202 Environment noise: α, ε (a) Accuracy relative to oracle (b) Model error (c) Runtime (seconds) (d) Number of parameters (e) Model states (sPOMDP) Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (f) Legend for subgures (a)-(d) Figure 8.24: A comparison of dierent learning algorithms and models in the -Shape environ- ment (Figure 8.1.1). 8.4.1 Environments from the Literature Figures 8.24-8.30 illustrate the key results of this performance comparison in the -POMDP environments adapted from important environments in the literature (Section 8.1, Figures 8.1.1- 8.1.6) for a variety of noise levels. The hyperparameters used for PSBL learning of sPOMDPs in these experiments were explore = 0:5, numActions = 5000, patience = 3, and minGain = 0:0 for all environments and noise levels tested. The hyperparameters used for PSBL learning of PSDE models in these environments were explore = 0:5 and numActions = 250. minGain was set to be 0:1 times the length of the PSDE being evaluated for splitting. This is not strictly required, but it can help reduce the runtime of PSBL for PSDEs by requiring that longer PSDEs provide stronger evidence (in the form of larger gain) in order to justify splitting them. Compound PSDEs were not used in these experiments and neither were the operations of renement and merging (see Sections 5.3.1.3 and 5.3.1.4). This enabled us to focus on the performance of model splitting in 203 PSBL, which is common to both PSDE and sPOMDP models. Results associated with compound PSDEs, renement, and merging can be found in [38]. Environment noise: α, ε (a) Accuracy relative to oracle (b) Model error (c) Runtime (seconds) (d) Number of parameters (e) Model states (sPOMDP) Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (f) Legend for subgures (a)-(d) Figure 8.25: A comparison of dierent learning algorithms and models in the -Little Prince environment (Figure 8.1.2). In each of the Figures 8.24-8.30, subgure (a) provides the predictive accuracy of each model type for a given environment type and noise level relative to the performance of an oracle with complete knowledge of the environment. The raw predictive accuracy of each model in each environment is computed by dividing the number of correct predictions that model makes about the next observation under 10; 000 random actions in the environment (beginning at a random initial environment state) by the number of total predictions made. The agent is forced to make a prediction at every time step (i.e., it cannot refuse to make predictions when it is unsure). The predictive accuracy of each model relative to the oracle is computed by simultaneously having the oracle make predictions in the same environment and dividing the raw predictive accuracy of each model by the predictive accuracy of the oracle in the same environment (and under the same random environment experience). This results in a relative predictive accuracy between 0 204 and 1 that measures the strength of the predictive accuracy of each model type relative to the performance of an optimal predictor in the same environment. Environment noise: α, ε (a) Accuracy relative to oracle (b) Model error (c) Runtime (seconds) (d) Number of parameters (e) Model states (sPOMDP) Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (f) Legend for subgures (a)-(d) Figure 8.26: A comparison of dierent learning algorithms and models in the-Circular 1D Maze environment (Figure 8.1.3). As we discussed in Chapter 5.3.2.1, it is possible to have a model that predicts future obser- vations with high accuracy but which represents the distributions dening its underlying environ- ment very poorly. The model error metric developed in Chapter 4.1 (Equation 4.1) is designed to empirically measure the (average) deviation of the history-dependent observation probability distributions induced by a model at each time step from ground truth history-dependent observa- tion distributions via a simulation ofT random actions, and it can be applied to all of the models tested in this section. T was set to 10; 000 for all experiments. In Figures 8.24-8.30, subgure (b) illustrates this model error for each model type. In each of the Figures 8.24-8.30, subgure (c) provides the runtimes (in seconds) of PSBL for learning PSDE models (orange) and PSBL for learning sPOMDP models (cyan). Recall that the runtimes of the other models are not particularly informative as they are trained oine on 205 Environment noise: α, ε (a) Accuracy relative to oracle (b) Model error (c) Runtime (seconds) (d) Number of parameters (e) Model states (sPOMDP) Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (f) Legend for subgures (a)-(d) Figure 8.27: A comparison of dierent learning algorithms and models in random variations of the -Shape environment (Figure 8.1.1). large, xed-size datasets. Subgure (d) compares the total number of model parameters in each model type. For the PSDE andk-order Markov models, the number of parameters is the number of PSDEs (or history contexts, in the case ofk-order Markov models) multiplied byjOj1. The minus 1 owes to the sum-to-one constraint of each of these separate observation probability distributions. In sPOMDP models, the total number of parameters isjMjjAj(jMj1), wherejMj is the number of learned model states. This is because, in an sPOMDP model, we have a distribution over the next model state for each possible model state and action pair. Again, the minus 1 owes to the sum-to-one constraint of each of these transition distributions dening the model. Subgure (e) compares the number of model states inferred by PSBL for sPOMDP learning (blue) in each environment to the ground truth number of environment states (red). Subgure (f) provides a legend for subgures (a)-(d). Each group of bars represents a dierent environment noise level (setting of ;) tested. Each bar is an average over 50 runs of each learning algorithm. 206 Environment noise: α, ε (a) Accuracy relative to oracle (b) Model error (c) Runtime (seconds) (d) Number of parameters (e) Model states (sPOMDP) Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (f) Legend for subgures (a)-(d) Figure 8.28: A comparison of dierent learning algorithms and models in the -Shuttle environ- ment (Figure 8.1.4). In Figures 8.24(a)-8.30(a), we see that PSDE modeling (orange) performs as well as or out- performs k-order Markov models (blue, red, and yellow) and LSTMs (green) in terms of relative predictive accuracy in every -POMDP environment tested on almost every noise level tested. PSDE modeling requires fewer parameters (in every case) than 3rd-order Markov models and LSTMs and even requires fewer parameters than 2nd-order Markov models in many cases (see Figures 8.24(d)-8.30(d)). PSDE modeling also results in consistently lower model error than the other modeling approaches (see Figures 8.24(b)-8.30(b)). PSDE modeling rather signicantly outperforms the other models in terms of both model error and relative predictive accuracy in the -Circular 1D Maze environment (see Figure 8.26) and the -Network environment of Fig- ure 8.29. Both of these environments have observations that cover several (up to 4, in the case of the-Network environment) environment states. The adaptive nature of PSDE modeling allows it to dig deeply along ambiguous history contexts in a way that the xed-length history-based 207 (k-order Markov) and LSTM models are not able to do (because of their xed parametric form and xed history window size). Environment noise: α, ε (a) Accuracy relative to oracle (b) Model error (c) Runtime (seconds) (d) Number of parameters (e) Model states (sPOMDP) Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (f) Legend for subgures (a)-(d) Figure 8.29: A comparison of dierent learning algorithms and models in the -Network envi- ronment (Figure 8.1.5). In some environments, including the-Shape environment (Figure 8.24), the-Little Prince environment (Figure 8.25), and the -Circular 1D Maze environment (Figure 8.26), sPOMDP modeling meets or exceeds the performance of other modeling approaches (including PSDE mod- eling) in terms of relative predictive accuracy when the environment is fully or almost fully deterministic. Given the nature of the approximate Bayesian inference procedures utilized when learning sPOMDP models with PSBL (e.g., SDE-based belief smoothing, see Section 6.3.4.2), this is not unexpected. The assumptions made by these approximate procedures become increasingly justied the closer the environment is to fully deterministic, so we should expect sPOMDP mod- eling to be most eective in such cases. Nevertheless, as we show later, even though sPOMDP 208 Environment noise: α, ε (a) Accuracy relative to oracle (b) Model error (c) Runtime (seconds) (d) Number of parameters (e) Model states (sPOMDP) Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (f) Legend for subgures (a)-(d) Figure 8.30: A comparison of dierent learning algorithms and models in the-Grid environment (Figure 8.1.6). models often don't predict as well as PSDE models, they typically perform better on decision- making tasks, because they can be solved like regular POMDPs for optimal 3 policies. It is also interesting to note that the runtimes of PSBL for learning sPOMDP models (Figures 8.24(c)- 8.30(c)) typically uctuate less with changes to environment noise levels than the runtimes of PSBL for learning PSDE models. The runtimes of PSBL for PSDEs vary quite a bit depending on the environment noise level (particularly when noise is introduced into agent observations). Finally, we note that, for many environments and noise levels, PSBL for sPOMDPs learns an sPOMDP whose model state space sizejMj is a high-quality approximation of the number of true underlying environment statesjQj (see Figures 8.24(e)-8.30(e)). In the -Shape environment, for example (Figure 8.24), the agent infers almost exactly 4 model states in all but the noisiest versions of this environment, in which it infers slightly fewer. In general, of course, increasing the environment noise can cause the agent to infer too many or too few model states. In the 3 Again, we note that the optimality of the policy is relative to the given sPOMDP model, which is only an approximation of the true underlying POMDP environment. 209 -Little Prince environment (Figure 8.25(e)), the agent infers between 4 and 5 model states. In the random variations of the-Shape environment (Figure 8.27(e)), the agent is actually able to provide good performance while inferring fewer than 4 model states on average. This is because a number of these environments can be represented well with 2 or 3 model states, depending on the particular conguration ofs ands (and the way in which they are connected by most likely transitions). As we discussed in Section 8.3.1, the -Network (Figure 8.29) and -Grid (Figure 8.30) environments are interesting cases in which PSBL learns an sPOMDP with signicantly fewer model states than there are underlying environment states. The relative predictive accuracies of the sPOMDPs learned in both environments are quite low, though they are still capable of supporting high-quality decision-making, as we demonstrate in future sections. To reiterate, the poor performance of sPOMDP modeling in these cases can be attributed to the fact that these environments have very deeply hidden structure, with almost all agent observations representing many environment states. This makes it dicult for the agent to identify hidden structure while simultaneously localizing itself, because there are few (if any) distinctive observations that help to orient the agent during the learning process. This is a current limitation of sPOMDP modeling that requires further investigation. 8.4.2 Color World POMDP Environments Figures 8.31-8.39 illustrate the key results of an analogous performance comparison on random color world environments of increasingly large sizes (Section 8.2, Figure 8.7) and the ISI Floor environment (Figure 8.8) for a variety of environment noise levels. The hyperparameters used for PSBL learning of sPOMDPs in these experiments were explore = 0:5, numActions = 15000, pa- tience = 3, and minGain = 0:0 for all environments and noise levels tested. The hyperparameters used for PSBL learning of PSDE models in these environments were (as before) explore = 0:5 210 and numActions = 250. minGain was again set to be 0:1 times the length of the PSDE being evaluated for splitting. Environment noise: α, ε (a) Accuracy relative to oracle (b) Model error (c) Runtime (seconds) (d) Number of parameters (e) Model states (sPOMDP) Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (f) Legend for subgures (a)-(d) Figure 8.31: A comparison of dierent learning algorithms and models in random 2x2 color world environments with 2 color observations (Figure 8.7(b)). As in the previous section, in each of the Figures 8.31-8.39: subgure (a) illustrates the predic- tive accuracy of each model type relative to the performance of an oracle in the same environment; subgure (b) illustrates the model error of each model type; subgure (c) illustrates the runtimes (in seconds) of PSBL for learning PSDE models (orange) and learning sPOMDP models (cyan); subgure (d) illustrates the total number of parameters in each model type; subgure (e) com- pares the number of sPOMDP model states learned by PSBL (blue) to the ground truth number of environment states (red); subgure (f) provides a legend for subgures (a)-(d). As a reminder, each group of bars represents a dierent environment noise level (setting of ;) tested, but color world environments are not -POMDPs (as discussed in Section 8.2). Each bar is an average over 50 runs of each learning algorithm. Recall that these are random color world environments of 211 the given sizes and numbers of color observations. This randomization is performed as described in Section 8.3.2. Environment noise: α, ε (a) Accuracy relative to oracle (b) Model error (c) Runtime (seconds) (d) Number of parameters (e) Model states (sPOMDP) Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (f) Legend for subgures (a)-(d) Figure 8.32: A comparison of dierent learning algorithms and models in random 2x3 color world environments with 3 color observations (Figure 8.7(c)). We notice many of the same trends in Figures 8.31-8.38 that we saw in Figures 8.24-8.30 in Section 8.4.1. For the vast majority of color world environments and noise levels tested, PSDE modeling meets or exceeds the performance of k-order Markov models and LSTMs in terms of both relative predictive accuracy and model error, while requiring many fewer parameters than 3rd-order Markov models (and typically fewer parameters than the LSTMs). PSDE modeling outperforms all other model types quite substantially in terms of model error in almost every environment and noise level tested. In many of these color world environments, sPOMDP modeling was able to meet or exceed the performance of all other model types (including PSDEs) in terms of relative predictive accuracy when the environment was fully (or very highly) deterministic, as we saw in some of the envi- ronments from the literature in Section 8.4.1. Overall, in terms of relative predictive accuracy, 212 Environment noise: α, ε (a) Accuracy relative to oracle (b) Model error (c) Runtime (seconds) (d) Number of parameters (e) Model states (sPOMDP) Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (f) Legend for subgures (a)-(d) Figure 8.33: A comparison of dierent learning algorithms and models in random 3x3 color world environments with 3 color observations (Figure 8.7(d)). Environment noise: α, ε (a) Accuracy relative to oracle (b) Model error (c) Runtime (seconds) (d) Number of parameters (e) Model states (sPOMDP) Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (f) Legend for subgures (a)-(d) Figure 8.34: A comparison of dierent learning algorithms and models in random 3x3 color world environments with 5 color observations (Figure 8.7(e)). 213 Environment noise: α, ε (a) Accuracy relative to oracle (b) Model error (c) Runtime (seconds) (d) Number of parameters (e) Model states (sPOMDP) Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (f) Legend for subgures (a)-(d) Figure 8.35: A comparison of dierent learning algorithms and models in random 3x4 color world environments with 6 color observations (Figure 8.7(f)). Environment noise: α, ε (a) Accuracy relative to oracle (b) Model error (c) Runtime (seconds) (d) Number of parameters (e) Model states (sPOMDP) Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (f) Legend for subgures (a)-(d) Figure 8.36: A comparison of dierent learning algorithms and models in random 4x4 color world environments with 8 color observations (Figure 8.7(g)). 214 Environment noise: α, ε (a) Accuracy relative to oracle (b) Model error (c) Runtime (seconds) (d) Number of parameters (e) Model states (sPOMDP) Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (f) Legend for subgures (a)-(d) Figure 8.37: A comparison of dierent learning algorithms and models in random 4x5 color world environments with 10 color observations (Figure 8.7(h)). Environment noise: α, ε (a) Accuracy relative to oracle (b) Model error (c) Runtime (seconds) (d) Number of parameters (e) Model states (sPOMDP) Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (f) Legend for subgures (a)-(d) Figure 8.38: A comparison of dierent learning algorithms and models in random 5x5 color world environments with 13 color observations (Figure 8.7(i)). 215 sPOMDPs perform signicantly better in these color world environments than in the environments from the literature (Figures 8.24-8.30). Again, we see that the runtimes for learning sPOMDPs using PSBL vary much less with environment noise changes than do the runtimes for learning PSDE models using PSBL. In Figures 8.31(e)-8.38(e), we see that the sizes of the learned sPOMDP model state spacesjMj are close approximations to the true underlying number of environment statesjQj in almost every environment tested (regardless of environment noise or environment size). There are two main exceptions. The rst is the random 2x2 color world environments with 2 color observations (Figure 8.31), in which the agent could consistently model many of these random environments with with high accuracy using only 2 or 3 model states (resulting in an average number of model states that was less than 4 for every noise level tested). The second main exception was the 3x3 color world environments with 3 color observations (Figure 8.33), in which PSBL for sPOMDPs struggled to learn a good model for the same reasons that it struggled when learning in the-Network (Figure 8.29) and-Grid (Figure 8.30) environments. This was discussed at length in Sections 8.3.1 and 8.4.1. In these environments, PSBL consistently learns sPOMDPs with signicantly fewer model states than there are underlying environment states. The ISI Floor environment (Figure 8.39) was unique in that it was the only environment in which sPOMDP modeling rather substantially outperformed the other model types across a variety of dierent noise levels in terms of relative predictive accuracy (with the primary exception, interestingly, being = = 1). sPOMDP modeling achieved this while requiring signicantly fewer parameters than any other model type (with the exception of rst-order Markov models) and with a much smaller runtime than PSBL for PSDE modeling in this environment. The learned sPOMDPs have slightly larger state spaces, on average, than the underlying environment, but they still provide a high-quality approximation of the number of underlying environment states, jQj, for most noise levels tested. The relatively poor performance of sPOMDP modeling on the ISI Floor environment when = = 1 (i.e., when the environment is fully deterministic) is due to a shallow local minimum that we observe in surprise space (see Section 8.3.2, Figure 8.22(a)) that 216 Environment noise: α, ε (a) Accuracy relative to oracle (b) Model error (c) Runtime (seconds) (d) Number of parameters (e) Model states (sPOMDP) Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (f) Legend for subgures (a)-(d) Figure 8.39: A comparison of dierent learning algorithms and models in the ISI Floor environ- ment (Figure 8.8). occasionally causes PSBL to terminate too early with a poor sPOMDP model. In most runs in the deterministic ISI Floor environment, PSBL learns a very high-accuracy sPOMDP model, but, in a handful of runs, it terminates early with a low-accuracy model (owing to this local minimum in surprise space), which brings the average relative predictive accuracy down when = = 1. 8.4.3 Summary We can usefully visualize and summarize the key takeaways of the comparative analyses in Sec- tions 8.4.1 and 8.4.2 by averaging the performance of each modeling approach across the dierent environment noise levels tested, which were: = = 1; = 0:99; = 1; = 0:95; = 1; = 0:9; = 1; = 0:85; = 1; = 0:8; = 1; = = 0:99; = = 0:95. This gives us an idea of the average overall performance of each modeling approach in each environment (or each class of environments, in the case of random color world environments). 217 (a) Average accuracy relative to the oracle in the literature environments of Section 8.1. (b) Average accuracy relative to the oracle in the color world environments of Section 8.2. (c) Average model error in the literature environ- ments of Section 8.1. (d) Average model error in the color world environ- ments of Section 8.2. Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (e) Legend for subgures (a)-(d). Figure 8.40: A comparison of the average relative predictive accuracies and model errors of dierent modeling approaches in the environments from the literature (subgures (a) and (c), see Section 8.1) and random color world environments (subgures (b) and (d), see Section 8.2). Figure 8.40 compares the average performance (in terms of relative predictive accuracy and model error) of rst-order Markov models (blue), second-order Markov models (red), third-order Markov models (yellow), and LSTMs (green) against PSDE models (orange) and sPOMDP models (cyan), where the latter two models (our models) are learned using PSBL. Subgure (a) plots the 218 average predictive accuracy of each model type relative to an oracle (where the average is taken over all the environment noise levels tested) in each literature environment tested (see Section 8.1). Subgure (b) plots the average predictive accuracy of each model type relative to an oracle (again averaged over all noise levels tested) in the random color world environments of Section 8.2. Subgure (c) compares the average model error of each model type in each literature environment (where the average is, again, taken over all environment noise levels tested). Finally, subgure (d) compares the average model error of each model type in each color world environment type tested. Each group of bars represents a dierent environment, whereas each individual bar represents a dierent modeling technique. Each bar is an average over 50 runs of each learning algorithm for each of the 8 environment noise levels enumerated above. These results suggest that PSDE modeling does, on average, almost always outperform the other modeling techniques in terms of both relative predictive accuracy (Figure 8.40(a),(b)) and model error (Figure 8.40(c),(d)), though there are a few exceptions. sPOMDP modeling actually outperforms PSDE modeling on average in the -Little Prince environment (Figure 8.1.2) and the ISI Floor environment (Figure 8.8). We discussed the exceptional performance of sPOMDP modeling in the ISI Floor environment at length in the previous section. In the -Shuttle environment (Figure 8.1.4) and the random 2x2 color world environments (Figure 8.7(b)) with 2 color observations (both of which require very short distinguishing outcome sequences to eectively model), 2nd-order Markov models match the performance of PSDE models in terms of relative predictive accuracy. It is also only in the 2x2 color world environments that PSDE models do not lead to minimal model error (though the model error of PSDE models is very close to the minimal value, which is achieved by 3rd-order Markov models). In virtually all other environment types, PSDE models result in substantially lower model error than the other modeling approaches. Note that 2nd-order Markov models and PSDE models perform almost equivalently in terms of model error in the -Shuttle environment. 219 (a) Average number of parameters in the literature environments of Section 8.1. (b) Average number of parameters in the color world environments of Section 8.2. (c) Average runtime (seconds) in the literature en- vironments of Section 8.1. (d) Average runtime (seconds) in the color world environments of Section 8.2. Environment Noise Accuracy Relative to Oracle Order 1 Markov Model Order 2 Markov Model Order 3 Markov Model LSTM PSDE Model sPOMDP Experimental Results: Color 3x4(6) Environment Noise Conformance to true P(O|h) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (e) Legend for subgures (a)-(d). Figure 8.41: A comparison of the average numbers of model parameters for each model type in dierent environments (subgures (a) and (b)) and the average runtime (in seconds) of PSBL as applied to learning both PSDE models and sPOMDP models in dierent environments. Figure 8.41 performs an analogous comparison of the average number of model parameters in each model type (subgures (a) and (b)) for each environment tested and the average runtimes (in seconds) of PSBL for PSDE learning and PSBL for sPOMDP learning (subgures (c) and (d)) for each environment tested. Note that the vertical axes in Figures 8.41(a) and 8.41(b) are on a log 220 scale due to the huge range in the number of parameters for the dierent modeling techniques. As in Figure 8.40, each group of bars represents a dierent environment type, and each individual bar represents a dierent modeling technique. As before, each bar represents an average over 50 runs of each learning algorithm in each of the 8 dierent environment noise levels tested (enumerated above). In Figures 8.41(a),(b), we see that PSDE models are, on average, more compact than both 3rd-order Markov models and the LSTMs tested in each environment (with no exceptions). PSDE models are, on average, substantially smaller than 3rd-order Markov models. As the environments become larger, sPOMDP models become substantially smaller than PSDE models, which is one of the benets of modeling latent structure in the agent's environment (as opposed to a purely predictive modeling approach). In a number of environments, including the -Shuttle environ- ment (Figure 8.1.4), the ISI Floor environment (Figure 8.8), and the 5x5 random color world environment with 13 color observations (Figure 8.7(i)), PSDE models were also more compact on average than 2nd-order Markov models. In Figures 8.41(c),(d), we see that, for the majority of the environments tested, sPOMDP mod- eling requires less runtime (on average) than PSDE modeling, particularly as the environment size increases. Two prominent exceptions to this trend are the -Shuttle environment (Figure 8.1.4) and the 3x3 color world environment with 5 color observations (Figure 8.7(e)), both of which can be modeled modeled very accurately with very short agent trajectories. As we see in the scalability analyses in Section 8.6, there is a crossover point at which sPOMDP modeling actually begins to take substantially longer than PSDE modeling, owing to the O(jMj 3 ) time required to update model one-step transition probabilities and the O(jMj 2 ) time required to update model transition probabilities after each agent action. Recall thatjMj is the number of model states currently in the agent's sPOMDP model of its environment. In Figure 8.42, we perform an analogous comparison of the average number of model states learned by PSBL for sPOMDPs (jMj, blue) against the ground truth number of environment states 221 (a) Average number of sPOMDP model states in the environments from the literature (Section 8.1). (b) Average number of sPOMDP model states in the color world environments of Section 8.2. Figure 8.42: A comparison of the average number of model states,jMj, in the sPOMDPs learned using PSBL against the ground truth number of environment states,jQj, in the environments from the literature (subgure (a), see Section 8.1) and random color world environments (subgure (b), see Section 8.2). (jQj, red) in the environments from the literature (subgure (a), see Section 8.1) and the random color world environments of Section 8.2 (subgure (b)). As with the previous average analyses in Figures 8.40 and 8.41, the averages are performed over 50 runs of PSBL in each of the 8 noise levels tested, and each group of bars represents a dierent test environment (or environment type, in the case of the random color world environments). In all but 4 environment types, the average number of model states in the sPOMDPs learned by PSBL is within 1 model state of the true number of underlying environment states, which is remarkable, considering that the agent had no a priori knowledge about the size or nature ofjQj during learning. One of the 4 exceptions to this trend was the ISI Floor environment (Figure 8.8), in which the agent's averagejMj value diered fromjQj by 1.2 model states. As we discussed above, this was the environment in which sPOMDP modeling performed the best in terms of relative predictive accuracy, so these extra model states did not hamper the agent's ability to make useful predictions. The other 3 exceptions to this trend were the -Network environment (Figure 8.1.5), the-Grid environment (Figure 8.1.6), and the 3x3 color world environment with 3 color observations (Figure 8.7(d)). These environments were extensively discussed above, because they are particularly dicult cases for sPOMDP modeling. The highly 222 hidden nature of these environments results in PSBL learning signicantly fewer model states than there are underlying environment states. Many of these averagejMj values are remarkably accurate. In the literature environments of Section 8.1 (subgure (a)), the averagejMj values for the -Shape environment (Figure 8.1.1), the -Circular 1D maze environment (Figure 8.1.3), and the random variations of the -Shape environment are all within 0:3 model states of the ground truth value ofjQj. In Figure 8.42(b), the average value ofjMj is within 0:2 model states of the ground truth value ofjQj in the 2x3 color world with 3 colors (Figure 8.7(c)), the 3x3 color world with 5 colors (Figure 8.7(e)), the 3x4 color world with 6 colors (Figure 8.7(f)), the 4x4 color world with 8 colors (Figure 8.7(g)), and the 4x5 color world with 10 colors (Figure 8.7(h)). 8.5 Decision Making In this section, we present the results of extensive experimentation which demonstrates that learned PSDE and sPOMDP models can be used eectively to make high-quality (often near- optimal) decisions across a number of dierent rewardless POMDP environments. In Figures 8.43- 8.47, we compare the decision-making performance of PSDE models (red) and sPOMDP mod- els (green) against an oracle agent that makes optimal decisions according to the ground truth POMDP environment (yellow). A random policy (blue) is also given to establish a baseline. We note that Figures 8.43-8.47 illustrate this decision-making performance only on some example literature environments from Section 8.1 and some sample color world environments from Sec- tion 8.2. The remaining illustrations do not add signicantly to the understanding of the decision- making performance of our models, and our average analysis of decision-making performance in Figure 8.48 includes almost all of the environments from Sections 8.1 and 8.2. In these experiments, one of the observations in each environment tested was designated as a goal observation, and the agent was tasked with observing that goal observation as often as possible over a xed number of time steps (10; 000), beginning from a random initial state. 223 T i mestep Reward accrued 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Random POMDP Average SPOMDP Average PSDE (a) = = 1. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Random POMDP Average SPOMDP Average PSDE (b) = 0:99; = 1. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Random POMDP Average SPOMDP Average PSDE (c) = 0:95; = 1. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 Random POMDP Average SPOMDP Average PSDE (d) = 0:9; = 1. Reward accrued 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 Random POMDP Average SPOMDP Average PSDE (e) = 0:8; = 1. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Random POMDP Average SPOMDP Average PSDE (f) = = 0:99. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 Random POMDP Average SPOMDP Average PSDE (g) = = 0:95. Experimental Results: Aggregate Decision Making Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results Random PSDE Ground Truth sPOMDP (h) Legend for (a)-(g). Figure 8.43: Decision making in the -Shape environment (Figure 8.1.1) with as the goal observation. T i mestep Reward accrued 0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 5000 Random POMDP Average SPOMDP Average PSDE (a) = = 1. 0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 5000 Random POMDP Average SPOMDP Average PSDE (b) = 0:99; = 1. 0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 Random POMDP Average SPOMDP Average PSDE (c) = 0:95; = 1. 0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 Random POMDP Average SPOMDP Average PSDE (d) = 0:9; = 1. Reward accrued 0 2000 4000 6000 8000 10000 0 500 1000 1500 2000 2500 3000 3500 4000 Random POMDP Average SPOMDP Average PSDE (e) = 0:8; = 1. 0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 5000 Random POMDP Average SPOMDP Average PSDE (f) = = 0:99. 0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 Random POMDP Average SPOMDP Average PSDE (g) = = 0:95. Experimental Results: Aggregate Decision Making Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results Random PSDE Ground Truth sPOMDP (h) Legend for (a)-(g). Figure 8.44: Decision making in the -Circular 1D Maze environment (Figure 8.1.3) with goal as the goal observation. The agent received a reward of 1 when it observed the goal observation and received 0 reward otherwise. Decision-making in PSDE models was performed using the search-based decision- making procedure for PSDEs that was developed in Chapter 5.4. The ground truth POMDPs (for the oracle agent) and the sPOMDP models were solved for optimal policies using incremental 224 T i mestep Reward accrued 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Random POMDP Average SPOMDP Average PSDE (a) = = 1. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Random POMDP Average SPOMDP Average PSDE (b) = 0:99; = 1. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Random POMDP Average SPOMDP Average PSDE (c) = 0:95; = 1. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 Random POMDP Average SPOMDP Average PSDE (d) = 0:9; = 1. Reward accrued 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 Random POMDP Average SPOMDP Average PSDE (e) = 0:8; = 1. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Random POMDP Average SPOMDP Average PSDE (f) = = 0:99. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 Random POMDP Average SPOMDP Average PSDE (g) = = 0:95. Experimental Results: Aggregate Decision Making Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results Random PSDE Ground Truth sPOMDP (h) Legend for (a)-(g). Figure 8.45: Decision making in the -Network environment (Figure 8.1.5) with up as the goal observation. pruning [29] 4 . Both ground truth POMDPs and sPOMDPs were solved in an identical fashion, which demonstrates that sPOMDPs can, in fact, be used in place of traditional POMDPs in standard POMDP solution software. This is one of the other major benets of the sPOMDP modeling approach. PSDE modeling has to rely on a search-based planning methodology that does not take environment noise into account in a principled fashion, whereas sPOMDPs can make use of the wealth of existing literature on solving POMDPs for optimal (or approximately optimal, in this case, since the sPOMDP model itself is an approximation) policies. There is some work in the Predictive State Representation (PSR) literature (e.g. [65]) that extends some traditional POMDP solution methods to PSRs. An important area of future work is to consider whether these methods (or similar ideas) can facilitate optimal decision making under uncertainty in PSDE models. In each of the Figures 8.43-8.47, subgures (a)-(g) illustrate the reward accrued over time using each modeling approach for a specic environment and environment noise level. Subgure (h) provides a legend for the colors in subgures (a)-(g). In each subgure (a)-(g), the horizontal 4 We utilized the implementation of incremental pruning in Cassandra's well-known POMDP solution software podmp-solve: http://www.pomdp.org/code/. 225 T i mestep Reward accrued 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Random POMDP Average SPOMDP Average PSDE (a) = = 1. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 Random POMDP Average SPOMDP Average PSDE (b) = 0:99; = 1. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 Random POMDP Average SPOMDP Average PSDE (c) = 0:95; = 1. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 Random POMDP Average SPOMDP Average PSDE (d) = 0:9; = 1. Reward accrued 0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 Random POMDP Average SPOMDP Average PSDE (e) = 0:85; = 1. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 Random POMDP Average SPOMDP Average PSDE (f) = = 0:99. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 Random POMDP Average SPOMDP Average PSDE (g) = = 0:95. Experimental Results: Aggregate Decision Making Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results Random PSDE Ground Truth sPOMDP (h) Legend for (a)-(g). Figure 8.46: Decision making in random 4x4 color world environments with 8 color observations (Figure 8.7(g)) with red as the goal observation. T i mestep Reward accrued 0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 5000 6000 7000 Random POMDP Average SPOMDP Average PSDE (a) = = 1. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 Random POMDP Average SPOMDP Average PSDE (b) = 0:99; = 1. 0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 5000 6000 7000 Random POMDP Average SPOMDP Average PSDE (c) = 0:95; = 1. 0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 Random POMDP Average SPOMDP Average PSDE (d) = 0:9; = 1. Reward accrued 0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 5000 Random POMDP Average SPOMDP Average PSDE (e) = 0:8; = 1. 0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 Random POMDP Average SPOMDP Average PSDE (f) = = 0:99. 0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 5000 6000 7000 Random POMDP Average SPOMDP Average PSDE (g) = = 0:95. Experimental Results: Aggregate Decision Making Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results Random PSDE Ground Truth sPOMDP (h) Legend for (a)-(g). Figure 8.47: Decision making in random 5x5 color world environments with 13 color observations (Figure 8.7(i)) with white as the goal observation. axis represents the number of time steps the agent has tried to accumulate reward. Note that this reward accrual test is performed on PSDE and sPOMDP models that have already been learned. In PSDE models, changing the goal observation is trivial. The agent would simply need to re-run the search-based decision-making procedure developed in Chapter 5.4. Changing the goal observation for sPOMDP models is also simple. The agent would simply need to modify 226 the reward function and re-solve the sPOMDP for a new optimal policy associated with the new goal observation. (We note that we also developed a search-based decision-making procedure for sPOMDPs in Chapter 6.4.2 that can trivially handle changes to goal observations). Importantly, the PSDE or sPOMDP model itself would not have to be re-learned in order to support such a goal observation change. This is an important distinction between this type of modeling and much work in the reinforcement learning literature (e.g., Q-learning [162]), in which the tables or models learned are tied completely to a specic task. Each subgure in Figures 8.43-8.47 is an average over 30 sPOMDP and PSDE models. There are a couple of very interesting things that we notice about these gures. First, we notice that both PSDE models and sPOMDP models support decision making that is typically very nearly optimal for almost all environments and noise levels tested (and well above the random policy baseline). In Figure 8.44, we see that sPOMDP models (green) support leading the agent to the observation goal in the -Circular 1D Maze environment (Figure 8.1.3) almost perfectly optimally (relative to the quality of decisions made with the ground truth POMDP). This is interesting, given that sPOMDP modeling sometimes struggled in terms of relative predictive accuracy in this environment (Figure 8.26). The sPOMDP models actually outperform the PSDE models in terms of decision-making in this environment. Figure 8.45 demonstrates something similar in the -Network environment (Figure 8.1.5). Again, though sPOMDP models struggle in terms of relative predictive accuracy in the-Network environment (see Figure 8.29), they are able to keep the network up almost optimally (and consistently better than the PSDE models, which predicted with much greater accuracy in these environments). In these examples, we see that capturing the latent structure of the agent's environment sometimes has tremendous benets in terms of making optimal decisions. Even if the entire model is not perfect, it may be good enough to support high-quality decision-making for certain tasks. Figure 8.48 performs an average analysis of the decision-making performance of PSDE and sPOMDP models (in terms of reward accrued) over 30 learned models of each environment type 227 (a) Average decision making performance (re- ward accrued) in the literature environments (Sec- tion 8.1). (b) Average decision making performance (reward accrued) in the random color world environments of Section 8.2. Experimental Results: Aggregate Decision Making Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results Random PSDE Ground Truth sPOMDP (c) Legend for (a)-(b). Figure 8.48: A comparison of the average decision making performance (in terms of reward accrued over time) by PSDE (red) and sPOMDP (green) models against an optimal agent making decisions with access to the ground truth POMDP environment (yellow). A random policy (blue) is provided as a baseline. in each of the 8 environment noise levels tested (enumerated in Section 8.4.3), analogously to how the average analyses in Figures 8.40, 8.41, and 8.42 were performed. Subgure (a) illus- trates this comparison in the-POMDP environments adapted from the literature (Section 8.1). Subgure (b) illustrates this comparison in the random color world POMDP environments of Section 8.2. Each group of bars represents a dierent environment, whereas each individual bar represents the average amount of reward accrued over simulations of 10; 000 time steps for each model type. In Figure 8.48, we see that both PSDE models and sPOMDP models support very high-quality decisions that are typically very nearly optimal (yellow) and well above the random policy baseline (blue). We also see that, for almost all environments tested, sPOMDP modeling (green) outperforms PSDE modeling (red) in terms of reward accrued over time. This is very likely due to the fact that the sPOMDP representation allows us to make use of powerful existing 228 POMDP solvers in order to derive (approximately) optimal policies that take environment noise into account in a principled fashion, whereas we can only currently make use of search-based decision-making in PSDE models (Chapter 5.4). Thus, we nd that the true benet of modeling latent structure in the environment (at least with the current approaches laid out in this disser- tation) seems to be in making high-quality decisions under uncertainty. The only environment in which PSDE modeling substantially outperforms sPOMDP modeling in terms of reward accrued is the-Grid environment (Figure 8.1.6). As with the-Network environment (see Figure 8.29), sPOMDP modeling also struggles in making high-quality predictions in the -Grid environment (see Figure 8.30). The primary dierence between the -Network environment and the -Grid environment in terms of these decision-making tasks is that, in the -Grid environment, the goal observation ofk covers only one environment state. In contrast, the goal observation up in the -Network environment covers three environment states. It is likely that having multiple environment states emitting the same goal observation makes the decision-making task somewhat easier in the -Network environment than it is in the -Grid environment. 8.6 Scalability Analyses Finally, in this section, we present the results of experimental analyses demonstrating the scalabil- ity of PSBL as applied to learning both PSDE models and sPOMDP models under dierent levels of environment hiddenness and dierent levels of environment noise. We demonstrate this scala- bility on random nn color world POMDP environments, such as those described in Section 8.2 (only much larger). Recall that these color world POMDPs are not -POMDPs, meaning that these scalability analyses are not limited in scope or applicability to -POMDPs. The hyper- parameters used for PSBL learning in all the scalability analyses performed in this section were identical to those used in Section 8.4.2. 229 As an initial validation of the scalability of PSBL, we consider random nn color world POMDP environments in whichjOj (the number of distinct color observations) is equal tojQj (the number of underlying environment states, which isn 2 in this case), and each grid cell (state) has a unique color observation associated with it. In other words, these environments are fully observable (though the agent is not given knowledge of this), and the agent's observations are, in fact, a sucient statistic of its history in these environments. Figures 8.49 and 8.50 illustrate this initial scalability analysis. Figure 8.49(a) illustrates the relative predictive accuracies of the PSDE models learned by PSBL on random, fully-observable color world environments ranging in size from 2 2 to 16 16 under a variety of levels of environment noise. Figure 8.49(b) illustrates the relative predictive accuracies of the sPOMDP models learned by PSBL on random, fully-observable color world environments ranging in size from 2 2 to 11 11 (again, under a variety of noise levels). Figure 8.49(c) illustrates the associated runtimes (in seconds) of learning these PSDE models using PSBL. Likewise, Figure 8.49(d) illustrates the associated runtimes (in seconds) of learning these sPOMDP models using PSBL. Each data point in subgures (a)-(d) of Figure 8.49 is an average over 20 random, fully-observable color world POMDP environments of the given sizes (where the randomization is performed by randomly assigning color observations to grid cells). In subgures (a) and (b) of Figure 8.49, we see that, for all environment sizes, PSBL learns PSDE and sPOMDP models with relative predictive accuracies of exactly 1 when = 1 (meaning that the agent has a perfect \state" sensor). When < 1, the agent can only hope to predict at a relative accuracy of as the environment increases in size, because it will be confused about which environment state it is in with probability 1 and make wrong predictions when its sensor gives it incorrect information about its current state. As the number of environment states (and thus observations) increases, the agent becomes less and less likely to accidentally guess the correct observation when its sensor fails, and its relative predictive performance asymptotes toward. We observe precisely this in the purple ( = = 0:99) and gold ( = = 0:95) bars of Figures 8.49(a) 230 (a) Predictive accuracy of PSDE models relative to an oracle in fully-observable color world environ- ments. (b) Predictive accuracy of sPOMDP models relative to an oracle in fully-observable color world environ- ments. (c) Runtime (in seconds) of PSBL for learning PSDE models in fully-observable color world envi- ronments. (d) Runtime (in seconds) of PSBL for learning sPOMDP models in fully-observable color world en- vironments. Experimental Results: Spearman Correlation (Surprise/Accuracy) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results α = 1.0,ϵ = 1.0 α = 0.99,ϵ = 1.0 α = 0.95,ϵ = 1.0 α = 0.9,ϵ = 1.0 α = 0.85,ϵ = 1.0 α = 0.8,ϵ = 1.0 α = 0.99,ϵ = 0.99 α = 0.95,ϵ = 0.95 (e) Legend for (a)-(d). Figure 8.49: An analysis of the scalability of PSBL for learning PSDE models (subgures (a), (c)) and for learning sPOMDP models (subgures (b), (d)) in random color worlds when the environment is fully observable (i.e.,jOj=jQj). and 8.49(b). In Figure 8.49, we also note that the runtimes of PSBL for sPOMDP learning (subgure (d)) are substantially larger than the runtimes of PSBL for PSDE learning (subgure (c)), due to the O(jMj 3 ) time required to update one-step and transition probability tables after 231 (a) Number of PSDEs learned by PSBL in fully- observable color world environments. (b) Number of sPOMDP model states learned by PSBL in fully-observable color world environments. Experimental Results: Spearman Correlation (Surprise/Accuracy) Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results α = 1.0,ϵ = 1.0 α = 0.99,ϵ = 1.0 α = 0.95,ϵ = 1.0 α = 0.9,ϵ = 1.0 α = 0.85,ϵ = 1.0 α = 0.8,ϵ = 1.0 α = 0.99,ϵ = 0.99 α = 0.95,ϵ = 0.95 (c) Legend for (a)-(b). Figure 8.50: An analysis of the number of PSDEs (subgure (a)) and the number of sPOMDP model states (subgure (b)) learned by PSBL in random color worlds when the environment is fully observable (i.e.,jOj=jQj). each agent action in sPOMDP modeling. In smaller environments (see Section 8.4), we observed that PSBL for learning sPOMDP models typically required substantially less runtime than PSBL for PSDE models (due to the eciencies of our vectorized implementation, see Chapter 6.3.4.6). However, in larger environments, we see that further scalability will require other approaches. One particularly interesting area of future work is to investigate how function approximators such as neural networks [57] might be eectively used in place of the transition and one-step distribution tables currently used in sPOMDP models to overcome some of these scalability limitations. In Figure 8.50(a), we see that the number of PSDEs learned by PSBL when the environment is fully-observable is identical to the number of actions (4, in this case) times the number of observations (which is equal to the number of states, in this case) for almost every environment size and noise level tested. Recall that PSBL begins withjAjjOj PSDEs, one per action and 232 (a) Predictive accuracy of PSDE models relative to an oracle whenjOj= 0:95jQj. (b) Predictive accuracy of sPOMDP models relative to an oracle whenjOj= 0:95jQj. (c) Runtime (in seconds) for PSBL learning of PSDE models whenjOj= 0:95jQj. (d) Runtime (in seconds) for PSBL learning of sPOMDP models whenjOj= 0:95jQj. Experimental Results: Scalability PSBL for sPOMDPs with |O| = 0.9|Q| Accuracy Relative to Oracle Conformance to true P(O|h) Runtime (seconds) Number of Model States, |M| Environment Size 1.0, 1.0 0.99, 1.0 0.95, 1.0 Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (e) Legend for (a)-(d). Figure 8.51: An analysis of the scalability of PSBL for learning PSDE models (subgures (a), (c)) and for learning sPOMDP models (subgures (b), (d)) in random color worlds when 5% of environment states are hidden from the agent (i.e.,jOj= 0:95jQj). observation pair (see Chapter 5.3). We would hope not to see any splitting of PSDEs, since the environment is fully observable and each observation is a sucient statistic of agent history (i.e., considering previous actions and observations does not help the agent predict future observations any more accurately). We do see a few minor cases of PSDE splitting in Figure 8.50(a) (due to minor overtting to environment noise), but such splitting is rare and only occurs in the noisier 233 environments tested. In Figure 8.50(b), we see that PSBL always learns an sPOMDP model with exactly the same number of model states as there are environment states, as desired. As was the case with Figure 8.49, each data point in Figures 8.50(a) and 8.50(b) is an average over 20 random, fully-observable color world POMDP environments of the given sizes. (a) Number of PSDEs learned by PSBL whenjOj= 0:95jQj. (b) Number of sPOMDP model states learned by PSBL whenjOj= 0:95jQj. (c) PSDE model error whenjOj= 0:95jQj. (d) sPOMDP model error whenjOj= 0:95jQj. Experimental Results: Scalability PSBL for sPOMDPs with |O| = 0.9|Q| Accuracy Relative to Oracle Conformance to true P(O|h) Runtime (seconds) Number of Model States, |M| Environment Size 1.0, 1.0 0.99, 1.0 0.95, 1.0 Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (e) Legend for (a)-(d). Figure 8.52: An analysis of the number of PSDEs (subgure (a)) and the number of sPOMDP model states (subgure (b)) learned by PSBL in random color worlds whenjOj= 0:95jQj, as well as an analysis of the model errors of the learned PSDE models (subgure (c)) and sPOMDP models (subgure (d)). 234 In Figures 8.51 and 8.52, we perform a similar scalability analysis on random nn color world POMDP environments. This time, however, 5% of the environment states are (randomly) hidden from the agent. More specically, we set the number of possible color observations in each environment to beb0:95jQjc. We randomly assign each possible color observation (in order) to a random grid cell (state) until we run out of color observations. At this point, we return to the front of this list and assign some of the (already-used) color observations (again, in order) to the remaining grid cell states. This results in at least 5% of the environment grid cells (states) looking identical to exactly one other state in the environment (by emitting the same most likely color observation). Thus, by using only its observations, the agent can predict with approximately 95% relative accuracy in these environments. The challenge is to see whether the agent can infer the (sparsely) hidden 5% of the unknown environment. Note that, due to the time required to learn these models, only three environment noise levels are tested this time: = 1:0; = 1:0; = 0:99; = 1:0; = 0:95; = 1:0. Based on our experimentation, however, we have reason to suspect that these results would extend to additional noise levels (particularly for PSDE models). As before, all the data points in all the subgures of Figures 8.51 and 8.52 are averages over 20 random color world POMDP environments. Figure 8.51(a) illustrates the relative predictive accuracies of the PSDE models learned by PSBL in random color world environments of sizes 2 2 to 12 12 in which the number of color observations was 0:95 times the number of environment states (grid cells). Figure 8.51(b) illustrates the analogous relative predictive accuracies of learned sPOMDP models of random color world environments of sizes 22 to 1010. Figure 8.51(a) illustrates that PSBL was able to learn PSDE models with a relative accuracy of almost exactly 1 for every color world environment size and noise level tested in this analysis. This means that, in every case tested, PSDE modeling was able to successfully uncover the sparsely hidden 5% of the environment, which is quite remarkable. Recall that the agent is given no information aboutjQj or its relationship tojOj. Figure 8.51(b) demonstrates similar performance in terms of relative predictive accuracy for learned sPOMDP 235 models, with the exception of the 9 9 and 10 10 color world environments when = 0:95 (yellow). In these cases, PSBL learned models with relative predictive accuracies of 0:99 and 0:97, respectively. Thus, the agent was still able to partially recover the hidden environment structure. We believe that increasing the value of hyperparameter numActions would likely improve the performance of sPOMDP modeling on these environments and noise levels. (a) Predictive accuracy of PSDE models relative to an oracle whenjOj= 0:9jQj. (b) Predictive accuracy of sPOMDP models relative to an oracle whenjOj= 0:9jQj. (c) Runtime (in seconds) for PSBL learning of PSDE models whenjOj= 0:9jQj. (d) Runtime (in seconds) for PSBL learning of sPOMDP models whenjOj= 0:9jQj. Experimental Results: Scalability PSBL for sPOMDPs with |O| = 0.9|Q| Accuracy Relative to Oracle Conformance to true P(O|h) Runtime (seconds) Number of Model States, |M| Environment Size 1.0, 1.0 0.99, 1.0 0.95, 1.0 Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (e) Legend for (a)-(d). Figure 8.53: An analysis of the scalability of PSBL for learning PSDE models (subgures (a), (c)) and for learning sPOMDP models (subgures (b), (d)) in random color worlds when 10% of environment states are hidden from the agent (i.e.,jOj= 0:9jQj). 236 Figures 8.51(c) and 8.51(d) illustrate the associated runtimes (in seconds) of PSBL for learning PSDE models and sPOMDP models (respectively) in these random color world environments. Note that the runtimes in these gures are on a log scale. We again see that PSBL requires substantially more time to learn sPOMDPs than to learn PSDE models as the environment increases in size. This is in line with our experiments in fully-observable color world environments (Figure 8.49). Note also that the learning times in Figure 8.51 are larger than the learning times in Figure 8.49, as expected. Figure 8.52(a) illustrates how the number of PSDEs learned by PSBL grows according to increases in environment size for the same set of scalability experiments illustrated in Figure 8.51. Note that this gure is also on a log scale. Figure 8.52(b) illustrates the number of sPOMDP model states learned by PSBL for the same scalability analysis. As we saw in section 8.4.3, the number of sPOMDP model states learned by PSBL (particularly when averaged across dierent noise levels) is quite an accurate estimate of the true number of underlying environment states, jQj, in many cases. This is a remarkable result because, as was mentioned before, the agent is given no a priori knowledge about the value ofjQj or its relationship tojOj. Figures 8.52(c) and 8.52(d) illustrate the model errors associated with the same learned PSDE and sPOMDP models (respectively). They are extremely small for most environment types and noise levels tested (particularly when compared to the errors of the learned models in Section 8.4). Note that the vertical axes of these gures have been scaled such that these small values can actually be seen and compared. Figures 8.53 and 8.54 illustrate the results of a nal scalability analysis (completely analogous to the scalability analysis presented in Figures 8.51 and 8.52) in whichjOj= 0:9jQj. As before, each data point in each subgure of these gures is an average over 20 random color world POMDPs of the given sizes. In Figure 8.53(a), we again see that PSDE modeling almost completely recovers the hidden 10% structure in the agent's environment, though the relative accuracy achieved in some PSDE models is slightly lower than it was in analogous environments in Figure 8.51(a). 237 (a) Number of PSDEs learned by PSBL whenjOj= 0:9jQj. (b) Number of sPOMDP model states learned by PSBL whenjOj= 0:9jQj. (c) PSDE model error whenjOj= 0:9jQj. (d) sPOMDP model error whenjOj= 0:9jQj. Experimental Results: Scalability PSBL for sPOMDPs with |O| = 0.9|Q| Accuracy Relative to Oracle Conformance to true P(O|h) Runtime (seconds) Number of Model States, |M| Environment Size 1.0, 1.0 0.99, 1.0 0.95, 1.0 Motivation Problem Challenges Related PSDEs sPOMDPs PSBL Theoretical Results Contributions Future Experimental Results (e) Legend for (a)-(d). Figure 8.54: An analysis of the number of PSDEs (subgure (a)) and the number of sPOMDP model states (subgure (b)) learned by PSBL in random color worlds whenjOj= 0:9jQj, as well as an analysis of the model errors of the learned PSDE models (subgure (c)) and sPOMDP models (subgure (d)). This is reasonable, though, considering that twice as many environment states are hidden from the agent in the case of these experiments. In Figure 8.53(b), we again see that sPOMDP modeling also recovers all the missing structure in the agent's environment in all but some of the noisiest and largest color world POMDPs tested, in which it could only recover some of the missing structure (at least half of it). In Figures 8.53(c) and 8.53(d), we again nd that sPOMDP learning using PSBL 238 takes substantially more time than PSDE learning using PSBL, as discussed above. As expected, these runtimes are also substantially higher than the analogous runtimes in Figures 8.51(c) and 8.51(d). In Figure 8.54, we again nd very low model error for both PSDE and sPOMDP models in subgures (c) and (d), respectively. We also nd, again, that the number of sPOMDP model states typically provides a very accurate estimate of the true number of underlying environment states, particularly whenjMj is averaged over the various noise levels tested (see Figure 8.54(b)). 8.7 Conclusions In this chapter, we presented the results of extensive experimentation which demonstrated the performance and scalability of PSDE modeling, sPOMDP modeling, and the PSBL learning frame- work. We rst demonstrated the strong positive correlation between model surprise and model error and the strong negative correlation between model surprise and model predictive accuracy across a number of diverse and interesting rewardless POMDP environments. This validated the key PSBL idea of minimizing model surprise as a proxy for minimizing model error (while simulta- neously maximizing model predictive accuracy). Next, we rigorously compared the performance of PSDE and sPOMDP modeling in a number of dierent rewardless POMDP environments against xed-length history-based (k-order Markov [19]) models and Long Short-Term Memory (LSTM) [60] neural network models. Both the k-order Markov and LSTM models were trained oine to maximize their performance. This analysis demonstrated that, on average, PSDE model- ing meets or exceeds the performance of these models in terms of relative predictive accuracy and model error (while requiring fewer parameters than both LSTM and 3rd-order Markov models). Next, we demonstrated that the sPOMDP and PSDE models learned by PSBL were capable of supporting high-quality (typically near-optimal) decision-making in all the environments tested, with sPOMDPs tending to outperform PSDE models on average, owing to the fact that they can 239 be solved in the exact same way as traditional POMDPs for optimal policies using state-of-the- art POMDP solving software. A nal set of experimental results demonstrated the scalability of PSBL in learning both sPOMDP and PSDE models in environments with dierent levels of hid- denness. This analysis (as well as some of the previous analyses) also demonstrated that sPOMDP modeling often results, on average, in model state spaces whose sizes are a close approximation to the true number of underlying environment states, particularly when these estimates are averaged over a number of environment noise levels. 240 Chapter 9 Conclusions and Future Work In this chapter, we summarize the key scientic contributions of the work presented in this dis- sertation and discuss important directions for future work. 9.1 Summary of Key Contributions In this dissertation, we addressed the very challenging and open problem of Autonomous Learning from the Environment (ALFE, originally formulated by Shen in [140]) in stochastic and partially- observable environments formulated as rewardless POMDPs [68]. This work resulted in 4 main scientic contributions. The rst contribution is a novel family of probabilistic, nonparametric models for representing rewardless deterministic and POMDP environments called Stochastic Distinguishing Experiments (SDEs). The main idea behind SDE modeling (see Chapter 4) is to hierarchically organize agent experience into key sequences of ordered actions and associated expected observations { i.e., experiments the agent can perform in its environment { that serve as a means for statistically dis- ambiguating identical-looking environment states or history contexts. SDEs are a principled gen- eralization of Shen's Local Distinguishing Experiments [141] to stochastic and partially-observable environments. We developed two distinct variants of SDE modeling: 241 1. Predictive SDE models (Chapter 5, [38]), which are tree-structured approximations of the history-dependent probability of any future observation sequence given any sequence of agent actions (conditioned on any possible agent history). Predictive SDE models form an implicit predictive representation of state and dynamics without the need to attempt to model latent environment structure. In the previous chapter, we demonstrated that these models outperformed xed-history (k-order Markov models [19]) and hand-tuned LSTMs [60] in almost all of the diverse environments tested in terms of predictive accu- racy and model error. 2. Surprise-based POMDPs (Chapter 6, [39, 35]), which are hybrid latent-predictive represen- tations of state and dynamics that use SDEs to uncover and explicitly model hidden latent structure in the agent's environment. This results in a model with an explicit learned state space that can be used directly in place of traditional human-designed POMDPs [35, 39], as we demonstrated in the previous chapter by solving learned sPOMDPs for optimal policies. This resulted in policies that outperformed the search-based decision-making of Predictive SDE models on almost every environment tested and were near-optimal (as compared to policies generated using ground truth POMDPs). These models overcome some theoreti- cal limitations of purely predictive approaches (e.g., Predictive State Representations, [84]) by providing a formal relationship between predictive experiments and latent environment structure. In the previous chapter, we demonstrated experimentally that the number of model states in an sPOMDP learned by PSBL is often an accurate estimate of the number of states in the underlying environment, particularly when averaged over many runs of the learning procedure. The second main contribution is a novel biologically-inspired algorithmic framework for ac- tively and incrementally learning SDE representations of unknown, rewardless deterministic and POMDP environments directly from agent experience called Probabilistic Surprise-Based Learning 242 (PSBL, Chapter 4, [38, 39, 35]). PSBL extends traditional surprise-based learning (SBL) algo- rithms for learning state and dynamics representations [137, 141, 117] to stochastic and partially- observable environments in a principled way and generalizes the denition of surprise (as used in these works) in a way that is applicable to both deterministic and stochastic environments (Chapter 4). Deriving this learning procedure required that we generalize the key theory and denitions associated with the problem of ALFE (as originally laid out by Shen in [140] for deter- ministic environments) such that they became well-suited to stochastic and partially-observable environments while remaining applicable to deterministic ones. This enabled us to formally dene the ALFE problem as that of minimizing model error and usefully frame our unied solution to it (PSBL) { which applies to both Predictive SDE and Surprise-based POMDP models { as that of minimizing model surprise (Chapter 4). The eectiveness of this learning framework across a broad range of environments was demonstrated in the previous chapter. The third main contribution consists of formal mathematical proofs that PSBL learning of PSDE models in rewardless POMDP environments (Chapter 5) converges to a solution in nite time with probability 1 (under very mild technical conditions) and a rigorous analysis of the worst- case computational complexity of this procedure (Chapter 7, [38]). Importantly, these proofs of convergence require no user-dened bound on the maximum length or number of PSDEs allowed in the model. Extensive experimental results have demonstrated that PSBL for both PSDEs and sPOMDPs converges to a solution in nite time with no user-dened bounds on the length or number of SDEs in the model (though we do not yet have a formal proof for this in the case of sPOMDP models). The fourth and nal contribution consists of formal mathematical proofs of the representa- tional capacity of sPOMDP models demonstrating that they are capable of perfectly representing minimal deterministic nite automata (DFA) and a useful subclass of rewardless POMDP envi- ronments (-POMDPs) with equivalent compactness (Chapter 7, [39]). This presents a solution to a signicant open theoretical problem in the SBL literature [142]. These constructive proofs 243 also lead to a provably-optimal sPOMDP learning algorithm, which, given innite sampling ca- pabilities (in the form of an oracle that gives the correct transition probabilities) and perfect localization, can learn a perfect sPOMDP model (provided environment noise levels are within certain bounds, which we formally derived). We extensively experimentally validated both these constructive representational capacity proofs and this optimal sPOMDP learning algorithm (see Chapter 6 and Appendix A.3). In summary, Probabilistic Surprise-Based Learning (PSBL) and Stochastic Distinguishing Ex- periments (SDEs) have shown great promise in addressing the problem of ALFE in unknown stochastic and partially-observable environments. However, there is still much work left to be done to move toward our ultimate goal of end-to-end, lifelong model learning and decision mak- ing in unknown environments. The scope of this possible future work is vast. Accordingly, in the next section, we focus on a few areas that we feel are the most fruitful and useful areas of exploration. 9.2 Future Work There are a great many ways in which this work can be extended and improved. We focus here on a few directions of future work that we believe are the most promising (and the most important). 9.2.1 Factored Representations of Observation Space As of yet, we have not investigated the use of sPOMDP and PSDE modeling on agents with multiple distinct sensors. In the work laid out in this dissertation, the agent's observations come from a single discrete and nite set of symbols. In most real-world applications, however, agents have multiple distinct sensors, and this brings up an interesting modeling challenge. In order to apply the current formulation of sPOMDP and PSDE modeling to such agents, we would have to dene a new meta-observation space consisting of the n-ary Cartesian product of 244 the observation spaces associated with each of the agent's sensors. Clearly, such an approach would scale very poorly, and it may also add unnecessary complexity to the modeling process, particularly in situations where many agent actions aect only a subset of the agent's sensors. Clearly, it would be benecial if we could apply sPOMDP and PSDE modeling approaches to a factored representation of the agent's observation space that treats separate sensors as separate observations when appropriate while still allowing the agent to learn useful patterns that only appear when combinations of these sensor values are considered together. Ranasinghe's work on SBL [117, 118] considered such factored representations in the context of learning formal logic- based prediction rules. It would be interesting (and potentially very powerful) to combine some of those ideas with the probabilistic modeling framework laid out in this dissertation. 9.2.2 Integration of Function Approximation Techniques Another interesting direction for future work is to explore the use of function approximators (e.g., neural networks [57]) in place of the tables and trees currently used when learning SDE models with PSBL to aid in scalability and generalization to unexplored areas of the action-observation space. For example, the table of model state to model state transition probabilities in an sPOMDP model (which grows quadratically with the number of model states) might be replaced with a neural network whose size is initially proportional to the size of the agent's observation space. As the model state space increases in size after each model split, the agent could periodically increase the complexity of this neural network. Increasing the complexity of this neural network by adding new layers may prevent the agent from having to train the new network from scratch. We could do the same thing for the one-step extension distribution tables. It may also be possible to replace the (exponentially-branching) tree-based implementation of PSBL for PSDEs with neural networks that are associated with each distinct PSDE length in the PSDE model. This would mean, for example, that all length 5 PSDEs would share the same neural network parameters, while all length 3 PSDEs would share the same parameters of 245 a dierent (less complex) neural network. Length 1 PSDEs would share the parameters of yet another (possibly even less complex) neural network. Our initial experimentation with this latter procedure indicates that it is feasible in general and does sometimes result in PSDE models with signicantly fewer parameters, though much work remains to be done to validate this procedure. We believe that this is a very important area of future work that is likely to dramatically increase the scalability of PSDE and sPOMDP modeling by striking a balance between parametric and nonparametric modeling approaches in a way that allows for distributed representations and higher-level abstractions. 9.2.3 Continuous Observations One nal important direction for future research is to extend this work to support continuous observation spaces. One approach might be to apply an automated discretization of sensor values such as Ranasinghe uses in his formulation of SBL [118]. It remains to be seen whether such an approach (which is based on logic-based rules) can be extended to apply to the probabilistic models presented in this dissertation. It may be the case that such an on-the- y discretization coupled with a factored representation of the agent's observation space (and judicious use of function approximators) is sucient to apply PSDE and sPOMDP modeling techniques successfully to problems with high-dimensional continuous observation spaces. 246 Reference List [1] Douglas Aberdeen, Olivier Buet, and Owen Thomas. Policy-gradients for psrs and pomdps. In Articial Intelligence and Statistics, pages 3{10, 2007. [2] Joshua Achiam and Shankar Sastry. Surprise-based intrinsic motivation for deep reinforce- ment learning. arXiv preprint arXiv:1703.01732, 2017. [3] Swati Aggarwal, Kushagra Sharma, and Manisha Priyadarshini. Robot navigation: Review of techniques and research challenges. In Computing for Sustainable Global Development (INDIACom), 2016 3rd International Conference on, pages 3660{3665. IEEE, 2016. [4] Houssam Albitar, Kinan Dandan, Anani Ananiev, and Ivan Kalaykov. Underwater robotics: surface cleaning technics, adhesion and locomotion systems. International Journal of Ad- vanced Robotic Systems, 13(1):7, 2016. [5] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An intro- duction to mcmc for machine learning. Machine learning, 50(1):5{43, 2003. [6] Dana Angluin. A note on the number of queries needed to identify regular languages. Information and control, 51(1):76{87, 1981. [7] Dana Angluin. Learning regular sets from queries and counterexamples. Information and computation, 75(2):87{106, 1987. [8] Michael A Arbib and H Paul Zeiger. On the relevance of abstract algebra to control theory. Automatica, 5(5):589{606, 1969. [9] M Sanjeev Arulampalam, Simon Maskell, Neil Gordon, and Tim Clapp. A tutorial on particle lters for online nonlinear/non-gaussian bayesian tracking. IEEE Transactions on signal processing, 50(2):174{188, 2002. [10] Minoru Asada, Koh Hosoda, Yasuo Kuniyoshi, Hiroshi Ishiguro, Toshio Inui, Yuichiro Yoshikawa, Masaki Ogino, and Chisato Yoshida. Cognitive developmental robotics: A sur- vey. IEEE transactions on autonomous mental development, 1(1):12{34, 2009. [11] Pierre Baldi. A computational theory of surprise. In Information, Coding and Mathematics, pages 1{25. Springer, 2002. [12] Pierre Baldi and Laurent Itti. Of bits and wows: a bayesian theory of surprise with appli- cations to attention. Neural Networks, 23(5):649{666, 2010. [13] Luenin Barrios, Thomas Collins, Robert Kovac, and Wei-Min Shen. Autonomous 6d- docking and manipulation with non-stationary-base using self-recongurable modular robots. In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Con- ference on, pages 2913{2919. IEEE, 2016. 247 [14] Andrew G Barto, Satinder Singh, and Nuttapong Chentanez. Intrinsically motivated learn- ing of hierarchical collections of skills. In Proceedings of the 3rd International Conference on Development and Learning, pages 112{19, 2004. [15] Fr ed erique Bassino, Julien David, and Cyril Nicaud. On the average complexity of moore's state minimization algorithm. In 26th International Symposium on Theoretical Aspects of Computer Science STACS 2009, pages 123{134. IBFI Schloss Dagstuhl, 2009. [16] Leonard E Baum and John Alonzo Eagon. An inequality with applications to statistical es- timation for probabilistic functions of markov processes and to a model for ecology. Bulletin of the American Mathematical Society, 73(3):360{363, 1967. [17] Leonard E Baum and Ted Petrie. Statistical inference for probabilistic functions of nite state markov chains. The annals of mathematical statistics, 37(6):1554{1563, 1966. [18] Jonathan Baxter et al. A model of inductive bias learning. J. Artif. Intell. Res.(JAIR), 12(149-198):3, 2000. [19] Ron Begleiter, Ran El-Yaniv, and Golan Yona. On prediction using variable order markov models. Journal of Articial Intelligence Research, 22:385{421, 2004. [20] Richard Bellman. A markovian decision process. Indiana Univ. Math. J., 6:679{684, 1957. [21] Yoshua Bengio. Practical recommendations for gradient-based training of deep architec- tures. In Neural networks: Tricks of the trade, pages 437{478. Springer, 2012. [22] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798{1828, 2013. [23] Jean Berstel, Luc Boasson, Olivier Carton, and Isabelle Fagnot. Minimization of automata. arXiv preprint arXiv:1010.5318, 2010. [24] Armando Blanco, Miguel Delgado, and MC Pegalajar. A genetic algorithm to obtain the op- timal recurrent neural network. International Journal of Approximate Reasoning, 23(1):67{ 83, 2000. [25] Larry C Boles and Kenneth J Lohmann. True navigation and magnetic maps in spiny lobsters. Nature, 421(6918):60, 2003. [26] Byron Boots, Sajid M Siddiqi, and Georey J Gordon. Closing the learning-planning loop with predictive state representations. The International Journal of Robotics Research, 30(7):954{966, 2011. [27] Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, Jose Neira, Ian Reid, and John J Leonard. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Transactions on Robotics, 32(6):1309{ 1332, 2016. [28] Francesco Paolo Cantelli. Sulla probabilita come limite della frequenza. Atti Accad. Naz. Lincei, 26(1):39{45, 1917. [29] Anthony Cassandra, Michael L Littman, and Nevin L Zhang. Incremental pruning: A simple, fast, exact method for partially observable markov decision processes. In Proceedings of the Thirteenth conference on Uncertainty in articial intelligence, pages 54{61. Morgan Kaufmann Publishers Inc., 1997. 248 [30] Anthony Rocco Cassandra. Exact and approximate algorithms for partially observable markov decision processes. 1998. [31] Bin Chen and Yongmiao Hong. Testing for the markov property in time series. Econometric Theory, 28(1):130{178, 2012. [32] Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 1281{ 1288, 2005. [33] Lonnie Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinc- tions approach. In AAAI, volume 1992, pages 183{188, 1992. [34] Thomas Collins and Wei-Min Shen. Paso: an integrated, scalable pso-based optimization framework for hyper-redundant manipulator path planning and inverse kinematics. [35] Thomas Collins and Wei-Min Shen. Autonomous learning of pomdp state representations from surprises. In 2018 4th International Conference on Control, Automation and Robotics (ICCAR), pages 359{367. IEEE, 2018. [36] Thomas Joseph Collins and Wei-Min Shen. Integrated and adaptive locomotion and ma- nipulation. In 18th Toward Autonomous Robotic Systems (TAROS) Conference, page To appear, 2017. [37] Thomas Joseph Collins and Wei-Min Shen. Particle swarm optimization for high-dof in- verse kinematics. In Control, Automation and Robotics (ICCAR), 2017 3rd International Conference on, pages 1{6. IEEE, 2017. [38] Thomas Joseph Collins and Wei-Min Shen. A robust cognitive architecture for learning from surprises. Biologically Inspired Cognitive Architectures, 21:1 { 12, 2017. [39] Thomas Joseph Collins and Wei-Min Shen. Surprise-based learning of state representations. Biologically Inspired Cognitive Architectures, 24:1{20, 2018. [40] Gabriella Contardo, Ludovic Denoyer, Thierry Artieres, and Patrick Gallinari. Learning states representations in pomdp. In Internation Conference on Learning Representations (poster) ICLR 2014, pages 120{122, 2014. [41] Paul Dagum, Adam Galper, and Eric Horvitz. Dynamic network models for forecasting. In Proceedings of the eighth international conference on uncertainty in articial intelligence, pages 41{48. Morgan Kaufmann Publishers Inc., 1992. [42] Paul Dagum, Adam Galper, Eric Horvitz, Adam Seiver, et al. Uncertain reasoning and forecasting. International Journal of Forecasting, 11(1):73{87, 1995. [43] Christos Dimitrakakis. Bayesian variable order markov models. In AISTATS, pages 161{ 168, 2010. [44] Christos Dimitrakakis. Variable order markov decision processes: Exact bayesian inference with an application to pomdps. 2010. [45] Finale Doshi, David Wingate, Joshua B. Tenenbaum, and Nicholas Roy. Innite dynamic bayesian networks. In Proceedings of the 28th International Conference on Machine Learn- ing, page 913{920, 2011. [46] Finale Doshi-Velez. The innite partially observable markov decision process. In Advances in neural information processing systems, pages 477{485, 2009. 249 [47] Finale Doshi-Velez, David Pfau, Frank Wood, and Nicholas Roy. Bayesian nonparamet- ric methods for partially-observable reinforcement learning. IEEE transactions on pattern analysis and machine intelligence, 37(2):394{407, 2015. [48] Kenji Doya, Kazuyuki Samejima, Ken-ichi Katagiri, and Mitsuo Kawato. Multiple model- based reinforcement learning. Neural computation, 14(6):1347{1369, 2002. [49] M Emile Borel. Les probabilit es d enombrables et leurs applications arithm etiques. Rendi- conti del Circolo Matematico di Palermo (1884-1940), 27(1):247{271, 1909. [50] Mohammadjavad Faraji, Kerstin Preuscho, and Wulfram Gerstner. Balancing new against old information: The role of puzzlement surprise in learning. Neural computation, 30(1):34{ 83, 2018. [51] Ronald A Fisher. Frequency distribution of the values of the correlation coecient in samples from an indenitely large population. Biometrika, 10(4):507{521, 1915. [52] Jonathan Folmsbee, Xulei Liu, Margaret Brandwein-Weber, and Scott Doyle. Active deep learning: Improved training eciency of convolutional neural networks for tissue classica- tion in oral cavity cancer. In Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on, pages 770{773. IEEE, 2018. [53] Mikhail Frank, J urgen Leitner, Marijn Stollenga, Alexander F orster, and J urgen Schmidhu- ber. Curiosity driven reinforcement learning for motion planning on humanoids. Frontiers in neurorobotics, 7, 2013. [54] Keinosuke Fukunaga. Introduction to statistical pattern recognition. Elsevier, 2013. [55] E Mark Gold. System identication via state characterization. Automatica, 8(5):621{636, 1972. [56] E Mark Gold. Complexity of automaton identication from given data. Information and control, 37(3):302{320, 1978. [57] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org. [58] William L Hamilton, Mahdi Milani Fard, and Joelle Pineau. Ecient learning and planning with compressed predictive states. Journal of Machine Learning Research, 15(1):3395{3439, 2014. [59] Paul Hebert, Max Bajracharya, Jeremy Ma, Nicolas Hudson, Alper Aydemir, Jason Reid, Charles Bergh, James Borders, Matthew Frost, Michael Hagman, et al. Mobile manipula- tion and mobility as manipulation-design and algorithms of robosimian. Journal of Field Robotics, 32(2):255{274, 2015. [60] Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735{1780, 1997. [61] Kaijen Hsiao, Leslie Pack Kaelbling, and Tomas Lozano-Perez. Grasping pomdps. In Robotics and Automation, 2007 IEEE International Conference on, pages 4685{4692. IEEE, 2007. [62] Pao-Lu Hsu and Herbert Robbins. Complete convergence and the law of large numbers. Proceedings of the national academy of sciences, 33(2):25{31, 1947. 250 [63] Marcus Hutter. Universal Articial Intelligence: Sequential Decisions Based on Algorithmic Probability. Springer-Verlag Berlin Heidelberg, 2005. [64] Laurent Itti and Pierre F Baldi. Bayesian surprise attracts human attention. In Advances in neural information processing systems, pages 547{554, 2006. [65] M. R. James, S. Singh, and M. L. Littman. Planning with predictive state representations. In 2004 International Conference on Machine Learning and Applications, 2004. Proceed- ings., pages 304{311, Dec 2004. [66] Michael R James, Satinder Singh, and Michael L Littman. Planning with predictive state representations. In Machine Learning and Applications, 2004. Proceedings. 2004 Interna- tional Conference on, pages 304{311. IEEE, 2004. [67] Rico Jonschkowski and Oliver Brock. Learning state representations with robotic priors. Autonomous Robots, 39(3):407{428, 2015. [68] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Articial intelligence, 101(1-2):99{134, 1998. [69] Sisir Karumanchi, Kyle Edelberg, Ian Baldwin, Jeremy Nash, Jason Reid, Charles Bergh, John Leichty, Kalind Carpenter, Matthew Shekels, Matthew Gildner, et al. Team ro- bosimian: semi-autonomous mobile manipulation at the 2015 darpa robotics challenge nals. Journal of Field Robotics, 34(2):305{332, 2017. [70] Sammie Katt, Frans A Oliehoek, and Christopher Amato. Learning in pomdps with monte carlo tree search. In International Conference on Machine Learning, pages 1819{1827, 2017. [71] James Kennedy. Particle swarm optimization. In Encyclopedia of machine learning, pages 760{766. Springer, 2011. [72] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238{1274, 2013. [73] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009. [74] John R Koza. Genetic programming: on the programming of computers by means of natural selection, volume 1. MIT press, 1992. [75] Solomon Kullback. Information theory and statistics. Courier Corporation, 1997. [76] Solomon Kullback and Richard A Leibler. On information and suciency. The annals of mathematical statistics, 22(1):79{86, 1951. [77] Shane Legg. Machine super intelligence. PhD thesis, University of Lugano, 2008. [78] Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learning for detecting robotic grasps. The International Journal of Robotics Research, 34(4-5):705{724, 2015. [79] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334{1373, 2016. [80] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research, 37(4-5):421{436, 2018. 251 [81] Jesse Levinson, Jake Askeland, Jan Becker, Jennifer Dolson, David Held, Soeren Kammel, J Zico Kolter, Dirk Langer, Oliver Pink, Vaughan Pratt, et al. Towards fully autonomous driving: Systems and algorithms. In Intelligent Vehicles Symposium (IV), 2011 IEEE, pages 163{168. IEEE, 2011. [82] Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017. [83] Daniel Ying-Jeh Little and Friedrich Tobias Sommer. Learning and exploration in action- perception loops. Frontiers in neural circuits, 7:37, 2013. [84] Michael L Littman, Richard S Sutton, Satinder Singh, et al. Predictive representations of state. Advances in neural information processing systems, 2:1555{1562, 2002. [85] F. Liu, X. Jin, and Y. She. No-fringe u-tree: An optimized algorithm for reinforcement learning. In 2016 IEEE 28th International Conference on Tools with Articial Intelligence (ICTAI), pages 278{282, Nov 2016. [86] Yunlong Liu, Hexing Zhu, Yifeng Zeng, and Zongxiong Dai. Learning predictive state representations via monte-carlo tree search. In Proceedings of the Twenty-Fifth International Joint Conference on Articial Intelligence, IJCAI'16, pages 3192{3198. AAAI Press, 2016. [87] M Mahmud. Constructing states for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 727{734, 2010. [88] Andrei Andreevich Markov. The theory of algorithms. Trudy Matematicheskogo Instituta Imeni VA Steklova, 42:3{375, 1954. [89] Andrew Kachites McCallum. Reinforcement learning with selective perception and hidden state. PhD thesis, University of Rochester, 1996. [90] R Andrew McCallum. Instance-based utile distinctions for reinforcement learning with hidden state. In ICML, pages 387{395, 1995. [91] R Andrew McCallum, G Tesauro, D Touretzky, and T Leen. Instance-based state iden- tication for reinforcement learning. Advances in Neural Information Processing Systems, pages 377{384, 1995. [92] Peter McCracken and Michael Bowling. Online discovery and learning of predictive state representations. Advances in neural information processing systems, 18:875, 2006. [93] George H Mealy. A method for synthesizing sequential circuits. Bell System Technical Journal, 34(5):1045{1079, 1955. [94] Melanie Mitchell. An introduction to genetic algorithms. MIT press, 1998. [95] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. [96] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. [97] Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 2125{2133, 2015. 252 [98] Michael Montemerlo, Sebastian Thrun, Daphne Koller, Ben Wegbreit, et al. Fastslam: A factored solution to the simultaneous localization and mapping problem. 2002. [99] Andrew William Moore. Ecient memory-based learning for robot control. 1990. [100] Edward F Moore. Gedanken-experiments on sequential machines. Automata studies, 34:129{153, 1956. [101] Richard Morris. Developments of a water-maze procedure for studying spatial learning in the rat. Journal of neuroscience methods, 11(1):47{60, 1984. [102] Kevin P Murphy. A survey of pomdp solution techniques. environment, 2:X3, 2000. [103] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. [104] Daniel Nikovski. State-aggregation algorithms for learning probabilistic models for robot control. PhD thesis, Carnegie Mellon University, 2002. [105] Stefano Nol, Dario Floreano, and Director Dario Floreano. Evolutionary robotics: The biology, intelligence, and technology of self-organizing machines. MIT press, 2000. [106] Emmanuelle Normand and Christophe Boesch. Sophisticated euclidean maps in forest chim- panzees. Animal Behaviour, 77(5):1195{1201, 2009. [107] Travis E Oliphant. Python for scientic computing. Computing in Science & Engineering, 9(3), 2007. [108] Pierre-Yves Oudeyer, Frdric Kaplan, and Verena V Hafner. Intrinsic motivation systems for autonomous mental development. IEEE transactions on evolutionary computation, 11(2):265{286, 2007. [109] Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 1, 2007. [110] G unther Palm. Novelty, information and surprise. Springer Science & Business Media, 2012. [111] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven ex- ploration by self-supervised prediction. arXiv preprint arXiv:1705.05363, 2017. [112] Charles Sanders Peirce and Andreas Hetzel. How to make our ideas clear. 1878. [113] George Philipp and Jaime G Carbonell. Nonparametric neural networks. arXiv preprint arXiv:1712.05440, 2017. [114] Jean Piaget. The origin of intelligence in the child, 1953. [115] Joelle Pineau, Geo Gordon, Sebastian Thrun, et al. Point-based value iteration: An anytime algorithm for pomdps. In IJCAI, volume 3, pages 1025{1032, 2003. [116] Leonard Pitt and Manfred K Warmuth. The minimum consistent dfa problem cannot be approximated within any polynomial. In Structure in Complexity Theory Conference, 1989. Proceedings., Fourth Annual, page 230. IEEE, 1989. [117] Nadeesha Ranasinghe and Wei-Min Shen. Surprise-based learning for developmental robotics. In Learning and Adaptive Behaviors for Robotic Systems, 2008. LAB-RS'08. EC- SIS Symposium on, pages 65{70. IEEE, 2008. 253 [118] Nadeesha Oliver Ranasinghe. Learning to detect and adapt to unpredicted changes. PhD thesis, University of Southern California, 2012. [119] Ronald L Rivest and Robert E Schapire. Inference of nite automata using homing se- quences. In Machine Learning: From Theory to Applications, pages 51{73. Springer, 1993. [120] Ronald L Rivest and Robert E Schapire. Diversity-based inference of nite automata. Journal of the ACM (JACM), 41(3):555{589, 1994. [121] Paul S. Rosenbloom, Abram Demski, and Volkan Ustan. Toward a neural-symbolic sigma: Introducing neural network learning. In Proceedings of the 15th Annual Meeting of the International Conference on Cognitive Modeling, 2017. [122] Paul S Rosenbloom, Abram Demski, and Volkan Ustun. Rethinking sigma's graphical archi- tecture: An extension to neural networks. In International Conference on Articial General Intelligence, pages 84{94. Springer, 2016. [123] Matthew Rosencrantz, Geo Gordon, and Sebastian Thrun. Learning low dimensional predictive representations. In Proceedings of the twenty-rst international conference on Machine learning, page 88. ACM, 2004. [124] St ephane Ross, Joelle Pineau, Brahim Chaib-draa, and Pierre Kreitmann. A bayesian ap- proach for learning and planning in partially observable markov decision processes. Journal of Machine Learning Research, 12(May):1729{1770, 2011. [125] Nicholas Roy, Georey J Gordon, and Sebastian Thrun. Finding approximate pomdp solu- tions through belief compression. Journal of Articial Intelligence Research (JAIR), 23:1{ 40, 2005. [126] Gavin A Rummery and Mahesan Niranjan. On-line Q-learning using connectionist systems, volume 37. 1994. [127] Stuart Jonathan Russell and Peter Norvig. Articial intelligence: a modern approach (3rd edition), 2009. [128] R egis Sabbadin, J er^ ome Lang, and Nasolo Ravoanjanahry. Purely epistemic markov decision processes. In Proceedings of the national conference on articial intelligence, volume 22, page 1057. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2007. [129] Doyen Sahoo, Quang Pham, Jing Lu, and Steven C. H. Hoi. Online deep learning: Learn- ing deep neural networks on the y. In Proceedings of the Twenty-Seventh International Joint Conference on Articial Intelligence, IJCAI-18, pages 2660{2666. International Joint Conferences on Articial Intelligence Organization, 7 2018. [130] Behnam Salemi, Mark Moll, and Wei-Min Shen. SUPERBOT: A deployable, multi- functional, and modular self-recongurable robotic system. Beijing, China, October 2006. [131] Sven Sandberg. 1 homing and synchronizing sequences. In Model-based testing of reactive systems, pages 5{33. Springer, 2005. [132] Simo S arkk a. Bayesian ltering and smoothing, volume 3. Cambridge University Press, 2013. [133] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core- set approach. 2018. 254 [134] Guy Shani. Learning and solving partially observable markov decision processes. PhD thesis, Ben-Gurion University of the Negev, 2007. [135] Claude Elwood Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379{423, 1948. [136] Wei-Min Shen. Learning from the environment based on percepts and actions. PhD thesis, Carnegie Mellon University, 1989. [137] Wei-Min Shen. Complementary discrimination learning: a duality between generalization and discrimination. In Proceedings of the Eighth National Conference on Articial Intelli- gence, 1990. [138] Wei-Min Shen. Complementary discrimination learning with decision lists. In AAAI, pages 153{158, 1992. [139] Wei-Min Shen. Discovering regularities from knowledge bases. International Journal of Intelligent Systems, 7(7):623{635, 1992. [140] Wei-Min Shen. Discovery as autonomous learning from the environment. Machine Learning, 12(1):143{165, 1993. [141] Wei-Min Shen. Learning nite automata using local distinguishing experiments. In IJCAI, pages 1088{1093, 1993. [142] Wei-Min Shen. Autonomous Learning from the Environment, (Forwarded by Herbert A. Simon). Computer Science Press, WH Freeman, 1994. [143] Wei-Min Shen and Herbert A Simon. Rule creation and rule learning through environ- mental exploration. In Proceedings of Eleventh International Joint Conference on Articial Intelligence, 1989. [144] Wei-Min Shen and Herbert A Simon. Fitness requirements for scientic theories containing recursive theoretical terms. The British journal for the philosophy of science, 44(4):641{652, 1993. [145] Karl Sigman. Lecture notes on borel-cantelli lemmas. http://www.columbia.edu/ ~ ks20/ stochastic-I/stochastic-I-BC.pdf, 2009. [146] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanc- tot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016. [147] H Simon and G Lea. Problem solving and rule induction: A unied view. Knowledge and cognition., pages 105{128, 1974. [148] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [149] Satinder Singh, Michael L Littman, Nicholas K Jong, David Pardoe, and Peter Stone. Learning predictive state representations. In ICML 2003, pages 712{719. [150] R.J. Solomono. A formal theory of inductive inference. part i. Information and Control, 7(1):1 { 22, 1964. 255 [151] R.J. Solomono. A formal theory of inductive inference. part ii. Information and Control, 7(2):224 { 254, 1964. [152] Jan Storck, Sepp Hochreiter, and J urgen Schmidhuber. Reinforcement driven information acquisition in non-deterministic environments. In Proceedings of the international confer- ence on articial neural networks, Paris, volume 2, pages 159{164, 1995. [153] Yi Sun, Faustino Gomez, and J urgen Schmidhuber. Planning to be surprised: Optimal bayesian exploration in dynamic environments. In International Conference on Articial General Intelligence, pages 41{51. Springer, 2011. [154] Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction. MIT press, 1998. [155] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic robotics. MIT press, 2005. [156] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press, 2005. [157] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012. [158] Edward Chace Tolman and Charles H Honzik. Introduction and removal of reward, and maze performance in rats. University of California publications in psychology, 1930. [159] Paul E Utgo. Incremental induction of decision trees. Machine learning, 4(2):161{186, 1989. [160] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134{ 1142, 1984. [161] CHRISTOPHER JCH WATKINS and PETER DAYAN. Q-learning. Machine Learning, 8:279{292, 1992. [162] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, King's College, Cambridge, 1989. [163] Britton Wolfe, Michael R James, and Satinder Singh. Learning predictive state repre- sentations in dynamical systems without reset. In Proceedings of the 22nd international conference on Machine learning, pages 980{987. ACM, 2005. [164] Patricia J. Wozniak. Applied nonparametric statistics (2nd ed.). Technometrics, 33(3):364{ 365, 1991. [165] Liping Yang, Alan M MacEachren, Prasenjit Mitra, and Teresa Onorati. Visually-enabled active deep learning for (geo) text and image classication: A review. ISPRS International Journal of Geo-Information, 7(2):65, 2018. [166] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289{315, 2007. [167] Emaad Mohamed H Zahugi, Mohamed M Shanta, and TV Prasad. Oil spill cleaning up using swarm of robots. Advances in Computing and Information Technology, pages 215{224, 2013. 256 [168] Aonan Zhang and John Paisley. Deep Bayesian nonparametric tracking. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5833{5841. PMLR, 10{15 Jul 2018. [169] Lei Zheng, Siu-Yeung Cho, and Chai Quek. Reinforcement based u-tree: A novel approach for solving pomdp. In Handbook on Decision Making, pages 205{232. Springer, 2010. 257 Appendix A Proofs of Theoretical Results A.1 Proofs of Predictive SDE Theoretical Results Lemma 1 (Any simple experiment longer than some nite random length fails with probability 1). If, for all o2 O and s2 S, 0 < o s < 1, then, for some nite random integer D, k-action simple experiment e k will fail for all k>D with probability 1 (almost surely). Proof. Recall that E k = 1 is the event that experiment e k succeeds. We calculate P (E k = 1) in terms of underlying POMDPE as follows: P (E k = 1) =P (o t+1 ;:::;o t+k ja t ;:::;a t+k1 ) (A.1) = X st;:::;s t+k P (o t+1 ;:::;o t+k ;s t ;:::;s t+k ja t ;:::;a t+k1 ) (A.2) = X st;:::;s t+k P (o t+1 ;:::;o t+k js t ;:::;s t+k ;a t ;:::;a t+k1 )P (s t ;:::;s t+k ja t ;:::;a t+k1 ) (A.3) = X st;:::;s t+k P (o t+1 ;:::;o t+k js t+1 ;:::;s t+k )P (s t ;:::;s t+k ja t ;:::;a t+k1 ) (A.4) = X st;:::;s t+k ( t+k Y i=t+1 P (o i js i ))P (s t ;:::;s t+k ja t ;:::;a t+k1 ) (A.5) Where the nal two steps (Equations A.4 and A.5) utilize the sensor Markov assumption of the underlying POMDP state space. Since we do not know the particular values o t+1 ;:::;o t+k , we will proceed by providing an upper bound on the probability of any k-length simple experiment and showing that summing up this upper bound from k = 0 to1 converges to a nite value. Dene, max o;s o s , i.e., the maximum probability of observing any observation in any state, and substitute this maximum value in for each P (o i js i ): P (E k = 1) = X st;:::;s t+k ( t+k Y i=t+1 P (o i js i ))P (s t ;:::;s t+k ja t ;:::;a t+k1 ) (A.6) k X st;:::;s t+k P (s t ;:::;s t+k ja t ;:::;a t+k1 ) (A.7) = k (A.8) 258 Where the nal step (Equation A.8) utilizes the fact that P st;:::;s t+k P (s t ;:::;s t+k ja t ;:::;a t+k1 ) = 1 since it is a valid probability distribution. Thus: P (E k = 1) k (A.9) Recall that the Borel-Cantelli lemma states that if G 1 ;G 2 ;::: is a sequence of events in a probability space and the sum of their probabilities is nite (i.e., 1 P k=1 P (G k ) <1), then the probability that innitely many of these events occur is 0. Consider the sequence of events fE 1 = 1g;fE 2 = 1g;:::, where, in general, G k is the eventfE k = 1g. Since, by assumption, for all o2O and s2S, 0< o s < 1, 0< < 1, and: 1 X k=1 P (G k ) 1 X k=0 k = 1 1 <1 (A.10) Thus, E k = 1 occurs only nitely many times and, consequently, for k > D for some nite random integer D, E k always takes on the value 0 (i.e., fails) with probability 1 (almost surely), completing the proof. Lemma 2 (Any compound experiment longer than some nite random length fails with probabil- ity 1). If, for all o2O and s2S, 0< o s < 1, then, for some nite random integer D, k-action compound experiment e c;k will fail for all k>D with probability 1 (almost surely). Proof. We begin by noting the following: the more trajectories covered by compound experiment e c;k , the higher its probability of succeeding must necessarily be. We will again proceed by nding an upper bound on the probability of any compound experiment of length k by assuming that, at each time step i2ft;:::;t +k 1g, alljAj possible actions are allowed to be selected, and that, for exactly one action allowed at each time step i, onlyjOj1 observations are allowed at time step i + 1. For all other actions at time step i, any observation is allowed at time step i + 1. Without loss of generality, assume that, at each time step i, actions a 1 i ;:::;a jAj1 i are the jAj1 actions for which any observation is allowed at time step i + 1, while a jAj i is the action for which onlyjOj1 observations are allowed at time step i + 1. Let be the maximum, over all states, of the probability of observing any one of the most likely set ofjOj1 observations. Since all observations are mutually exclusive, this is simply the largest sum ofjOj1 observation probabilities possible in any state. Note that, since for all o2 O and s2 S, 0 < o s < 1 by assumption, 0<< 1. We now write P (E c;k = 1) in terms of the underlying POMDP: P (E c;k = 1) = X at;:::;a t+k1 P (fo r t+1 g;:::;fo r t+k gja t ;:::;a t+k1 )P (a t ;:::;a t+k1 ) (A.11) = X st;:::;s t+k X at;:::;a t+k1 P (fo r t+1 g;:::;fo r t+k g;s t ;:::;s t+k ja t ;:::;a t+k1 )P (a t ;:::;a t+k1 ) (A.12) = X st;:::;s t+k X at;:::;a t+k1 P (fo r t+1 g;:::;fo r t+k gjs t ;:::;s t+k ;a t ;:::;a t+k1 ) P (s t ;:::;s t+k ja t ;:::;a t+k1 )P (a t ;:::;a t+k1 ) = X st;:::;s t+k X at;:::;a t+k1 ( t+k Y i=t+1 P (fo r i gjs i ;a i1 )P (s i js i1 ;a i1 )P (a i1 ))P (s t ) (A.13) 259 Where the nal step (Equation A.13) is due to the sensor and transition Markov properties of the underlying POMDP state space and we have also used the fact that, since all action random variables at all time steps i 1, A i1 , are independent, P (a t ;:::;a t+k1 ) = P (a t )P (a t+1 ) P (a t+k1 ). P (a i1 ) = 1 jAj for all time steps i because any action can be chosen at any time step and actions are chosen uniformly at random. We know that jfo r i gj= jOj when a i1 2 fa 1 i1 ;:::;a jAj1 i1 g, and, thus,P (fo r i gjs i ;a j i1 ) = 1 whenj2f1;:::;jAj1g. P (fo r i gjs i ;a jAj i1 )< 1 when a i1 =a jAj i1 . We see that, if we marginalize out over a i1 and s i1 at each time step i, Equation A.13 can be dened as a recursive procedure in which the (in general, unnormalized) joint distri- bution over the current state and observations P (S i ;fo r t+1 g;:::;fo r i g) is computed from the (in general, unnormalized) joint distribution over the state and observations up to the previous time step P (S i1 ;fo r t+1 g;:::;fo r i1 g). At the nal time step, S t+k is marginalized out to nd P (fo r t+1 g;:::;fo r t+k g). At the rst time step, the normalized distributionP (S t ) is used to compute P (S t+1 ;fo r t+1 g), whereP (fo r t+1 g) = P st+1 P (s t+1 ;fo r t+1 g). We proceed by nding the following up- per bound: P (fo r t+1 g) , which is analogous to in the proof of Lemma 1. Clearly, due to the re- cursive nature of this computation, ifP (fo r t+1 g) and 0< < 1, thenP (fo r t+1 g;fo r t+2 g) 2 and, in general, P (fo r t+1 g;:::;fo r t+k g) k . Thus: P (fo r t+1 g) =P (E c;1 = 1) (A.14) = X at X st+1 P (fo r t+1 gjs t+1 ;a t ) X st P (s t+1 js t ;a t )P (a t )P (s t ) (A.15) = X at X st+1 P (fo r t+1 gjs t+1 ;a t ) X st P (s t+1 ;s t ;a t ) (A.16) = X at X st+1 P (fo r t+1 gjs t+1 ;a t )P (s t+1 ;a t ) (A.17) = jAj1 X j=1 X st+1 P (fo r t+1 gjs t+1 ;a j t )P (s t+1 ;a j t ) + X st+1 P (fo r t+1 gjs t+1 ;a jAj t )P (s t+1 ;a jAj t ) (A.18) jAj1 X j=1 X st+1 P (s t+1 ;a j t ) + X st+1 P (s t+1 ;a jAj t ) (A.19) = jAj1 X j=1 P (a j t ) +P (a jAj t ) (A.20) = jAj1 X j=1 1 jAj + 1 jAj = jAj1 + jAj < 1 (A.21) Thus, = (jAj1 +)=jAj. Equation A.18 separates out the value a jAj t from the summation over the possible values ofA t such that all remaining terms in the summation fromj = 1 tojAj1 have the same upper bound in observation probability at time stept+1. Equation A.19 substitutes in these upper bounds for the observation probabilities, recalling thatP (fo r t+1 gjs t+1 ;a jAj t ) and P (fo r t+1 gjs t+1 ;a j t ) = 1 for j2f1;:::;jAj1g. Equation A.21 utilizes the fact that P (a t ) = 1=jAj for all possible values of A t and the fact that 0<< 1. Thus, we have that P (E c;k = 1) k for a value where 0 < < 1. Analogously to Lemma 1, we dene the sequence of events G c;k =fE c;k = 1g, and: 260 1 X k=1 P (G c;k ) 1 X k=0 k = 1 1 <1 (A.22) Thus, E c;k = 1 occurs only nitely many times and, consequently, for k > D for some nite random integerD,E c;k always takes on the value 0 (i.e., fails) with probability 1 (almost surely), completing the proof. Theorem 1 (All PSDEs have a nite length). Assume that, for allo2O ands2S, 0< o s < 1. M is a PSDE Model of rewardless POMDP environmentE learned by either Max Predict or PSBL for PSDEs (Algorithms 2 and 3, respectively, in Chapter 5). Let V denote the set of PSDEs in M. jv i j denotes the length of v i 2 V 's experiment e vi . With probability 1 (almost surely), there exists a nite random integer D such that8 vi2V jv i jD. Proof. Each v i 2 V is associated with ajv i j-action simple or compound experiment, e vi . jv i j increases by 1 every time PSDE v i is split (only splitting operations increasejv i j), but splitting only occurs if all of v i 's one-step extension experiments succeed at least one time during some nite number of experiments. Assuming that for all o2 O and s2 S, 0 < o s < 1, we can use Lemmas 1 and 2 to assert that, for each v i , there exists a nite random integer D i past which at least one of e vi 's one-step extension experiments will fail with probability 1 (almost surely). Dene D, max i D i . No PSDE of length D will be split, and thus, no PSDE has length longer than D (i.e.,8 vi2V jv i jD), and the result is proved. Theorem 2 (Computational Complexity). The worst-case computational complexity of the Max Predict and PSBL for PSDEs algorithms (Algorithms 2 and 3 in Chapter 5) is O(DjAj 2D+1 jOj 2D+2 ), where D is the maximum length of any PSDE in the agent's environment modelM. Proof. In this proof, the term PSDE learning algorithm refers to both Max Predict (Algorithm 2) and PSBL for PSDEs (Algorithm 3) in Chapter 5. We rst note that, as discussed in Sec- tion 5.3.1.4, the PSDE learning algorithm saves the pairs of PSDE experiments involved in all split, rene, and merge operations in a hash map which is checked before each operation to ensure that the same operation is never performed more than once. This prevents any possible merge- split or merge-rene innite loops, ensuring that the algorithm terminates in nite time. The complexity of each split, rene, and merge operation upper bounds the amount of time required for this check (and for the hashing of experiments), so this does not increase the overall complexity of the algorithm proved below. We begin by showing three important bounds. First, Theorem 1 ensures us of the existence of a nite integer D bounding the number of actions and observations (and time steps represented) in any PSDE's experiment. This immediately implies that the learning agent's model can contain no more thanO((jAjjOj) D ) PSDEs, one per simpleD-length experiment. Second, in a compound experiment, the learning agent can choose from, at most,jAjjOj1 choices of action and asso- ciated observation at each time-step. This implies that processing all the elements of a PSDE's experiment takes O(DjAjjOj) time. Third, a compound PSDE can be associated with, at most jAjjOj1 probability distributions over the set of possible observations. Thus, processing all the elements of a PSDE's probability distributions takes O(jAjjOj 2 ) time. Each attempted or successful splitting operation takes O(DjAj 2 jOj 3 ) time, requiring a pass over each of theO(DjAjjOj) elements in the experiment and each of theO(jAjjOj 2 ) elements of the probability distributions of each of the O(jAjjOj) one-step extension PSDEs of the PSDE being split. Each successful or attempted renement and merging operation requires O(DjAjjOj 2 ) time to process the elements in the experiments and probability distributions of at most two PSDEs. 261 We will now bound the worst-case runtime of the PSDE learning algorithm by bounding the number of attempted splits, renes, and merges, which obviously upper bounds the number of successful splits, renes, and merges. Since the learning agent's model can contain no more than O((jAjjOj) D ) PSDEs and no iden- tical operations are ever repeated, the number of successful and attempted renes and the number of successful merges are both O((jAjjOj) D ). The number of successful splits is O((jAjjOj) D1 ). We can bound the number of while loop iterations by noting that, after all successful splits, renes, and merges have completed in O((jAjjOj) D ) time, at each iteration, the algorithm per- forms the experimentation procedure (Section 5.3.1.1, which requires O(D(jAjjOj) D+1 ) time to process sub-trajectories) and attempts to perform a split. The learning agent's model has no more than O((jAjjOj) D ) PSDEs, so the algorithm will perform experimentation and attempt to split O((jAjjOj) D ) additional times before every PSDE has a successCount>=convergeTol, and the while loop terminates after O((jAjjOj) D ) iterations. In the worst case, O((jAjjOj) D ) merges are attempted at each iteration, since we check every other PSDE for merging eligibility with the selected PSDE. TheO((jAjjOj) D (jAjjOj) D DjAjjOj 2 ) = O(DjAj 2D+1 jOj 2D+2 ) time required by merging over the course of all iterations provides an upper bound on the time required for all experimentation, splits, renes, and merges throughout the course of the algorithm, and the result is proved. A.2 Proofs of sPOMDP Theoretical Results Theorem 3. For any Moore machine environmentE, there exists a perfect sPOMDP modelM wherejMjjQj,jSjjQj1, and each SDE s2S has, at most,jQj1 actions, wherejQj is the number of environment states in the minimal representation ofE. Proof. We assume thatE is the minimal representation (Denition 5) of the given Moore machine environment. This assumption is justied by the fact that polynomial-time algorithms exist that convert non-minimal Moore machines into minimal ones [15, 23] that would not increase the worst-case runtime of the procedure we describe to prove this theorem. For the purposes of this proof, we call an sPOMDP with deterministic transition and observation functions an SDE Moore Machine. We prove this theorem constructively by providing a polynomial-time algorithm Convert (Algorithm 21) that converts any minimal Moore machine environment into a perfect SDE Moore Machine model. We say that a model state m 2 M with a k + 1-length outcome sequence g m = (o 1 ;o 2 ;:::;o k+1 ) generated using k-action SDE s = (a 1 ;:::;a k ) can cover environment state q if (q;s) =g m , where (q;s) represents the observation sequence generated by applying action sequence (SDE) s to environmentE, starting from state q. This algorithm begins by initializing one model state per observation o2 O (lines 2-8). The outcome sequence of each of these model states is simply the observation with which it is initialized. S = [] initially, because a null SDE can distinguish environment states that are not visibly equivalent (i.e., states emitting distinct observations). Each environment state q2Q is initially covered by the model state m2M to which it is visibly equivalent (lines 7-8). By construction, each environment state remains covered by exactly one model state throughout this algorithm, though a single model state may cover multiple environment states. Letfq m g represent the set of environment states covered by model state m. For each model state in the modelm2M, we execute each actiona2A from eachq i m 2fq m g (the three foreach loops in lines 12-40) resulting in new most likely environment states covered by model states fm 0 1 ;:::;m 0 jfqmgj g (line 16). Let f be the rst observation of m's outcome sequence (line 17). Let uM be the distinct (non-repeated) outcome sequences in the setff +a +m 0 1 ;:::;f +a +m 0 jfqmgj g (guaranteed by lines 19-20). 262 If the size ofuM is greater than 1 (line 23), the current model state space M is insucient to fully capture the dynamics of state spaceQ, because at least two environment states covered by the same model state transition to dierent model states under the same actiona. This indicates that we need to split model statem into at least two model states representing the various possibilities of model state transitions from m to m 0 2 uM under the action a by underlying environment states covered by model state m. To perform this splitting procedure, we rst set change true to indicate that the model will be modied (line 24). We then nd a pair of model states u 1 6= u 2 2 uM (line 25) whose outcome sequences are identical up to a rst dierence in observation. The recursive construction of these outcome sequences guarantees that such a pair exists whenever the model must be split. Let s represent the actions shared by u 1 and u 2 up to their rst dierence in observation (line 26). We add u 1 and u 2 to the set of model states M in line 28. For each u2fu 1 ;u 2 g and for each environment stateq that caused the transition to model state u under actiona, we addq to the list of environment states covered by u (line 30), erase q from the list of environment states covered by m (line 31), and map q to its new covering model state u (line 32). If m no longer covers any environment states (line 33), we erase m from mToQ and from the model state set M (lines 34-35) to maintain the invariant that all model states cover at least one environment state. Finally, we add s to the set of SDEs S (line 36). Note that, by construction, SDE s is a separating sequence (denition 6) for at least two environment states that consists of an action a prepended to an SDE with one fewer actions (which itself distinguishes at least two environment states). We then break out of all the foreach loops (lines 37-40), returning to the main while loop of the algorithm. After the termination of the main while loop, line 41 computes the transition and emission functions of the SDE Moore Machine model. This can be done very simply. At the end of the main while loop, by construction (guaranteed by line 23), each underlying environment state covered by each model state transitions to the same model state upon executing each action. So, for each model state-action pair m;a, we can choose any environment state q i m 2fq m g and set (m;a) equal to the model state covering (q i m ;a). The emission function is simply (m) = m.rstObservation(), wherem.rstObservation() is the rst observation inm's outcome sequence. We now state and prove 3 Lemmas that prove the correctness and polynomial complexity of the Convert procedure (Algorithm 21), completing the proof of this theorem. Lemma 3. The Convert algorithm (Algorithm 21) terminates in O(jQj 4 jAj) time with a model M with no more thanjQj model states (jMjjQj) and no more thanjQj1 SDEs (jSjjQj1), with each SDE having no more thanjQj1 actions and each outcome sequence associated with each model state having no more thanjQj observations. Proof. First, we show that the main while loop of the Convert algorithm (lines 10-40, Algo- rithm 21) terminates in, at most,jQjjOj iterations. Recall that we have assumedjOjjQj. In each while loop iteration,M is either changed, increasing the number of SDEsjSj and model statesjMj, or it is not modied and the algorithm terminates. In the worst case, each of thejQj environment states has unique dynamics and a perfect sPOMDP modelM will requirejMj=jQj model states. Recall thatM is initialized withjOj model states (lines 2-8), each of which covers at least one environment state (recall that we assume that each o2O is emitted by at least one environment state). Furthermore, by construction (lines 27-35), each timeM is modied, each model state m still covers at least one environment state. Also, by construction (lines 8, 32), each environment state is covered by a unique model state at all times throughout the algorithm (the qToM map). We call this the unique model state invariant. Assume that the Convert procedure terminates in more thanjQjjOj while loop iterations, each of which must modify M by adding at least one SDE and model state. Thus, M would 263 contain at leastjQj+1 model states. There must, then, exist a model state that does not cover any environment state, because there are more model states than environment states and each environment state is uniquely mapped to exactly one model state, resulting in a contradiction of the unique model state invariant. Therefore, the algorithm must nish in, at most,jQjjOj while loop iterations with, at most,jQj model states. SincejOjjQj, we can conclude the while loop runs for O(jQj) iterations. We have shown that the Convert procedure (Algorithm 21) results in a modelM with no more thanjQj model states. We now prove that each model state has no more thanjQj observations in its outcome sequence and that this procedure results in no more thanjQj1 SDEs, each of which has no more thanjQj1 actions. The main while loop (lines 10-40) runs for, at most,jQjjOj iterations.jOj 2 (by assumption). In the worst case, the while loop runs forjQj2 iterations, adding at least one new model state (whose outcome sequences are, at most, one action-observation pair longer than the longest outcome sequence in the previous while loop iteration) and one new SDE (whose number of actions is, at most, one larger than the number of actions in the longest SDE in the previous while loop iteration) at each iteration. Since each model state begins with an outcome sequence of length one (a single observation), when the algorithm terminates, no model state can have an outcome sequence with more thanjQj2+1 =jQj1<jQj observations. Since each while loop iteration adds, at most, one SDE toM and increases the length of the longest SDE by, at most, one action { recall that the model is initialized with no SDEs { the number of SDEs and the number of actions in each SDE must both be upper bounded byjQj1, as required. The initialization of model states (lines 2-8) requires O(jQj) time to createjOj initial model states and associate each environmental state q 2 Q with the appropriate initial model state (asjOjjQj). Each of the O(jQj) iterations of the main while loop in the Convert procedure (lines 10-40) requires a pass over each of thejQjjAj possible environmental state and action pairs (spread across all model states, due to the unique mapping). For each such pair, we spend O(jQj) time copying outcome sequences (each upper bounded in length byjQj). For one such pair (the last one, in the worst case), we spendO(jQj 3 ) time in the splitting procedure (lines 24-36), which is dominated by the search for a pair of model states whose outcome sequences are identical up to a rst dierence in observation (line 25). This directly leads to a worst-case runtime of O(jQj(jQj 2 jAj+jQj 3 )) = O(jQj 3 jAj+jQj 4 ) = O(jQj 4 jAj), polynomial in the size of the state and action space of the given environmentE, as required. Lemma 4. Letfq m g denote the set of environment states covered by model state m. If, for every triple (m;a;m 0 )2 MAM in modelM, q 0 = (q;a) is covered by m 0 = (m;a) for all q2fq m g and all q 0 2fq m 0g, thenM is a perfect model of environmentE. Proof. Note that, by construction, the Convert procedure of Algorithm 21 guarantees that the model returned by this procedure satises the property that, for every triple (m;a;m 0 )2 M AM in modelM, q 0 =(q;a) is covered by m 0 =(m;a) for all q2fq m g and all q 0 2fq m 0g. We call this the unique transition property. By construction, it also clearly satises the property that for all m2 M, every q2fq m g is visibly equivalent to m, meaning that (q) matches the rst observation in the outcome sequence of m. The unique transition property ensures that, whenever the agent transitions from model state m2 M to model state m 0 2 M under action a for any (m;a;m 0 ) triple in MAM, the corresponding transition between environment states q and q 0 under action a will be such that m covers q and m 0 covers q 0 . Thus, q 0 will be visibly equivalent to m 0 and m 0 will also be guaranteed to uphold the unique transition property. Using this argument inductively, we see that, regardless of the actions the agent executes, the model state resulting from any transition it makes will necessarily be visibly equivalent to the underlying environment state resulting from that transition. Therefore,M is a perfect model of environmentE. 264 Lemma 5. The Convert algorithm (Algorithm 21) terminates in polynomial time with an SDE Moore Machine modelM that is a perfect model of input Moore machine environmentE. Proof. Lemma 4 proves that if SDE Moore Machine modelM has the unique transition property, thenM is a perfect model of environmentE. The Convert procedure (Algorithm 21) explicitly tests for this property (line 23) and does not terminate until it is met. Lemma 3 proves that the Convert procedure terminates in a nite amount of time (polynomial in the size of the underlying environment state space and the number of actions). Together, these lemmas prove that the Covert procedure terminates in polynomial time with an SDE Moore Machine modelM that is a perfect model of Moore machine environmentE. Together, Lemmas 3, 4, and 5, complete the proof of Theorem 3. For any Moore machine environmentE, there exists a perfect SDE Moore Machine modelM whose model states are represented by no more thanjQj outcome sequences (model states) generated by no more than jQj1 SDEs, each with no more thanjQj1 actions. This theorem implies that any Moore machine environment E can be represented by a perfect SDE Moore machine model M (an sPOMDP with deterministic transition and observation functions) no larger than the minimal representation ofE. As Shen proves in [142], concatenating these SDEs (the set S) together, ordered by their creation time, forms a homing sequence in the environment, since each SDE is a separating sequence between at least two visibly equivalent states. We now proceed to extend these results to rewardless -POMDPs and sPOMDP models. Theorem 4. For any rewardless -POMDP environmentE in which > 1=jQj and > 1=jOj, there exists a perfect sPOMDP modelM wherejMjjQj,jSjjQj1, and each SDE s2 S has, at most, jQj1 actions, where jQj is the number of environment states in the minimal representation ofE. In Section 6.1, we discussed how an -POMDP environment could be created by softening the transition and observation functions of a deterministic Moore machine environment. If the Moore machine that was softened into an -POMDP is minimal, then we say that the resulting -POMDP is minimal. Theorem 4 implies that any rewardless-POMDPE can be represented by a perfect sPOMDP modelM (in the sense described in Denition 9 of Section 7.2.2) such thatjMjjQj,jSjjQj1, and each SDE s2S has no more thanjQj1 actions (which results in model states with outcome sequences with no more thanjQj observations), wherejQj is the number of environment states in the minimal representation of -POMDP environmentE. We again proceed by construction. Algorithm 21 detailed the Convert procedure. By chang- ing a few lines, this algorithm, along with its computational complexity analysis, can be ex- tended to rewardless -POMDP environments and sPOMDP models. First, we must change E.resultingState(q, a) in line 16 of Algorithm 21, toE.mostLikelyState(q, a), which returns the environment state q 0 for which P (q 0 jq;a) = . Second, we changeE.observation(q) in line 3 of Algorithm 21 toE.mostLikelyObservation(q), which now returns the observation most likely to be emitted in environment state q (i.e., the o2O for which P (ojq) =). Third, we replace line 41 in Algorithm 21 with the procedure in Algorithm 22 for computing T and , which we return instead of and , respectively. The remainder of these algorithms can be left unchanged. The following result clearly follows directly from Lemma 3: Corollary 1. The Convert algorithm (Algorithm 21) extended to rewardless -POMDP envi- ronmentsE terminates in O(jQj 4 jAj) time with an sPOMDP modelM with no more thanjQj model states (jMjjQj) and no more thanjQj1 SDEs (jSjjQj1), with each SDE having no more thanjQj1 actions and each outcome sequence associated with each model state having no more thanjQj observations. We now proceed to prove thatM is a perfect model ofE. 265 Lemma 6. Letfq m g denote the set of environmental states covered by model statem. If transition probabilitiesT and emission probabilities are calculated according to Algorithm 22 and, for every pair (m;a)2MA in modelM and for all q2fq m g, q max a = arg max q 0 2Q P (q 0 jq;a) is covered by m max a = arg max m 0 2M P (m 0 jm;a), thenM is a perfect model of environmentE. Proof. Note that, by construction, the Convert procedure (Algorithm 21) explicitly ensures that the model returned satises the unique most likely transition property: for every pair (m;a)2 MA in modelM and for all q2fq m g, q max a = arg max q 0 2Q P (q 0 jq;a) is covered by m max a = arg max m 0 2M P (m 0 jm;a). We assume thatEM (i.e., the model and environment are stochastically visibly equivalent, Denition 8, Section 7.2.2) and prove that, each time the agent executes action a at time t and observeso at timet+1 inM andE for anya2A ando2O, the updated beliefsb E;t+1 andb M;t+1 will maintain the property thatEM at time t + 1, if transition probabilities are assigned to M according to and calculated in lines 5 and 10 of Algorithm 22. The easiest way to ensure thatEM is by setting all model states m2 M to be equally likely (with probability 1=jMj) and then partitioning the probability 1=jMj amongst the underlying environment statesq2fq m g uniformly, for each m2 M. This will prove that the unique most likely transition property will continue to hold regardless of where the agent transitions and what it sees. This ensures that marginalizing out the environment state variable Q or model state variable M always results in the same probability for o given the agent's history, and we can use this argument inductively to assert thatE andM will remain stochastically visibly equivalent regardless of the actions and observations encountered by the agent. We now apply the traditional Bayesian belief update procedure to incorporate action a and observation o. In the model,M, it is: b M;t+1 (m 0 ) =P (ojm 0 ) X m P (m 0 jm;a)b M;t (m) (A.23) Equation A.23 can be expressed equivalently in terms of the underlying environmental states as follows: bE;t+1(fq m 0g) =P (ojfq m 0g) X m P (fq m 0gjfqmg;a)bE;t(fqmg) (A.24) By the construction of the sPOMDP model, for all m 0 2 M, P (ojfq 0 m g) is identical for all q2fq m 0g and all o2 O. It will be if o is visibly equivalent to the rst observation in m 0 's outcome sequence and 1 jOj1 otherwise. This denes for all m 0 2 M and o 2 O. Thus, P (ojfq m 0g) =P (ojm 0 ) for all m 0 2M and o2O. Dene, m max a , ( + 1 jQj1 (jfq m max a gj1)) (A.25) This is clearly the probability of transitioning from one of the environment states covered bym to one of the environment states covered by m max a , the most likely model state to be transitioned into from m under action a. Since q max a { the most likely environment state to be transitioned into from stateq undera { is guaranteed to be covered bym max a regardless of whichq2fq m g the agent begins at (due to the unique most likely transition property), the probability that it makes the transition to someq 0 2fq m max a g is the probability of making the most likely environment state to environment state transition q!q max a , which is, plus the probability that it transitions into any other environment state covered by m max a that is not q max a . The probability of transitioning into any q 0 6=q max a = 1 jQj1 , andjfq m max a gj1 of them are covered by m max a . Now, dene: m 0, 1 jQj1 jfq m 0gj (A.26) 266 This is clearly the probability of transitioning from one of the environment states covered by m to one of the environment states covered by some m 0 6=m max a under action a. The probability of transitioning to any q not covered bym max a is equally likely, so the probability of transitioning to some m 0 6= m max a is proportional to the number of environment states it covers. Notice that the denitions of m max a and m 0 use both the fact thatE is a rewardless-POMDP and thatM satises the unique most likely transition property. m max a and m 0 may be distinct for distinct model states because they may cover dierent numbers of states. Thus, P (fq m 0gjfq m g;a) = P (m 0 jm;a) = m max a if m 0 = m max a and m 0 otherwise, dening T . This is true for all triples (m;a;m 0 )2MAM. Thus, for all m2 M and all q2fq m g, b M;t (m) and b E;t (q) will be multiplied by the same transition and observation probabilities. Since b M;t (m) = P q2fqmg b E;t (q) (they represent mutu- ally exclusive possibilities), it follows that b M;t+1 (m) = P q2fqmg b E;t+1 (q) for all m2M at time t + 1. We also have that,8 o2O;m2M P (ojm) = P (ojq) for all q2fq m g. This implies that8 o2O P q P (ojq)b E;t+1 (q) = P m P (ojm)b M;t+1 (m). In other words,EM at timet + 1. SinceEM at time t + 1 and b E;t+1 and b M;t+1 uphold the unique most likely transition property regardless of the action a the agent took and the observation o that it saw, we can use this argument in- ductively to assert that, regardless of the agent's actions and observations,E andM will remain stochastically visibly equivalent for all future time steps. Thus,M is a perfect model ofE and the result is proved. Lemma 7. The Convert algorithm (Algorithm 21) extended to rewardless -POMDP environ- ments terminates in polynomial time with an sPOMDP modelM that is a perfect model of input rewardless -POMDP environmentE. Proof. We again assume thatE is the minimal representation of the given -POMDP environ- ment, using the same justication as we did in the proof of Theorem 3. Lemma 6 proves that if sPOMDP modelM has the unique most likely transition property, it is a perfect model of rewardless-POMDP environmentE. The Convert procedure (Algorithm 21) explicitly tests for this property (line 23) and does not terminate until it is met. Corollary 1 proves that the Convert procedure applied to-POMDP environments terminates in a nite amount of time (polynomial in the size of the underlying environment state space and the number of actions). Together these lemmas prove that the Covert procedure terminates in polynomial time with an sPOMDP model M such thatjMjjQj,jSjjQj1, and each SDEs2S has no more thanjQj1 actions (which results in model states with outcome sequences with no more thanjQj observations), wherejQj is the number of environment states in the minimal representation of -POMDP environmentE. This completes the proofs of Lemma 7 and Theorem 4. A.3 Evaluation of the Convert procedure In this section, we present our experimental results validating the Convert algorithm (Algo- rithm 21) and its extensions to rewardless -POMDP environments (Algorithm 22). Please see Appendix A.2 for more details regarding these algorithms. Figure A.1 provides the results of using Algorithms 21 and 22 on increasingly large, random (minimal) -POMDP environments with varying numbers of hidden states. In Figure A.1 (a), we illustrate how the runtime of the Convert procedure increases as the size of the environment increases and as the number of hidden states increases. Each line represents a dierent ratio of observationsjOj to environment statesjQj. For example, the green line represents random -POMDP environments in which the number of observations was 1=16th the number of environment states. Each observation was most likely in a random number of environment states, with the only restriction being that each observation had to be the most likely one in at least one environment state. Each data point is an average 267 over 100 random-POMDP environments. and were chosen uniformly at random from their allowable ranges of ( 1 jQj ; 1] and ( 1 jOj ; 1], respectively. (a) Runtime (seconds) (b) Longest outcome sequence Figure A.1: Experimental results for the Convert procedure (Algorithm 21) and its extensions to-POMDPs in Algorithm 22 on increasingly large, random (minimal), rewardless -POMDP environments with varying numbers of hidden states. Figure A.1 (b) illustrates the average length of the longest model state outcome sequence for the same random -POMDP environments as in Figure A.1 (a). Since we require that jOj> 1, environments in whichjOj=jQj=4 (red) could only be constructed withjQj 8. Likewise, environments withjOj=jQj=8 (yellow) required thatjQj 16, and environments withjOj=jQj=16 (green) required thatjQj 32. This is why some elements appear to be missing in the bar chart of Figure A.1 (b).jAj was set to 2 in all of these experiments in an attempt to force the outcome sequences of the learned model states to be as large as possible (thereby more directly testing our theoretical bounds). The most important takeaways from these results are: 1) average runtime is clearlyO(jQj 4 jAj), which is consistent with our theoretical analysis (see Appendix A.2); 2) the average longest outcome sequence in the model is always far below the provable upper bound of lengthjQj (again see Appendix A.2), and no outcome sequence ever exceeded this theoretical upper bound in any trials; 3) the agent always learns exactly the same number of model states jMj as environment statesjQj, which is expected because the environments are guaranteed to be minimal -POMDP environments; 4) as the ratio of environment states to observations increased, the average maximum outcome sequence length also increased, as expected. The reason this is to be expected is that increasing this ratio increases the average number of environment states covered by the same most likely observation, and we should expect that longer outcome sequences will, in turn, be required to statistically distinguish between them. This causes a corresponding increase in runtime, as observed in Figure A.1 (a). Though not shown in the gures, the model error of each learned model (see Equation 6.4) was 0:0 over 10000 timesteps. In other words, empirically, these models induced exactly the same probability distributions over observations at each time step as the ground-truth environment. This is in accordance with our theoretical analysis in Appendix A.2. In Figure A.2, we present additional experimental validation of the Convert procedure (Algo- rithm 21) and its extensions to rewardless -POMDP environments in Algorithm 22 on increas- ingly large, random (minimal) deterministic Moore machine and -POMDP environments. jAj was (again) set to 2 andjOj=jQj=2. Each observation covered a random number of environment states, with only only restriction being (again) that each observation be most likely in at least one environment state. The blue bars represent random deterministic Moore machine environments 268 (a) Runtime (b) Longest outcome sequence (c) Model size Figure A.2: Experimental results for the Convert procedure (Algorithm 21) on increasingly large random (minimal) -POMDPs in whichjOj=jQj=2 andjAj= 2. In these environments, each observation covered a random number of environment states, with the only restriction being that each observation must be emitted as the mostly likely one in at least one state. The blue bars represent randomly constructed Moore machine environments ( = = 1). The red bars represent random -POMDPs in which and were randomly chosen uniformly from their allowable values. Subgure (a) provides runtime results. Subgure (b) provides the average length of the longest outcome sequence of any state m2 M in the agent's model. Subgure (c) provides the average number of model statesjMj learned by the agent. (that is, rewardless -POMDPs with = = 1). The red bars represent random -POMDPs in which and are chosen uniformly at random from their allowable ranges of ( 1 jQj ; 1] and ( 1 jOj ; 1], respectively. Figure A.2 (a) presents a comparison of average runtime results. Figure A.2 (b) presents a comparison of the average length of the longest outcome sequences in the learned models. Figure A.2 (c) compares the sizes (number of states,jMj) of the learned models. All data points are averages over 100 random environments. In these results, we again see that runtimes are clearly upper bounded according to our worst-case theoretical analyses (Figure A.2 (a)) and that the average length of the longest outcome sequence is far below the theoretical upper bound (Figure A.2 (b)). Additionally, no outcome sequence ever exceeded our theoretical bounds in any trial. The agent always inferred exactly the same number of model statesjMj as there were en- vironment statesjQj (Figure A.2 (c)). We see minimal variation between the performance of the 269 algorithm on deterministic Moore machine environments and -POMDP environments (except for the largest environments tested). This is expected behavior, because the algorithm for both is essentially the same (and the extra processing of transition and observation probabilities in Algorithm 22 required in -POMDP environments is extremely minimal for all but the largest environment sizes tested). Figures A.1 and A.2 provide us with a strong independent verication of the theoretical analyses detailed in the previous section. 270 Algorithm 21: Convert Input:E: minimal Moore machine environment Output: (M;S;;): SDE Moore Machine model ofE 1 mToQ fg; qToM fg; M []; S [] 2 foreach q2E.states do 3 o E.observation(q)/* What does the agent see in q? */ 4 m new MState([o]) 5 if m = 2M then 6 M.add(m) 7 mToQ[m].add(q) 8 qToM[q] m 9 change true 10 while change do 11 change false 12 foreach m2M do 13 foreach a2E:actions do 14 uM fg; uMToQ fg; uQToM fg 15 foreach q2mToQ[m] do 16 m 0 qToM[E.resultingState(q, a)] 17 f m.rstObservation() 18 newM new MState([f + a + m 0 ]) 19 if newM = 2uM then 20 uM.add(newM) 21 uMToQ[newM].add(q) 22 uQToM[q] newM 23 if uM.size() > 1 then 24 change true 25 Find a pair u 1 6=u 2 2uM identical up to a rst dierence in observation 26 s actions shared by u 1 and u 2 up to rst diering observation 27 foreach u2fu 1 ;u 2 g do 28 M.add(u) 29 for q2uMToQ[u] do 30 mToQ[u].add(q) 31 mToQ[m].erase(q) 32 qToM[q] = u 33 if mToQ[m].empty() then 34 mToQ.erase(m) 35 M.erase(m) 36 S.add(s) 37 if change then 38 break 39 if change then 40 break 41 getTransitions(); getEmissions() 42 return M, S, , 271 Algorithm 22: ComputeTransitions 1 foreach m2M do 2 foreach a2E:actions do /* Compute state transitions */ 3 q mToQ[m][0] 4 m max a qToM[E.mostLikelyState(q, a)] 5 + 1 jQj1 (jmToQ[m max a ]j1) 6 foreach m 0 2M do 7 if m 0 is m max a then 8 T [m][a][m 0 ] = 9 else 10 = ((1)=(jQj1))jmToQ[m 0 ]j 11 T [m][a][m 0 ] = 12 foreach m2M do 13 foreach o2O do 14 if o ==m:rstObservation() then 15 [m] 16 else 17 [m] (1)=(jOj1) 272
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning to detect and adapt to unpredicted changes
PDF
Transfer learning for intelligent systems in the wild
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Modeling, learning, and leveraging similarity
PDF
Intelligent near-optimal resource allocation and sharing for self-reconfigurable robotic and other networks
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Incrementality for visual reference resolution in spoken dialogue systems
PDF
Understanding goal-oriented reinforcement learning
PDF
Interactive learning: a general framework and various applications
PDF
Learning to diagnose from electronic health records data
PDF
Learning affordances through interactive perception and manipulation
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Learning logical abstractions from sequential data
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Transfer reinforcement learning for autonomous collision avoidance
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Utilizing context and structure of reward functions to improve online learning in wireless networks
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Towards learning generalization
PDF
Multimodal representation learning of affective behavior
Asset Metadata
Creator
Collins, Thomas Joseph
(author)
Core Title
Active state learning from surprises in stochastic and partially-observable environments
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
01/29/2019
Defense Date
11/28/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
active learning,autonomous learning from the environment,machine learning,OAI-PMH Harvest,online learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Shen, Wei-Min (
committee chair
), Carlsson, John (
committee member
), Rosenbloom, Paul (
committee member
)
Creator Email
collinst@usc.edu,tjcollins47@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-115108
Unique identifier
UC11676704
Identifier
etd-CollinsTho-7027.pdf (filename),usctheses-c89-115108 (legacy record id)
Legacy Identifier
etd-CollinsTho-7027.pdf
Dmrecord
115108
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Collins, Thomas Joseph
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
active learning
autonomous learning from the environment
machine learning
online learning