Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Emotional appraisal in deep reinforcement learning
(USC Thesis Other)
Emotional appraisal in deep reinforcement learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
UNIVERSITY OF SOUTHERN CALIFORNIA Emotional Appraisal in Deep Reinforcement Learning by Tobi Akomolede A thesis submitted in partial fulllment for the degree of Master of Science in the Department of Computer Science August 2019 Declaration of Authorship I, Tobi Akomolede, declare that this thesis titled, `Emotional Appraisal in Deep Rein- forcement Learning' and the work presented in it are my own. I conrm that: This work was done wholly or mainly while in candidature for a research degree at this University. Where any part of this thesis has previously been submitted for a degree or any other qualication at this University or any other institution, this has been clearly stated. Where I have consulted the published work of others, this is always clearly at- tributed. Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. I have acknowledged all main sources of help. Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself. Signed: Date: i To my dearest sister Toba and my good friend Victor McElhaney, both of whom departed us long before their time. May your memory continue to inspire us to change the world the way you did mine. UNIVERSITY OF SOUTHERN CALIFORNIA Abstract Department of Computer Science Master of Science by Tobi Akomolede Reinforcement learning algorithms aim to nd a policy that gives the best action to perform in each state such that some expected future extrinsic reward is maximized. Existing work combining emotion and reinforcement learning has been concerned with using emotion as either a part of the agent state or as an intrinsic reward signal. Very little work has been concerned with reinforcing the emotional appraisal variables them- selves, i.e. predicting the expected future appraisal intensities associated with states. Additionally, much of the existing literature on appraisal dimensions as intrinsic reward signals are not applied in the deep reinforcement learning domain. In this thesis, I propose approach to deep reinforcement learning that, given an observation, prior ap- praisal, goal, and action, predicts the expected future appraisal as well as the extrinsic values. The model is evaluated in a Pac-Man environment and it is shown that appraisal information aids in generalization to unseen maze layouts. Contents Declaration of Authorship i Abstract iii 1 Introduction 1 2 Background and Related Work 3 2.1 Background: Appraisal Theory of Emotion . . . . . . . . . . . . . . . . . 3 2.1.1 Background: Process Models of Appraisal . . . . . . . . . . . . . . 4 2.1.2 Related Work: Computational Models of Emotion . . . . . . . . . 4 2.2 Background: Deep Reinforcement Learning . . . . . . . . . . . . . . . . . 6 2.2.1 Related Work: Predictions on Unstructured Data using Deep Neu- ral Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Emotions and Appraisal in Reinforcement Learning . . . . . . . . . . . . . 7 3 The SWAGR Framework 11 3.1 SWAGR Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Appraisal Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.1 Surprise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.2 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.3 Desirability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.4 Coping Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Environment Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.1 Rollout Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4 Pac-Man Experiment 1 17 4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.1 Performance: Layout A . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.2 Performance: Layout B . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.3 Performance: Layout C . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3 Analysis of Reinforced Appraisal Values . . . . . . . . . . . . . . . . . . . 23 4.3.1 Heatmap Analysis: Desirability . . . . . . . . . . . . . . . . . . . . 23 4.3.2 Heatmap Analysis: Relevance . . . . . . . . . . . . . . . . . . . . . 24 4.3.3 Heatmap Analysis: Coping Potential . . . . . . . . . . . . . . . . . 25 iv Contents v 4.3.4 Heatmap Analysis: Surprise . . . . . . . . . . . . . . . . . . . . . . 28 4.3.5 Heatmap Analysis: Conclusion . . . . . . . . . . . . . . . . . . . . 28 5 Pac-Man Experiment 2 29 5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6 Discussion 37 A SWAGR Neural Architecture 39 B Environment Model Network and Training Details 42 C Additional t-SNE Visualizations 45 Bibliography 47 Chapter 1 Introduction Appraisal theorists posit that the emotions humans experience are a result of an indi- vidual's appraisal of the environment around them in relation to their goals. Dierent dimensions of appraisal combine to form dierent emotions, and these emotions help humans to adapt to their environment [1, 2]. Researchers studying the intersection of Aective Computing and reinforcement learning have proposed using appraisal as an intrinsic reward signal to boost agent performance. Appraisal-based intrinsic reward signals have been shown to increase sample eciency [3] and improve adaptation to changing environment dynamics [4, 5], due to denser frequency of reward information. Commonly in intrinsically motivated reinforcement learning, a linear combination of the extrinsic reward signals from the environment and the intrinsic reward from the internal critic are summed to form a composite reward, which is then used to learn the expected value of each state. [3, 4, 6, 7]. However, when the agent is done training, the component appraisal variable intensities that make up the expected values are lost. There are many reasons why one might be interested in retaining the appraisal variable intensities associated with states. One reason is that visualizing the appraisal intensities associated with states might can give a better insight into how the agent has learned to perceive the environment. Retaining the appraisal intensities enables the construction of agents that dynamically change their goals during evaluation based on their most recently experienced appraisals patterns (i.e., emotions). One also might be interested in nding new environments that elicit specic appraisal patterns in the trained agent. For example, if one has trained an agent to simulate the player of an online video game, the appraisal associations of that agent can be used to nd new environments that maximize surprise or coping potential and thus could keep the player interested and engaged. 1 Introduction 2 Another reason for retaining the appraisal intensities associated with states has to do with computing time. Some appraisal variables are computationally expensive to com- pute. For example, coping potential and causal attribution can require searching over the state space of an environment. In many domains, it is very often that a state is revisited by the agent. Likewise, a new state may be visited by the agent that, with the exception of some noise, is the same as a previously visited state. It is unideal to be required to recalculate the appraisal variables upon each visit to a state that has already been seen or is very similar to a state that has already been seen. From a theoretical perspective, it is commonly hypothesized that individuals learn to associate states with appraisals [8{13] Thus the need for a reinforcement learning frame- work that predicts expected future appraisal variable intensities associated with states, rather than their composite value is both theoretically and pragmatically motivated. Furthermore, while individual motivations are at the core of appraisal theory [2], these intrinsic reward-based approaches to reinforcement learning lack an explicit parame- terization of the agent's goals or motivations. Because of this, the agent's goals and resultant behavior are immutable, xed to the dynamics of a scalar reward function. Finally, much of the work modeling emotion and appraisal in reinforcement learning has been done in domains with well structured state representations and/or tabular expected state value representations. Deep reinforcement learning [14] often works on unstructured state representations, such as pixel data from video games or camera feeds. This thesis proposes an approach that addresses these challenges. In this thesis, I will de- scribe SWAGR 1 , a framework for reinforcement learning agents that integrates the cur- rent state, appraisal, goal, and reward to predict expected future appraisals and rewards. The primary focus of this framework is not on maximizing quantitative performance, but the successful integration of appraisal with sensory stimuli and goal information. This framework is applied in a video game setting on raw pixel data. Experimentation is performed to answer two research questions: Is it possible to train a deep reinforcement learning model to associate states and goals with appraisals? Can a model-free agent successfully predict appraisal variables derived from models of the environment? If such an agent is successfully trained, how does its task performance compare to that of the agent trained without appraisal information? 1 The name SWAGR comes from the symbols used to represent its inputs. Appraisal is represented by w as emotions are commonly associated with wellbeing, and a is being used to represent actions. Chapter 2 Background and Related Work 2.1 Background: Appraisal Theory of Emotion In psychology, appraisal theory is a theory of emotion that posits that emotions in a person are formed from the evaluation of that person's relationship with the environment with respect to their motivations or goals (Scherer et al. [15]). The evaluation of the person-environment relationship is known as appraisal, and occurs over individual events or episodes (Lazarus [1]). An appraisal is composed of dierent appraisal variables (also known as appraisal di- mensions or appraisal components) (Lazarus [1], Smith et al. [2]). The set of appraisal variables and their denitions vary from researcher to researcher, but commonly include the variables relevance, desirability, and coping potential. For example, if a woman came home to nd her dog missing, the appraisal of this event would be comprised of a high motivational relevance and low desirability. If the woman's neighbor later told her that they recently saw the dog down the street, the woman would appraise this event as having high motivational relevance, low desirability, but with a high potential of coping. The appraisal variables in this example are relevance, desirability, and coping potential. Dierent combinations of appraisal variables combine to form dierent emotions (Smith et al. [2], Lazarus [1]). In the scenario above, the rst event's appraisal of high relevance and low desirability would result the woman feeling anxiety. The second event, in which new information is produced by the neighbor, would result in a feeling of hope, due to the high coping potential. All of this is assuming the woman has a goal of retaining her dog. 3 Background and Related Work 4 2.1.1 Background: Process Models of Appraisal Theorists have described a variety of models with which to describe the process of appraisal. Lazarus [1] and Smith et al. [2] state there are two stages of appraisal: primary and secondary appraisal. Primary appraisal variables include relevance and desirability and describe the perceived impact on the person's wellbeing. Secondary appraisal variables constitute an assessment of a person's ability to act in a manner that maintains or improves upon their wellbeing, e.g., problem-focused coping potential. Scherer's component process model [9, 16] breaks appraisal into four assessments: (1) how relevant is this event to the individual, (2) what are the implications of this event, (3) to what extent can the individual cope with this event, and (4) how does this event relate to the individual's ego and social values. These four assessments are made sequentially in a xed order, with processing at later phases terminating early depending on the results of prior assessments. Kirby and Smith [8] suggest a dual process model of appraisal comprising of a component that forms associations between stimuli and appraisal, and a reasoning component that slowly and deliberately constructs appraisal meanings from stimuli and goals. Common criticisms with the aforementioned process models are that they are too slow to account for the observed rapid onset of emotion [16], they unnecessarily complicate the appraisal process [17], and that it is dicult to draw an unequivocal distinction between rule-based mechansims and associative mechanisms [18]. Moors [10] disregards the distinction between associative and rule-based appraisal, arguing for a functional view that appraisal is both automatic and constructive. Automatic in this sense means that it is a process that can occur under suboptimal conditions (e.g., limited time) or unintentionally. Constructive in this sense means that the process integrates both stimuli and goals to result in appraisal. 2.1.2 Related Work: Computational Models of Emotion Aective Computing literature has a large number of computational models that attempt to describe and simulate the process of appraisal and resultant emotions in humans. While this thesis will only cover those serving as the basis for the proposed model, confer Marsella et al. [19] for a more complete history of the computational models of emotion in literature. Gratch and Marsella [20] propose a domain independent framework for modeling emo- tions in virtual humans. This framework, EMA, is built on top of the Soar (Newell Background and Related Work 5 [21]) cognitive architecture, and makes use of a structure called a causal interpretation to represent the intersection between the person's environment, beliefs, desires, and in- tentions. EMA evaluates appraisal variables in parallel (using the causal interpretation) and supports both problem-focused and emotion-focused coping strategies in response. Appraisal intensity is described by an expected utility model which has been shown to supported by human behavior in empirical results(Gratch et al. [22]). The appraisals variables used in this system are perspective, relevance, desirability, likelihood, causal attribution, controllability, and changeability. Appraisals are mapped to emotions using appraisal frames. Emotions in EMA are used to support coping and facial expressions as well as to inform natural language disambiguation, planning, and other systems in virtual human simulations. Related to EMA, Thespian [23, 24] models social emotions and interactions in partially observable, multi-agent environments. Instead of using a causal interpretation, Thespian computes appraisal for an agent using "mental models" of other agents in the model as well as utility. Thespian utilizes ve appraisal variables [23]: relevance, desirability, novelty, control, and accountability. The formulation given for novelty refers to the discrepancy between what the agent expected and what actually occured, thus, a more accurate name for this appraisal variable would be surprise (c.f. discussions in Barto et al. [25]). To assess control, the agent utilizes its set of mental models regarding other agents to look ahead a xed number of timesteps into the future. The nal control value is determined by the sum of expected utilities for each trajectory (i.e., each combination of mental models). Appraisals in Thespian are used to model the emotional experiences of agents in simulated social situations. Psychsim [26, 27], which serves as the basis for Thespian, models social interactions amongst multiple agents using POMDPs. The reward function is seperated into compo- nents corresponding to subgoals. The agent i's preferences are represented by the goal vector~ g i . Given agent i's beliefs regarding the state of the world, ~ b t i , the agent selects the action that maximizes the expected reward given its current preferences: V a ( ~ b t i ) =~ g i ~ b t i + X ~ b t+1 V ( ~ b i t+1 )P (b t+1 j ~ b i t+1 ;a;~ :i (b i t+1 )) (2.1) It is important to note that Psychsim does not feature appraisals or emotions. However, the goal parameterization it utilizes is similar to the system proposed in this thesis. El-Nasr et al. [28] proposes a system which utilizes fuzzy logic event evaluation and subsequent appraisal. The system's learning component utilizes reinforcement learning to predict expected utilities of events. The emotion component uses fuzzy rules to Background and Related Work 6 describe the desirability of events (in terms of fuzzy set membership) based on its impact on an individual goal. The expected utilities and the desirabilities are combined in fuzzy rules to determine resulting emotions. The agent's behavior is determined by the set of emotions. FLAME uses emotions to control a virtual pet's expression and are graphically communicated in a agent-human interactive environment. 2.2 Background: Deep Reinforcement Learning Deep Reinforcement Learning is the application of deep neural networks to reinforcement learning. Normally in reinforcement learning, a table is used to predict the expected utility of state-action pairs. Mnih et al. [14] in their foundational work, replace this table with a deep neural network structure. Training the agent to play Atari games, the network contains convolutional neural networks [29] for feature learning directly from the unstructured input (i.e. pixel data). This is dierent from many table-based approaches, which often use a structured representation of the state in learning. The authors used experience replay[30] rather than online-learning, citing data eciency gains and a broader behavioral range to learn on. The authors were able to reach and surpass human performance at several Atari games, showing that their network structure is domain-independent and cementing many of the common methodologies used in modern reinforcement learning literature. 2.2.1 Related Work: Predictions on Unstructured Data using Deep Neural Networks Many works that extend reinforcement learning, particularly for the purpose of support- ing appraisal or emotions, can do because their work meets two criteria: (1) their state representations are well-structured (2) it is feasible to model environment dynamics. For example, [27] features well-structured models of other agents in the environment, which facilitates the look ahead process needed to compute the control appraisal. In [31], the unexpectedness term, used to calculate joy, is able to be computed directly from a model of state-transition probabilities. In deep reinforcement learning scenarios, states often have continuous rather than discrete representations. This renders any count or table-based approaches infeasible (though progress has been made using psuedo-counts obtained from neural density models, cf. [32]). Additionally, it is unlikely that many domains with unstructured data have pre-existing models available for performing look ahead or Monte Carlo Tree Search. The works described below adddress these challenges in a deep reinforcement learning environment. Background and Related Work 7 Dosovitskiy and Koltun [33] describe a framework that generealizes the extrinsic reward signal into a vector-valued measurement set, m t . The measurement set could include health, ammunition, and frags in a rst person shooter video game. The framework also gives a goal vector g with the same dimensionality as m t that describes the importance of each measurement. The model takes as input the state s t , the current measurement m t , the goal vectorg, and an actiona to predictf, the future value of each measurement across multiple time intervals. The objective function u(f;g) gives the value of a state in terms of the goal vector and the predicted future values: u(f;g) =g > f =gf (2.2) This is similar to Equation 2.1, dened in [26]. Thus this work makes predictions regarding a subset of the state using the current state and proposed actions. Weber et al. [34] give a domain independent approach to environment modeling in deep reinforcement learning. The motivation for this work was to combine the benets of model-free and model-based reinforcement learning. The authors train a model- free policy network and additionally pretrain an environment model, which takes as input a state-action pair and predicts the next state and subsequent reward. Rather than perform Monte Carlo Tree Search, which is common in model-based reinforcement learning literature, the model performs a rollout for each action available to the agent. Each rollout step is produced by the environment model, and a rollout policy network is used to determine the subsequent action. The outputs from the rollouts are fed through an LSTM-based encoder and are concatenated with the output from the model-free policy network. [35] expands on this by creating an environment model that learns compact state representations with which to make predictions on, rather than raw pixel predictions. 2.3 Emotions and Appraisal in Reinforcement Learning The following works directly model emotion and/or appraisal using reinforcement learn- ing. This is dierent from some of the literature in 2.1.2, which utilized reinforcement learning to support emotion and appraisal in other parts of the model. While this thesis will only cover literature serving as the basis for the proposed model, confer Moerland et al. [36] for a more complete survey of works integrating emotion with reinforcement learning. Castro-Gonz alez et al. [37] propose a Q-Learning agent that has multiple dimensions of wellbeing as a part of its internal state. One of these dimensions is fear, is associated Background and Related Work 8 Surprise Relevance Desirability Control Generates Emotions? Function of Appraisal/Emotion EMA X X X X Agent behavior, expression Thespian X X X X Modeling social interactions Psychsim N/A FLAME X X Agent behavior, expression Castro-Gonz alez et al. [37] X Agent behavior modication Moerland et al. [38] X X X Socially communicative agents Broekens et al. [31] X X X Modeling emotion dynamics Sequeira et al. [5] X X X Intrinsic reward signal Pathak et al. [7] X Intrinsic reward signal McDu and Kapoor [3] X Intrinsic reward signal Table 2.1: An overview of the systems discussed in this chapter. Lists the presence of common appraisal variables in the system and the function appraisal and/or emotion serves in the system. with states where the agent has been damaged by external actors. A separate table Q obj i worst is maintained to store the worst experiences associated with actor obj i for a given state-action pair. This is because the overall Q-value of a state might be high (due to future expected reward), even if the agent can be damaged in it. The expected utility of states where there is a Q obj i worst worse than a predened threshold are decremented. The function of fear in this system is to modify the agent's behavior such that it learns to avoid dangerous states. Broekens et al. [31] propose a model-based reinforcement learning model of emotions, specically the emotions of joy, distress, hope and fear. Hope and fear are calculated using the value of the state: Hope(s t ) =max(V (s t ); 0)Fear(s t ) =max(V (s t ); 0) (2.3) Joy and distress are modeled by using the product of the value prediction error [39] and the transition "unexpectedness": Joy(s t1 ;a t1 ;s t ) = (r t +V (s t )V (s t1 ))(1P (s t js t1 ;a t1 )) (2.4) It can be seen that (r t +V (s t )V (s t1 )) is the value prediction error, andP (s t js t1 ;a t1 ) is the "expectedness" of the transition. As "unexpectedness" is measuring the expectation- voliation of the state transition, a better name for this term would be surprise. The justication for the value prediction error is that the intensity of joy should decrease upon visiting that state more, i.e., habituation. Background and Related Work 9 Moerland et al. [38] propose a model-based reinforcement learning agent that uses for- ward sampling to support the anticipatory emotions hope and fear. The authors argu- ment for the necessity of forward sampling is that it supports nding the specic states that are eliciting the agent's anticipatory emotions. This is in contrast to [31], which derives hope and fear from the current state value, which is accumulated over all possible future states. As the agent interacts with the environment, the environment dynamics model P , and the reward function R are learned from sampled data. P (s 0 js;a) gives the probability of transition to state s 0 from s by performing action a. R(s;a;s 0 ) gives the expected reward received upon transition to state s 0 froms by performing action a. To support planning, the agent makes use of Upper Condence Bounds for Trees [40], a Monte Carlo Tree Search method [41]. Hope in state s 0 is given by the end state s 0 of the trajectory with the highest expected desirability: H(s 0 ) =max s 0 [b(s 0 js 0 )d(s 0 js 0 )] (2.5) Where b(s 0 js 0 ) refers to the likelihood of reaching state s 0 from s 0 and d(s 0 js 0 ) refers to the discounted, expected reawrd received on the trajectory from s 0 to s 0 . Emotions in this work do not impact action-selection, but are used to communicate to human observers the agent's evaluation of of its current state. In [5] and [4], the authors describe a framework for using appraisal variables as intrinsic reward signals [6, 42, 43] within reinforcement learning: r total (s;a) =g extrinsic r extrinsic + X i g i w i (2.6) Here g i is the weight for the ith appraisal variable and w i is the value of the ith ap- praisal variable. The framework includes four appraisal variables: novelty, motivation, relevance, and control. The novelty of a state-action pair is inversely related with the number of times it has been visited. See [44] for a work that similarly uses appraisal as intrinsic reward. Pathak et al. [7] propose using "curiosity" as an intrinsic reward feature in deep rein- forcement learning. The framework has an Intrinsic Curiosity Module (ICM), containing two models: a forward dynamics model and an inverse dynamics model. The forward dynamics model takes as input the action a t and the encoded state (s t ), and outputs the anticipated next state ^ (s t+1 ) where is the feature encoder. The inverse dynamics model takes as input the encoded state (s t ) and the next encoded state (s t+1 ) and predicts the action ^ a t . The encoder function is trained by jointly optimizing the in- verse and forward dynamics model, and the error in the forward model predictions is Background and Related Work 10 used as the curiosity reward. The error in the forward model prediction is given by: L F ((s t+1 ); ^ (s t+1 )) = 1 2 k ^ (s t+1 )(s t+1 )k 2 2 (2.7) Because curiosity here is dened as a the dierence between the expected ( ^ ()) state encoding and the actual (()) state encoding, this intrinsic reward signal could also be called surprise. Experiments show successful agent performance even when the agent is trained solely on surprise and not any extrinsic reward features. McDu and Kapoor [3] give a framework for using human aective response data for training agents in deep reinforcement learning, specically self-driving car agents. Volu- metric change in blood in the periphery of the skin is recorded as humans interact with a high delty driving simulation [45]. Changes in blood volume pulse wave form are shown to correspond to when a person is startled, fearful, or anxious [46]. The authors build and train a Reward Network that learns to predict the normalized pulse amplitude from training data collected on human physiological responses while driving. The output of the reward network is then used as an intrinsic reward while training a DQN-based agent to perform driving-related tasks. Using this intrinsic reward feature in conjunction with the extrinsic reward improved sample eciency and reduced the number of catastrophic failures in training. One side eect of the formulation of the total reward r total utilized by [3{5, 7] is that after training, the appraisal signals that constitute the total reward are lost and, if one wishes to inspect them, must be recalculated directly. If one is solely interested in utilizing appraisal to improve training sample eciency, this does not present a problem. However, appraisal variables have a communicative value to both human observers and other agents and can provide a deeper understanding of the agent's evaluation of the current state than a single scalar expected utility value. Additionally, reinforcement learning agents that retain appraisal values could be expanded to support emotion, coping, neuromodulation (confer Doya [47], Shi et al. [48]), and other processes. If one is interested in pursuing these aims within the intrinsically motivated framework of reinforcement learning, this would require recalculating the appraisal every time a state is revisited. Recalculating the appraisal values for the current time step alone is unideal if they are expensive to compute. Recalculating the expected future discounted appraisal values would be just as computationally tractable as nding a state's expected value via brute force, i.e., infeasible in all but the simplest environments. Chapter 3 The SWAGR Framework 3.1 SWAGR Framework In this chapter, I introduce the a framework for integrating states, goals, and appraisals in reinforcement learning, which is called SWAGR. The agent receives at each timestep t a state s t and the reward vector r t received upon transitioning froms t1 tos t . In this work, thes t is two-dimensional pixel input. Exam- ples of reward component sets are gold, corn, and salt in a goods trading environment, or score, health, and experience points in a video game environment. At each timestep, the agent has a goal vector g t . Each appraisal variable and reward component has a corresponding component in g t . Given s t , r t , and the appraisal at the previous timestep w t1 , the appraisal for the current timestep is given: w t =Appraisal(s t ;r t ;w t1 ;g t ) (3.1) The appraisalw t is a vector with each component corresponding to the value of a dierent appraisal variables, which is analogous to the aective reward features of Sequeira et al. [5]. The concatenated vector < r t ;w t > is analogous to the measurement vector in Dosovitskiy and Koltun [33]. The expected future appraisal vectorW at timestep t is given by: W t = 1 X d=0 d W w t+d (3.2) 11 SWAGR Framework 12 where W i is a vector of discount factors for each component inW. The future expected reward vectorR is given in a similar manner: R t = 1 X d=0 d R r t+d (3.3) As in Dosovitskiy and Koltun [33], we dene the goal the agent is pursuing as the maximization of a utility function u(R t ;W t ;g t ). In this work, u is dened as the linear combination of g t and the concatenation ofR andW: u(R t ;W;g t ) =g t (R _ W) (3.4) This is similar to approach to goal parameterization presented by Si et al. [23]. Note that while here u is a linear function, any parametric function could be used. In traditional Q-Learning [49], the functionQ(s t ;a t ) gives the maximum expected future rewards for performing a given action a t in state s t . Likewise, this work is interested in predicting the expected future values of the reward componentsR t+1 and the appraisal variablesW t+1 given s t , w t , a t , g t , and r t . The function approximator is given by: F (s t ;w t ;a t ;g t ;r t ;) =< ^ r t+1 ; ^ w t+1 > +F (s t+1 ;w t+1 ;a 0 ;g t+1 ;r t+1 ;) = < ^ R t ; ^ W t > (3.5) The action a 0 is given by: a 0 = argmax a2A u(F (s t+1 ;w t+1 ;a;g t+1 ;r t+1 ;); g t+1 ) (3.6) is the learned parameter set of the approximator. ^ indicates a predicted value. A is the set of actions. This can be seen as a variation of the function approximator given by Dosovitskiy and Koltun [33], where the parameterized function approximator has been expanded to include appraisal information. The deep neural architecture trained as the predictor is described in Appendix A. SWAGR Framework 13 3.2 Appraisal Variables The SWAGR framework uses the following appraisal variables: surprise, relevance, de- sirability, and coping potential. 3.2.1 Surprise Surprise in this work is dened as the mismatch between the state the agent anticipated entering, ^ s t and the actual state the agent entered,s t . This is given by the mean square error normalized with respect to ^ s t : Surprise(^ s t ;s t ) = MSE(^ s t ;s t ) MSE(^ s t ; 0) (3.7) The mechanisms that allow for the generation of anticipated states will be described in Section 3.3. 3.2.2 Relevance Goal relevance is described in Lazarus [1] as the degree of impact an event has on the agent's goals. Relevance(r t ;w t1 ;g t ) = u(r t ;w t1 ; g t ) (3.8) This equation is similar to the one given for motivational relevance in Si et al. [23] 3.2.3 Desirability Desirability(r t ;w t1 ;g t ) =u(r t ;w t1 ; g t ) =w desirability;t (3.9) The situation at timestep t is called desirable if Desirability(r t ;w t1 ;g t ) > 0 and is callced undesirable if Desirability(r t ;w t1 ;g t )< 0 3.2.4 Coping Potential Coping potential here is analogous to problem-focused coping potential in Smith et al. [50] and controllability in Gratch and Marsella [20] and refers to an agent's ability to change the situation such that the resultant desirability is improved or maintained. If the current situation is desirable, then the agent seeks to maintain this desirability. If the current situation is undesirable, then the agent seeks to improve the situation. SWAGR Framework 14 t be the desirability of the most rewarding sequence of state transitions anticipated after state s t : t =Desirability(max ^ r X d=t+1 ^ r d ;w t1 ;g t ) (3.10) Here is a constant that describes the depth of the trajectory. This is similar to the mechanisms for anticipatory emotions described in Moerland et al. [38]. The mechanisms that allow for anticipated reward will be described in Section 3.3.1. For w desirability;t > 0, MaintainingPotential(w desirability;t ; t ) = 8 > < > : 1 if t 0 1min(1; t w desirability;t ) otherwise ImprovingPotential(w desirability;t ; t ) = 0 (3.11) For w desirability;t < 0, MaintainingPotential(w desirability;t ; t ) = 0 ImprovingPotential(w desirability;t ; t ) = 8 > > < > > : 0 if t 0 min(1; t w desirability;t ) otherwise (3.12) Coping potential can then be formulated as: CopingPotential(w desirability;t ; t ) = MaintainingPotential(w desirability;t ; t )ImprovingPotential(w desirability;t ; t ) (3.13) Here the sign of coping potential indicates whether the agent is seeking to maintain (positive) or improve upon (negative) the desirability of the event. 3.3 Environment Model In order to support anticipatory appraisal, i.e., anticipated states in surprise and antic- ipated future rewards in coping potential, an action-conditional next-step environment model is given. Given a state s t and an action a t , the agent anticipates transitioning to state ^ s t and receiving reward ^ r t . As in Oh et al. [51] and Weber et al. [34], we train SWAGR Framework 15 a neural architecture to achieve this aim. This network architecture is described in Appendix B. It naturally follows that the anticipated state produced by the environment model is then used to compute surprise. 3.3.1 Rollout Policy To nd an approximation for t using the environment model, one could use a exhaustive search approach or a Monte-Carlo Tree Search, as in Moerland et al. [38]. This would come at a severe computational performance penalty. Instead, this work uses the rollout strategy described in Weber et al. [34]. For each possible action ^ a i t in the set of actions A, we rollout one step in the environment. However, when choosing the subsequent action for the ith rollout, rather than using a rollout policy network (as in Weber et al. [34]), we directly reuse the policy formed by combining the SWAGR predictor with the utility function: ^ a i d = argmax a2A u(F (^ s i d ;w t1 ;a;g t ;r t ;); g t ); for d =t + 1;:::;t + 1 (3.14) Note that the appraisal, goal, and reward remain xed throughout the rollout. This gives us a trajectory of the form: i t = [s t ^ a i t ^ r i t+1 ^ s i t+1 ^ a i t+1 ^ r i t+2 :::^ s i t+1 ^ a i t+1 ^ r i t+tau ] Thus we estimate t as: ^ t = max i Desirability( X d=t+1 ^ r i d ;w t1 ) (3.15) 3.4 Training The environment model was trained on experiences collected from a random policy agent interacting with the environment forN episodes. A set of theM most recent experiences was maintained, with mini-batches of size k sampled with prioritized experience replay (cf. Schaul et al. [52]) for each iteration of the solver. The environment model was optimized with stochastic gradient descent [53] according to the loss: SWAGR Framework 16 L environment () =RMSE(^ s t ;s t ) +RMSE(^ r t ;r t ) (3.16) Additional details for the environment model training procedure are provided in Ap- pendix B. The SWAGR model was trained on experiences collected from the agent interacting with the environment for N episodes. Again, a set of the M most recent experiences was maintained, with mini-batches of sizek sampled for training with prioritized experience replay for each iteration of the solver. The SWAGR model was optimized with stochastic gradient descent according to the loss: L SWAGR () = F (s t ;w t ;a t ;g t ;r t ;) [<r t+1 ;w t+1 > +F (s t+1 ;w t+1 ;a 0 ;g t+1 ;r t+1 ;)] (3.17) The parameters of the SWAGR model are updated after each timestep. The agent follows in-greedy policy, with initialized to 1 and decreasing linearly during training, similarly to the procedure in Dosovitskiy and Koltun [33]. The appraisal process used by the SWAGR agent depends on the environment model, and thus the environment model is fully trained before training the SWAGR model. Future work could investigate strategies toward interleaving the training of the two models. Chapter 4 Pac-Man Experiment 1 It is possible that the agent that must learn both the environment task and appraisal could suer in task performance compared to the agent that only has to learn the environment task. Thus the aim of this experiment is to show that the agent that learns appraisal performs no worse than the agent that solely learns the environment task. The evaluation criteria of this experiment is the average extrinsic reward received by the agent per episode. This work builds on the Pac-Man Python developed in van der Ouderaa [54] and UCB [55]. This environment was chosen due to the availability of existing source code and due to its simplicity. All agents trained below interact in a (5 5) grid environment, as seen below. The agent controls the Pac-Man as it moves throughout the grid. The set of actions is A =fNorth;East;South;Westg. The Ghost moves without respect to the player, and upon reaching a splitting path, has an equal probability of choosing each path. The score increases by 10 when Pac-Man eats a pill and increases by 200 when Pac-Man eats a scared Ghost. The score decreases by 500 when Pac-Man collides with a ghost. The game ends when Pac-Man has eaten all the white pills. The performance of the agent (i.e., extrinsic reward) is assessed using the score as reported by the Pac-Man environment. 4.1 Setup During training, the each episode starts with the conguration depicted in Figure 4.1. The set of reward components isfEatsPill;EatsGhost;WinsGame;HitsGhost;TakesStepg. At each timestept the agent receives a state inputs t is a (3120150) array of pixel data 17 Pac-Man Experiment 1 18 Figure 4.1: A screenshot of gameviewer displaying the start state a Pac-Man episode. The large white circle to the left of the Pac-Man is a power pill. The smaller circles are regular pills. and experiences reward r t , a vector of dimensionality 5. The agent uses an -greedy ac- tion selection strategy. starts at 1.0 and decreases linearly to 0.01 after 1,000 iterations. R is set to 0.9 for all reward components. ImprovingPotential = MaintainingPotential = 0 For all other components in are set to 0.9. Coping potential as described above is assessed in respect to the desirability at a specic time t. w CopingPotential;t and w CopingPotential;t+1 do not accumulate as each of those assessments are in respect to the desirabilities of dierent events. At the beginning of each game, the goal vectorg 0 is randomly initialized in the following manner: The values for g EatsPill;0 , g EatsGhost;0 , and g WinsGame;0 are chosen randomly with uniform distribution from the range (0; 1) The values for g HitsGhost;0 and g TakesStep;0 are chosen randomly with uniform dis- tribution from the range (1; 0). Note that this means the Pac-Man receives a penalty for each time it takes a step, even if it moves into a wall. The values for g W i are 0 for allW i in the set of appraisals (i.e., during training, appraisal is not taken into account in action selection) The following modications are made to the game: The score decreases by 1 with action Pac-Man takes (including moving into a wall) The game always ends after 60 steps The game does not end after the Pac-Man collides with a EatsGhost Pac-Man is represented by a simple yellow circle in the pixel data s t Agents are evaluated under the following training regimes: Pac-Man Experiment 1 19 1. No appraisal: The appraisal process is not executed and w t is the zero vector for all t, referred to as NA-Agent 2. With appraisal: The appraisal process is executed as described above, referred to as A-Agent The environment model is trained for 150 episodes as described in Chapter 3. After which, the agents are trained for 110 episodes. Rollout depth for assessing coping potential is set to 6. The batch size k is 16 and the experience replay memory stores M = 10000 experiences. 4.2 Results We evaluate the agent in three layouts. Layout A is the same environment the agent trained in. Layout B is a variation of Layout A where the pills in rst and second rows have been removed. Layout C is a variation where no pills are accessible and the movement options of the Ghost are increased. The agent is tested in each layout for 100 games. All layout congurations are pictured below. The Ghost, after being eaten or hitting the player, always respawns in the position shown. The Pac-Man always starts in the position shown. All of the weight congurations shown below are hand-tuned and were not discovered through any automated process. 4.2.1 Performance: Layout A The results are shown in 4.1 Table 4.1: Performance in Layout A. Conguration Weight Conguration NA-Agent g EatsPill =g EatsGhost =g WinsGame = 0:2, g HitsGhost =g TakesStep =0:2. 323.93 250.23 A-Agent g EatsPill =g EatsGhost =g WinsGame = 0:2, g HitsGhost =g TakesStep =0:2, g W i = 0 for all W i 2W 334.27 250.21 A-Agent g Desirability = 1:0, g W i = 0 for all W i 2WjW i 6=W Desirability , g R i = 0 for all R i 2R. 333.57 250.83 The agent that completely eschews reward information in its action-selection policy and solely relies on reinforce desirability performs roughly the same as the reward-only Pac-Man Experiment 1 20 (a) Layout A (b) Layout B (c) Layout C Figure 4.2: The three layouts used to evaluate performance agents. The learned policy had agents move east from the start state and follow the trail of pills until the game was completed. A more optimal path would be to move west from the start state, collect the power pill, then move south into the middle row and collect the trail of pills into completion. This strategy would avoid being hit by the Ghost, as it would be in the scared state. A hypothesis as to why none of the agents learned this strategy is that the low amount of training episodes did not facilitate learning the delayed reward of eating the power pill and subsequently the scared Ghost. Nonetheless, the results indicate that the addition of appraisal variables to training and evaluation does not signicantly hamper the ability of the agent to perform the task. 4.2.2 Performance: Layout B This evaluation is primarily concerned with investigating whether appraisal information hampers or aids generalization to unseen layouts. The performance is given in gure 4.2 Pac-Man Experiment 1 21 Figure 4.3: Histogram of episode scores achieved in Layout A for the three congu- rations. Table 4.2: Performance in Layout B. Conguration Weight Conguration NA-Agent g EatsPill =g EatsGhost =g WinsGame = 0:2, g HitsGhost =g TakesStep =0:2. -1,7641.10 930.22 A-Agent g EatsPill =g EatsGhost =g WinsGame = 0:2, g HitsGhost =g TakesStep =0:2, g W i = 0 for all W i 2W -1,135.01 685.50 A-Agent g Desirability = 1:0, g W i = 0 for all W i 2WnfW Desirability g, g R i = 0 for all R i 2R. -217.89 1145.04 While the agents under all congurations perform poorly, it can be seen that on av- erage, the agent that weights only appraisal information in its action-selection pol- icy(specically desirability) performs considerably better in comparison to the agent that trains without appraisal information and the agent that ignores does not weight appraisal information. The agents that do not weight appraisal information simply oscillate left and right the entire episode, colliding with the ghost as time passes. The appraisal agent also does this, but certain times will choose to eat the power pill, then eat the ghost, then follow the trail of pills. This is evidenced in the much greater frequency of episode scores greater than 500, as seen in Figure 4.4. 4.2.3 Performance: Layout C Taking things a step further from the previous layout, this evaluation is concerned with investigating whether appraisal information hampers or helps generalization to new tasks in unseen layouts. Rather than collecting pills and eating ghosts, the agent is tasked Pac-Man Experiment 1 22 Figure 4.4: Histogram of episode scores achieved in Layout B for the three congu- rations. with minimizing the amount of times it gets hit by the Ghost before the 60 timesteps of the episode elapse. The results are given in Table 4.3. Table 4.3: Number of times hit by Ghost in Layout C. Conguration Weight Conguration NA-Agent g HitsGhost =1:0, g R i = 0 for R i 2RjR i 6=R HitsGhost . 1.6 1.40 A-Agent g W i = 0:0 for all W i 2WjW i 6=W Relevance . , g HitsGhost =1:0, g R i = 0 for all R i 2RjR i 6=R HitsGhost . 1.75 1.72 A-Agent g Relevance =0:1, , g W i = 0:0 for all W i 2WjW i 6=W Relevance . , g HitsGhost =1:0, g R i = 0 for R i 2RjR i 6=R HitsGhost . 1.35 0.98 The agent that jointly weights appraisal information (specically negatively weights W Relevance ) with goal information performs slightly better than those that do not. While the appraisal-weighted agent gets it 1-2 times per episode with higher frequency than the other agents, it suers very little episodes with 4 or more collisions. This suggests that slightly biasing action-selection towards states with lesser expected future relevancy results in more consistent performance. Pac-Man Experiment 1 23 Figure 4.5: Histogram of episode scores achieved in Layout C for the three congu- rations. 4.3 Analysis of Reinforced Appraisal Values In the worst case scenario, the performance results obtained above could be the result of random chance. Most reinforcement learning literature trains and evaluates for thou- sands or millions of episodes, whereas this work only trained in the Pac-Man environment for 110 episodes. It is not inconceivable that the anticipated appraisal variable values generated by the SWAGR model are entirely random in nature due limited amount of time spent on reinforcing them. As such this work now turns its attention to the task of analyzing the anticipated appraisal values. A challenge here is that there is not neces- sarily a ground truth to what the expected future appraisal values should be. Instead of pursuing that challenge, this work instead chooses to generate many possible Pac-Man position congurations in dierent layouts, and to consider whether the appraisal values generated for these congurations are consistent. 4.3.1 Heatmap Analysis: Desirability The heatmap for desirability shown in Figure 4.6 was generated for each possible position Pac-Man could be in in Layout A. In each position is an arrow indicating the acttion corresponding to the highest anticipated desirability. The most desirable states are surrounding the pills, and the least desirable states are in the bottom row where there are no pills and where the Ghost respawns. The actions considered most desirable generally point in the direction of pills, though there are some that point into walls. These errant arrows may be brought about by under-exploration in those parts of the layout. Another thing to note is that the desirability in all of these positions is considered at best, negative. This could be due to the negative penalty received upon performing any Pac-Man Experiment 1 24 Figure 4.6: Heatmap of highest anticipated desirability for each Pac-Man position with corresponding action in Layout A. action or simply underexploration. In order to investigate this, a heatmap was generated for the modifed layout shown in Figure 4.7(a). (a) The heatmap. (b) The layout the heatmap was generated for. Figure 4.7: Heatmap of highest anticipated desirability and corresponding layout used in analysis of positive anticipated desirability The two positions with the highest desirability are to the left and right of the scared ghost. The only state with positive desirability is the one directly to the right of the scared ghost. While the relative desirabilities of these two state-action pairs make intu- itive sense, it remains to be seen why the left position is negative while the right position is positive, and while there are errant actions in the bottom row. 4.3.2 Heatmap Analysis: Relevance The heatmap for relevance shown in Figure 4.6 was generated for each possible position Pac-Man could be in in Layout A. The states the agent anticipates being most relevant are around where there are pills and around the Ghost's location. However, there are positions with relatively low relevance that are bordering with pills, particularly in the top right corner of the layout. It is possible that the relevance of colliding with the Ghost is out weighing the relevance of eating pills. Pac-Man Experiment 1 25 Figure4.8: Heatmap of total anticipated relevance across all actions for each Pac-Man position in Layout A. (a) The heatmap. (b) The layout the heatmap was generated for. Figure 4.9: Heatmap of total anticipated relevance across all actions and correspond- ing modifed layout In order to investigate this, a heatmap was generated for the modied layout shown in Figure 4.9(a). The heatmap for this layout shows the relevance concentrated much more tightly around the top right and middle of the layout. The bottom row has relatively low total relevance. This supports the hypothesis that the anticipated relevance is largely swayed by the position of the ghost. 4.3.3 Heatmap Analysis: Coping Potential It is worth restating what exactly the anticipated coping potential represents, for clar- ity's sake. F CopingPotential (s t ;w t ;a t ;g t ;r t ) = ^ w CopingPotential;t is a measure of the agent's perceived ability to maintain or improve upon the desirability experienced upon tran- sitioning to state s t+1 with action a t . In other word's, it's a measure of how well the agent thinks it can recover from a penalized action, or its ability to prevent the reversal of progress made by a rewarding action. This appraisal variable is subject to the most potential error, as it is reinforced using values obtained from an imperfect model of the environment. These values are then learned by another imperfect model. In addition to this, the rollout policy described in 3.3.1 is not a thorough search of the state-space, and also utilizes the imperfect SWAGR model for action-selection. Pac-Man Experiment 1 26 Coping potential, on the other hand, is the appraisal variable that most justies the SWAGR model itself. The environment model is signicantly larger in terms of number of parameters than the SWAGR model (43,359,539 parameters vs. 13,967,090) and its inference is signicantly slower as a result. Assessing coping potential requires = 6 rollout steps for each game step. Thus, there is much to be gained if coping potential can successfully be reinforced. To investigate whether this occured, heatmaps for maintaining potential and improving potential are separately analyzed. The heatmap for maintaining potential in Layout A is shown below. Figure4.10: Heatmap of highest anticipated maintenance potential for each Pac-Man position with corresponding action in Layout A. Arrows indicate the action that results in the state with the highest anticipated main- taining potential, and are omitted for states with signicantly low values. The highest values are at positions where the Pac-Man borders the trail of pills, with the exception of the top right position which has a low value of 0.0557 relative to the values of the other positions (0.1229 for the top left, 0.2523 for the bottom left, and 0.1386 for the bottom right). It can be inferred from the desirability heatmap that the rollout policy leads the Pac-Man towards pills without backtracking This does make some intuitive sense, but warrants the question: why is it that the sequence of actions needs to form a continuous path for desirability to be maintained? To investigate this, another layout was developed that modied the location of the Ghost and the pills in the environment: There are only two positions with nonnegligible maintaining potential: in the center and one space to the right of the center. The best action in both of these positions is to move west. From these states are the best trajectories the Pac-Man can travel through to simultaneously collect pills and avoid the Ghost. Returning to the question above, it could be that moving through the trail of pills without backtracking gives the Pac-Man the best chance of avoiding the ghost as well as compensating for movement penalties. The heatmap for improving potential in Layout A is shown below. The actions shown in 4.12 show penalized actions that the agent anticipates it can recover from. The highest values are either in states where it is highly likely Pac-Man will be heat by the Ghost or close to pills. It is important to note that the actions with the Pac-Man Experiment 1 27 (a) The heatmap. (b) The layout the heatmap was generated for. Figure 4.11: Heatmap of highest anticipated maintaining potential across all actions and corresponding modifed layout Figure4.12: Heatmap of highest anticipated improvement potential for each Pac-Man position with corresponding action in Layout A. highest anticipated improvement potential often send Pac-Man into a wall or torwards the Ghost. However, given the desirability heatmap, there seem to be inconsistencies between the improvement potential assessed at the bottom left corner and the cumulative reward the rollout policy would achieve in those states. To investigate this further, the modied layout used for analyzing maintaining potential was repurposed for analyzing improvement potential. (a) The heatmap. (b) The layout the heatmap was generated for. Figure 4.13: Heatmap of highest anticipated maintaining potential across all actions and corresponding modifed layout Pac-Man Experiment 1 28 The heatmap for this layout solidies that the actions with the highest improvement potential occur in states near pills. The inconsistencies in the bottom left corner of the map are not present in this heatmap, indicating that this conguration may produce states more inline with the environment model's training set. It is important to note that many of the congurations Pac-Man and pill positions analyized in these heatmaps are unreachable during regular gameplay. For example, the state corresponding to the center position on the heatmap would have pills on both Pac-Man's left and right side, which is impossible during training. 4.3.4 Heatmap Analysis: Surprise As it pertains to the Pac-Man itself, actions are deterministic. However, the Ghost's movement throughout the maze is random. Thus, to analyze anticipated surprise, a heatmap is generated for each Ghost position. Figure4.14: Heatmap of total anticipated surprise across all actions for each Pac-Man position in Layout A. The highest anticipated surprise is at positions around junctures in the maze. This is to be expected as those are where the Ghost choose to change its direction with random chance. 4.3.5 Heatmap Analysis: Conclusion The previous section showed that appraisal information in the worst case scenario does not signicantly hamper performance and in the best case can help adaptation to new layouts. Heatmap analysis has shown that the actual anticipated appraisal values them- selves are generally rational within the context of the Pac-Man game. Much more rigorous analysis is needed to determine whether inconsistencies and errant actions are the result of lack of training or the model's ability to capture appraisal information. Fu- ture analysis should attempt to discover what features in the pixel space are associated with appraisal dimensions directly. Chapter 5 Pac-Man Experiment 2 The aim of the prior experiment was to show that the agent that learns appraisal per- forms no worse than the agent that solely learns the environment task. However, the prior experiments trained the agent for only 110 games. It is common to train deep reinforcement learning algorithms with sample sizes several magnitudes larger. None of the agent congurations in the prior experiment were able to nd an optimal pol- icy in the training layout. It would also be interesting to compare the performance of models trained with random goals, as in the previous section, with the performance of models trained with a xed goal vector. This chapter introduces an new experiment that addresses these concerns. In this experiment, the questions being pursued are: (1) What impact does a larger sample set have on the quantitative performance? Can an optimal policy be found? Does the agent that learns appraisal still perform no worse than the agent that solely learns the environment task? (2) Does the eect of improved generalization due to appraisal hold for increased training? 5.1 Setup Adapting the training procedure given in [14, 33], agent was trained 20,000 minibatch iterations with a batch size of 32 for a grand total of 640,000 training steps. After every 500 training iterations, the agent was evaluated for 200 iterations. The performance for each evaluation period is given by the cumulative score received during evaluation. During training, the agent used an-greedy action selection strategy. starts at 1.0 and decreases linearly to 0.01 after 15,000 iterations. Models are trained for the following four agent congurations: 29 Pac-Man Experiment 2 30 No appraisal, random goal: The appraisal process is not executed, the goal vector is randomized as in Chapter 4.1, referred to as NA-Agent-RG, No appraisal, xed goal: The appraisal process is not executed, the goal vector is g fixed for all iterations. Referred to as NA-Agent-FG, With appraisal, random goal: The appraisal process is executed, the goal vector is randomized, referred to as A-Agent-RG, With appraisal, xed goal: The appraisal process is executed, the goal vector is g fixed for all iterations. Referred to as A-Agent-FG. Let vector g 0 fixed =< 0:01; 0:2; 0:5;0:5; 0; 0; 0; 0; 0; 0>, where the rst ve components correspond to the reward components and the remaing ve components respond to the appraisal components. Note that the reward components are ordered: eats pill, eats ghost, wins game, hits ghost, takes step, and that the appraisal components are ordered: surprise, relevance, desirability, (maintaining) coping potential, (improving) coping potential>. g fixed is the normalized g 0 fixed , i.e., g fixed = g 0 fixed jg 0 fixed j The reward weights in g fixed were derived from the scores given from the Pac-Man environment. Layout A from Chapter 4.1 was used during training and layouts B and C during testing, as in the previous experiment. The agent is tested for 100 games in each layout. During testing, the goal vector is g fixed for all games except for in Layout C, where which is instead g c =< 0; 0; 0;1; 0; 0; 0; 0; 0; 0>. All hyperparameters and rules not mentioned here are the same as in the previous experiment. 5.2 Results The performance while training is shown in Figure 5.1. Each point is averaged over three independent runs. Table 5.1 summarizes the results in each layout 1 . All four of the congurations performed similarly in the training layout. This is a repetition of the result in Chapter 4.2, with one key exception. The appraisal agent trained with randomized goals, when utilizing solely desirability (referred to in Table 5.1 as A-Agent-WA-RG) performs signicantly worse than the other congurations. In all other congurations, the Pac-Man agent immediately eats the power pill at the start of each game before heading out to either eat the scared Ghost or the pills. While the appraisal agent trained with the xed goal is able to capture the utility of that policy in 1 The "UA" in A-Agent-UA stands for unweighted appraisal and indicates that at test time the agent was ran with gj = 0 for all j2W. The "WA" in A-Agent-WA stands for weighted appraisal. The goal vectors for A-Agent-WA in each layout are given in Table 5.1. Pac-Man Experiment 2 31 its desirability appraisal, the randomized goal appraisal agent is not. This may be due to the fact that with randomized goals, appraisals are less consistent and as a result it takes longer accurately estimate the expected future desirability. Figure 5.1: Training curves of the average score per evaluation period. During training, the agent starts executing a sub-optimal policy similar to the policy learned in Chapter 4.2 after around 6000 iterations. The agents with random goals are the rst to reach the best performing policy, which occurs after around 13750 iterations. The all other agents reach this policy at around 17500 iterations In Layout B, both the appraisal agent congurations outperform the non-appraisal agents. The performances in this layout share the same exception as the ones Layout A, however. The appraisal agent with randomized goals that solely utilizes desirabil- ity again performs worse than the other agents. The appraisal agents with unweighted appraisals and the xed goal agent relying on desirability perform in a similar manner. The Pac-Man agent eats the power pill, then heads south to the center row, and then moves east through the path of pills until the game is completed. A-Agent-WA-RG instead takes a clockwise path, often getting hit by the Ghost on its way to collect all the pills. - what does the UA and WA stand for in A-Agent-WA? In Layout C, all the appraisal agents and their weighted appraisal variations outperform the non-appraisal agents. The appraisal agent adopted a strategy of eating the power pill to put the Ghost into a scared state, then avoiding the Ghost until the time ran out. Pac-Man Experiment 2 32 Conguration Layout A Layout B Layout C* NA-Agent-RG 734.00 82.85 537.00 0.00 4.66 2.55 NA-Agent-FG 749.39 69.17 552.65 129.02 2.60 1.66 A-Agent-UA-RG 735.53 81.64 640.78 99.42 1.16 1.04 A-Agent-UA-FG 729.99 85.43 682.00 87.73 1.71 1.59 A-Agent-WA-RG** 408.31 237.53 505.34 267.83 1.67 1.08 A-Agent-WA-FG** 732.00 84.17 670.00 93.30 0.96 1.34 Table 5.1: Comparison of performance for each conguration across layouts. *Layout C performance is measured by the number of times the Pac-Man collided with a ghost, lower is better. **In layouts A and B, A-Agent-WA-RG and A-Agent-WA- FG at evaluation time were given the goal vector with g Desirability = 1, with all other components set to 0. In layout C, A-Agent-WA-RG and A-Agent-WA-FG at evaluation time were given the goal vector with g Relevance =0:1, with all other components set to 0. One conclusion from this experiment is that the SWAGR framework can produce agents that learn near-optimal policies and acheive consistently high performance. The perfor- mance of the appraisal agent in layouts A and B suggest a high uency in the environ- ment's mechanics. When looking at specically the performance of the randomized goal agents in compari- son with the xed goal agents, it can be seen that performance was roughly the same in layouts A and B, and that the randomized goal agents did substantially worse in Layout C. This seems to contradict the nding in [33] that randomized goals helped agents to generalize to unseen new tasks. However, a key dierence between that work and the experiment mentioned here is that in [33] the agent is trained and tested in the same layout. Future experimentation should test the agent at performing a new task in the layout it trained in. It is dicult to compare directly the performance results in this experiment with those in Chapter 4.2. While on rst glance it may seem that the increased size of the sample set led to increased performance, the goal vector used during testing in this experiment, g fixed , diers from the goal vector used while testing in the prior experiment. g fixed was found through hand-tuning in order to maximize performance in the training environ- ment. This introduces a benet of proving that the SWAGR framework can produce agents that nd a near-optimal policy. In conclusion, Chapter 4.2's nding that appraisal information can facilitate performance in unseen layouts holds in training regiments with a larger sample set. The exception to this nding is that the appraisal agent that utilizes solely desirability or relevance does Pac-Man Experiment 2 33 not adapt well to unseen layouts in comparison to the other agents. A possible expla- nation for this phenomenon is that because appraisal varies with the goal, randomizing the goal vector increases the amount of samples necessary to accurately estimate the expected future appraisal. 5.3 Qualitative Analysis As in Chapter 4.3, it is necessary to investigate directly the expected appraisal predicted by the model for given game states. Again, it is possible that the SWAGR neural architecture is solely learning the reward information well, or learning some appraisals well and not the others. Additionally, as the size of the training set has been increased in this chapter's experiments, it is necessary to increase the sample set used for qualitative analysis. Because the training set size has increased, the SWAGR neural architecture representational capacity must support a broader set of input data. Investigating this representational capacity will require a suciently large set of test states. Also, while the heatmap analysis can show correlations between grid positions and predicted appraisal, it shows nothing of how the state itself is encoded by the network. Along these lines, this analysis uses an approach that converts representational data output by the network to a human-interpretable format. As in [14], the t-SNE [56] algorithm is used to map the high-dimensional neural state representations to a two dimensional embedding space. The neural representation of a state is assigned by the last image-processing hidden layer in the SWAGR neural architecture (see Figure ??). In order to collect data for the analysis, the agent trained with appraisal information and the xed goal vector plays the game for 10,000 steps, choosing a random action with probability 0.5. At each step t,the neural representation for the game state is stored, as well as ^ W t , the future expected appraisal values predicted by the model. The t-SNE algorithm is run for 15,000 iterations with perplexity 60. A plot is generated for each appraisal variable, with the points colored by the intensity of the expected future appraisal on a linear scale 2 . There is a high degree of spatial autocorrelation and relatively low amount of salt-and- pepper noise for each appraisal dimension, suggesting that states with similar appraisals are given similar hidden representations. States appear to be organized spatially clus- ters with some tight in formation but most loosely strctured. With the exception of improving potential, these cluster have a uniform or gradated appraisal values. This suggests that the neural architecture is learning representations such that states with distinct appraisal patterns have relatively distinct embeddings. 2 The plots for relevance and surprise can be seen in Appendix C Pac-Man Experiment 2 34 150 100 50 0 50 100 150 150 100 50 0 50 100 150 Desirability 0.2 0.0 0.2 0.4 0.6 Figure 5.2: Two-dimensional embedding of the neural state representation. Points colored according to the expected future desirability predicted by the trained network. The majority of the clusters in the t-SNE plots for maintaining and improving potential have coping potential values close to 0. As coping potential has a discount factor of 0, the expected future coping potential corresponds solely to the state immediately following the current one. The small amount of clusters of high coping potential could indicate that these state transitions are seldom experienced and as a result do not require as much specialized representational capacity. Supporting this explanation is the observation that regions with high maintaining potential and high improving potential are highly overlapped with high desirability clusters and low desirability clusters respectively. The conclusion drawn from this analysis is that the SWAGR neural architecture learns representations that encode information about the expected appraisal. The agent is able to associate states with appraisal patterns in a consistent and coherent manner and states are clustered and dispersed based on the appraisal values expected in them. This aligns with the qualitative results shown in Chapter 4.3, which showed a rational and consistent assignment of appraisal values to states. Pac-Man Experiment 2 35 150 100 50 0 50 100 150 150 100 50 0 50 100 150 (Maintain) Coping Potential 0.0 0.2 0.4 0.6 0.8 1.0 Figure 5.3: 2D embedding of the neural state representation, colored according to the predicted maintaining potential. Pac-Man Experiment 2 36 150 100 50 0 50 100 150 150 100 50 0 50 100 150 (Improve) Coping Potential 0.00 0.05 0.10 0.15 0.20 0.25 Figure 5.4: 2D embedding of the neural state representation, colored according to the predicted improving potential. Chapter 6 Discussion This thesis has presented SWAGR, a framework for deep reinforcement learning that automatically constructs appraisals associated with raw unstrucutred data such as pixel images. SWAGR is able to successfully learn both model-free and model-based ap- praisal variables. Preliminary experimentation shows that reinforcing appraisal variable intensities does not harm performance and in some scenarios can improve generaliza- tion in unseen environments. State representations learned by the SWAGR network are distinguished and encoded by their associated appraisals. There are a number of immediate future directions for research suggested by this work. The results found in Chapter 5.2 indicate the need to investigate the impact of modu- lating the goal vector during training on the agent's appraisal associations and overall performance. The appraisal and reward weights in the goals utilized during testing were found in an ad-hoc manner. A more systematic approach for nding these weights, such as the approach given in [4], would be ideal. It also remains to be seen how the framework performs in larger, more complex environments. Specically, the Pac-Man environments presented in this work had dense and frequent rewards. In constrast, many of the challenging scenarios within deep reinforcement learning involve sparse and delayed reward. Along these lines, it remains to be seen how utilizing appraisal as an intrinsic reward during training, such as in [3{5], would impact the SWAGR agent's performance. Broader future research directions fall along three axes: (1) reconciling SWAGR with the modern reinforcement learning methods that achieve state-of-the-art performance, (2) rening and expanding the set of appraisal variables, and (3) adopting additional concepts from appraisal theory. 37 Discussion 38 Much of today's interest in deep reinforcement learning has been concerned with achiev- ing extremely high or even super-human performance at tasks, especially games [14, 57]. Approaches such as Asynchronous-Advantage-Actor Critic [58] signicantly outperform standard DQNs. Recent literature has suggested methods for increasing sample e- ciency through environment models [59]. It would be interesting to imbue some of these state-of-the-art frameworks with SWAGR's constructive appraisal and see how the re- sulting agents perform against their non-appraisal counterparts. The set of appraisal variables used in this work was quite restricted. Computational models such as EMA [20] and Thespain [24] capture a much richer set of appraisals. The formulations given in this thesis for the appraisals of surprise, desirability, relevance, and coping potential are not strongly based in psychological, economic, or neuroscience literature. Future work should seek to center appraisal formulations on known models of human behavior and include appraisal variables commonly found in literature such as novelty and causal attribution. One glaring ommission from the SWAGR framework is that the appraisals do not ac- tually arise emotions in the agent. Appraisal frames could be used to map particular appraisal patterns to emotions, such as in EMA [20]. Another glaring omission is the lack of emotion-focused coping strategies present in the framework. For example, the goal weights could shift over time depending on the emotion experienced by the agent. Future work on SWAGR should explore integrating concepts from appraisal theory and emotion literature such as mood, meta-learning through emotion-based neuromodula- tion [47, 48], action tendencies [1], and counterfactual emotions such as envy and greed [60]. Appendix A SWAGR Neural Architecture The components that make up the SWAGR neural architecture are described here 1 . These diagrams are best viewed electronically. Figure A.1: An overview of the modules that make up the SWAGR network archi- tecture. The network receives the state, appraisal, goal, and reward as its input and outputs the anticipated reward and appraisal for each action for a total of 40 value pre- dictions per input. This architecture is inspired by the network proposed by Dosovitskiy and Koltun [33]. The network is trained end-to-end using stochastic gradient descent [53] with the L1 loss function. The L2 regularization hyperparameter was set to 1e 4. The initial learning rate is 1.0 and is set to 0.1 after 1500 training iterations. Gradients are clamped to 1 The rst dimension for all layers is the batch size. During evaluation, this is set to 1. See 4.1 and 5.1 for the batch sizes used during training. 39 Appendix A 40 Figure A.2: The state encoder module takes the [3x120x150] pixel data as input and produces the state encoding using a convolutional neural structure. The image is split into [15x15] square regions, each of which is attened into a pixel with 315 2 channels. This [675x8x10] representation is then sent through size-preserving convolutional layers. See [35] for a detailed description of the ConvStack component. Figure A.3: The state to linear conversion module. This layer produces the neural state representation used to generate the t-SNE visualizations in Chapter 5.3. Figure A.4: The appraisal encoder module. Makes use of residual feedforward net- works [61]. Figure A.5: The goal encoder module. Makes use of residual feedforward networks [61]. Figure A.6: The reward encoder module. the range [0:5; 0:5], similar to [14]. The neural network was implemented and trained using PyTorch [62] using CUDA [63]. Appendix A 41 Figure A.7: The nal module of the architecture, the encodings-to-predictions mod- ule. Takes as input the concatenation of all input encodings. The input is processed through two streams. The rst stream captures the expected reward and appraisal values for the state-appraisal-goal-reward quadruplet. The second stream captures the action-conditional reward and appraisal values. This is based on the work in [33]. Appendix B Environment Model Network and Training Details The environment model was trained over N = 150 games using the M = 10000 most recent transitions. The minibatch size was set tok = 16. The model was optimzed using stochastic gradient descent with the root mean square error loss function. The learning rate was initialized to 1.0 and set to 0.1 after 5000 training steps. The L2 regularization hyperparameter was set to 1e 7. An overview of the environment model neural architecture is given in Figure B.1. The network takes as input the current frame and action (broadcasted to [1 x 4 x 120 x 150]). The image is split into [15x15] square regions, each of which is attened into a pixel with 3 15 2 channels. The resulting tensor is then sent through a series of residual layers. The representation from the residual layers in then passed to a subnetwork that utilizes a residual linear structure to ultimately predict the reward. The other branch processes this representation with four parallel size preserving convolutional layers before reshaping the tensor to its original dimensions. The data then passes through a series of residual lters until the predicted next frame is produced. The predicted image frame values are clamped to the range [0; 255] to produce human-readable images. See [35] for a detailed description of the ResConv structure. The neural network was implemented and trained using PyTorch [62] using CUDA [63]. 42 Appendix B 43 Figure B.1: An overview of the environment model architecture, inspried by the sokoban environment model presented in [34]. Appendix B 44 (a) Original State (b) Move East (c) Move West Figure B.2: The states predicted by the environment model for moving east or west. Appendix C Additional t-SNE Visualizations 150 100 50 0 50 100 150 150 100 50 0 50 100 150 Relevance 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Figure C.1: 2D embedding of the neural state representation, colored according to the expected future relevance predicted by the network. 45 Appendix C 46 150 100 50 0 50 100 150 150 100 50 0 50 100 150 Surprise 0.5 1.0 1.5 2.0 2.5 Figure C.2: 2D embedding of the neural state representation, colored according to the expected future surprise predicted by the network. Bibliography [1] Richard Lazarus. Emotion and adaptation. 1991. [2] Craig A Smith, Richard S Lazarus, et al. Emotion and adaptation. [3] Daniel McDu and Ashish Kapoor. Visceral machines: Reinforcement learning with intrinsic physiological rewards. In International Conference on Learning Represen- tations, 2019. URL https://openreview.net/forum?id=SyNvti09KQ. [4] Pedro Sequeira, Francisco S Melo, and Ana Paiva. Learning by appraising: An emotion-based approach to intrinsic reward design. Adaptive Behavior - Animals, Animats, Software Agents, Robots, Adaptive Systems, 22(5):330{349, October 2014. ISSN 1059-7123. doi: 10.1177/1059712314543837. URL http://dx.doi.org/10. 1177/1059712314543837. [5] Pedro Sequeira, Francisco Melo, and Ana Paiva. Emotion-based intrinsic moti- vation for reinforcement learning agents. pages 326{336, 10 2011. doi: 10.1007/ 978-3-642-24600-5 36. [6] S. Singh, R. L. Lewis, A. G. Barto, and J. Sorg. Intrinsically motivated re- inforcement learning: An evolutionary perspective. IEEE Transactions on Au- tonomous Mental Development, 2(2):70{82, June 2010. ISSN 1943-0604. doi: 10.1109/TAMD.2010.2051031. [7] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity- driven exploration by self-supervised prediction. CoRR, abs/1705.05363, 2017. URL http://arxiv.org/abs/1705.05363. [8] Leslie Kirby and Craig Smith. Consequence Require Antecedents Toward a Process Model of Emotion Elicitation, pages 83{106. 01 2000. [9] Scherer and Kr. The component process model: A blueprint for a comprehensive computational model of emotion. In Klaus R. Scherer, Tanja B anziger, and Etienne Roesch, editors, A Blueprint for Aective Computing: A Sourcebook and Manual. Oxford University Press, 2010. 47 Bibliography 48 [10] Agnes Moors. Automatic constructive appraisal as a candidate cause of emotion. Emotion Review, 2:139{156, 03 2010. doi: 10.1177/1754073909351755. [11] Gerald Clore and Andrew Ortony. Cognition in emotion: Always, sometimes, or never?, page 24{61. 01 2000. [12] Howard Leventhal and Klaus Scherer. The relationship of emotion to cognition: A functional approach to a semantic controversy. Cognition and Emotion, 1(1): 3{28, 1987. doi: 10.1080/02699938708408361. URL https://doi.org/10.1080/ 02699938708408361. [13] Eliot R. Smith and Roland Neumann. Emotion processes considered from the per- spective of dual-process models. In Lisa Feldman Barrett, Paula M. Niedenthal, and Piotr Winkielman, editors, Emotion and Consciousness, pages 287{311. Guilford Press, 2005. [14] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, He- len King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529{33, 02 2015. doi: 10.1038/nature14236. [15] Klaus Scherer, A Schorr, and Tom Johnstone. Appraisal Processes in Emotion: Theory, Methods, Research. 01 2001. [16] Klaus Scherer. Emotions are emergent processes: They require a dynamic compu- tational architecture. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 364:3459{74, 12 2009. doi: 10.1098/rstb.2009.0141. [17] Stacy C. Marsella and Jonathan Gratch. Ema: A process model of appraisal dy- namics. Cognitive Systems Research, 10(1):70 { 90, 2009. ISSN 1389-0417. doi: https://doi.org/10.1016/j.cogsys.2008.03.005. URL http://www.sciencedirect. com/science/article/pii/S1389041708000314. Modeling the Cognitive An- tecedents and Consequences of Emotion. [18] Ulrike Hahn and Nick Chater. Similarity and rules: distinct? exhaustive? empirically distinguishable? Cognition, 65(2):197 { 230, 1998. ISSN 0010- 0277. doi: https://doi.org/10.1016/S0010-0277(97)00044-9. URL http://www. sciencedirect.com/science/article/pii/S0010027797000449. [19] Stacy Marsella, Jonathan Gratch, and P Petta. Computational models of emotion. A Blueprint for Aective Computing-A Sourcebook and Manual, pages 21{46, 01 2010. Bibliography 49 [20] Jonathan Gratch and Stacy Marsella. A domain-independent framework for mod- eling emotion. 2004. [21] Allen Newell. Unied Theories of Cognition. Harvard University Press, Cambridge, MA, USA, 1990. ISBN 0-674-92099-6. [22] Jonathan Gratch, Stacy Marsella, Ning Wang, and Brooke Stankovic. Assessing the validity of appraisal-based models of emotion. pages 1 { 8, 10 2009. doi: 10.1109/ACII.2009.5349443. [23] Mei Si, Stacy Marsella, and David Pynadath. Modeling appraisal in theory of mind reasoning. 2010. [24] Mei Si, Stacy Marsella, and David V. Pynadath. Thespian: An architecture for interactive pedagogical drama. In AIED, 2005. [25] Andrew Barto, Marco Mirolli, and Gianluca Baldassarre. Novelty or surprise? Frontiers in psychology, 4:907, 12 2013. doi: 10.3389/fpsyg.2013.00907. [26] David V. Pynadath, Mei Si, and Stacy C. Marsella. Modeling Theory of Mind and Cognitive Appraisal with Decision-Theoretic Agents. In Appraisal, pages 1{30. April 2011. URL http://ict.usc.edu/pubs/Modeling%20Theory%20of%20Mind% 20and%20Cognitive%20Appraisal%20with%20Decision-Theoretic%20Agents. pdf. [27] David V. Pynadath. Psychsim: Agent-based modeling of social interactions and in uence. pages 243{248, 01 2004. [28] Magy Seif El-Nasr, John Yen, and Thomas R. Ioerger. Flame - fuzzy logic adaptive model of emotions. 2222. [29] Y. Lecun, L. Bottou, Y. Bengio, and P. Haner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278{2324, Nov 1998. ISSN 0018-9219. doi: 10.1109/5.726791. [30] Long-Ji Lin. Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Pittsburgh, PA, USA, 1992. UMI Order No. GAX93-22750. [31] Joost Broekens, Elmer Jacobs, and Catholijn Jonker. A reinforcement learning model of joy, distress, hope and fear. Connection Science, 27:1{19, 04 2015. doi: 10.1080/09540091.2015.1031081. [32] Georg Ostrovski, Marc G. Bellemare, A aron van den Oord, and R emi Munos. Count-based exploration with neural density models. CoRR, abs/1703.01310, 2017. URL http://arxiv.org/abs/1703.01310. Bibliography 50 [33] Alexey Dosovitskiy and Vladlen Koltun. Learning to act by predicting the future. CoRR, abs/1611.01779, 2016. URL http://arxiv.org/abs/1611.01779. [34] Theophane Weber, S ebastien Racani ere, David P. Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adri a Puigdom enech Badia, Oriol Vinyals, Nico- las Heess, Yujia Li, Razvan Pascanu, Peter Battaglia, David Silver, and Daan Wierstra. Imagination-augmented agents for deep reinforcement learning. CoRR, abs/1707.06203, 2017. URL http://arxiv.org/abs/1707.06203. [35] Lars Buesing, Theophane Weber, S ebastien Racani ere, S. M. Ali Eslami, Danilo Jimenez Rezende, David P. Reichert, Fabio Viola, Frederic Besse, Karol Gregor, Demis Hassabis, and Daan Wierstra. Learning and querying fast gen- erative models for reinforcement learning. CoRR, abs/1802.03006, 2018. URL http://arxiv.org/abs/1802.03006. [36] Thomas M. Moerland, Joost Broekens, and Catholijn M. Jonker. Emotion in re- inforcement learning agents and robots: A survey. CoRR, abs/1705.05172, 2017. URL http://arxiv.org/abs/1705.05172. [37] Alvaro Castro-Gonz alez, Maria Malfaz, and Miguel Salichs. An autonomous social robot in fear. Autonomous Mental Development, IEEE Transactions on, 5:135{151, 06 2013. doi: 10.1109/TAMD.2012.2234120. [38] Thomas Moerland, Joost Broekens, and Chatolijn Jonker. Fear and hope emerge from anticipation in model-based reinforcement learning. 2016. [39] Richard Stuart Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, 1984. AAI8410337. [40] Levente Kocsis and Csaba Szepesv ari. Bandit based monte-carlo planning. In Pro- ceedings of the 17th European Conference on Machine Learning, ECML'06, pages 282{293, Berlin, Heidelberg, 2006. Springer-Verlag. ISBN 3-540-45375-X, 978- 3-540-45375-8. doi: 10.1007/11871842 29. URL http://dx.doi.org/10.1007/ 11871842_29. [41] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlf- shagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational In- telligence and AI in Games, 4(1):1{43, March 2012. ISSN 1943-068X. doi: 10.1109/TCIAIG.2012.2186810. [42] Gianluca Baldassarre and Marco Mirolli. Intrinsically Motivated Learning Systems: An Overview, pages 1{14. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. Bibliography 51 ISBN 978-3-642-32375-1. doi: 10.1007/978-3-642-32375-1 1. URL https://doi. org/10.1007/978-3-642-32375-1_1. [43] Andrew G. Barto. Intrinsically motivated learning of hierarchical collections of skills. pages 112{119, 2004. [44] Robert Marinier III and John Laird. Emotion-driven reinforcement learning. 01 2008. [45] Shital Shah, Ashish Kapoor, Debadeepta Dey, and Chris Lovett. Air- sim: High-delity visual and physical simulation for autonomous ve- hicles. Field and Service Robotics, pages 621{635, November 2017. URL https://www.microsoft.com/en-us/research/publication/ airsim-high-fidelity-visual-physical-simulation-autonomous-vehicles/. [46] James J Gross. Emotion regulation: Aective, cognitive, and social consequences. Psychophysiology, 39:281{91, 06 2002. doi: 10.1017/S0048577201393198. [47] Kenji Doya. Metalearning and neuromodulation. Neural Networks, 15(4):495 { 506, 2002. ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(02)00044-8. URL http://www.sciencedirect.com/science/article/pii/S0893608002000448. [48] Xuefei Shi, Zhiliang Wang, and Qiong Zhang. Articial emotion model based on neuromodulators and q-learning. In Wei Deng, editor, Future Control and Automa- tion, pages 293{299, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-31006-5. [49] C.J. Watkins. Learning from delayed rewards. 1989. [50] Craig Smith, Kelly Haynes, Richard Lazarus, and Lois Pope. In search of the "hot" cognitions: Attributions, appraisals, and their relation to emotion. 1993. [51] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, and Satinder P. Singh. Action-conditional video prediction using deep networks in atari games. CoRR, abs/1507.08750, 2015. URL http://arxiv.org/abs/1507.08750. [52] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized expe- rience replay. CoRR, abs/1511.05952, 2015. URL http://arxiv.org/abs/1511. 05952. [53] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regres- sion function. Ann. Math. Statist., 23(3):462{466, 09 1952. doi: 10.1214/aoms/ 1177729392. URL https://doi.org/10.1214/aoms/1177729392. [54] Tycho van der Ouderaa. Deep reinforcement learning in pac-man. 2016. Bibliography 52 [55] Berkeley pac-man project. URL http://ai.berkeley.edu/project_overview. html. [56] Laurens van der Maaten and Georey Hinton. Visualizing data using t-SNE. Jour- nal of Machine Learning Research, 9:2579{2605, 2008. URL http://www.jmlr. org/papers/v9/vandermaaten08a.html. [57] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Grae- pel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815, 2017. URL http://arxiv.org/abs/1712.01815. [58] Volodymyr Mnih, Adri a Puigdom enech Badia, Mehdi Mirza, Alex Graves, Timo- thy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016. URL http://arxiv.org/abs/1602.01783. [59] Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H. Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Ryan Sepassi, George Tucker, and Henryk Michalewski. Model- based reinforcement learning for atari. CoRR, abs/1903.00374, 2019. URL http://arxiv.org/abs/1903.00374. [60] Celso M. de Melo and Jonathan Gratch. People show envy, not guilt, when making decisions with machines. pages 315{321, 09 2015. doi: 10.1109/ACII.2015.7344589. [61] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/ abs/1512.03385. [62] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic dierentiation in pytorch. In NIPS-W, 2017. [63] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda. Queue, 6(2):40{53, March 2008. ISSN 1542-7730. doi: 10. 1145/1365490.1365500. URL http://doi.acm.org/10.1145/1365490.1365500.
Abstract (if available)
Abstract
Reinforcement learning algorithms aim to find a policy that gives the best action to perform in each state such that some expected future extrinsic reward is maximized. Existing work combining emotion and reinforcement learning has been concerned with using emotion as either a part of the agent state or as an intrinsic reward signal. Very little work has been concerned with reinforcing the emotional appraisal variables themselves, i.e. predicting the expected future appraisal intensities associated with states. Additionally, much of the existing literature on appraisal dimensions as intrinsic reward signals are not applied in the deep reinforcement learning domain. In this thesis, I propose approach to deep reinforcement learning that, given an observation, prior appraisal, goal, and action, predicts the expected future appraisal as well as the extrinsic values. The model is evaluated in a Pac-Man environment and it is shown that appraisal information aids in generalization to unseen maze layouts.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Learning social sequential decision making in online games
PDF
Deep learning models for temporal data in health care
PDF
The interpersonal effect of emotion in decision-making and social dilemmas
PDF
Sample-efficient and robust neurosymbolic learning from demonstrations
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Learning from planners to enable new robot capabilities
PDF
Multimodal representation learning of affective behavior
PDF
A framework for research in human-agent negotiation
PDF
Building and validating computational models of emotional expressivity in a natural social task
PDF
Predicting and planning against real-world adversaries: an end-to-end pipeline to combat illegal wildlife poachers on a global scale
PDF
Towards learning generalization
PDF
Decoding situational perspective: incorporating contextual influences into facial expression perception modeling
PDF
Deep representations for shapes, structures and motion
PDF
3D deep learning for perception and modeling
PDF
Transfer reinforcement learning for autonomous collision avoidance
PDF
Invariant representation learning for robust and fair predictions
PDF
Leveraging training information for efficient and robust deep learning
PDF
Towards generalizable expression and emotion recognition
Asset Metadata
Creator
Akomolede, Olutobi (Tobi)
(author)
Core Title
Emotional appraisal in deep reinforcement learning
School
Viterbi School of Engineering
Degree
Master of Science
Degree Program
Computer Science
Publication Date
07/24/2019
Defense Date
06/17/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
affective computing,appraisal,artificial intelligence,deep learning,emotion,OAI-PMH Harvest,reinforcement learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Gratch, Jonathan (
committee chair
), Ferrara, Emilio (
committee member
), Soleymani, Mohammad (
committee member
)
Creator Email
akomoled@usc.edu,akomolot@mail.uc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-187497
Unique identifier
UC11660098
Identifier
etd-AkomoledeO-7584.pdf (filename),usctheses-c89-187497 (legacy record id)
Legacy Identifier
etd-AkomoledeO-7584.pdf
Dmrecord
187497
Document Type
Thesis
Format
application/pdf (imt)
Rights
Akomolede, Olutobi (Tobi)
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
affective computing
appraisal
artificial intelligence
deep learning
reinforcement learning