Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
(USC Thesis Other)
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
IDENTIFYING AND LEVERAGING STRUCTURE IN COMPLEX COOPERATIVE TASKS FOR MULTI-AGENT REINFORCEMENT LEARNING by Shariq Nadeem Iqbal A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2022 Copyright 2022 Shariq Nadeem Iqbal Dedication To my parents, Sabeen and Nadeem Iqbal. ii Acknowledgements I am very lucky to have a number of people to be thankful to, not only those that have supported me over the past ve years, but also those who helped me get here to begin with. First, I would like to thank Fei Sha for taking a chance on me and providing me with the opportunity to learn and grow in his lab. Fei saw something in me that I am not sure I even saw myself, and I will always be greatful to him for bringing me in and allowing me to explore my interests. He created a positive lab culture full of supportive peers and thoughtful discussions. His emphasis on clear and focused scientic communication has been foundational to my understanding of what it means to do research. I also want to thank Shimon Whiteson for giving me an opportunity to visit his lab. Though my time at Oxford was cut short, it was one of the most enriching experiences of my PhD. Next, I want to thank Maja Matarić for adopting us into her lab with open arms and making us feel welcome. Her support consistently went above and beyond what we could have possibly expected and helped me tremendously in navigating this last year. Finally, I want to thank my rst ever scientic advisor, John Pearson, without whom I may have never discovered my passion for research. His enthusiasm was infectious and his ability to explain complex topics to a complete novice in both machine learning and neuroscience was unparalleled. I am very fortunate to have experienced two excellent internships during my PhD. I would like to thank Stan Bircheld for his mentorship during my internship at NVIDIA. He consistently provided a fresh perspective and helped me scope out a successful project in a eld that I was new to. I would also like to thank my team members at NVIDIA who provided endless help and troubleshooting while I attempted to iii learn something about robotics: Jia Cheng, Thang To, and Jonathan Tremblay. I also want to thank Cosmin Paduraru for his mentorship during my internship at DeepMind. He tirelessly provided every resource at his disposal to help ensure my time there was successful, and he encouraged me to seek help and leverage the expertise of those around me. I also want to thank my team members at DeepMind who provided one of the most supportive and intellectually stimulating teams I have ever been a part of: Daniel Mankowitz, Andrea Michi, and Anton Zhernov. I want to thank my dissertation committee members: Haipeng Luo and Ketan Savla, as well as those who served on my qualication exam and thesis proposal committees: Aram Galstyan, Sven Koenig, and Gaurav Sukhatme. A huge thank you goes out to Lizsl De Leon for her tireless eort in providing administrative support and guidance and to Nina Shilling for keeping our lab organized and running smoothly. I have been lucky to work with and alongside an excellent group of labmates and colleagues. Aaron Chan is my basketball buddy and always provided a fresh perspective on everything from research to life philosophies. Bowen Zhang is always open to spontaneous adventures in and around LA, and my PhD would have been far more boring without him. I’ll miss our weekly game nights and random dinners on Sawtelle. Playing soccer with Séb Arnold was one of the highlights of my week every week, second only to the conversations (and dessert) before and after. I came away from every conversation with Michiel de Jong feeling like I gained a new perspective or deeper level of understanding of a research topic (or at least a great book recommendation). Working with Robby Costales re-ignited my passion for research; he is a model co-author (and Rocket League teammate). I would also like to thank Melissa Ailem, Wendelin Böhmer, Soravit (Beer) Changpinyo, Wei-Lun (Harry) Chao, Liyu Chen, Chao-Kai Chiang, Nathan Dennler, Tom Groechel, Jeremy Hsu, Hexiang (Frank) Hu, Mina Kian, Lauren Klein, Zhiyun Lu, Amy O’Connell, Bei Peng, Christian Schroeder de Witt, Zhonghao Shi, Ivy Xiao, Yiming Yan, Daniel Yang, Yury Zemlyanskiy, Ke Zhang, and Bill Zhu for contributing to a welcoming, joyful, and intellectually stimulating research atmosphere. iv I have been lucky to have a great support system that has kept me going through the past ve years. My cousin Niza and her husband Naveed provided the family away from home that I never knew I needed. My friends Joseph Choi, Connor Gordon, Kevan Homan, Kenny Kraynik, Hillary Lee, Sam Lin, Scott Nobbs, Jina Yun, and many others all provided endless laughter and adventures, and I hope it will continue for years to come. Finally I would like to thank my family: my parents Sabeen and Nadeem, my sister Mishal, and my brother Faiz, for being unwavering sources of love, support, and levity. I wouldn’t be here without you. v TableofContents Dedication ii Acknowledgements iii ListofTables viii ListofFigures ix Abstract xi I Background 1 Chapter1: Introduction 2 1.1 Imbuing Agents and Models with an Understanding of Structure . . . . . . . . . . . . . . . 3 1.2 Types of Structure in Complex Multi-Agent Tasks . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Thesis Organization and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Relationship to Published Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter2: TechnicalBackground 11 2.1 Markov Decision Processes and Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Reinforcement Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Multi-Agent Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 II InteractionStructure 22 Chapter3: Actor-Attention-CriticforMulti-AgentReinforcementLearning 23 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Chapter4: CoordinatedExplorationviaIntrinsicRewardsforMulti-AgentReinforcement Learning 41 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 vi 4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Chapter5: RandomizedEntity-WiseFactorizationforMulti-AgentReinforcementLearning 57 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 III TaskStructure 77 Chapter6: PossibilityBeforeUtility: LearningandUsingHierarchicalAordances 78 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Chapter7: LearnedEnd-To-EndTaskAllocationinMulti-AgentReinforcementLearning 97 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 IV Conclusion 119 Chapter8: Conclusion 120 8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Bibliography 123 vii ListofTables 3.1 MAAC - Baseline comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 MAAC - Cooperative navigation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 MAAC - Cooperative treasure collection scalability . . . . . . . . . . . . . . . . . . . . . . 37 4.1 MA-Exp - Full results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1 REFIL - Baseline comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.1 HAL - Baseline comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 viii ListofFigures 3.1 MAAC - Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 MAAC - Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 MAAC - Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 MAAC - Rover-Tower scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 MAAC - Attention visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1 MA-Exp - Intrinsic reward types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 MA-Exp - Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 MA-Exp - Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4 MA-Exp - Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1 REFIL - Intro diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 REFIL - Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3 REFIL - Group matching results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4 REFIL -StarCraft results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.5 REFIL - Environment visualization and generalization results . . . . . . . . . . . . . . . . . 72 6.1 HAL - Task hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2 HAL - Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.3 HAL - Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4 HAL - Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 ix 6.5 HAL - Robustness results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.6 HAL - Task-agnostic results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.1 ALMA - Intro diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.2 ALMA - Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3 ALMA - VRP results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.4 ALMA -SaveTheCity results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.5 ALMA -StarCraft results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.6 ALMA -StarCraft episode walkthrough . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.7 ALMA -StarCraft additional results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 x Abstract Modern reinforcement learning methods often focus on simple atomic tasks achieved by single agents; however, real world tasks often consist of complex compositions of interrelated subtasks requiring the cooperation of teams composed of varying sets of agents. In order to successfully learn such tasks, we posit that articial agents must be imbued with an understanding of both task structure and the structure of their interactions with other agents. This dissertation contributes ve methods that incorporate an understanding of structure into multi-agent reinforcement learning through modeling invariance/equivariance, modularity, and discovering structure in a data driven manner from part-based representations of the task/state. These methods are divided into two main categories: those that address structure in the interactions between agents and entities in their environment and those that address structure in the representation of tasks through subtask decompositions. These contributions improve upon modern deep multi-agent reinforcement learning methods on settings of increasing complexity in terms of scale (the number of agents present), variety (handling multiple related tasks), and hierarchy (tasks consisting of complex compositions of sub-tasks). xi PartI Background 1 Chapter1 Introduction Advances in deep learning have enabled signicant improvements in the scalability and performance of models in computer vision [83, 59, 37], natural language processing [142, 152, 20], and reinforcement learning [105, 133, 17, 156]; however, these approaches often focus on a single task achieved by a single specialized agent. Real-world problems, on the other hand, often require (or can be massively accelerated by) the cooperation of several specialized agents on a complex composition of sub-tasks. For example, consider the task of constructing buildings: The construction process involves teams of construction workers independently assembling sub- components which are then combined and integrated into the nal building. Construction workers possess the capability to eciently shift their resources such that they can adapt to new settings. One month a group of workers may build an oce building, and the next a movie theater. This ecient transfer of skills to new settings is made possible by the specialization and hierarchy presentinthegroup. Theconstructionworkersarenotamonolithicgroup,butratherasetofdistinct individuals with specialized skills organized in a fashion which enables their exibility. Some workers may specialize in operating bulldozers while others operate cranes. These specializations have developed over time over time as humanity’s understanding of the structure of construction problems has evolved. Over the course of history, construction workers have identied the common 2 patterns required in their work and organized themselves such that they can not only accomplish their tasks more eciently, but also generalize to whatever new construction tasks they may encounter. In order to build articial intelligence that performs complex tasks eectively, we must imbue agents with a similar understanding of structure. “Structure” is dened as “the arrangement of and relations between the parts or elements of something complex” [139]. As humans, understanding the structure of a class of problems allows us to note similarities across instances within these classes and transfer our knowledge eectively. Tenenbaum et al. [149] argue that an agent’s understanding of structure and capacity for forming abstractions are crucial to its ability to form generalizable knowledge. Multi-agent settings, in particular, introduce several opportunities for exploiting structure, as the presence of multiple agents both introduces a type of structure (the relationship between agents that form the team) as well as a natural approach to handling structured tasks (i.e. tasks with independent sub-tasks) through assigning sub-teams of agents to separate sub-tasks which leverage specializations. Many settings require multi-agent solutions in order to be tractable. Considering the construction example again: a group of construction workers are more likely to be eective when they specialize in the operation of specic equipment, rather than all workers attempting to learn how to perform every single sub-task involved in the construction of a building; however, this specialization is only made possible by an understanding of the structure of construction tasks and the manner in which they can be decomposed eectively. 1.1 ImbuingAgentsandModelswithanUnderstandingofStructure A large portion of the machine learning literature has focused on developing a variety of techniques for discovering or integrating an understanding of the structure of data and tasks. This process can broken into three main steps: 3 1. Identify the salient parts which comprise the data or task at hand. 2. Specify or learn the relationships between these parts. 3. Integrate an understanding of this structure into our model. In this dissertation we will focus on methods for accomplishing one or both of the latter two points, assuming that some sort of part-based representation has been provided. Here we will specically highlight two classes of techniques that integrate knowledge of structure into neural network models: those that incorporate some sort of invariance/equivariance and those that structure models in a modular fashion that allows specialization. InvarianceandEquivariance Convolutional neural networks (CNNs) [87, 83] learn lters which are applied to images in a manner that is robust to the location of patterns within the image. This design comes from an understanding of how image data are structured; two images of the same object are comprised of similar patterns, but the location of those patterns within the image may vary. This property of CNNs is commonly referred to as “translational equivariance.” By modeling with convolutional lters, computer vision models can generalize more eciently from less data. Take for example the task of object recognition where the goal for the model is to output a prediction of whether a specic class of object exists within a given image. A standard multi-layer-perceptron (MLP) model would treat the image as a vector of pixel values and may produce drastically dierent outputs depending on the location of the object in the image. A CNN, on the other hand, would produce a spatial map of features where translating the location of the object in the input image would simply translate the feature map without changing the values contained within. This natural inductive bias provides an advantage in learning eciency when modelling image data. More generally, equivariance is a property of functions dened by their ability to produce outputs transformed in the same way as any transformation of their inputs. In the case of CNNs this transformation 4 is translation or shift. Formally stated, functionf is equivariant to transformationT when the following statement is true: f(Tx) =Tf(x)8x2X (1.1) whereX is the domain of the function. A related and equally useful concept is that of function invariance, where the output of a function will be equivalent for any transformation of its inputs: f(Tx) =f(x)8x2X (1.2) Many more examples of various forms of invariance/equivariance exist in the literature. Most relevant to this dissertation, attention-based models [152] possess permutation equivariance, as described by Lee et al. [88]. Symmetries and invariances are at the heart of “geometric” deep learning, which Bronstein et al. [19] provide a thorough treatment of. They argue that “geometric priors” or methods by which we incorporate invariances into deep learning architectures can help overcome the “curse of dimensionality” when dealing with high-dimensional data with some sort of physical structure. Modularity/Specialization We can also incorporate structure into agents and models through modularity and specialization. The basic idea behind modularity is to construct a model with several components that address various parts of a task. Solving complex tasks with multi-agent systems can itself be viewed as a useful form of modularity. It may be possible to train a single agent to execute any task typically treated as multi-agent by building an action space that is the cross-product of all agents’ action spaces; however, doing so is often computationally intractable, unrealistic in real-world settings (separately embodied agents would require unwieldy communication), and sacrices the modularity that a multi-agent approach would provide. For example, in our construction example, we may have an agent that specializes in digging, while another focuses on moving heavy objects. In this case, these agents only need to focus on the aspects of 5 the environment relevant to their specic sub-task. On the other hand, if we did not model these agents separately, a single joint agent would have to be able to generalize to every possible unique combination of digging and moving sub-tasks that may be encountered. By modeling the digging and moving sub-tasks separately, the independent agents can generalize combinatorially [4, 3]. Wu, Zhang, and Ré [166] provide a theoretical explanation for positive transfer across similar tasks. They claim that, under the condition that task A and task B are suciently similar, a single model can achieve greater performance on task B after training on task A than if it had been trained on task B from scratch. Conversely, they claim when tasks are not similar, they can negatively transfer or hurt the performance of each other. By this logic, we can conclude that knowledge of task structure may allow us to separate components that are suciently dissimilar and prevent negative interference across portions of a complex task. Two forms of modularity are of particular relevance to this dissertation: hierachical models and value function factoring. Hierarchical models, in the context of reinforcement learning, typically involve a higher-level controller responsible for determining goals for low-level controllers to then execute [34, 145]. Modularity exists in hierarchical agents in two main ways. The rst form of modularity is the decomposition of behavior into a high-level decision making and low-level control, and the second is the ability to learn separate low-level controllers for unique goals (in the setting where goals are discrete). Learning hierarchical models also enables agents to reason over multiple timescales at various levels of granularity [42]. Value function factoring is a method by which value functions (models which attempt to predict the value of a state and/or action in an MDP) are decomposed into pieces depending on independent components of the state and/or action space [81, 111]. 6 1.2 TypesofStructureinComplexMulti-AgentTasks In this dissertation we will classify the types of structure present in complex multi-agent tasks into two main categories: interaction structure and task structure. Interaction Interaction structure can be thought of as information characterizing the relationships between agents and other entities in the environment. Understanding and leveraging this structure can allow articial agents to generalize their behavior to various teams of agents. While most work on multi- agent learning focuses on executing a task with a xed set of agents, complex multi-agent tasks are often agnostic to the size or composition of the team of agents. In fact, when considering real-world settings, multi-agent tasks are almost never accomplished by a xed team. When a construction worker calls in sick, the team is often able to adapt and continue performing at nearly the same rate. Teams of articial agents acting in the real-world require the same robustness (e.g. a robot may break). Understanding the interaction structure present among agents and entities may also enable scalability in multi-agent learning, in cases where interactions are sparse. In the most general case, all agents must consider the state and actions of all other agents in the environment; however, an understanding of the sparse interaction structure may allow agents to focus on a smaller subset of their observations and, as a result, scale to larger settings. Task Task structure refers to the manner in which a task can be broken into sub-tasks and how those sub- tasks are related. Large, complex tasks are often composed of varied sub-components requiring diverse skills. As such, training large monolithic policies to execute these tasks can be highly sample inecient. Instead, we can train smaller more focused policies on sub-tasks and then compose these sub-tasks’ solutions into a solution for the global task. In this dissertation, we focus on works that leverage the task representation provided by a given sub-task decomposition. Once a decomposition has been established, these sub-tasks may possess several forms of structure. For example, some sub-tasks may rely on others to have been 7 accomplished prior to their completion, may require execution in parallel, or they may require specic skills that only a subset of agents possess. Multi-agents tasks are particularly amenable to methods which can exploit the structure they exhibit, and several streams of modern machine learning research have proposed frameworks for doing so. In this dissertation, we introduce several contributions which exploit both task and interaction structure and demonstrate their utility through improved performance on several complex tasks. 1.3 ThesisOrganizationandContributions This dissertation is organized as follows: Part I begins with providing background information relevant to our contributions. In Chapter 1 we provide the motivation for our contributions. Next, in Chapter 2 we introduce used throughout the dissertation, including Markov Decision Processes (MDPs), some multi-agent abstractions built on top of MDPs, value-based and policy gradient methods for reinforcement learning, as well as the multi-agent extensions of these methods (value factorization and asymmetric actor-critic). The chapters following this background discussion detail the specic contributions of this dissertation, separated into two main parts organized by the types of structure detailed in Section 1.2 of this chapter. Part II details our works focusing on discovering and leveraging interaction structure among agents and entities in multi-agent tasks. In Chapter 3 we introduce Multi-Actor Attention-Critic (MAAC). This work introduces agent-wise permutation invariance and the discovery of agent interaction structure into the computation of expected returns via attention-based models in a multi-agent actor-critic setting. Next, in Chapter 4 we introduce a method for coordinating exploration in a multi-agent setting such that agents take into account which regions of the state space other agents have already explored and selects from various exploration modalities using a hierarchical policy. Finally, in Chapter 5 we propose an auxiliary loss for multi-agent reinforcement learning which encourages discovering and leveraging partition invariance 8 in multi-agent factored value functions. Discovering partition invariance encourages generalization to new compositions of familiar sub-groups of entities. Part III details our works which attempt to leverage specic types of structure commonly found in complex tasks for more ecient learning. In Chapter 6 we propose a modular hierarchical architecture for learning complex sub-task dependency structures by incorporating the notion of aordances. Finally, in Chapter 7 we introduce a hierarchical multi-agent method for simutaneously learning to allocate agents to sub-tasks and execute those sub-tasks via low-level actions. 1.4 RelationshiptoPublishedWork Chapter3 Shariq Iqbal and Fei Sha. “Actor-attention-critic for multi-agent reinforcement learning”. In: International Conference on Machine Learning. PMLR. 2019, pp. 2961–2970 Chapter 4 Shariq Iqbal and Fei Sha. “Coordinated exploration via intrinsic rewards for multi-agent reinforcement learning”. In: arXiv preprint arXiv:1905.12127 (2019) Chapter 5 Shariq Iqbal et al. “Randomized Entity-wise Factorization for Multi-Agent Reinforcement Learning”. In: International Conference on Machine Learning. PMLR. 2021, pp. 4596–4606 Chapter6 Robby Costales, Shariq Iqbal, and Fei Sha. “Possibility Before Utility: Learning And Using Hierarchical Aordances”. In: International Conference on Learning Representations. 2022. url: https: //openreview.net/forum?id=7b4zxUnrO2N Chapter7 This chapter corresponds to work currently under submission. OtherWorks The following works are outside the scope of this dissertation but were published during its preparation. 9 Shariq Iqbal et al. “Toward sim-to-real directional semantic grasping”. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2020, pp. 7247–7253 Sébastien Arnold, Shariq Iqbal, and Fei Sha. “When maml can adapt fast and how to assist when it cannot”. In: International Conference on Articial Intelligence and Statistics. PMLR. 2021, pp. 244–252 10 Chapter2 TechnicalBackground The contents of this dissertation present methods for discovering or leveraging existing structure for (primarily multi-agent) reinforcement learning. In this chapter we will provide an overview of Markov Decision Processes (the modeling paradigm upon which reinforcement learning is based), and then detail common classes of reinforcement learning methods and their multi-agent extensions. 2.1 MarkovDecisionProcessesandVariants Markov Decision Processes (MDPs) are a form of stochastic process designed to model sequential decision making processes. An MDP is dened by the state spaceS, action spaceU (also represented asA in single- agent settings), reward functionR :SU!R, and state transition distributionP :SU!4(S): Agents act in MDPs by selecting actions based on the state and receiving the next state from the transition function. In the next section (2.2) we describe methods for learning policies (the “decision making” function mapping from states to actions or a distribution over actions) from experience (i.e. without access to the transition probabilities or the reward function). For now, in sections 2.1 and 2.1 we introduce several extensions of MDPs for modelling multi-agent systems. 11 MarkovGames Markov Games [94] are a generic extension of MDPs, capable of representing cooperative, competitive, and mixed tasks with heterogeneous agents. They are dened by a set of states,S, action sets for each ofN agents, U 1 ;:::;U N , a state transition function, P : SU 1 :::U N ! P (S), which denes the probability distribution over possible next states, given the current state and actions for each agent, and a reward function for each agent that also depends on the global state and actions of all agents, R i :SU 1 :::U N !R. Note that agents have unique action spaces and reward functions which enable heterogeneity and competition respectively. Decentralized-POMDPs Decentralized Partially Observable Markov Decision Provess [110] are a commonly used modelling paradigm for cooperative tasks with partial observability. A decentralized POMDP (Dec-POMDP) is dened by a tuple: (S;U;T;O;O;R;A; ). In this setting we haven =jAj total agents whereA is the set of agents. S is the set of global states in the environment, whileO = a2A O a is the set of joint observations for each agent andU = a2A U a is the set of possible joint actions for each agent. A specic joint action at one time step is denoted asu =fu 1 ;:::;u n g2U and a joint observation iso =fo 1 ;:::;o n g2O. T is the state transition function which denes the probabilityP (s 0 js;u), andO is the observation function which denes the probabilityP (oju;s 0 ). r(s;u) is the reward function that maps the global state and joint actions to a single scalar reward. Importantly, this reward is shared between all agents, so Dec-POMDPs always describe cooperative problems. AugmentingwithEntityRepresentations In some cases, our global state may be comprised of a set of entity states, rather than discrete states that describe the totality of the environment. This representation can be useful for encoding or learning the 12 relational structure between entities in the environment. It is also useful for modeling tasks in which the number of agents and entities may be variable (across or within episodes). Schroeder de Witt et al. [126] introduce Dec-POMDPs with entities by making the following extensions to the Dec-POMDP framework described in section 2.1.E is the set of entities in the environment. Each entitye has a state representation s e , and the global state is the sets =fs e je2Eg2S. Some entities can be agentsa2AE. Non-agent entities are parts of the environment that are not controlled by policies (e.g., landmarks, obstacles, agents with xed behavior). Not all entities may be visible to each agent, so we dene a binary observability mask: (s a ;s e )2f1; 0g, where agents can always observe themselves(s a ;s a ) = 1;8a2A. Thus, an agent’s observation is dened aso a =fs e j(s a ;s e ) = 1;e2Eg2O. In order to model tasks where the number of entities (including agents) varies, we can allow them to become “inactive.” Over the course of an episode entities may become inactive (e.g., a unit dying in StarCraft) and no longer aect transitions and rewards. Sinces andu are sets, their ordering does not matter, and our modeling construct should account for this (e.g., by modeling with permutation invariant/equivariant attention models [88]). AddingSubtaskStructure In suciently complex settings, the global task may be decomposed into sub-tasks. These sub-tasks can be thought of as pieces of the whole task that can be accomplished independently. For example, the task of reghting across a whole city involves independently putting out res at individual buildings. By providing a sub-task decomposition, we can design learning methods that are better able to leverage the task structure and generalize more eectively to new compositions of sub-tasks. We extend Dec-POMDPs by adding subtask-specic entities and reward functions. Given this information, we can treat the subtasks independently under certain assumptions (detailed in Section 7.3). Within the environment exist a set of subtasksI where each is dened by a set of subtask-specic entitiesE i and the subtask reward function r i : S E i S A U! R. The relevant state (i.e. set of entity states) for subtask i is then denoted 13 ass i =fs e je2E i [Ag. We deneE as the set of all entities (including agents) in the environment: E := S i2I E i [A. The global state is the sets :=fs e je2Eg2 S. In the simplest case, the global reward function,r, can be a sum of the individual subtask rewards; however, it can optionally encode global objectives (e.g. a scalar reward upon the completion of all subtasks). 2.2 ReinforcementLearningMethods The goal of reinforcement learning is to learn a policy, :S!4(A), which selects actions that maximize expected future returns: G() = E P 1 t=0 t r t ju t ;s t P , where is the discount factor. This policy must be learned from experience. In other words, the learning procedure has no access to the underlying dynamics (state transition probabilities and reward function) of the environment. Instead, the agent must take actions in the environment and observe its rewards and transition states, then learn from these transitions. Reinforcement learning methods can be generally categorized into two types: policy gradient methods and value-based methods. Policy gradient methods attempt to directly learn a policy, while value function methods attempt to learn a function which predicts expected returns given the current state and possible actions and derive a policy from this value function. PolicyGradients Policy gradient techniques [144, 165] use the gradient of an agent’s expected returns with respect to the parameters of its policy in order to learn. This gradient cannot be computed exactly, so it must be estimated in the following way: r J( ) =r log( (u t js t )) 1 X t 0 =t t 0 t r t 0(s t 0;u t 0) (2.1) 14 Actor-Critic The term P 1 t 0 =t t 0 t r t 0(s t 0;u t 0) in the policy gradient estimator leads to high variance, as these returns can vary drastically between episodes. Actor-critic methods [82] aim to ameliorate this issue by using a function approximation of the expected returns, and replacing the original return term in the policy gradient estimator with this function. One specic instance of actor-critic methods learns a function to estimate expected discounted returns, given a state and action,Q (s t ;u t ) =E[ P 1 t 0 =t t 0 t r t 0(s t 0;u t 0)], learned through o-policy (data collected by a policy other than the one being updated, typically older versions of the policy whose data is stored in a replay buer) temporal-dierence learning by minimizing the regression loss: L Q ( ) =E (s;u;r;s 0 )D (Q (s;u)y) 2 wherey =r(s;u) + E u 0 (s 0 ) Q (s 0 ;u 0 ) (2.2) where Q is the target Q-value function, which is simply an exponential moving average of the past Q-functions’ parameters andD is a replay buer that stores past experiences. To encourage exploration and avoid converging to non-optimal deterministic policies, recent approaches of maximum entropy reinforcement learning learn a soft value function by modifying the policy gradient to incorporate an entropy term [52]: r J( ) =r log( (ujs))( log( (ujs)) +Q (s;u)b(s)) (2.3) 15 whereb(s) is a state-dependent baseline (for the Q-value function). This is known as Soft Actor-Critic (SAC). Most commonly the state-value function V (s) is used as this baseline. The loss function for temporal-dierence learning of the value function is also revised accordingly with a new target: y =r(s;u) + E u 0 (s 0 ) [Q (s 0 ;u 0 ) log( (u 0 js 0 ))] (2.4) Q-Learning Q-learning is specically concerned with learning an accurate action-value functionQ tot (dened below), and using this function to construct a policy by selecting the actions that maximize the Q-value. Unlike policy gradient methods, Q-Learning is not explicitly tied to function approximation, as the Q-values can be stored in a matrix/table with entries for all unique state-action pairs. The optimalQ-function for an MDP is dened as: Q tot (s;u) :=E h P 1 t=0 t r(s t ;u t ) s 0 =s;u 0 =u;s t+1 P (jst;ut) u t+1 =arg maxQ tot (s t+1 ;) i =r(s;u) + E h maxQ tot (s 0 ;)js 0 P (js;u) i The nal line denes a recursive relationship between the Q-values in a given state and its resulting states known as the Bellman Equation. This equation forms the core of Q-learning. Tabular Q-learning updates Q-values using the right hand side of the Bellman Equation, where the expectation is approximated by sampling from the MDP. For MDPs with massive state spaces, it is computationally intractable to store Q-values for every state-action pair. As such, we must utilize function approximation to estimate Q-values. Partial observability is typically handled by using the history of actions and observations as a proxy for state, often processed by a recurrent neural network [56]:Q tot ( t ;u t )Q tot (s t ;u t ) where the trajectory is 16 a t := (o a 0 ;u a 0 ;:::;o a t ) and t :=f a t g a2A . To learn theQ-function, Deep Q-Learning uses neural networks as function approximators trained to minimize the loss function, derived from the Bellman Equation: L Q () :=E h y tot t Q tot ( t ;u t ) 2 ( t ;u t ;r t ; t+1 )D i (2.5) y tot t :=r t + Q tot t+1 ; arg maxQ tot ( t+1 ; ) (2.6) where are the parameters of a target network that is copied from periodically to improve stability [105] andD is a replay buer [92] that stores transitions collected by an exploratory policy (typically-greedy). Double deepQ-learning [55] mitigates overestimation of the learned values by using actions that maximize Q tot as inputs for the target networkQ tot . HierarchicalQ-Learning In cases where tasks may be broken down or described in terms of sub-goals/sub-tasks, we may want to explicitly model this hierarchical relationship. Sutton, Precup, and Singh [145] introduce the options framework, which exibly models hierarchical abstractions with minimal modication to the RL paradigm. Each option,o :=hI o ; o ; o i, is dened by an initiation set,I o S, indicating where the option can be selected, the corresponding option policy, o , and the termination condition, o :S + ! [0; 1], indicating the probability of termination in each state. Options turn our typical MDP into a semi-Markov decision process (SMDP) since the state transition distribution is, in general, no longer dependent on the current state and action, but also the present option, which was decided in a previous time-step. The design of this framework allows options to be treated similarly to actions, except they may be executed across multiple time-steps, interrupted, composed, and learned as separate subpolicies. Hierarchical Deep Q-Learning (H-DQN) [84] extends the options framework to the modern deep reinforcement learning setting by learning a temporally abstracted high-levelQ-function which estimates 17 the value of goals upon which a low-level Q-function is conditioned. We dene the high-level action (i.e. goalsb2B) value-function as: Q(s;b) = Nt X t r t + E maxQ(s 0 ;) s 0 P; b (2.7) This controller operates over a dilated time scale; one step from its perspective amounts toN t steps in the environment executed by low-level controllers conditioned on the goal, b . The low-level controllers are trained to maximize goal-specic rewards which can be manually dened or learned. These controllers are learned as Q-functions in the same manner as typical DQN methods, only with goal-conditioning. 2.3 Multi-AgentReinforcementLearning Naively adapting single-agent RL methods to the multi-agent setting may take two forms: fully centralized learning and fully decentralized learning. In the fully centralized case, agents are treated as a single agent and the action space is a cross product of all individual agents’ action spaces. Unfortunately, as the number of agents increases, the tractability of this approach dwindles due to the exponential growth in action space size. Furthermore, it requires communication between all agents in practice which is not feasible in many settings. The fully decentralized learning, on the other hand, suers from its own disadvantages. In order to learn in a decentralized fashion, the other agents must be treated as part of the environment. Unfortunately, however, reinforcement learning methods require that the dynamics of the MDP remain stationary which will not be the case since the other agents are learning. This non-stationarity can result in instability in training. A number of works in deep multi-agent reinforcement learning (MARL) have followed the paradigm of centralized training with decentralized execution [99, 44, 141, 119, 68] (CTDE) in order to bridge the gap between the naive approaches and benet from the advantages of both. The MARL work in this dissertation 18 will primarily follow this paradigm. This paradigm allows for agents to train while sharing information (or incorporating information that is unavailable at test time) but act using only local information, without requiring communication which may be costly upon execution. By doing so, CTDE methods can alleviate non-stationarity without requiring centralized execution and an explosion in action space size. Since most reinforcement learning applications use simulation for training, communication between agents during the training phase has a relatively low cost. CTDE typically takes two forms in MARL: value function factorization and centralized critics in an asymmetric information actor-critic approach. ValueFunctionFactorization Some methods achieve CTDE through factoringQ-functions into monotonic combinations of per-agent utilities, with each depending only on a single agent’s local history of actions and observationsQ a ( a ;u a ). This factorization allows agents to independently maximize their local utility functions in a decentralized manner with their selected actions combining to form the optimal joint action. While factored value functions can only represent a limited subset of all possible value functions [18], they tend to perform better empirically than those that learn unfactored joint action value functions [111]. These works implicitly assume a fully cooperative setting, as they are essentially learning a single value function factored into independent pieces. Value decomposition networks propose the rst work in deep value-based MARL with CTDE by representing the Q-function as a sum of independent agent utilitiesQ a . QMIX [119] improves over value decomposition networks (VDN) [141] by using a more expressive factorization than a summation of factors: Q tot =g Q 1 ( 1 ;u 1 ; Q );:::;Q jAj ( jAj ;u jAj ; Q ); g The parameters of the monotonic mixing function g are generated by a hyper-network [51] conditioning on the global states: g =h(s; h ). Every state can therefore have a dierent mixing function; however, 19 the mixing functions’s monotonicity maintains decentralizability, as agents can greedily maximizeQ tot without communication. All parameters =f Q ; h g are trained with the DQN loss of Equation 2.5. CentralizedCritics Alternatively, some works [99, 44] propose actor-critic approaches for CTDE, where the actors rely only on local information, while the critic receives information from all agents. These works have the advantage over value factorization methods of being applicable to competitive, cooperative, and mixed settings since they can learn separate critics for each agent if necessary. The policy gradient (for a specic agent:a) with a centralized critic typically looks like: r J( ) =r log( (u a t jo a t ))Q(s t ;u 1 t ;:::;u n t ) (2.8) Note that the policy only selects actions for agenta based on its local observation, but the critic depends on the global state (this can be replaced by the set of all agents’ observations) and the actions of all agents. AttentionMechanismsforMARL Attention models have recently generated intense interest due to their ability to incorporate information across large contexts, including in MARL [74, 68, 98]. Importantly for our purposes, they can process variable sized sets of xed length vectors (in our case entities) and possess permutation equivariance [88]. At the core of these models is a parameterized transformation known as multi-head attention [152] that allows entities to selectively extract information from other entities based on their local context. We deneX as a matrix where each row corresponds to the state representation (or its transformation) of an entity. The global states is represented in matrix form asX E whereX e; =s e . Attention models in MARL consist of entity-wise feedforward layers eFF(X), which apply an identical linear transformation to all input entities and multi-head attention layers MHA (A;X;M), which integrate information across 20 entities. The latter take three arguments: the set of agentsA for which to compute an output vector, the matrixX2R jEjd whered is the dimensionality of the input representations, and a maskM2R jAjjEj . The layer outputs a matrixH2 R jAjh whereh is the hidden dimension of the layer. The rowH a; corresponds to a weighted sum of linearly transformed representations from all entities selected by agent a. Importantly, if the entry of the maskM a;e = 0, then entitye’s representation is not included inH a; . Masking enables decentralized execution by providing the maskM a;e =(s a ;s e ), such that agents can only see entities observable by them in the environment. Entity-wise feedforward layers and multi-head attention enable models to share parameters across tasks where the number of agents and entities is variable. 21 PartII InteractionStructure 22 Chapter3 Actor-Attention-CriticforMulti-AgentReinforcementLearning Reinforcement learning in multi-agent scenarios is important for real-world applications but presents challenges beyond those seen in single-agent settings. We present an actor-critic algorithm that trains decentralized policies in multi-agent settings, using centrally computed critics that share an attention mechanism which selects relevant information for each agent at every timestep. This attention mechanism enables more eective and scalable learning in complex multi-agent environments, when compared to recent approaches. Our approach is applicable not only to cooperative settings with shared rewards, but also individualized reward settings, including adversarial settings, as well as settings that do not provide global states, and it makes no assumptions about the action spaces of the agents. As such, it is exible enough to be applied to most multi-agent learning problems. 3.1 Introduction Reinforcement learning has recently made exciting progress in many domains, including Atari games [105], the ancient Chinese board game, Go [133], and complex continuous control tasks involving locomotion [91, 127, 128, 60]. While most reinforcement learning paradigms focus on single agents acting in a static environment (or against themselves in the case of Go), real-world agents often compete or cooperate with other agents in a dynamically shifting environment. In order to learn eectively in multi-agent 23 environments, agents must not only learn the dynamics of their environment, but also those of the other learning agents present. To this end, several approaches for multi-agent reinforcement learning have been developed. The simplest approach is to train each agent independently to maximize their individual reward, while treating other agents as part of the environment. However, this approach violates the basic assumption underlying reinforcement learning, that the environment should be stationary and Markovian. Any single agent’s environment is dynamic and nonstationary due to other agents’ changing policies. As such, standard algorithms developed for stationary Markov decision processes fail. At the other end of the spectrum, all agents can be collectively modeled as a single-agent whose action space is the joint action space of all agents [23]. While allowing coordinated behaviors across agents, this approach is not scalable as the size of action space increases exponentially with respect to the number of agents. It also demands a high degree of communication during execution, as the central policy must collect observations from and distribute actions to the individual agents. In real-world settings, this demand can be problematic. Recent work [99, 44] attempts to combine the strengths of these two approaches. In particular, a critic (or a number of critics) is centrally learned with information from all agents. The actors, however, receive information only from their corresponding agents. Thus, during testing, executing the policies does not require the knowledge of other agents’ actions. This paradigm circumvents the challenge of non-Markovian and non-stationary environments during learning. Despite these progresses, however, algorithms for multi-agent reinforcement learning are still far from being scalable (to larger numbers of agents) and being generically applicable to environments and tasks that are cooperative (sharing a global reward), competitive, or mixed. Our approach extends these prior works in several directions. The main idea is to learn a centralized critic with an attention mechanism. The intuition behind our idea comes from the fact that, in many 24 real-world environments, it is benecial for agents to know what other agents it should pay attention to. For example, a soccer defender needs to pay attention to attackers in their vicinity as well as the player with the ball, while she/he rarely needs to pay attention to the opposing team’s goalie. The specic attackers that the defender is paying attention to can change at dierent parts of the game, depending on the formation and strategy of the opponent. A typical centralized approach to multi-agent reinforcement learning does not take these dynamics into account, instead simply considering all agents at all timepoints. Our attention critic is able to dynamically select which agents to attend to at each time point during training, improving performance in multi-agent domains with complex interactions. Our proposed approach has an input space linearly increasing with respect to the number of agents, as opposed to the quadratic increase in a previous approach [99]. It is also applicable to cooperative, compet- itive, and mixed environments, exceeding the capability of prior work that focuses only on cooperative environments [44]. We have validated our approach on three simulated environments and tasks. The rest of the chapter is organized as follows. In section 3.2, we discuss related work, followed by a detailed description of our approach in section 3.3. We report experimental studies in section 3.4 and conclude in section 3.5. 3.2 RelatedWork Multi-Agent Reinforcement Learning (MARL) is a long studied problem [23]. Topics within MARL are diverse, ranging from learning communication between cooperative agents [147, 41] to algorithms for optimal play in competitive settings [94], though, until recently, they have been focused on simple gridworld environments with tabular learning methods. As deep learning based approaches to reinforcement learning have grown more popular, they have, naturally, been applied to the MARL setting [146, 50], allowing multi-agent learning in high-dimensional or continuous state spaces; however, naive applications of Deep RL methods to MARL naturally encounter 25 some limitations, such as nonstationarity of the environment from the perspective of individual agents [45, 99, 44], lack of coordination/communication in cooperative settings [140, 107, 99, 43], credit assignment in cooperative settings with global rewards [119, 141, 44], and the failure to take opponent strategies into account when learning agent policies [58]. Most relevant to this work are recent, non-attention approaches that propose an actor-critic framework consisting of centralized training with decentralized execution [99, 44], as well as some approaches that utilize attention in a fully centralized multi-agent setting [29, 74]. Lowe et al. [99] investigate the challenges of multi-agent learning in mixed reward environments [23]. They propose an actor-critic method that uses separate centralized critics for each agent which take in all other agents’ actions and observations as input, while training policies that are conditioned only on local information. This practice reduces the non-stationarity of multi-agent environments, as considering the actions of other agents to be part of the environment makes the state transition dynamics stable from the perspective of one agent. In practice, these ideas greatly stabilize learning, due to reduced variance in the value function estimates. Similarly Foerster et al. [44] introduce a centralized critic for cooperative settings with shared rewards. Their method incorporates a "counterfactual baseline" for calculating the advantage function which is able to marginalize a single agent’s actions while keeping others xed. This method allows for complex multi-agent credit assignment, as the advantage function only encourages actions that directly inuence an agent’s rewards. Attention models have recently emerged as a successful approach to intelligently selecting contextual information, with applications in computer vision [7, 104], natural language processing[152, 10, 93], and reinforcement learning [109]. In a similar vein, Jiang and Lu [74] proposed an attention-based actor-critic algorithm for MARL. This work follows the alternative paradigm of centralizing policies while keeping the critics decentralized. Their focus is on learning an attention model for sharing information between the policies. As such, this approach 26 is complementary to ours, and a combination of both approaches could yield further performance benets in cases where centralized policies are desirable. Our proposed approach is more exible than the aformentioned approaches for MARL. Our algorithm is able to train policies in environments with any reward setup, dierent action spaces for each agent, a variance-reducing baseline that only marginalizes the relevant agent’s actions, and with a set of centralized critics that dynamically attend to the relevant information for each agent at each time point. As such, our approach is more scalable to the number of agents, and is more broadly applicable to dierent types of environments. 3.3 Methods The main idea behind our multi-agent learning approach is to learn the critic for each agent by selectively paying attention to information from other agents. This is the same paradigm of training critics centrally (to overcome the challenge of non-stationary non-Markovian environments) and executing learned policies distributedly. Figure 3.1 illustrates the main components of our approach. Attention The attention mechanism functions in a manner similar to a dierentiable key-value memory model [49, 109]. Intuitively, each agent queries the other agents for information about their observations and actions and incorporates that information into the estimate of its value function. This paradigm was chosen, in contrast to other attention-based approaches, as it doesn’t make any assumptions about the temporal or spatial locality of the inputs, as opposed to approaches taken in the natural language processing and computer vision elds. To calculate the Q-value functionQ i (o;a) for the agenti, the critic receives the observations,o = (o 1 ;:::;o N ), and actions,a = (a 1 ;:::;a N ), for all agents indexed byi2f1:::Ng. We represent the set of 27 all agents excepti asni and we index this set withj.Q i (o;a) is a function of agenti’s observation and action, as well as other agents’ contributions: Q i (o;a) =f i (g i (o i ;a i );x i ) (3.1) wheref i is a two-layer multi-layer perceptron (MLP), whileg i is a one-layer MLP embedding function. The contribution from other agents,x i , is a weighted sum of each agent’s value: x i = X j6=i j v j = X j6=i j h(Vg j (o j ;a j )) where the value,v j is a function of agentj’s embedding, encoded with an embedding function and then linearly transformed by a shared matrixV .h is an element-wise nonlinearity (we have used leaky ReLU). The attention weight j compares the embeddinge j withe i =g i (o i ;a i ), using a bilinear mapping (ie, the query-key system) and passes the similarity value between these two embeddings into a softmax j / exp(e > j W > k W q e i ) (3.2) whereW q transformse i into a “query” andW k transformse j into a “key”. The matching is then scaled by the dimensionality of these two matrices to prevent vanishing gradients [152]. In our experiments, we have used multiple attention heads [152]. In this case, each head, using a separate set of parameters (W k ;W q ;V ), gives rise to an aggregated contribution from all other agents to the agenti and we simply concatenate the contributions from all heads as a single vector. Crucially, each head can focus on a dierent weighted mixture of agents. Note that the weights for extracting selectors, keys, and values are shared across all agents, which encourages a common embedding space. The sharing of critic parameters between agents is possible, 28 Attention Head ! 1 …! $ % & ,( & MLP MLP ) & (%,() = unique to each agent = shared among agents ! & ! & ! , ,, ∈ \i Concatenate heads per agent Scaled Dot Product Softmax Dot Product 0 & 0 & 0 1 …0 $ 1 2 1 3 4 Figure 3.1: CalculatingQ i (o;a) with attention for agenti. Each agent encodes its observations and actions, sends it to the central attention mechanism, and receives a weighted sum of other agents encodings (each tranformed by the matrixV ) even in adversarial settings, because multi-agent value-function approximation is, essentially, a multi-task regression problem. This parameter sharing allows our method to learn eectively in environments where rewards for individual agents are dierent but share common features. This method can easily be extended to include additional information, beyond local observations and actions, at training time, including the global state if it is available, simply by adding additional encoders,e. (We do not consider this case in our experiments, however, as our approach is eective in combining local observations to predict expected returns in environments where the global state may not be available). 29 LearningwithAttentiveCritics All critics are updated together to minimize a joint regression loss function, due to the parameter sharing: L Q ( ) = N X i=1 E (o;a;r;o 0 )D h (Q i (o;a)y i ) 2 i , where y i =r i + E a 0 (o 0 ) [Q i (o 0 ;a 0 ) log( i (a 0 i jo 0 i ))] (3.3) where and are the parameters of the target critics and target policies respectively. Note thatQ i , the action-value estimate for agenti, receives observations and actions for all agents. is the temperature parameter determining the balance between maximizing entropy and rewards. The individual policies are updated by ascent with the following gradient: r i J( ) =E oD;a [r i log( i (a i jo i ))( log( i (a i jo i )) +Q i (o;a)b(o;a ni ))] (3.4) whereb(o;a ni ) is the multi-agent baseline used to calculate the advantage function decribed in the following section. Note that we are sampling all actions,a, from all agents’ current policies in order to calculate the gradient estimate for agenti, unlike in the MADDPG algorithm Lowe et al. [99], where the other agents’ actions are sampled from the replay buer, potentially causing overgeneralization where agents fail to coordinate based on their current policies [164]. Multi-Agent Advantage Function As shown in Foerster et al. [44], an advantage function using a baseline that only marginalizes out the actions of the given agent fromQ i (o;a), can help solve the multi- agent credit assignment problem. In other words, by comparing the value of a specic action to the value of the average action for the agent, with all other agents xed, we can learn whether said action will cause an 30 increase in expected return or whether any increase in reward is attributed to the actions of other agents. The form of this advantage function is shown below: A i (o;a) =Q i (o;a)b(o;a ni )), where b(o;a ni )) =E a i i (o i ) h Q i (o; (a i ;a ni )) i (3.5) Using our attention mechanism, we can implement a more general and exible form of a multi-agent baseline that, unlike the advantage function proposed in Foerster et al. [44], doesn’t assume the same action space for each agent, doesn’t require a global reward, and attends dynamically to other agents, as in our Q-function. This is made simple by the natural decomposition of an agents encoding,e i , and the weighted sum of encodings of other agents,x i , in our attention model. Concretely, in the case of discrete policies, we can calculate our baseline in a single forward pass by outputting the expected returnQ i (o; (a i ;a ni )) for every possible action,a i 2A i , that agenti can take. We can then calculate the expectation exactly: E a i i (o i ) h Q i (o; (a i ;a ni )) i = X a 0 i 2A i (a 0 i jo i )Q i (o; (a 0 i ;a ni )) (3.6) In order to do so, we must removea i from the input ofQ i , and output a value for every action. We add an observation-encoder,e i =g o i (o i ), for each agent, using these encodings in place of thee i =g i (o i ;a i ) described above, and modifyf i such that it outputs a value for each possible action, rather than the single input action. In the case of continuous policies, we can either estimate the above expectation by sampling from agenti’s policy, or by learning a separate value head that only takes other agents’ actions as input. 31 (a) Cooperative Treasure Collection. The small grey agents are “hunters” who collect the colored treasure, and deposit them with the correctly colored large “bank” agents. (b) Rover-Tower. Each grey “Tower” is paired with a “Rover” and a destination (color of rover corresponds to its destina- tion). Their goal is to communicate with the "Rover" such that it moves toward the destination. Figure 3.2: Our environments 3.4 Experiments Setup We construct two environments that test various capabilities of our approach (MAAC) and baselines. We investigate in two main directions. First, we study the scalability of dierent methods as the number of agents grows. We hypothesize that the current approach of concatenating all agents’ observations (often used as a global state to be shared among agents) and actions in order to centralize critics does not scale well. To this end, we implement a cooperative environment, Cooperative Treasure Collection, with partially shared rewards where we can vary the total number of agents without signicantly changing the diculty of the task. As such, we can evaluate our approach’s ability to scale. The experimental results in sec 3.4 validate our claim. Secondly, we want to evaluate each method’s ability to attend to information relevant to rewards, especially when the relevance (to rewards) can dynamically change during an episode. This scneario is 32 Table 3.1: Comparison of various methods for multi-agent RL Base Algorithm How to incorporate Number Multi-task Multi-Agent other agents of Critics Learning of Critics Advantage MAAC (ours) SAC z Attention N X X MAAC (Uniform) (ours) SAC Uniform Atttention N X X COMA Actor-Critic (On-Policy) Global State + 1 X Action Concatenation MADDPG y DDPG Observation and N Action Concatenation COMA+SAC SAC Global State + 1 X Action Concatenation MADDPG+SAC SAC Observation and N X Action Concatenation HeadingExplanation How to incorporate other agents: method by which the centralized critic(s) incorporates observations and/or actions from other agents (MADDPG: concatenating all information together. COMA: a global state instead of concatenating observations; however, when the global state is not available, all observations must be included.) Number of Critics: number of separate networks used for predictingQ i for allN agents. Multi-task Learning of Critics: all agents’ estimates ofQ i share information in intermediate layers, beneting from multi-task learning. Multi-Agent Advantage: cf. Sec 3.2 for details. Citations: [44], y [99], z [52], [91] analogous to real-life tasks such as the soccer example presented earlier. To this end, we implement a Rover-Tower task environment where randomly paired agents communicate information and coordinate. Finally, we test on the Cooperative Navigation task proposed by Lowe et al. [99] in order to demonstrate the general eectiveness of our method on a benchmark multi-agent task. All environments are implemented in the multi-agent particle environment framework ∗ introduced by Mordatch and Abbeel [107], and extended by Lowe et al. [99]. We found this framework useful for creating environments involving complex interaction between agents, while keeping the control and perception problems simple, as we are primarily interested in addressing agent interaction. To further simplify the control problem, we use discrete action spaces, allowing agents to move up, down, left, right, or stay; however, the agents may not immediately move exactly in the specied direction, as the task framework incorporates a basic physics engine where agents’ momentums are taken into account. Fig. 3.2 illustrates the two environments we introduce. ∗ https://github.com/openai/multiagent-particle-envs 33 CooperativeTreasureCollection The cooperative environment in Figure 3.2a) involves 8 total agents, 6 of which are "treasure hunters" and 2 of which are “treasure banks”, which each correspond to a dierent color of treasure. The role of the hunters is to collect the treasure of any color, which re-spawn randomly upon being collected (with a total of 6), and then “deposit” the treasure into the correctly colored “bank”. The role of each bank is to simply gather as much treasure as possible from the hunters. All agents are able to see each others’ positions with respect to their own. Hunters receive a global reward for the successful collection of treasure and all agents receive a global reward for the depositing of treasure. Hunters are additionally penalized for colliding with each other. As such, the task contains a mixture of shared and individual rewards and requires dierent “modes of attention” which depend on the agent’s state and other agents’ potential for aecting its rewards. Rover-Tower The environment in Figure 3.2b involves 8 total agents, 4 of which are “rovers” and another 4 which are “towers”. At each episode, rovers and towers are randomly paired. The pair is negatively rewarded by the distance of the rover to its goal. The task can be thought of as a navigation task on an alien planet with limited infrastructure and low visibility. The rovers are unable to see in their surroundings and must rely on communication from the towers, which are able to locate the rovers as well as their destinations and can send one of ve discrete communication messages to their paired rover. Note that communication is highly restricted and dierent from centralized policy approaches [74], which allow for free transfer of continuous information among policies. In our setup, the communication is integrated into the environment (in the tower’s action space and the rover’s observation space), rather than being explicitly part of the model, and is limited to a few discrete signals. 34 0 10000 20000 30000 40000 Training Episodes 50 25 0 25 50 75 100 125 150 Mean Episode Rewards MAAC MAAC (Uniform) MADDPG (Discrete) MADDPG+SAC COMA COMA+SAC DDPG (Discrete) 0 10000 20000 30000 40000 Training Episodes 50 25 0 25 50 75 100 125 Figure 3.3: (Left) Average Rewards on Cooperative Treasure Collection. (Right) Average Rewards on Rover-Tower. Our model (MAAC) is competitive in both environments. Error bars are a 95% condence interval across 6 runs. Baselines We compare to two recently proposed approaches for centralized training of decentralized policies: MAD- DPG [99] and COMA [44], as well as a single-agent RL approach, DDPG, trained separately for each agent. As both DDPG and MADDPG require dierentiable policies, and the standard parametrization of discrete policies is not dierentiable, we use the Gumbel-Softmax reparametrization trick [72]. We will refer to these modied versions as MADDPG (Discrete) and DDPG (Discrete). Our method uses Soft Actor-Critic to optimize. Thus, we additionally implement MADDPG and COMA with Soft Actor-Critic for the sake of fair comparison, referred to as MADDPG+SAC and COMA+SAC. We also consider an ablated version of our model as a variant of our approach. In this model, we use uniform attention by xing the attention weight j (Eq. 3.2) to be 1=(N 1). This restriction prevents the model from focusing its attention on specic agents. All methods are implemented such that their approximate total number of parameters (across agents) are equal to our method, and each model is trained with 6 random seeds each. Hyperparameters for each underlying algorithm are tuned based on performance and kept constant across all variants of critic architectures for that algorithm. A thorough comparison of all baselines is summarized in Table 3.1. 35 Table 3.2: Average rewards per episode on Cooperative Navigation MAAC MAAC (Uniform) MADDPG+SAC COMA+SAC -1.74 0.05 -1.76 0.05 -2.09 0.12 -1.89 0.07 ResultsandAnalysis Fig. 3.3 illustrates the average rewards per episode attained by various methods on our two environments, and Table 3.2 displays the results on Cooperative Navigation [99]. Our proposed approach (MAAC) is competitive when compared to other methods. We analyze in detail in below. Impact of Rewards and Required Attention Uniform attention is competitive with our approach in the Cooperative Treasure Collection (CTC) and Cooperative Navigation (CN) environments, but not in Rover-Tower. On the other hand, both MADDPG (Discrete) and MADDPG+SAC perform well on Rover-Tower, though they do not on CTC. Both variants of COMA do not fare well in CTC and Rover-Tower, though COMA+SAC does reasonably well in CN. DDPG, arguably a weaker baseline, performs surprisingly well in CTC, but does poorly in Rover-Tower. In CTC and CN, the rewards are shared across agents thus an agent’s critic does not need to focus on information from specic agents in order to calculate its expected rewards. Moreover, each agent’s local observation provides enough information to make a decent prediction of its expected rewards. This might explain why MAAC (Uniform) which attends to other agents equally, and DDPG (unaware of other agents) perform above expectations. On the other hand, rewards in the Rover-Tower environment for a specic agent are tied to another single agent’s observations. This environment exemplies a class of scenarios where dynamic attention can be benecial: when subgroups of agents are interacting and performing coordinated tasks with separate rewards, but the groups do not remain static. This explains why MAAC (Uniform) performs poorly and DDPG completely breaks down, as knowing information from another specic agent iscrucial in predicting expected rewards. 36 Table 3.3: MAAC improvement over MADDPG+SAC in CTC # Agents 4 8 12 % Improvement 17 98 208 COMA uses a single centralized network for predicting Q-values for all agents with separate forward passes. Thus, this approach may perform best in environments with global rewards and agents with similar action spaces such as Cooperative Navigation, where we see that COMA+SAC performs well. On the other hand, the environments we introduce contain agents with diering roles (and non-global rewards in the case of Rover-Tower). Thus both variants of COMA do not fare well. MADDPG (and its Soft Actor-Critic variant) perform well on RT; however, we suspect their low performance in CTC is due to this environment’s relatively large observation spaces for all agents, as the MADDPG critic concatenates observations for all agents into a single input vector for each agent’s critic. Our next experiments conrms this hypothesis. Scalability In Table 3.3 we compare the average rewards attained by our approach and the next best performing baseline (MADDPG+SAC) on the CTC task (normalized by the range of rewards attained in the environment, as diering the number of agents changes the nature of rewards in this environment). We show that the improvement of our approach over MADDPG+SAC grows with respect to the number of agents. As suspected, MADDPG-like critics use all information non-selectively, while our approach can learn which agents to pay more attention through the attention mechanism and compress that information into a constant-sized vector. Thus, our approach scales better when the number of agents increases. In future research we will continue to improve the scalability when the number of agents further increases by sharing policies among agents, and performing attention on sub-groups (of agents). In Figure 3.4 we compare the average rewards per episode on the Rover-Tower task. We can compare rewards directly on this task since each rover-tower pair can attain the same scale of rewards regardless of how many other agents are present. Even though MADDPG performed well on the 8 agent version of the 37 8 10 12 14 # Agents 40 60 80 100 120 Mean Episode Rewards Model MAAC MADDPG-SAC Figure 3.4: Scalability in the Rover-Tower task. Note that the performance of MAAC does not deteriorate as agents are added. task (shown in Figure 3.3), we nd that this performance does not scale. Meanwhile, the performance of MAAC does not deteriorate as agents are added. As a future direction, we are creating more complicated environments where each agent needs to cope with a large group of agents where selective attention is needed. This naturally models real-life scenarios that multiple agents are organized in clusters/sub-societies (school, work, family, etc) where the agent needs to interact with a small number of agents from many groups. We anticipate that in such complicated scenarios, our approach, combined with some advantages exhibited by other approaches will perform well. Visualizing Attention In order to inspect how the attention mechanism is working on a more ne- grained level, we visualize the attention weights for one of the rovers in Rover-Tower (Figure 3.5), while xing the tower that said rover is paired to. In this plot, we ignore the weights over other rovers for 38 Figure 3.5: Attention weights over all Towers for a Rover in Rover-Tower task. As expected, the Rover learns to attend to the correct tower, despite receiving no explicit signal to do so. simplicity since these are always near zero. We nd that the rover learns to strongly attend to the tower that it is paired with, without any explicit supervision signal to do so. The model implicitly learns which agent is most relevant to estimating the rover’s expected future returns, and said agent can change dynamically without aecting the performance of the algorithm. 3.5 Conclusion We propose an algorithm for training decentralized policies in multi-agent settings. The key idea is to utilize attention in order to select relevant information for estimating critics. We analyze the performance of the proposed approach with respect to the number of agents, dierent congurations of rewards, and 39 the span of relevant observational information. Empirical results are promising and we intend to extend to highly complicated and dynamic environments. 40 Chapter4 CoordinatedExplorationviaIntrinsicRewardsforMulti-Agent ReinforcementLearning Solving tasks with sparse rewards is one of the most important challenges in reinforcement learning. In the single-agent setting, this challenge is addressed by introducing intrinsic rewards that motivate agents to explore unseen regions of their state spaces; however, applying these techniques naively to the multi-agent setting results in agents exploring independently, without any coordination among themselves. Exploration in cooperative multi-agent settings can be accelerated and improved if agents coordinate their exploration. In this chapter we introduce a framework for designing intrinsic rewards which consider what other agents have explored such that the agents can coordinate. Then, we develop an approach for learning how to dynamically select between several exploration modalities to maximize extrinsic rewards. Concretely, we formulate the approach as a hierarchical policy where a high-level controller selects among sets of policies trained on diverse intrinsic rewards and the low-level controllers learn the action policies of all agents under these specic rewards. We demonstrate the eectiveness of the proposed approach in cooperative domains with sparse rewards where state-of-the-art methods fail and challenging multi-stage tasks that necessitate changing modes of coordination. 41 4.1 Introduction Solving tasks with sparse rewards is a fundamental challenge of reinforcement learning. This challenge is most commonly addressed by learning with intrinsic rewards that encourage exploration of the state space [114, 64, 21, 112, 148]. In the cooperative multi-agent setting, the sparse reward challenge is exacer- bated by the need for agents tocoordinate their exploration. In many cases, the non-coordinated approach – agents exploring independently – is not ecient. For example, consider a search-and-rescue task where multiple agents need to collectively nd all missing persons spread throughout their environment and bring them to a common recovery location. During the search phase, it would be inecient for the agents to explore the same areas redundantly. Instead, it would be much more sensible for agents to “divide-and- conquer” or avoid redundant exploration. Thus, an ideal intrinsic reward for this phase would encourage such behavior; however, the same behavior would not be ideal during the recovery phase where agents must converge at a common location. Cooperative multi-agent reinforcement learning can benet from coordinating exploration across agents; however, the type of coordination should be adaptive to the task at hand. In this work, we introduce a framework for designing multi-agent intrinsic rewards that coordinate with respect to explored regions, then we present a method for learning both low-level policies trained on dierent intrinsic rewards and a meta-policy for selecting the policies which maximize extrinsic rewards on a given task. Importantly, we learn the policies simultaneously using a shared replay buer with o-policy methods, drastically improving sample eciency. This shared replay buer enables us to use all data to train all policies, rather than needing to collect data with the specic policies we want to update. Moreover, the meta-policy is learned in conjunction with those low-level policies, eectively exploring over the space of coordinated low-level exploration types. We show empirically, in both a GridWorld domain as well as in the more complex 3D ViZDoom [75] setting: 1) intrinsic reward functions which coordinate across agents are more eective than independent intrinsically motivated exploration, 2) our approach is able to match or 42 exceed the performance of the best coordinated intrinsic reward function (which diers across tasks) while using no more samples, and 3) on challenging multi-stage tasks requiring varying modes of cooperation, our adaptive approach outperforms all individual reward functions and continues to learn while the other approaches stagnate due to their lack of coordination and/or adaptability. 4.2 RelatedWork Single-AgentExploration In order to solve sparse reward problems, researchers have long worked on improving exploration in reinforcement learning. Prior works commonly propose reward bonuses that encourage agents to reach novel states. In tabular domains, reward bonuses based on the inverse state- action count have been shown to be eective in accelerating learning [138]. In order to scale count-based approaches to large state spaces, many recent works have focused on devising pseudo state counts to use as reward bonuses [13, 112, 148]. Alternatively, some work has focused on dening intrinsic rewards for exploration based on inspiration from psychology [113, 124]. These works use various measures of state novelty as intrinsic rewards motivating exploration [114, 64, 21] Multi-AgentReinforcementLearning(MARL) Multi-agent reinforcement learning introduces several unique challenges that recent work attempts to address. These challenges include: multi-agent credit assignment in cooperative tasks with shared rewards [141, 119, 44], non-stationarity of the environment in the presence of other learning agents [99, 44, 68], and learning of communication protocols between cooperative agents [43, 140, 74]. Exploration in MARL Carmel and Markovitch [26] consider exploration with respect to opponent strategies in competitive games, and Verbeeck, Nowé, and Tuyls [153] consider exploration of a large joint action space in a load balancing problem. Jaques et al. [73] dene an intrinsic reward function for multi-agent reinforcement learning that encourages agents to take actions which have the biggest eect on 43 other agents’ behavior, otherwise referred to as “social inuence”. Agogino and Tumer [2] denes metrics for evaluating the ecacy of reward functions in multi-agent domains. These works, while important, do not address the problem of coordinating exploration in a large state space among multiple agents. A recent approach to collaborative evolutionary reinforcement learning [76] shares some similarities with our approach. As in our work, the authors devise a method for learning a population of diverse policies with a shared replay buer and dynamically selecting the best learner; however, their work is focused on single-agent tasks and does not incorporate any notion of intrinsic rewards for exploration. Wang et al. [160] dene inuence-based rewards which encourage agents to visit regions where their actions inuence other agents’ transitions and rewards (e.g. one agent unlocks a door for another). In practice, they combine their approach with state-based exploration, and thus, their work is complementary and orthogonal to our approach where agents do not simply explore regions that are novel to them but take into account how novel all other agents consider these regions. This work is most applicable to settings where agents inuencing each others’ dynamics forms a positive inductive bias towards solving a task. Most recently, Mahajan et al. [102] introduce a mechanism for achieving “committed” exploration, allowing agents to explore temporally extended coordinated strategies. While this approach enables coordinated exploration, it does not encourage exploration of novel states, and as such, may not learn eectively in sparse reward tasks. 4.3 Methods IntrinsicRewardsforMARLExploration In this section we describe intrinsic reward functions for exploration that are specically tailored to multi- agent learning. The main idea is to share whether other agents have explored a region and consider it as novel. 44 We assume that each agent (indexed byi) has a novelty functionf i :O i !R + that determines how novel an observation is to it, based on its past experience. This function can be an inverse state visit count in discrete domains, or, in large/continuous domains, it can be represented by recent approaches for developing novelty-based intrinsic rewards in complex domains, such as random network distillation [21]. We assume that all agents share the same observation space so that each agent’s novelty function can operate on all other agents’ observations. We dene the multi-agent intrinsic reward functiong i () for agenti, which considers how novel all agents consideri’s observation. Concretely, the function maps the vector [f 1 (o i ); ; f n (o i )] to a scalar reward. Desiderataofg i () While in theoryg i () can be in any form, we believe the following two properties are intuitive and naturally applicable to cooperative MARL: • Coordinate-wiseMonotonicity An observation becoming less novel to any individual agent should not increase the intrinsic reward, preventing the agent from exploring a region more as it becomes more known to another agent. Formally,@g i =@f j 0;8i;j. • Inner-directedness If an observation approaches having zero novelty to an agent, then the intrinsic reward should also approach zero, irrespective of other agents’ novelty. This prevents the agent from repetitively exploring the same regions, at the “persuasion” of other agents. Examples Fig. 4.1 visualizes examples of intrinsic rewards that observe the aforementioned desirable properties. Independent rewards are analagous to single-agent approaches to exploration which dene the intrinsic reward for an agent as the novelty of their own observation that occurs as a result of an action. The remainder of intrinsic reward functions that we consider use the novelty functions of other agents, in addition to their own, to further inform their exploration. 45 Burrowing Leader-Follower First agent uses burrowing, other agents use covering Independent Minimum Covering = agent 1 = agent 2 Figure 4.1: Multi-agent intrinsic rewards. Visualized for the 2 agent case. Independent shows the regions that have been explored by each agent. Darker shades mean higher reward values. Minimum rewards consider how novel all agents nd a specic agent’s observation and rewards that agent based on the minimum. This method leads to agents only being rewarded for exploring areas that no other agent has explored, which could be advantageous in scenarios where redundancy in exploration is not useful or even harmful. Covering rewards agents for exploring areas that it considers more novel than the average agent. This reward results in agents shifting around the state space, only exploring regions as long as they are more novel to them than their average teammate. Burrowing rewards do the opposite, only rewarding agents for exploring areas that it considers less novel than the average agent. While seemingly counterintuitive, these rewards encourage agents to further explore areas they have already explored with the hope that they will discover new regions that few or no other agents have seen, which they will then consider less novel than average and continue to explore. As such, these rewards result in agents continuing to explore until they exhaust all possible intrinsic rewards from a given region (i.e. hit a dead end), somewhat akin to a depth-rst search. leader-follower uses burrowing rewards for the rst agent, 46 and covering rewards for the rest of the agents. This leads to an agent exploring a space thoroughly, and the rest of the agents following along and trying to cover that space. Note that these are not meant to be a comprehensive set of intrinsic reward functions applicable to all cooperative multi-agent tasks. In fact, any convex combination of them is consistent with the desiderata. We use the set of rewards introduced in Fig. 4.1 to concurrently learn a set of diverse exploration policies and introduce in the next section a method for dynamically switching between them. This method matches or exceeds the performance of the best individual exploration method in several complex tasks. LearningforMulti-AgentExploration For many tasks, it is impossible to know a priori what type of exploration may provide the right inductive bias. Furthermore, the type that is most helpful could change over the course of training if the task is suciently complex. In this section we present our approach for simultaneously learning policies trained with dierent types of intrinsic rewards and dynamically selecting the best one. Simultaneous Policy Learning In order to learn policies for various types of intrinsic rewards in parallel, we utilize a shared replay buer and o-policy learning to maximize sample eciency. In other words, we learn policies and value functions forall intrinsic reward types fromall collected data, regardless of which policies it was collected by. This parallel learning is made possible by the fact that we can compute our novelty functions o-policy, given the observations for each agent after each environment transition, which are saved in a replay buer. In Figure 4.2 we visualize our model architecture. We share a critic base and split extrinsic and intrinsic return heads as in Burda et al. [21]. We learn separate heads for each agent i2f1:::ng and rewardj2f1:::mg wherem is the total number of intrinsic reward types that we are considering. For policies, each agent learns its own base that is shared across separate heads for all intrinsic reward types. Our specic learning algorithm is adapted from the multi-agent Soft-Actor-Critic method presented in Iqbal and Sha [68]. 47 = shared across agents and reward types = specific to each agent and reward combination = shared across reward types Critics Policies Figure 4.2: Diagram of our model architecture. Colors indicate how parameters are shared.i indexes agents, whilej indexes reward types. The policy for agenti, trained using rewardj (in addition to extrinsic rewards), is represented by j i . The parameters of this policy are j i =f share i ; j i g, where share i is the shared base/input (for agenti) and j i is a head/output specic to this reward type. The extrinsic critic for policy head j i is represented by Q ex i;j . It takes as input the global states and the actions of all other agentsa ni , and it outputs the expected returns under policy j i for each possible action that agenti can take, given all other agents’ actions. The parameters of this critic are ex i;j =f share ; ex i;j g where share is a shared base across all agents and reward types. A critic with similar structure exists for predicting the intrinsic returns of actions taken by j i , represented byQ in i;j , which uses the parameters: in i;j =f share ; in i;j g. Note that the intrinsic critics share the same base parameters share . 48 In our notation, we use the absence of a subscript or superscript to refer to a group. For example j , refers to all agents’ policies trained on intrinsic rewardj. We train our critics with the actor-critic Q-function loss (Equation 2.2), using the following target,y ex/in i;j , for the extrinsic and intrinsic critics: r ex/in + E a 0 j " Q ex/in i;j (s 0 ;a 0 ) log( j i (a 0 i jo 0 i )) # (4.1) The extrinsic target,y ex i;j , uses the Dec-POMDP’s reward functionr ex (s;a), while the intrinsic target,y in i;j uses the intrinsic rewardr in i;j (o 0 i ). Note that the intrinsic rewards depend on the observation resulting from the actions taken,o 0 i . Q ex/in i;j refers to the target Q-function, an exponential weighted average of the past Q-functions, used for stability. j i are similarly updated target policies. We train each policy head with the following gradient: r j i J( j i ) =E (s;o)D;a j h r j i log j i (a i jo i ) log j i (a i jo i ) +A j i (s;a) !# (4.2) A j i (s;a) =Q ex i;j (s;a) +Q in i;j (s;a)V j i (s;a ni ) (4.3) V j i (s;a ni ) = X a 0 i 2A i j i (a 0 i jo i )(Q ex i;j (s;fa 0 i ;a ni g) +Q in i;j (s;fa 0 i ;a ni g)) (4.4) where is a scalar that determines the weight of the intrinsic rewards, relative to extrinsic rewards, and A j i is a multi-agent advantage function [44, 68], used for helping with multi-agent credit assignment. We update all policy and critic heads with each environment transition sample. DynamicPolicySelection Now that we have established a method for simultaneously learning policies using dierent intrinsic reward types, we must devise a means of selecting between these policies when collecting data in the environment (i.e., rollouts). In order to select policies to use for rollouts, we must consider which policies maximize extrinsic returns, while taking into account the fact that there may still be “unknown unknowns,” or regions that the agents have not seen yet where they may be able to further 49 (a) (b) (c) (d) Figure 4.3: (a) Rendering of the xed map in our gridworld domain. (b) Randomly generated map used forFlip-Task task.n doors to new paths open every time eithertask1 ortask2 is completed. (c) Top-Down view of VizDoom “My Way Home” map, modied for multi-agent experiments. (d) Egocentric view in VizDoom used for agents’ observations. increase their extrinsic returns. As such, we must learn a meta-policy that selects from the set of policies trained on dierent intrinsic rewards to maximize extrinsic returns while maintaining some degree of stochasticity. We parameterize the selector policy with a vector,, that contains an entry for every reward type. The probability of sampling headj is: (j)/ exp([j]). We sample from the meta-policy at the start of each rollout. Using policy gradients, we train the policy selector, , to maximize extrinsic returns: r J() =E h h r log (h) log (h) +R ex h b i (4.5) R ex h = T X t=0 t r ex (s t ;a t )ja h ; b = m X h 0 (h 0 ) h 0 (4.6) h is a running mean of the returns received by headh in the past, and is a parameter similar to for the low-level policies, which promotes entropy in the selector policy. Entropy in the policy selector prevents it from collapsing onto a single exploration type that exploits a local optimum. 4.4 Experiments We rst train policies with each intrinsic reward function dened in Fig. 4.1, then compare our approach as well as several baselines and ablations. We refer to the best performing reward type for each setting as 50 the “non-adaptive oracle”. “Non-adaptive” is used to contrast with our approach which can adapt dierent exploration strategies during training, while “oracle” is used since we do not know a priori which type will perform best. In our experiments we validate the following hypotheses: 1) Multi-agent intrinsic reward functions improve performance on tasks requiring coordination, 2) Our approach matches the performance of the non-adaptive oracle without training separate policies, and 3) In tasks requiring changing coordination strategies, our method outperforms the non-adaptive oracle. Tasks Tasks for testing single-agent exploration typically revolve around navigation of an environment with sparsely distributed rewards (e.g. Montezuma’s Revenge [112, 148, 21], VizDoom [114], etc). In the multi- agent setting, we dene tasks which similarly consider navigation with very sparse rewards, while requiring varying modalities of coordination across agents. These tasks involve collecting the items spread around a map (displayed in yellow in Figure 4.3): task1: Agents must cooperatively collectall treasure on the map in order to complete the task. Ideally, agents should spread out in order to solve the task eectively. task2: Agents must all collect the same treasure. Thus, agents would ideally explore similar regions concurrently. task3: Agents must all collect the specic treasure assigned to them, requiring no coordination across agents. task1 andtask2 need to solve coordination problems, as an individual agent can repeat the same behavior and receive drastically dierent returns depending on the behavior of other agents. task3 is intended as a sanity check where independent exploration should perform best. All 3 tasks are tested on the maps pictured in Figures 4.3a and 4.3c. Flip-Task is a task where the modality of required coordination changes as agents progress, akin to the search and rescue task mentioned in the introduction. This task is tested on randomly generated maps, an example of which is pictured in Figure 4.3b. InFlip-Task agents begin in a central room withn 51 branching paths available (where treasures are placed at the furthest available point) and must solve either task1 ortask2 with respect to the available treasures. Once this task is complete, the next set ofn paths (blocked by the light brown doors) opens up for which the task will be the opposite of the previous task (1 ! 2, 2! 1), requiring agents to adapt their exploration strategy after they learn to solve the rst task. Agents receive a negative time penalty in their extrinsic rewards at each step, so they are motivated to complete the task as quickly as possible. The only positive extrinsic reward comes from any agent collecting a treasure allowed by the specic task, and rewards are shared between all agents. Domains We rst test our approach using a multi-agent gridworld domain (pictured in Fig. 4.3a and 4.3b). Then, in order to test our method’s ability to scale to more complex 3D environments with visual observations, we test on the VizDoom framework [75]. The novelty function for each agentf i , which is used for calculating the intrinsic rewards in Figure 4.1, is dened as 1 N , whereN is the number of times the agent has visited its current cell and is a decay rate selected as a hyperparameter (we nd = 0:7 works well for our purposes). Since VizDoom is not a discrete domain, we discretize agents’ (x;y) positions into bins and use the counts for these bins. BaselinesandAblations We consider another approach to adapting single-agent intrinsic rewards to MARL in Centralized, where we provide intrinsic rewards to all agents as if they were a single agent. In other words, we use the inverse count of the number of timesall agents have jointly taken up their combined positions. We also evaluate QMIX [119], a state-of-the-art method in cooperative MARL, and MAVEN [102], which builds on QMIX by incorporating committed temporally extended exploration. We use the authors’ open-sourced code for these comparisons. 52 Gridworld Task n NAO Type NAO Multi 1 2 Burrowing 1.98 0.06 2.00 0.00 3 Burrowing 2.06 1.05 2.23 0.73 4 Burrowing 1.90 0.49 2.04 0.61 2 2 Independent 2.00 0.00 1.83 0.41 3 Lead-Follow 3.00 0.00 1.80 0.71 4 Lead-Follow 2.66 2.06 2.54 1.21 3 2 Independent 1.39 0.94 2.00 0.00 3 Independent 1.68 0.70 2.21 0.91 4 Burrowing 2.14 1.49 1.73 0.47 Flip-Task 2 Independent 3.04 1.33 4.03 0.97 VizDoom 1 2 Burrowing 1.94 0.10 1.98 0.03 2 2 Lead-Follow 1.93 0.10 1.23 0.65 3 2 Minimum 0.64 1.05 1.64 0.63 Table 4.1: # of treasures found with standard deviation across 6 runs. Our method (multi) matches or outperforms the non-adaptive oracle (NAO) in nearly all settings Finally, we conduct ablation studies to determine the eectiveness of our meta-policy in balancing exploration and exploitation. Multi (Uniform Meta-Policy) samples action policies at random and Multi (No Entropy) does not incorporate entropy to encourage exploration. We also tested the base multi-agent SAC algorithm without intrinsic rewards (MA-SAC). Results Results (non-adaptive oracle and our approach) in all settings are summarized in Table 4.1, and several training curves are found in Figure 4.4. We note the type of the non-adaptive oracle is frequently not independent, indicating that the multi-agent intrinsic rewards introduced in section 4.3 are eective in comparison to a naive application of single agent intrisic reward methods. MatchingtheOraclePerformancewithaMeta-Policy We nd our approach is competitive with the non-adaptive oracle in nearly all tasks, while only needing the same number of samples as a single 53 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 Variant Multi Independent Minimum Covering Burrowing Leader-Follower 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 Variant Multi (Uniform Meta-Policy) Multi (No Entropy) Centralized MA-SAC QMIX MAVEN 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5 Steps (million) Mean Treasures Found Figure 4.4: (Top Left) Mean number of treasures found per episode ontask1 with 2 agents in the gridworld domain. (Top Right) Ablations and baselines in the same setting. (Bottom Left)task3 with 2 agents on VizDoom. (Bottom Right)Flip-Task with 2 agents on gridworld. Shaded regions are a 68% condence interval across 6 runs of the running mean over the past 100 episodes. run (a more fair comparison would allow our method to train for the sum of samples used to identify the oracle). This performance is exciting, as our method receives no prior information about the optimal type of exploration, while each type carries its own bias. Furthermore, we nd our results on the more complex VizDoom domain mirror those in the gridworld, indicating our methods are not limited to discrete domains, assuming a reliable way for measuring the novelty of observations exists. 54 AdvantagesofAdaptiveStrategies In two cases we nd our approach is able to surpass the performance of the non-adaptive oracle: Task3 andFlip-Task. In the case of Task3, this is interesting as rewards for each agent are assigned independently, so coordination is not strictly necessary. In this case, multi-agent intrinsic rewards introduce useful biases that naturally divide the space into explorable regions for each agent, reducing the chances of "detachment" [38] (which we nd independent exploration to suer from); however, in this task they may divide the space into the wrong regions for each agent to get their rewards. Our approach succeeds due to its ability to reap the benets of multi-agent intrinsic rewards while being able to switch strategies if one is not working. The success onTask3 suggests that our approach may be benecial in the single-agent setting when potential paths of exloration are numerous (e.g. by training several policies for one task and treating them as separate agents in our reward functions). Our approach is similarly successful onFlip-Task. Due to the adaptability of our meta-policy, the agents are able to switch their exploration type to suit the next task after they learn to solve the rst one. Note that all sets of exploration policies will learn to solve the rst task once any single one does since they share experience and are trained on a combination of extrinsic and intrinsic rewards. As such, the meta-policy can switch to an exploration strategy best suited for the next task, knowing that it will reliably solve the previous task. TheUnexpectedChallengeof task2 We nd our approach is unable to match the performance of the non-adaptive oracle onTask2 in certain cases (gridworld with 3 agents and VizDoom). This lack of success may be an indication that the exploration strategies which perform well in these settings require commitment to a single strategy early on in training, highlighting a limitation of our approach. Our method requires testing out all policies until we nd one that reaches high extrinsic rewards, which can dilute the eectiveness of exploration early on. 55 AblationsandBaselines We test against several baselines in the gridworld setting onTask1 with 2 agents (top right of Fig. 4.4). We nd that those results show the same patterns asTask1. We nd that our approach balances meta-exploration and exploitation by outperforming both Multi (Uniform Meta-Policy) (pure explore) and Multi (No Entropy) (pure exploit). While Centralized will ensure the global state space is thoroughly searched, it lacks the inductive biases toward spatial coordination that our reward functions incorporate. As such, it does not learn as eciently as our method. Finally, all three of MA-SAC, QMIX, and MAVEN fail to learn in our setting. None of these incorporate novelty-seeking exploration, which are crucial in sparse reward domains. 4.5 Conclusion We propose a framework for designing multi-agent intrinsic reward functions with diverse properties, and compare a varied set on several multi-agent exploration tasks in a gridworld domain as well as in VizDoom. Overall, we can see that cooperative multi-agent tasks can, in many cases, benet from intrinsic rewards that take into account what other agents have explored, but there are various ways to incorporate that information, each resulting in dierent coordinated behaviors. We show that our method is capable of matching or surpassing the performance of the non-adaptive oracle on various tasks while using the same number of samples collected from the environment. Furthermore, we show that adaptation of exploration type over the course of training can overcome the limitations of choosing a xed exploration type. 56 Chapter5 RandomizedEntity-WiseFactorizationforMulti-AgentReinforcement Learning Multi-agent settings in the real world often involve tasks with varying types and quantities of agents and non-agent entities; however, common patterns of behavior often emerge among these agents/entities. Our method aims to leverage these commonalities by asking the question: “What is the expected utility of each agent when only considering a randomly selected sub-group of its observed entities?” By posing this counterfactual question, we can recognize state-action trajectories within sub-groups of entities that we may have encountered in another task and use what we learned in that task to inform our prediction in the current one. We then reconstruct a prediction of the full returns as a combination of factors considering these disjoint groups of entities and train this “randomly factorized" value function as an auxiliary objective for value-based multi-agent reinforcement learning. By doing so, our model can recognize and leverage similarities across tasks to improve learning eciency in a multi-task setting. Our approach,Randomized Entity-wiseFactorization forImaginedLearning (REFIL), outperforms all strong baselines by a signicant margin in challenging multi-task StarCraft micromanagement settings. 57 5.1 Introduction Multi-agent reinforcement learning techniques often focus on learning in settings with xed groups of agents and entities; however, many real-world multi-agent settings contain tasks across which an agent must deal with varying quantities and types of cooperative agents, antagonists, or other entities. This variability in type and quantity of entities results in a combinatorial growth in the number of possible congurations, aggravating the challenge of learning control policies that generalize. For example, the sport of soccer exists in many forms, from casual 5 vs. 5 to full scale 11 vs. 11 matches, with varying formations within each consisting of dierent quantities of player types (defenders, midelders, forwards, etc.). Within these varied tasks, however, exist common patterns. For instance, a “breakaway” occurs in soccer when an attacker with the ball passes the defense and only needs to beat the goalkeeper in order to score (Figure 5.1). The goalkeeper and attacker can apply what they have learned in a breakaway to the next one, regardless of the task (e.g., 5 vs. 5). If players can disentangle their understanding of common patterns from their surroundings, they should be able to learn more eciently as well as share their experiences acrossall forms of soccer. These repeated patterns within sub-groups of entities can, in fact, be found in a wide variety of multi-agent tasks (e.g., heterogeneous swarm control [118] and StarCraft unit micromanagement [123]). Our work aims to develop a methodology for articial agents to incorporate knowledge of these shared patterns to accelerate learning in a multi-task setting. One way to leverage structural independence among agents, as in our soccer example, is to represent value functions as a combination of factors that depend on disjunct subsets of the state and action spaces [81]. These subsets are typically xed in advance using domain knowledge and thus do not scale to complex domains where dependencies are unknown and may shift over time. Recent approaches (e.g. VDN [141], QMIX [119]) in cooperative deep multi-agent reinforcement learning (MARL) factor value functions into separate components for each agent’s action and observation space in order to enable decentralized execution. These approaches learn a utility function for each agent that depends on the agent’s own action 58 Figure 5.1: Breakaway sub-scenario in soccer. Agents in the yellow square can generalize this experience to similar subsequent experiences, regardless of the state of agents outside the square. and observations, resulting in a unique observation space for each task and exacerbating the challenge of learning in a multi-task setting. How can we teach agents to be “situationally aware” of common patterns that are not pre-specied, such that they can share knowledge across tasks? Our main idea is as follows: Given observed trajectories in a real task, we randomly partition entities into sub-groups to “imagine” that agents only observe a (random) subset of the entities they actually observe. Then, in addition to estimating utility of their actions given the full observations, we use the same model to predict utilities in the imagined scenario, providing an opportunity to discover sub-group patterns that appear across tasks. For example, we might sample a breakaway (or a superset of the breakaway entities) in both 5v5 soccer and 11v11, allowing our model to share value function factors across tasks. We can then use these factors to construct a prediction of the full returns. Of course, the possibility of sampling sub-groups that do not contain independent behavior exists. Imagine a sub-group that looks like a breakaway, but in reality a defender is closing in on the attacker’s left. In such cases, we must include factors that account for the eect interactions outside of sampled sub-groups on each agent’s utility. Crucially however, the estimated utility derived from imagining a breakaway often provides at least some information as to the agent’s utility given the full observation (i.e., the agent knows 59 there is value in dribbling toward the goal). Imagined sub-group factors are combined with interaction factors to produce an estimate of the value function that we train as an auxiliary objective on top of a standard value function loss. As such, our approach allows models to exploit shared inter-task patterns via factorization without losing any expressivity. We emphasize that this approach does not rely on sampling an “optimal” sub-group. In other words, there is no requirement to sample sub-groups that are independent from one another (c.f. Section 5.3). In fact, it is useful to learn a utility function for any sub-group state that may appear in another task. Our approach: RandomizedEntity-wiseFactorization forImaginedLearning (REFIL) can be imple- mented easily in practice by using masks in attention-based models. We evaluate our approach on complex StarCraft Multi-Agent Challenge (SMAC) [123] multi-task settings with varying agent teams, ndingREFIL attains improved performance over state-of-the-art methods. 5.2 Methods AttentionLayersandModels Attention models have recently generated intense interest due to their ability to incorporate information across large contexts. Importantly for our purposes, they are able to process variable sized sets of inputs. We now formally dene the building blocks of our attention models. Given the inputX, a matrix where the rows correspond to entities, we dene an entity-wise feedforward layer as a standard fully connected layer that operates independently and identically over entities: eFF(X;W;b) =XW +b > ;X2R n x d ;W2R dh ;b2R h (5.1) Now, we specify the operation that denes an attention head, given the additional inputs ofSZ [1;n x ] , a set of indices that selects which rows of the inputX are used to compute queries such thatX S; 2R jSjd , 60 andM, a binary obserability mask specifying which entities each query entity can observe (i.e.M i;j = 1 wheni2S can incorporate information fromj2Z [1;n x ] into its local context): Atten(S;X;M;W Q ;W K ;W V ) = softmax mask QK > p h ;M V 2R jSjh (5.2) Q =X S; W Q ;K =XW K ;V =XW V ; M2f0; 1g jSjn x ;W Q ;W K ;W V 2R dh (5.3) The mask(Y;M) operation takes two equal sized matrices and lls the entries ofY with1 in the indices whereM is equal to 0. After the softmax, these entries become zero, thus preventing the attention mechanism from attending to specic entities. This masking procedure is used in our case to uphold partial observability, as well as to enable “imagining” the utility of actions within sub-groups of entities. Only one attention layer is permitted in the decentralized execution setting; otherwise information from unseen agents can be propagated through agents that are seen.W Q ,W K , andW V are all learnable parameters of this layer. Queries,Q, can be thought of as vectors specifying the type of information that an entity would like to select from others, while keys,K, can be thought of as specifying the type of information that an entity possesses, and nally, values,V , hold the information that is actually shared with other entities. We dene multi-head-attention as the parallel computation of attention heads as such: MHA (S;X;M) = concat Atten S;X;M;W Q j ;W K j ;W V j ;j2 1:::n h (5.4) The size of the parameters of an attention layer does not depend on the number of input entities. Furthermore, we receive an output vector for each query vector. AugmentingQMIXwithAttention The standard QMIX algorithm relies on a xed number of entities in three places: inputs of the agent-specic utility functionsQ a , inputs of the hypernetwork, and the number of utilities entering the mixing network, 61 which must correspond the output of the hypernetwork since it generates the parameters of the mixing network. QMIX uses multi-layer perceptrons for which all these quantities have to be of xed size. In order to adapt QMIX to the variable agent quantity setting, such that we can apply a single model across all tasks, we require components that accept variable sized sets of entities as inputs. By utilizing attention mechanisms, we can design components that are no longer dependent on a xed number of entities taken as input. We dene the following inputs:X E ei := s e i ; 1 i d;e2E;M ae := (s a ;s e );a2A;e2E. The matrixX E is the global states reshaped into a matrix with a row for each entity, andM is a binary observability matrix which enables decentralized execution, determining which entities are visible to each agent. UtilityNetworks While the standard agent utility functions map a at observation, whose size depends on the number of entities in the environment, to a utility for each action, our attention-utility functions can take in a variable sized set of entities and return a utility for each action. The attention layer output for agenta is computed as MHA (fag;X;M ), whereX is an row-wise transformation ofX E (e.g., an entity-wise feedforward layer). If agents share parameters, the layer can be computed in parallel for all agents by providingA instead offag, which we do in practice. GeneratingDynamicSizedMixingNetworks Another challenge in devising a QMIX algorithm for variable agent quantities is to adapt the hypernetworks that generate weights for the mixing network. Since the mixing network takes in utilities from each agent, we must generate feedforward mixing network parameters that change in size depending on the number of agents present, while incorporating global state information. Conveniently, the number of output vectors of a MHA layer depends on the cardinality of input setS and we can therefore generate mixing parameters 62 of the correct size by usingS =A and concatenating the vectors to form a matrix with one dimension size depending on the number of agents and the other depending on the number of hidden dimensions. Attention-based QMIX (QMIX (Attention)) trains these models using the standard DQN loss in Equation 2 of the main text. Our two layer mixing network requires the following parameters to be generated:W 1 2R +(jAjh m ) , b 1 2R h m ,w 2 2R +(h m ) ,b 2 2R, whereh m is the hidden dimension of the mixing network andjAj is the set of agents. Note from Eq. (5.2) that the output size of the layer is dependent on the size of the query set. As such, using attention layers, we can generate a matrix of sizejAjh m , by specifying the set of agents, A, as the set of queriesS from Eq. (5.2). We do not need observability masking since hypernetworks are only used during training and can be fully centralized. For each of the four components of the mixing network (W 1 ;b 1 ;w 2 ;b 2 ), we introduce a hypernetwork that generates parameters of the correct size. Thus, for the parameters that are vectors (b 1 andw 2 ), we average the matrix generated by the attention layer across thejAj sized dimension, and forb 2 , we average all elements. This procedure enables the dynamic generation of mixing networks whose input size varies with the number of agents. Assuming q = [Q 1 ( 1 ;u 1 );:::;Q n ( n ;u n )], thenQ tot is computed as: Q tot (s;;u) =((q > W 1 ) +b > 1 )w 2 +b 2 (5.5) where is an ELU nonlinearity [32]. RandomizedEntity-wiseFactorizationforImaginedLearning We now proposeRandomizedEntity-wiseFactorization forImaginedLearning (REFIL). We observe that common patterns often emerge in sub-groups of entities within complex multi-agent tasks (cf. soccer breakaway example in §5.1) and hypothesize that learning to predict agents’ utilities within sub-groups of 63 eFF FF MHA GRU FF -greedy Agent Utility Network eFF eFF MHA Hypernetwork Mixing Network Hypernet Softmax Hypernet Softmax Hypernet Hypernet Rest of imagined mixing net is identical Randomized Entity-wise Factorization Figure 5.2: Schematic forREFIL. Values colored orange or blue are used for computingQ tot andQ tot aux respectively. (left) Agent-specic utility networks. These are decentralizable due to the use of an observability mask (M ). We include Gated Recurrent Units [30] to retain information across timesteps in order to handle partial observability. (topcenter) Hypernetworks used to generate weights for the mixing network. We use a softmax function on the weights across the hidden dimension to enforce non-negativity, which we nd empirically to be more stable than the standard absolute value function. Hypernetworks are not restricted by partial observability since they are only required during training and not execution. (topright) The mixing network used to calculateQ tot . (bottomright) Procedure for performing randomized entity-wise factorization. For masksM I andM O , colored spaces indicate a value of 1 (i.e., the agent designated by the row will be able to see the entity designated by the column), while white spaces indicate a value of 0. The color indicates which group the entity belongs to, so agents in the red group see red entities inM I and blue entities inM O . Agents are split into sub-groups and their utilities are calculated for both interactions within their group, as well as to account for the interactions outside of their group, then monotonically mixed to predictQ tot aux . entities is a strong inductive bias that allows models to share information more freely across tasks. We instantiate our approach by constructing an estimate of the value function from factors based on randomized sub-groups, sharing parameters with the full value function, and training this factorized version of the value function as an auxiliary objective. RandomPartitioning Given an episode sampled from a replay buer, we rst randomly partition all entities inE into two disjunct groups, held xed for the episode. We denote the partition by a random binary ∗ vectorm2f0; 1g jEj . m e indicates whether entity e is in the rst group. The negation:m e ∗ We rst drawp2 (0;1) uniformly, followed byjEj independent draws from a Bernoulli(p) distribution. Partitioning into two groups induces a uniform distribution over all possible sub-groups. 64 represents whethere is in the second group. The subset of all agents is denotedm A := [m a ] a2A . With these vectors, we construct binary attention masksM I andM O : M I :=m A m > _:m A :m > ;M O :=:M I : (5.6) whereM I [a;e] indicates whether agenta and entitye are in the same group, andM O [a;e] indicates the opposite. They are further combined with a partial observability maskM , which is provided by the environment, to generate the nal attention masks M I :=M ^M I ;M O :=M ^M O (5.7) These matrices are of sizejAjjEj and will be used by the multi-head attention layers to constrain which entities can be observed by agents. CounterfactualReasoning Given an imagined partitionm, an agenta can examine its history of observations and actions and reason counterfactually what its utility would be had it solely observed the entities in its group. We call this quantity in-group utility and denote it byQ a I ( a I ;u a ; Q ). In order to account for the potential interactions with entities outside of the agents group, we calculate an out-group utility:Q a O ( a O ;u a ; Q ). Note that the real and imagined utilities share the same parameters Q , allowing us to leverage imagined experience to improve utility prediction in real scenarios and vice versa. Breaking the fully observed utilitiesQ a into these randomized sub-group factors is akin to breaking an image into cut-outs of the comprising entities. While the “images” (i.e. states) from each task are a unique set, it’s likely that the pieces comprising them share similarities. 65 Random Initialization Goal (a) Game Visualization 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Steps × 10 6 0.0 0.2 0.4 0.6 0.8 1.0 Win Rate Name REFIL QMIX (Attention) REFIL (Fixed Oracle) REFIL (Randomized Oracle) (b) Win Rate over Time Figure 5.3: Group Matching Game. We use the valuesn a = 8,n c = 6, andn g = 2 in our experiments. Shaded region is a 95% condence interval across 24 runs. Since we do not know the returns within the imagined sub-groups, we must ground our predictions in the observed returns. Just as QMIX learns a value function withn factors (Q a for each agent), we learn an imagined value function with 2n factors (Q a I andQ a O for each agent) that estimates the same value: Q tot =g Q 1 ;:::;Q jAj ;h(s; h ;M) Q tot aux =g Q 1 I ;:::;Q jAj I ;Q 1 O ;:::;Q jAj O ; h(s; h ;M I );h(s; h ;M O ) (5.8) Whereg() are mixing networks whose parameters are generated by hypernetworksh(s; h ;M). This network’s rst layer typically takesn inputs, one for each agent. Since we have 2n factors, we simply concatenate two generated versions of the input layer (usingM I andM O ). We then apply the network to the concatenated utilitiesQ a I ( a I ;u a ) andQ a O ( a O ;u a ) of all agentsa, to compute the predicted valueQ tot aux . This procedure is visualized in Figure 5.2. Importantly, since the mixing network is generated by the full state context, our model can weight factors contextually. For example, if the agenta’s sampled sub-group contains all relevant information to compute its utility such thatQ a I Q a , then the mixing networks can weightQ a I more heavily thanQ a O . Otherwise, the networks learn to balanceQ a I andQ a O for each agent, in order to estimateQ tot . In this way, 66 we can share knowledge in similar sub-group states across tasks while accounting for the dierences in utility that result from the out-of-group context. Learning We now describe the overall learning objective of REFIL. To enforce Equation 5.8, we replace Q tot in Equation 2.5 withQ tot aux , resulting a new lossL aux . We combine the standard QMIX loss (Eq. 2.5)L Q with this auxiliary loss to form: L = (1)L Q +E m L aux (5.9) where controls the tradeo between the two losses. Note that we randomly partition in each episode, hence the expectation with respect to the partition vectorm. We emphasize that the sub-groups are imagined. While we compute Q tot aux and its related quantities, we do not use them to select actions in Equation 2.5. Action selection is performed by each agent maximizingQ a given their local observations. This greedy local action selection is guaranteed to maximizeQ tot due to the monotonic structure of the mixing network [119]. Moreover, our auxiliary objective is only used in training, and execution in the environment does not use random factorization. Treating random factorization as an auxiliary task, rather than as a representational constraint, allows us to retain the expressivity of QMIX value functions (without sub-group factorization) while exploiting the existence of shared sub-group states across tasks. ImplementationDetails The model architecture is shown in Figure 5.2. “Imagination” can be implemented eciently using attention masks. Specically two additional passes through the network are needed, withM O andM I as masks instead ofM , per training step. These additional passes can be parallelized by computing all necessary quantities in one batch on GPU. It is feasible to split entities into an arbitrary numberi of random sub-groups without using more computation by sampling several disjunct vectorsm i and combining them them in the same way as we combinem and:m in Equation 5.6 to formM I andM O . Doing so could potentially bias agents towards considering patterns within smaller subsets of entities. 67 5.3 Experiments In our experiments, we aim to answer the following questions: 1) Are randomized counterfactuals an ecient means for leveraging common patterns? 2) Does our approach improve generalization in a multi- task setting? 3) Is training as an auxiliary objective justied? We begin with experiments in a simple domain we construct such that agents’ decisions rely only on a known subset of all entities, so we can compare our approach to those that use this domain knowledge. Then, we move on to testing on complex StarCraft micromanagement tasks to demonstrate our method’s ability to scale to complex domains. GroupMatchingGame In order to answer our rst question, we construct a group matching game, pictured in Figure 5.3a, where each agent only needs to consider a subset of other agents to act eectively and we know that subset as ground truth (unlike in more complex domains such as StarCraft). Agents (of which there aren a ) are randomly placed in one ofn c cells and assigned to one ofn g groups (represented by the dierent colors) at the start of each episode. Each unique group assignment corresponds to a task. Agents can choose from three actions: move clockwise, stay, and move counter-clockwise. Their ultimate goal is to be located in the same cell as the rest of their group members, at which point an episode ends. There is no restriction on which cell agents form a group in (e.g., both groups can form in the same cell). All agents share a reward of 2.5 when any group is completed (and an equivalent penalty for a formed group breaking) as well as a penalty of -0.1 for each time step in order to encourage agents to solve the task as quickly as possible. Agents’ entity-state descriptionss e include the cell that the agent is currently occupying as well as the group it belongs to (both one-hot encoded), and the task is fully-observable. Notably, agents can act optimally while only considering a subset of observed entities. Ground-truth knowledge of relevant entities enables us to disentangle two aspects of our approach: the use of entity-wise factorization in general and specically using randomly selected factors. We would like 68 to answer the question: does our method rely on sampling the “right” groups of entities (i.e., those with no interactions between them), or is the randomness of our method a feature that promotes generalization? We construct two approaches that use this knowledge to build factoring masks M I and M O that are used in place of randomly sampled groups (otherwise the methods are identical toREFIL).REFIL (Fixed Oracle) directly uses the ground truth group assignments (dierent for each task) to build masks.REFIL (Randomized Oracle) randomly samples sub-groups from the ground truth groups only, rather than from all possible entities. We additionally trainREFIL and QMIX (Attention) (i.e.,REFIL with no auxiliary loss). Figure 5.3b shows that using domain knowledge does not signicantly improve performance in this domain (QMIX (Attention) vs.REFIL (Fixed Oracle)). In fact our randomized factorization approach outper- forms the use of domain knowledge. The randomization inREFIL therefore appears to be crucial. Our hypothesis is that randomization of sub-group factors enables better knowledge sharing across tasks. For example, the situation where two agents from the same group are located in adjacent cells occurs within all possible group assignments. When sampling randomly, our approach occasionally samples these two agents alone in their own group. Even if the rest of the context in a given episode has never been seen before, as long as this sub-scenario has been seen, the model has some indication of the value associated with each action. Even when restricting the set of entities to form sub-groups with those that we know can be relevant to each agent (REFIL (Randomized Oracle)) we nd that performance does not signicantly improve. These results suggest that randomized sub-group formation forREFIL is a viable strategy, and the main benet of our approach is to promote generalization across tasks by breaking value function predictions into reusable components, even when the sampled sub-groups are not completely independent. StarCraft We next test on the StarCraft multi-agent challenge (SMAC) [123]. The tasks in SMAC involve microman- agement of units in order to defeat a set of enemy units in battle. Specically, we consider a multi-task 69 0.0 0.2 0.4 0.6 0.8 1.0 × 10 7 0.0 0.2 0.4 0.6 0.8 1.0 Win Rate Name REFIL QMIX (Attention) REFIL (VDN) VDN (Attention) QMIX (Mean Pooling) 0.0 0.2 0.4 0.6 0.8 1.0 × 10 7 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 × 10 7 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Steps × 10 7 0.0 0.2 0.4 0.6 0.8 1.0 Win Rate Name REFIL QMIX (EMP) ROMA (Attention) Qatten (Attention) QTRAN (Attention) REFIL (UPDeT) QMIX (UPDeT) (a) 3-8sz 0.0 0.2 0.4 0.6 0.8 1.0 Steps × 10 7 0.0 0.2 0.4 0.6 0.8 1.0 (b) 3-8csz 0.0 0.2 0.4 0.6 0.8 1.0 Steps × 10 7 0.0 0.2 0.4 0.6 0.8 1.0 (c) 3-8MMM Figure 5.4: Test win rate over time on multi-taskStarCraft environments. Tasks are sampled uniformly at each episode. Shaded region is a 95% condence interval across 5 runs. (top row) Ablations of our method. (bottom row) Baseline methods. setting where we train our models simultaneously on tasks with variable types and quantities of agents. We hypothesize that our approach is especially benecial in this setting, as it should encourage models to learn utilities for common patterns and generalize to more diverse settings as a result. The dynamic setting involves minor modications to SMAC but we change the environment as little as possible to maintain the challenging nature of the tasks. In the standard version of SMAC, both state and action spaces depend on a xed number of agents and enemies, so our modications, alleviate these problems. In our tests we evaluate on three settings we call 3-8sz, 3-8csz, and 3-8MMM. 3-8sz pits symmetrical teams of between 3 and 8 agents against each other where the agents are a combination of Zealots and Stalkers (inspired by the 2s3z and 3s5z tasks in the original SMAC), resulting in 39 unique tasks. 3-8csz pits symmetrical teams of between 0 and 2 Colossi and 3 to 6 Stalkers/Zealots against each other (inspired by 1c3s5z), resulting in 66 tasks. 3-8MMM pits symmetrical teams of between 0 and 2 Medics and 3 to 6 Marines/Marauders against each other (inspired by MMM and MMM2, again resulting in 66 tasks). 70 Table 5.1: Comparison of tested methods. Name Imagined Model Base Learning Algorithm REFIL X MHA 1 QMIX 2 QMIX (Attention) MHA QMIX REFIL (VDN) X MHA VDN 3 VDN (Attention) MHA VDN QMIX (Max Pooling) Max-Pool QMIX QMIX (EMP) EMP 4 QMIX ROMA (Attention) MHA ROMA 5 Qatten (Attention) MHA Qatten 6 QTRAN (Attention) MHA QTRAN 7 REFIL (UPDeT) X UPDeT 8 QMIX QMIX (UPDeT) UPDeT QMIX 1 : Vaswani et al. [152] 2 : Rashid et al. [119] 3 : Sunehag et al. [141] 4 : Agarwal, Kumar, and Sycara [1] 5 : Wang et al. [158] 6 : Yang et al. [168] 7 : Son et al. [135] 8 : Hu et al. [65] AblationsandBaselines We introduce several ablations of our method, as well as adaptations of existing methods to handle variable sized inputs. These comparisons are summarized in Table 5.1. QMIX (Attention) is our method without the auxiliary loss. REFIL (VDN) is our approach using summation to combine all factors as in VDN rather than a non-linear monotonic mixing network. VDN(Attention) does not include the auxiliary loss and uses summation for factor mixing. QMIX (Mean Pooling) is QMIX (Attention) with attention layers replaced by mean pooling. We also test max pooling but nd the performance to be marginally worse than mean pooling. Importantly, for pooling layers we add entity-wise linear transformations prior to the pooling operations such that the total number of parameters is comparable to attention layers. For baselines we consider some follow-up works to QMIX that improve the mixing network’s expres- sivity: QTRAN [135] and Qatten [168]. We also compare to a method that builds on QMIX by attempting to learn dynamic roles that depend on the context each agent observes: ROMA [158]. We additionally consider an alternative mechanism for aggregating information across variable sets of entities, known as Entity Message Passing (EMP) [1]. We specically use the restricted communication setting where agents 71 Ep 1 t = 0 Key Agent Enemy Zealot Ep 3 Ep 2 Stalker Colossus t = 5 t = 15 Unit Shield Unit Health Indicates an agent’s attack. Length corresponds to time until next attack. Task Rank 0.0 0.2 0.4 0.6 0.8 1.0 Win Rate Name REFIL QMIX (Attention) Qatten (Attention) 0.0 0.2 0.4 0.6 0.8 1.0 Steps × 10 7 0.0 0.2 0.4 0.6 0.8 1.0 Win Rate Name λ =0.01 λ =0.1 λ =0.5 λ =0.9 λ =0.99 λ =1.0 Figure 5.5: (left) Simplied rendering of a common pattern that emerges across tasks in the 3-8csz SMAC setting, highlighted att = 15.REFIL enables learning from each task to inform behavior in the others. (top right) Task-by-task performance on 3-8sz. REFIL generalizes better across a wider range of tasks. (bottom right) Varying forREFIL in 3-8sz. can only communicate with agents they observe, and we set the number of message passing steps to three. Finally, we consider the UPDeT architecture [65], a recent work that also targets the multi-task MARL setting. UPDeT utilizes domain knowledge of the environment to map entities to the specic actions that they correspond to. We train UPDeT with QMIX as well asREFIL. For all approaches designed for the standard single-task SMAC setting, we extend them with the same multi-head attention architecture that our approach uses. Ablation Results Our results on challenges in multi-task StarCraft settings can be found in Figure 5.4. Tasks are sampled uniformly at each episode, so the curves represent average win rate across all tasks. We nd thatREFIL outperforms all ablations consistently in these settings.REFIL (VDN) performs much worse than our approach and VDN (Attention), highlighting the importance of the mixing network handling contextual dependencies between entity partitions. Since the trajectory of a subset of entities can play out dierently based on the surrounding context, it is important for our factorization approach to recognize and adjust for these situations. The use of mean-pooling in place of attention also performs poorly, indicating that attention is valuable for aggregating information from variable length sets of entities. BaselineResults We nd that algorithms designed to improve on QMIX for the single task MARL setting (ROMA, Qatten, QTRAN), naively applied to the multi-task setting, do not see the same improvements. 72 REFIL, on the other hand, consistently outperforms other methods, highlighting the unique challenge of learning in multi-task settings. In Fig. 5.5 (top right) we investigate the performance of REFIL compared to the two next best methods in the 3-8sz setting on a task-by-task basis. We evaluate each method on each task individually and rank the tasks by performance, plotting from left to right. We nd that the performance gain of REFIL comes from generalizing performance across a wider range of tasks, hence the reduced rate of decay in task performance from best to worst. The entity aggregation method of EMP underperforms relative to the MHA module that we use. UPDeT is a related work that focuses on designing an architecture compatible with multiple tasks and variable entities and action spaces by utilizing domain knowledge to map entities to their corresponding actions. Despite adding this domain knowledge, QMIX (UPDeT) surprisingly underperforms in 2 of 3 settings, while performing similarly toREFIL on 3-8MMM; however, since UPDeT is an attention-based architecture, it is amenable to our proposed auxiliary training scheme. We nd applying random factorization to QMIX (UPDeT) improves its performance further in 3-8MMM as well as in 3-8sz. In the case of 3-8MMM, where the asmyptotic win rate of REFIL (UPDeT) and QMIX (UPDeT) are similar, we nd thatREFIL (UPDeT) wins on average in 22% fewer time steps by targeting enemy Medivacs, a unit capable of healing its teammates. Targeting of Medivacs is an example of a common pattern that emerges across tasks whichREFIL is able to leverage. Role of Auxiliary Objective In order to understand the role of training as an auxiliary objective (rather than entirely replacing the objective) we vary the value of to interpolate between two modes: = 0 is simply QMIX (Attention), while = 1 trains exclusively with random factorization. Our results on 3-8sz (Figure 5.5 (bottom right)) show that, similar to regularization methods such as Dropout [137], there is a sweet spot where performance is maximized before collapsing catastrophically. Training exclusively with random factorization does not learn anything signicant. This failure is likely due to the fact that 73 we use the full context in our targets for learning with imagined scenarios as well as when executing our policies, so we still need to learn with it in training. QualitativeExampleofCommonPattern Finally, we visualize an example of the sort of common patterns that REFIL is able to leverage (Fig. 5.5 (left)). Zealots (the only melee unit present) are weak to Colossi, so they learn to hang back and let other units engage rst. Then, they jump in and intercept the enemy Zealots while all other enemy units are preoccupied, leading to a common pattern of a Zealot vs. Zealot skirmish (highlighted at t=15). REFIL enables behaviors learned in these types of sub-groups to be applied more eectively across all tasks. By sampling groups from all entities randomly, we will occasionally end up with sub-groups that include only Zealots, and the value function predictions learned in these sub-groups can be applied not only to the task at hand, but to any task where a similar pattern emerges. 5.4 RelatedWork Multi-agent reinforcement learning (MARL) is a broad eld encompassing cooperative [44, 119, 141], competitive [12, 86], and mixed [99, 68] settings. This chapter focuses on cooperative MARL with centralized training and decentralized execution [110, CTDE]. Our approach utilizes value function factorization, an approach aiming to simultaneously overcome limitations of both joint [57] and independent learning [31] paradigms. Early attempts at value function factorisation require apriori knowledge of suitable per-agent team reward decompositions or interaction dependencies. These include optimising over local compositions of individualQ-value functions learnt from individual reward functions [125], as well as summing individual Q-functions with individual rewards before greedy joint action selection [122]. Recent approaches from cooperative deep multi-agent RL learn value factorisations from a single team reward function by treating all agents as independent factors, requiring no domain knowledge and enabling decentralized execution. Value-Decomposition Networks (VDN) [141] decompose the jointQ-value function into a sum of local utility 74 functions used for greedy action selection. QMIX [119, 120] extends such additive decompositions to general monotonic functions. Some works extend QMIX to improve the expressivity of mixing functions [135, 168], learn latent embeddings to help exploration [102] or learn dynamic roles [158], and encode knowledge of action semantics into network architectures [161]. Several recent works have addressed the topic of generalization and transfer across related tasks with varying agent quantities, though the learning paradigms considered and assumptions made dier from our approach. Carion et al. [25] devise an approach for assigning agents to tasks, assuming the existence of low-level controllers to carry out the tasks, and show that it can scale to much larger tasks than those seen in training. Burden [22] propose a transfer learning approach using convolutional neural networks and grid-based state representations to scale to tasks of arbitrary size. Wang et al. [159] introduce a method to decompose action spaces into roles, which they show can transfer to tasks with larger numbers of agents by grouping new actions into existing clusters. They do not propose a model to handle the larger observation sizes, instead using a euclidean distance heuristic to observe a xed number of agents. Several approaches devise attention or graph-neural-network based models for handling variable sized inputs and focus on learning curricula to progress on increasingly large/challenging settings [98, 11, 162, 1]. Most recently, Hu et al. [65] introduce a method for handling variable-size inputs and action spaces and evaluate their model on single-task to single-task transfer. In contrast to these curriculum and transfer learning approaches, we focus on training simultaneously on multiple tasks and specically develop a training paradigm for improving knowledge sharing across tasks. 5.5 Conclusion In this chapter we considered a multi-task MARL setting where we aim to learn control policies for variable- sized teams of agents. We proposedREFIL, an approach that regularizes value functions to share factors comprised of sub-groups of entities, in turn promoting generalization and knowledge transfer within and 75 across complex cooperative multi-agent tasks. Our results showed that our contributions yield signicant average performance improvements across these tasks when training on them concurrently, specically through improving generalization across a wider variety of tasks. 76 PartIII TaskStructure 77 Chapter6 PossibilityBeforeUtility: LearningandUsingHierarchicalAordances Reinforcement learning algorithms struggle on tasks with complex hierarchical dependency structures. Humans and other intelligent agents do not waste time assessing the utility of every high-level action in existence, but instead only consider ones they deem possible in the rst place. By focusing only on what is feasible, or “aorded”, at the present moment, an agent can spend more time both evaluating the utility of and acting on what matters. To this end, we present Hierarchical Aordance Learning (HAL), a method that learns a model of hierarchical aordances in order to prune impossible subtasks for more eective learning. Existing works in hierarchical reinforcement learning provide agents with structural representations of subtasks but are not aordance-aware, and by grounding our denition of hierarchical aordances in the present state, our approach is more exible than the multitude of approaches that ground their subtask dependencies in a symbolic history. While these logic-based methods often require complete knowledge of the subtask hierarchy, our approach is able to utilize incomplete and varying symbolic specications. Furthermore, we demonstrate that relative to non-aordance-aware methods, HAL agents are better able to eciently learn complex tasks, navigate environment stochasticity, and acquire diverse skills in the absence of extrinsic supervision—all of which are hallmarks of human learning. 78 6.1 Introduction Reinforcement learning (RL) methods have recently achieved success in a variety of historically dicult domains [106, 132, 156], but they continue to struggle on complex hierarchical tasks. Human-like intelligent agents are able to succeed in such tasks through an innate understanding of what their environment enables them to do. In other words, they do not waste time attempting the impossible. Gibson [47] coins the term “aordances” to articulate the observation that humans and other animals largely interpret the world around them in terms of which behaviors the environment aords them. While some previous works apply the concept of aordances to the RL setting, none of these methods easily translate to environments with hierarchical tasks. In this work, we introduce Hierarchical Aordance Learning (HAL), a method that addresses the challenges inherent to learning aordances over high-level subtasks, enabling more ecient learning in environments with complex subtask dependency structures. Many real-world environments have an underlying hierarchical dependency structure (Fig. 6.1a), and successful completion of tasks in these environments requires understanding how to complete individual subtasks and knowing the relationships between them. Consider the task of preparing a simple pasta dish. Some sets of subtasks, like chopping vegetables or lling a pot with water, can be successfully performed in any order. However, there are many cases in which the dependencies between subtasks must be obeyed. For instance, it is inadvisable chop vegetables after having mixed them with the sauce, or to boil a pot of water before the pot is lled with water in the rst place. Equipped with structural inductive biases that naturally allow for temporally extended reasoning over subtasks, hierarchical reinforcement learning (HRL) methods are well-suited for tasks with complex high-level dependencies. Existing HRL methods fall along a spectrum ranging from exible approaches that discover useful subtasks automatically, to the structured approaches that provide some prior information about subtasks and their interdependencies. The former set of approaches [e.g. 154, 40] have seen limited success, as the automatic identication of hierarchical abstractions is an open problem in deep learning [63]. But 79 approaches that endow the agent with more structure, to make complex tasks feasible, do so at the cost of rigid assumptions. Methods that use nite automatas (Fig. 6.1b) to express subtask dependencies [e.g. 66] require the set of symbols, or atomic propositions, provided to the agent to be complete, in that the history of symbols maps deterministically to the current context (i.e. how much progress has been made; which subtasks are available). Importantly, these methods and many others [e.g. 4, 134] consider subtasks to be dependent merely on the completion of others. Unfortunately, these assumptions do not hold in the real world (Fig. 6.1c). For instance, if one completes the subtask cook noodles, but they clumsily spill them all over the oor, are they now ready for the next subtask,mix noodles and sauce? While the subtaskcook noodles is somehow necessary for this further subtask, it is not sucient to have completed it in the past. The only way for automata-based approaches to handle this complexity is to introduce a new symbol that indicates that the subtask has been undone. This is possible, but extraordinarily restrictive, since, unless the set of symbols is complete, none of the subtask completion information can be used to reliably learn and utilize subtask dependencies. Modeling probabilistic transitions allows the symbolic signal to be incomplete, but still requires a complete set of symbols, in addition to predened contexts. In order to make use of incomplete symbolic information, our approach instead learns a representation of context grounded in the present state to determine which subtasks are possible (Fig. 6.1d), rather than solely relying on symbols. The contributions of this chapter are as follows. First we introduce milestones (§6.3), which serve the dual purpose of subgoals for training options [145] and as high-level intents [85] for training our aordance model. Milestones are a exible alternative to atomic propositions used in automata-based approaches, and they are easier to specify due to less rigid assumptions. Unlike a dense reward function, the milestone signal does not need to be scaled or balanced carefully to account for competing extrinsic motives. Next, we introduce hierarchical aordances, which can be dened over any arbitrary set of milestones, and describe HAL (§6.3), a method which learns and utilizes a model of hierarchical aordances to prune impossible 80 (-1 ) (1 ) (3 ) (-1 ) Start Goal Task Hierarchy Automata Affordances ? (a) (b) (d) Stochasticity (c) +(2) +(1) +1 Figure 6.1: Many real world tasks, like making pasta, can be conceptualized as a hierarchy (a) of subtasks. Automata-based approaches (b) map a history of subtask completion symbols to a context that indicates progress in the hierarchy. Approaches that assume symbolic history deterministically denes progress are not robust to stochastic changes in context (c) not provided symbolically. Hierarchical aordances (d) enable us to use incomplete symbolic information in the face of stochasticity by grounding context in the present state. subtasks. Finally, we demonstrate HAL’s superior performance on two complex hierarchical tasks in terms of learning speed, robustness, generalizability, and ability to explore complex subtask hierarchies without extrinsic supervision, relative to baselines provided with the same information (§6.4). 6.2 RelatedWork Multi-task RL methods take advantage of shared task structure in order to generalize to new tasks from the same distribution [4, 130, 35, 134, 100]. Sohn et al. [134] learn subtask preconditions, but use symbol-based contexts and do not learn and use their model of preconditions concurrently. Instead they assume a naive policy can suciently reach all subtasks. Furthermore, they assume ground-truth aordances are provided at each step. Some works provide the agent with high-level task sketches [4, 130] describing the order in which subtasks must be completed. While these sketches are advertised as “ungrounded" [4], they are in fact grounded by the inclusion of short sketches, which are the rst to be introduced to the agent in a curriculum learning scheme [16]. Our approach instead uses a direct signal, which alone need not determine task progress, and can learn without exposure to other tasks with shared structure. 81 In using a set of discrete symbols to indicate subtask completion, our work is similar to the variety of approaches that apply temporal logic (TL) to the RL setting [169, 53, 54, 90, 89]. These works typically provide the agent with a TL formula, as well as assignments of atomic propositions at each time-step. Some works use reward shaping to encourage satisfaction of the formula [90, 89], whereas others convert the TL formula to some nite state machine, which provides the agent with a structure that roughly expresses subtask dependencies [53, 54, 169]. Icarte et al. [66] bypass this formula-to-automata conversion, and instead directly provide the automata to the agent in the form of a reward machine (RM). While RMs are more expressive than LTL formulas, they are less exible than HAL, which can deal with incomplete sets of symbols, as well as context stochasticity. Gibson [47] introduces a theory of aordances, dened roughly as properties of the environment which must be measured relative to the agent. Heft [61] and Chemero [27] clarify aordances asrelations between the agent and its environment. Khetarpal et al. [80] formalize this relational denition of aordances in the context of RL, and model which low-level actions, given corresponding intents, are aorded in each state. In this work, milestones represent high-level intents corresponding to each subtask. They also demonstrate that modeling aordances speeds up and improves planning through the pruning of irrelevant actions, and allows for the learning of more accurate and generalizable partial world models. This approach does not directly translate to the hierarchical setting because subtasks, unlike actions, may fail for reasons other than aordances, meaning we do not have access to ground-truth aordance labels with which to train our model. Manoury, Nguyen, and Buche [103] and Khazatsky et al. [79] present approaches that can discover and use aordances to learn new skills, but their denition of aordances (i.e. a behavior is either aorded or not, with no notion of preconditions) does not translate to the hierarchical setting. 82 6.3 Methods MilestonesAndHierarchicalAordances We consider tasks that can be decomposed into subtasks, each represented by a milestone symbol,g2G, whereG is the set of symbols relevant to the task, andjGj =K. For each subtask, we introduce a separate option,hI g ; g ; g i, and we call g a subpolicy. At each time-step, in addition to the extrinsic reward signal provided by the environment to indicate success on the overall task, we have access to a milestone signal, which is a vectorb t where each elementb t g 2f0; 1g indicates whetherg2G was completed on time-stept. In ourpasta example, we might receive a milestone each time we cut a vegetable, make the sauce, cook the noodles, etc. Milestones serve two main purposes. Firstly, milestones function as option subgoals [145] that are in this work used to train each subpolicy (discussed in Section 6.3). Secondly, each milestone represents the intent of its corresponding subtask—similar to the action intents introduced to learn action-level aordances in the work of Khetarpal et al. [80]—which we use to learn hierarchical aordances (discussed in Section 6.3). In contrast to the standard options framework, primitive actions can only be executed as part of an option’s subpolicy in our method. Generally, policies trained solely over options have no guarantee of optimality [145], but we ensure the existence of an optimal solution by requiringg K 2G, whereg K is the task’s nal milestone (indicating task success). WhenG =fg k g, our setting is standard, at RL. Each additional milestoneg 0 added toG is useful as an intermediate signal so long asg 0 corresponds to a unique behavior necessary for achievingg K . Hierarchical aordances are dened overG in the following way. The vectorf s = f (s) of sizeK represents which milestones are immediately achievable from the present states, without requiring the collection of any intermediate milestones, wheref is a hierarchical aordance classier, andf is the optimal one ∗ . Formally,f s g = 1 if at timet 0 it is possible for futureb T g = 1 without anyb t j = 1;j6=g, where t 0 <t<T . Inpasta, the milestonemix cooked noodles and sauce is not aorded at the beginning ∗ Unlike option completion predictions [116], aordances predict possibility of success. 83 since cook noodles is required rst. A successful policy trained within the vanilla options framework will eventually learn to execute options in contexts where they are most useful, regardless of each option’s predened initiation set. However, hierarchical aordances give us a principled way to directly adjust this set: for subtaskg, we can setI g =fsjs2S; f s g = 1g. One can think of hierarchical aordances as using milestones to impose a state-grounded subtask dependency structure on top of the options framework, which we can use to prune impossible subtasks. IfG =fg 0 ;g K g, an aordance-aware agent with access to optimalf (s) will never initiate subtaskg K from the beginning ifg 0 is a necessary intermediate behavior. Some logic-based RL approaches [e.g. 169, 66] useatomicpropositions as markers of subtask achievement to transition between contexts in a nite state machine. These approaches, and many other HRL works [e.g. 4, 134] dene subtask preconditions in terms of other subtasks. There are two forms of stochasticity that hierarchical aordances, by virtue of being grounded in the present state, can more naturally address than symbolically-dened dependencies. We can conceptualize potential agent trajectories as graphs where nodes represent the attainment of milestones, and edges are the segments between them. Node stochasticity is aordance-aecting randomness that occurs either when milestones are attained (e.g. receiving varying quantity of an item), or at the beginning of the episode (i.e. starting in dierent contexts). Edge stochasticity is when aordances change at any time within a segment. We treat edge stochasticity events as infrequent exceptions to the typical subtask dependency rules. For example, after cook noodles is complete, mix cooked noodles and sauce is aorded, even if the agent may eventually spill the noodles on the oor. By grounding these rules in the current state, an aordance-aware agent can detect and adapt to edge anomalies. In Section 6.3, we describe in detail how hierarchical aordances are learned and used in stochastic environments where symbols alone would fail to reliably determine the current context. 84 LearningControllers Like h-DQN [85], we use a meta-controller that selects the current subtask to attempt and a low-level controller which executes the subpolicy relevant to that subtask. The controller, :SG!4(A), selects low-level actions,a2A, given a state,s2S, and milestone,g2G, and aims to maximize the expected milestone signal rewards,b g . Q-Learning [163] trains these controllers by learning an estimate of the optimal Q-function:Q c (s;a;g) = max E P 1 t=0 t b t g js 0 =s;a 0 =a;a t ;s t P and deriving a policy from the Q-function as such: g (ajs;g) =1(a = arg max a 0Q c (s;a 0 ;g)). Deep Q-Learning [106] estimatesQ using deep neural networks. This Q-function is parameterized by =f base ;:::; g ;:::g, where base is a set of shared base parameters and g is a goal-specic head. It is updated via gradient descent on the following loss function, derived from the original Q-learning update: L Q c =E (st;at;rt;s t+1 ;gt)D c " Q c (s t ;a t ;g t ;)b t g max a t+1 Q c (s t+1 ;a t+1 ;g t ; ) 2 # (6.1) whereD c is a replay buer that stores previously collected transitions, and are the parameters of a periodically updated target network. Both of these components are included to avoid the instability associated with using function approximation in Q-Learning. The meta-controller : S ! 4(G) aims to execute subtasks to maximize extrinsic rewards re- ceived by the environment. Again, we estimate a Q-function, this time over a dilated time scale (i.e. we allow the low-level controllers to run for multiple steps before choosing new goals): Q mc (s;g) = E h P N t=0 r t + max g 0Q mc (s N ;g 0 )js 0 =s;g 0 =g;a t g ;s t P i , whereN is the (variable) number of steps the option runs for. When collecting data in the environment, we add transitions (s t ;g t ; P t+N t r t ;s t+N ) to a separate meta-replay buer,D mc , used to train our meta-Q-function (parameterized by ) with a loss similar to Eq. 6.1, but without any goal-conditioning. In Section 6.3 we describe how hierarchical aordances are integrated into this training procedure. 85 achievement context affordance classifier meta-controller controller HAL meta-controller HAL controller Figure 6.2: Left: Architecture diagram for complete HAL method. Q-values of the meta-controller are masked by the output of the aordance classier. The operator represents the standard-greedy action selection procedure used in Q-learning, while 2 represents our aordance aware version. Right: For an optimal policy (top), the mask will have no eect since Q values will naturally be low for unaorded subtasks. However, a suboptimal policy (bottom) will benet from a mask since it can be eciently learned and used to prune irrelevant subtasks before TD errors can propagate. HierarchicalAordanceLearning In typical HRL methods, if the meta-controller is yet to receive extrinsic reward from the environment, there will be no preference for selecting any subtask over the others. However, by restricting the selection of subtasks to ones that have proven merely to be possible, an agent can avoid wasting time attempting the impossible, and reach more fruitful subtasks faster. Suppose from experience, gained through random exploration, the agent achieves milestoneg (where achievement meansb g = 1) very often from the initial state setI, but neverj2G, despite being able to achievej in later states. With enough experience, the agent should become condent thatj is not achievable without completing other milestones rst, and should not bother selectingj from anys2I, whileg, and any others that are achievable from those states, should instead be considered. If we had access to an oracle function,f (s), that accurately computes hierarchical aordances for our task, we could prune impossible subtasks by masking the otherwise uninformed policy with the aordance oracle output:p(gjs)/f s g (gjs). In the following sections, we describe a method 86 that can learn an approximatef(s)f (s) from experience and leverage it in real-time for more eective learning. Thefalsenegativeproblem Recall thatf(s) outputs a vectorf s , where eachf s g indicates the possibility of collecting milestoneg from states without requiring intermediate milestones. To train each binary classication head,f g (s), we must somehow generate labeled data for each milestone. Suppose an option was initialized at timet o , and in timeT milestoneg is received. If no others were received since the start of the option, we may assume that for anyt wheret o tT , the collection of milestoneg was aorded, so we can use the set of statesfs to ;s to+1 ;:::;s T g as positive (i.e. f g (s) = 1) training examples for the aordance classier. Even ifg was not the intended milestone, we can still generate positive training examples forf g in this way. If an option has failed to collect intended milestoneg, either through timing out or collecting an unintended milestone, we might be tempted to use the states encountered during that option as negative examples (i.e.f g (s) = 0). However, this occurrence can either be indicative of the states not aordingg, or that the subpolicy corresponding tog is sub-optimal and has failed despiteg being aorded. These false negatives are a problem for any approach requiring function approximation via neural networks, which are generally not robust to label noise [136]. It is particularly troublesome in our case since the noise is greater than the true signal when the subpolicies are under-trained. Contextlearning Suppose we had access to an abstract state representationz s a =z a (s), where any statess i ands j are mapped to the same value only whenf (s i ) = f (s j ). With a representation that could cluster states in this way, we could trivially determine the falsity of a collected negatives2D g by checking ifz a (s) =z a (s j ) for anys j 2D + g , that is, if we have encountered a true positive with the same representation. The classication procedure could be interpreted as “labeling" these contexts with aordance values. This is somewhat of a “chicken and egg” problem, since to learn aordances, we require a representation that maps states to contexts with the same aordance values, which clearly requires some 87 prior knowledge about aordances. Fortunately, from Section 6.3, we know that aordances will only change when either (1) a milestone is collected and (2) when edge stochasticity occurs. Since (2) is by denition a rare occurrence, statess t ands t+1 are more likely than not to satisfyf (s t ) =f (s t+1 ), so long as they exist in the same segment between milestones. In this case, we can say thats t ands t+1 share the same achievement context,z s = z(s). Letz (s) be an achievement context embedding represented by a dierentiable function parameterized by . We can trainz (s) from experience using the following contrastive loss: L = X j z (s a j )z (s p j ) 2 2 z (s a j )z (s n j ) 2 2 + + ; where eachs a j is a randomly chosen anchor, eachs p j is a positive † example chosen within the segment according to a (truncated) normal distribution,N T (0; 2 ), centered around (and excluding)s a j ,s n j is chosen randomly among other segments and is treated as a negative example, and is an arbitrary margin value. This loss pushes representations of states from the same achievement context together, and pulls representations of states from dierent contexts apart. We nd that a wide range of produce useful representations. For our edge stochasticity experiments (Figure 6.5) we use a low = 2:0 to reduce the risk of sampling across aordance changes. False negative ltering In the learned representation space, we expect false negative points to be closer to positive points than true negatives. Given a negatively-labeled states q for classier headf g , we compute ‡ the mean distance fromz (s q ) to the representations of thek closest positive points in a population uniformly sampled § fromD + g , denotedd k q . We expectd k q to be large for true negatives and small for false ones, and we can determine an eective separating margin in the following way. First we compute distance scores for a random sample of positive points, denotedfd k p g, to use as reference. We ensure these † Here, the usage of “positive" and “negative" refers to whether points share the same achievement context. ‡ This procedure is akin to the particle entropy approach used in [96]. § For eciency, we sample just enough points so that we are likely to cover all encountered contexts. 88 points come from segments that are disjoint from the population points’ segments to avoid trivially low scores that might skew the distribution. We then t a Gaussian distribution tofd k p g and compute an upper condence bound for a given percentile value and condence level. Anyd k q < is very similar to positive points in the representation space, so we counts q as a false negative and exclude it from our training set. Note, we do not train headf g (and therefore do not reliably prune) until we have access to both positives and negatives forg. Methodoverview The HAL architecture consists of a bi-level policy like h-DQN, a context embedding network, and an aordance classier (see Figure 6.2), which are all learned concurrently. Intuitively, the aordance classier is able to generalize to a novel state,s, by rst identifying the abstract achievement context, z (s), associated with the state, and then outputting an aordance value based on previous experience in that context. Ifz (s) has also not been encountered, that context will not be strongly “labeled” either way, so we will not be invariably pruning it. The meta-controller selects a subtaskg at the beginning of the episode, and selects a new subtaskg 0 after collecting any milestone or whenever an option times out after a predened number of steps (our g ). At each step, the current state is fed to the controller, which outputs an action conditioned on the most recently selected subtask. After discretizing the classier’s output to a binary mask, we perform an aordance aware version of-greedy as follows. Given parameters a and mc , we select a random subtask within the mask with probability a , randomly across all subtasks with probability mc , and otherwise select greedily with respect to meta-Q within the mask. 6.4 Experiments In our experiments we aim to answer the following questions: (1) Does HAL improve learning in tasks with complex dependencies? (2) Is HAL robust to milestone selection and context stochasticity? (3) Can HAL more eectively learn a diverse set of skills when trained task-agnostically? 89 Figure 6.3: Screenshots of crafting (left) andtreasure (right) environments. Displayed to the right of the environ- ments are each item’s ground-truth aordance indicator and inventory count. Environments We evaluate our method, along with several baselines, on two complex environments with intricate subtask dependency structures: crafting andtreasure. Both environments (visualized in Figure 6.3) are extensions of the minigrid framework [28]. Agents receive an egocentric image of the environment, as well as a vector describing their inventory (items picked up from the environment and currently in their possession) as observations. The action spaces are discrete and include actions for turning left/right, and moving forward/backward, as well as environment specic actions detailed below. crafting is based on Minecraft, a popular open-ended video game in which players collect resources from their environment and use them to craft objects which can be used to then obtain more resources. As such, the hierarchy of possible subtasks is immensely complex and presents a signicant challenge for AI agents to reach subtasks deeper in the hierarchy. We develop an environment that replicates this hierarchical complexity without the commensurate visuomotor complexity which our method does not aim to address. In addition to movement actions,crafting includes actions to mine the object immediately in front of the agent (which requires an appropriate pickaxe), as well as to craft and smelt the various objects (pickaxes, iron ingots, etc.). The full set of milestones contains items that are either craftable or collectable. crafting naturally contains node stochasticity since the collection of certain items, due to the random procedural 90 0.00 0.25 0.50 0.75 1.00 1.25 Step × 10 6 0.0 0.2 0.4 0.6 0.8 1.0 Success Rate 0 2 4 6 8 Step × 10 5 Oracle HAL (ours) HAL(–FNF) H-Rainbow H-Rainbow(+HER) Rainbow 0.0 0.2 0.4 0.6 0.8 1.0 Step × 10 6 0.0 0.2 0.4 0.6 0.8 1.0 Sub-Policy Success Rate Figure 6.4: Success rate over the course of training forcraftingiron task (left) andtreasure (center). Sub-policy success fortreasure (right). Success rate is the proportion of episode where the agent receives the target milestone, and sub-policy success is how often sub-policies, on average, receive the correct milestone when called, before a timing out or collecting incorrect milestones. generation, requires slightly dierent milestone trajectories across episodes (e.g. mining variable amount of stone to encounter diamond). treasure is a navigation task that requires the agent to collect various items and use them to unlock rooms to reach further items. The ultimate goal is to unlock a treasure chest, which requires a sequence of collecting several keys, as well as placing an object on a weight-based sensor, in order to open the requisite doors. Agents can only carry one object at a time, so they must reason about which object to pick up based on what it will aord them (e.g. if the weight-based sensor room is locked, the weight object is not currently useful). Likecrafting,treasure contains node stochasticity due to the procedural generation. For example, the central room that the agent is spawned in can contain either of the red or yellow key individually, or both together. Unlikecrafting, which has a large action space to accommodate the various crafting recipes, this environment only contains actions to move and a single “interaction” action that is used to pick up keys, open doors, etc. Whilecrafting has a more complex hierarchy and greater diversity in the potential ordering of subtasks, treasure has on average more dicult subtasks. The full set of milestones contains each object the agent can successfully interact with in the environment (e.g. opening door, collecting key). 91 Table 6.1: Summary of baselines. Hier- archical Agent Aord- ance Mask Hind- sight Replay False Negative Filtering Oracle X Truth X N/A HAL (ours) X Learned X X HAL(–FNF) X Learned X H-Rainbow X N/A N/A H-Rainbow (+HER) X N/A X N/A Rainbow N/A N/A N/A Baselines Our set of baselines is summarized in Table 6.1. All methods are based on the Rainbow [62] Deep Q-Learning algorithm, which combines several improvements to vanilla DQNs [106]. To compensate for the lack of milestone signals, non-hierarchical methods use a dense reward function that incorporates milestone signals for the rst time each milestone is obtained in the episode. To evaluate the ecacy of our aordance classier online learning procedure, we compare our method to an “oracle” that diers only by using ground-truth aordances for masking subtasks. The node stochasticity inherent to both environments, as well as the edge stochasticity later explored, preclude the use of any methods that reason solely over symbols (i.e. automata-, sketch-, or subtask dependency-based approaches). We also incorporate a version of Hindsight Experience Replay (HER) [5] adapted for the discrete milestone setting, which involves re-using “failed" trajectories that result in the collection of an unintended milestone (given the selected subpolicy) as successful data for the relevant subpolicy. For instance, if an agent accidentally collects iron while executing its wooden pickaxe sub-policy, it can use this trajectory to train its iron sub-policy. Results Learningecacy First, we evaluate the ability of HAL and our baselines to learn successful policies for complex tasks in each environment. Learning curves are shown in Figure 6.4 (plots depict mean and 95% condence interval over 5 seeds). In both environments, HAL signicantly outperforms the 92 strongest baseline, H-Rainbow(+HER) (HR+H), despite both methods receiving the same information, and performs only slightly worse than the oracle, which has access to ground truth. Incorporating HER into H-Rainbow leads to a signicant improvement. False negative ltering appears crucial for learning in thetreasure environment, but not as much forcrafting, though in both cases ltering improves mask accuracy. Removing false negative ltering causes HAL(-FNF) to be pessimistic (i.e. over-pruning subtasks), ultimately leading to its unstable learning. Since masking impossible subpolicies would have no impact on an optimal meta-controller, HAL’s success must stem from its ability to learn a useful mask before TD errors are able to propagate through the meta-controller’s Q function. HAL utilizes a more easily learnable function (aordance classier) to reduce the amount of unnecessary expensive learning (TD error propagation) required. Throughout training, HAL’s mask has an impact on greedy subtask selection60% of the time, which is evidence that HAL avoids wasting time learning Q-values that the mask is able to prune. Lastly, because aordance-aware methods are more likely to initiate subtasks in an appropriate context, we see they achieve a signicantly higher average subpolicy success rate (Figure 6.4, right). Robustnesstomilestoneselection In this section we evaluate HAL’s robustness to the selection of milestones. Aordances change when a milestone is removed since that milestone no longer acts as an intermediate link between others. One downside of some approaches that use symbol-based contexts is that an entirely dierent automata or subtask dependency graph must be dened over the new set of symbols. HAL does not use prior information of this kind, so the learning process is the same across all sets. Figure 6.5 shows HAL’s success on thecrafting environment’siron task when using “incomplete” milestone sets, relative to the full human-designed set. We see that randomly removing 1 milestone makes no signicant dierence for HAL, and even after removing 4 milestones, HAL still achieves better performance than HR+H using the full set. HAL’s performance drops when 5 milestones are removed likely due to the increased sparsity of the signal (i.e. greater subtask length) and variance in milestone set quality. However, when we double the training time for these same sets, we nd that HAL is able to converge to a 97% success rate on 93 10 11 6 7 8 9 # of Milestones 0.2 0.4 0.6 0.8 1.0 Success Rate HAL (ours) H-Rainbow(+HER) 0.25 0.50 0.75 1.00 1.25 Step × 10 6 0.0 0.2 0.4 0.6 0.8 1.0 Success Rate None 1/100 1/50 1/30 0.25 0.50 0.75 1.00 1.25 Step × 10 6 None 1/100 1/50 1/30 Figure 6.5: Comparing the robustness of HAL and HR+H to varying milestone sets (left) and various edge stochasticity frequencies (center and right, respectively) incraftingiron task. at least one set, while HR+H fails to converge on any set and ends with a maximum success rate of around 70%. Robustnesstostochasticity We modifycrafting so that at each environment step, there is a certain probability that an item in the inventory will disappear. In order to make the task feasible, rare items are less likely to disappear than common ones. This procedure produces edge stochasticity, since the disappearance of items may alter aordances, and this can occur at any time. We test three dierent levels of stochasticity, and display the learning curves in Figure 6.5. With a disappearance frequency of 1=100 that cuts HR+H’s success rate in half, HAL is still able to reach its non-stochastic success rate. With a frequency of 1=50, HAL performs comparably to HR+H with no stochasticity. To put these stochasticity rates into context, the algorithm’s average episode length about halfway through training is still over 1000 steps, meaning dozens of items are removed from the agent’s inventory over the course of an episode. By learning a model of aordances grounded in the present state, HAL is able to detect and adapt to these stochastic events. Task-agnosticlearning We next test the ability of HAL to learn skills when no task-specic extrinsic rewards (only milestones) are provided by the environment. Since we cannot learn a meta-controller in the absence of rewards, we instead randomly select subtasks with some probability mc , and random aorded subtasks otherwise (only for HAL). We evaluate both HAL and HR+H. In Figure 6.6 we see that by the end 94 Log Wood Stick Crafting Bench Wood Pickaxe Stone Furnace Stone Pickaxe Coal Iron Ore Iron Iron Pickaxe Diamond Subtask 0.2 0.4 0.6 0.8 1.0 Success Rate HAL (ours) H-Rainbow(+HER) 2 4 6 8 10 Hierarchy Depth Figure 6.6: Percentage of episodes where each milestone is achieved incrafting environment task-agnostic setting. of 10 6 steps, HAL is able to more reliably complete the milestones deeper in the hierarchy in thecrafting environment. We note that HR+H is able to marginally outperform HAL in tasks shallower in the hierarchy (e.g. wood pickaxe, stone, furnace), potentially as a result of failing to reach deeper tasks and getting more practice on shallower ones. This result is an indication of the general utility of HAL in environments with complex task hierarchies. 6.5 Conclusion The present work can be viewed as a rst step towards bridging the substantial gap between exible hierar- chical approaches that are currently intractable, and methods that impose useful structures but are too rigid to be of practical use. We introduce HAL, a method that is able to utilize incomplete symbolic information in order to learn a more general form of subtask dependency. By grounding subtask dependencies in the present state by learning a model of hierarchical aordances, HAL is able to navigate stochastic environ- ments that approaches relying solely on symbolic history are unable to. We demonstrate that HAL learns more eectively than baselines provided with the same information, is more robust to milestone selection and aordance stochasticity, and can more thoroughly explore the environment’s subtask hierarchy. Given HAL’s exible formulation and success in the face of incomplete and stochastic symbolic information, we 95 foresee future work integrating HAL with option (or subgoal) discovery methods [e.g. 8, 101, 9] to obtain performance gains in complex tasks without requiring prespecied milestones. Additionally, future work might be able to extend HAL to continuous goal-spaces, but this would require revising the denition of hierarchical aordances provided here, as it currently requires a notion of intermediate subgoal completion. 96 Chapter7 LearnedEnd-To-EndTaskAllocationinMulti-AgentReinforcement Learning Despite signicant progress on multi-agent reinforcement learning (MARL) in recent years, coordination in complex domains remains a challenge. We observe that a wide range of multi-agent settings can be decomposed into isolated subtasks, which agents can meaningfully focus on to the exclusion of all else in the environment. In these settings, successful policies allocate agents to the most relevant subtasks and each agent acts productively towards their assigned subtask alone. This decomposition provides a strong structural inductive bias, signicantly reduces agent observation spaces, and encourages subtask-specic policies to be reused and composed during training, as opposed to treating each new composition of subtasks as a unique situation. We introduce ALMA, a general learning method for taking advantage of these structured tasks. ALMA simultaneously learns a high-level subtask allocation policy and low-level agent policies. We demonstrate that ALMA learns sophisticated coordination behavior in a number of challenging environments, outperforming strong heuristics and baselines. ALMA’s modularity also enables it to better generalize to new environment congurations and incorporate separately trained action-level policies; however, the best performance is obtained only by training all components end-to-end. 97 Figure 7.1: The task of city-wide reghting can be decomposed into allocating reghters to the most urgent res (top), and reghters focusing solely on the re they are assigned to (bottom). 7.1 Introduction Multi-agent reinforcement learning (MARL) methods have lately achieved success in a wide range of domains [12, 97, 155, 17], but they still struggle to learn sophisticated coordination behavior in a sample- ecient manner. And this is no surprise—not only do most tasks require that agents learn complex low-level skills, they must also incorporate coherent high-level strategies into their policies. Integrating both strategies and skills into the same action-level policy makes learning dicult. Fortunately, for many relevant multi-agent tasks, while there may exist many dierent objectives that the agent population as a whole must attend to, individual agents need only focus on isolated aspects of the environment at any given time. Consider the setting of reghting in a city given a distributed set of resources (see Figure 7.1). While there may be many res occurring in the city at once, each reghter can 98 realistically only ght one at a time. Thus, the optimal behavior in this setting can be expressed in terms of allocating reghters to the right res, and each reghter eectively ghting the re to which they are assigned. Suitable policies for this and many other real-world problems can be formulated as bi-level decision- making processes [117]; agents are allocated to the most relevant subtasks in the environment, then each agent selects actions with the purpose of completing its assigned subtask. This decomposition allows for more ecient learning for a number of reasons. First, the learning process may benet from the added structural inductive bias, alleviating the problem of incorporating both skills and strategy into the same low-level policy. Second, by focusing solely on its assigned subtask, and ignoring the rest of the environment, agent action-level policies can learn over a signicantly reduced state-space. Finally, as we assume subtasks are drawn from the same distribution, learning policies for individual subtasks and allowing the high-level controller to compose them may be simpler than learning a single policy over all compositions of these subtasks. The problem of task allocation is well-studied [46], but most methods operate under a set of assumptions that make them unt for the complex multi-agent settings we care about. In particular, we have no means to evaluate the utility of any specic task allocation from the start, as we do not assume access to optimal agent policies. Not only will we rarely have access to such policies in practice, but in many cases it is benecial for these two levels of behavior to be learned simultaneously. For instance, the types of teams to which reghters expect to be assigned determine which skills are most useful to learn in the rst place. Learning both levels simultaneously prevents the immediate application of task allocation methods. Previous approaches that incorporate task allocation to learning-based multi-agent settings do so either by injecting domain knowledge or assuming access to low-level policies [131, 25]. Our contributions are as follows. First, we introduce a general learning method, ALlocator-Actor Multi-AgentArchitecture (ALMA), for taking advantage of decomposable multi-agent tasks (§7.3). ALMA 99 learns a subtask allocation policy and low-level agent policies simultaneously and is designed in a modular manner to promote the reuse of components across dierent compositions of subtasks. Learning the high-level controller poses challenges due to its large action space—at any time step, the number of unique allocations ofn agents tom subtasks ism n . Our solution leverages recent methods [151] in learning value functions over massive action spaces and is designed modularly to handle variable quantities of agents and subtasks drawn from a shared distribution. While action-value function typically depend on the global state, due to shared transition dynamics and reward functions that depend on all agents, we observe that under reasonable assumptions we can redene the subtask-specic value functions to only depend on local information, improving their reusability. We next experimentally validate our proposed method on two challenging multi-agent domains (§7.4). Our complete method outperforms both state-of-the-art hierarchical and standard MARL methods as well as strong allocation heuristics. Furthermore, we evaluate the practicality of our simplifying assumptions, demonstrating that learning fails without them while highlighting cases in which they do not hold. Finally, we show the importance of learning action policies in concert with allocation policies. 7.2 RelatedWork Multi-Agent Task Allocation Task allocation is a long studied problem in multi-agent systems and robotics [46, 78, 36]. Given a set of tasks and a set of agents, the goal in this setting is to assign agents to tasks in a manner that maximizes total utility. A utility function which maps agent team-task pairs to their utility is assumed to be given. Within the taxonomy of sub-classes to this problem formalized by Gerkey and Matarić [46], our problem setting can be classied as ST-MR-IA (single-task agents, multi-agent tasks, instantaneous allocation). In other words, agents cannot multi-task, tasks require the cooperation of several agents, and they are assigned instantaneously without incorporating any information regarding future task availability. Unlike standard approaches to multi-agent task allocation, our focus is not on identifying 100 a near-optimal team allocation from scratch, given a known utility function and existing task execution policies. Instead, we hope to learn an eective task allocation policy in conjunction with task-specic low-level policies from experience. In this way we can obtain articial agents which act in complex settings where task execution policies can not be designed by hand and where we possess no prior knowledge of which combinations of agents may form eective teams. MDP Formulations for Multi-Agent Task Allocation Several works have proposed formulations of task allocation as an MDP, enabling the use of learning and planning-based approaches. Proper and Tadepalli [117] introduce the decomposition of the simultaneous task setting into task allocation and task execution levels. However, their allocation selection procedure is myopic; it does not consider how allocations may change in the future. Campbell, Johnson, and How [24] address this shortcoming by taking future allocations into account when allocating agents to subtasks; however, they assume the pre-existence of low-level action policies. Notably, neither approach learns a tractable policy over the allocation space, and instead relies on exhaustive search or hand-crafted heuristics in order to select allocations. Multi-AgentReinforcementLearning Several recent works in MARL have addressed related problems to ours. Most relevantly, Carion et al. [25] introduce a learning-based approach for task allocation that scales to complex tasks. Unlike our proposed setting, their work assumes pre-existing subtask policies, as well as domain knowledge regarding how much an agent “contributes” to subtasks, which are used as constraints in a linear/quadratic program to prevent assigning more agents than necessary to subtasks. Our method, on the other hand, learns low-level policies, assumes less prior knowledge, and solves the problem of nding the best allocation by learning an allocation controller from experience rather than using an o-the-shelf solver with inputs generated by a learned function and constraints from domain knowledge. Shu and Tian [131] devise a method whereby a central “manager” agent provides incentives to self- interested “worker” agents to perform specic subtasks such that the manager’s reward is maximized 101 and incentives provided to workers are minimized. In this case, agents work on subtasks independently and are not cooperating towards a shared goal. Yang et al. [167] also consider a setting where multiple simultaneous tasks (described as goals) exist in a shared environment; however, they assume that goals are pre-assigned to agents and therefore do not tackle the task allocation problem. Liu et al. [95] introduce COPA, a hierarchical MARL method which learns a centralized controller in order to coordinate decentralized agents. Unlike our method, which utilizes task allocation for the purpose of decomposing complex settings to simplify learning, COPA’s motivation is primarily to alleviate the problem of partial observability during decentralized execution. We evaluate our method against COPA in order to assess the eectiveness of our hierarchical decomposition in comparison to a more generic hierarchical MARL method with similar assumptions (i.e. no pre-trained policies, cooperative tasks, and no pre-existing task allocation). 7.3 Methods In the last section we introduced a generic framework for value-based hierarchical MARL. Now we will present the details specic to our hierarchical MARL framework using subtask allocation,ALMA. In this case we dene the action space of the high-level controller (whose value function is dened in Eqn. 2.7) as the set of all possible allocations of agents to subtasks. A subtask allocation refers to a specic set of agent-subtask assignments, where each agent can only be assigned to one subtask at a time, and it is denoted byb =fb a ja2Ag2B, whereb a 2I represents the subtask that agenta is assigned to. We then slightly abuse notation and denote the set of agents assigned to subtaski asb i A and the set of all agents not assigned to taski asb ni whereb i [b ni =A. We then specify the joint action for agents assigned to subtaski asu b i =fu a ja2b i g. Recall that the subtask-specic states i includes the state of all agents. We denote the subtask-specic state including only the assigned agents ass b i =fs e je2E i [b i g. 102 Subtask Allocation Subtask Execution Agents Subtasks Figure 7.2: ALMA computing agent actions (specically agenta 5 ) given the current state. Subtask allocationsb are updated by a centralized controller everyN t steps, and then agent policies select low-level actionsu in a decentralized fashion given their local state. SubtaskAllocationControllers One of the main challenges of learning aQ-function for the high-level allocation controller is the massive action space. Our formulation of subtask allocation can be seen as a set partitioning problem, which is known to be NP-Hard [129, 46]. Q-Learning with the allocation action space requires nding the allocation with the highestQ-value at each step (Eqn. 2.7), as does deriving an action policy from this Q-function. At each step, we have a choice ofjIj jAj possible unique allocations of agents to subtasks, and it is prohibitively expensive to evaluate each one and select the best. Van de Wiele et al. [151] introduce “AmortizedQ-Learning” which addresses the problem of massive action spaces in DeepQ-Learning by dening a “proposal distribution” which they train to maximize the density of actions with high values given a state. We adapt this idea for our allocation controller, wheref(bjs;) is our proposal distribution over allocations. We can then sample from this distribution and select the allocation with the highest value, eectively approximating the maximization procedure required forQ-Learning. Our proposal distribution is learned with the following loss: L(;s) = logf(b (s)js;) AQL H(f(js;)); (7.1) 103 whereb (s) is the highest-valued allocation from a set of N p samples from the proposal distribution. Formally,b (s) := arg max b2B samp (s) Q(s;b) whereB samp (s) :=fb 1 ;:::;b Np f(js;)g. We then learn an approximation of Eq. 2.7 with the following loss, adapted from Eqn. 2.5: L Q () :=E y t Q (s t ;b t ) 2 y t := Nt X n r t+n + Q s t+Nt ;b (s t+Nt ) (7.2) where transitions (s t ;b t ; P Nt n r t+n ;s t+Nt ) are sampled from a replay buer. Learning a proposal distribu- tion and value function over a combinatorial space closely parallels recent work in learned combinatorial optimization [14, 77, 15]. We construct our proposal distribution modularly so that each module can leverage the fact that subtasks are drawn from a shared distribution. These modules consist of an agent embedding functionf a :S a !R d , a subtask embedding functionf s :S E i !R d , and a subtask embedding update functionf u :R 2d !R d which is used to update the embedding of the subtask that an agent was assigned to. We rst constructf in a factorized auto-regressive form, f(bjs) = Y a2A f(b a js;b i h a (7.4) We then sampleb a f, and update the selected subtask’s embedding asg 0 b a =g b a +f u (g b a;h a ) such that other agents’ allocations can take into account existing ones. SubtaskExecutionControllers Given a subtask allocationb, we must learn low-level action-value functions as described in Section 2.2. Specically, for each teamb i we learn a value function based on the subtask rewardsr i . Recall that the subtask reward function depends onall agents’ actions and states since any agent can potentially contribute to any subtask. From the perspective of the team assigned to subtaski, we treat the other subtasks’ agents as part of the environment by taking the expectation over actions sampled from their optimal policies: Q tot i (s;u b i ;b) =E r i (s i ;u) + maxQ tot i (s 0 ; ;b) u b ni b ni s 0 P (js;u) (7.5) While this function depends on the full global state, as all entities can feasibly inuence all other entities’ state transitions and any agent can potentially contribute to the rewards of any subtask, the set of truly relevant information may be much smaller for optimal policies. Reconsider our reghter example from Sec. 7.1 where reghters are agents and buildings are subtasks. While it is possible for any reghter to put out any re, given optimal policies we expect reghters to only put out the re they are assigned to. Moreover, in the optimal setting, reghters will not be directly interacting with reghters at other buildings. These observations motivate the following assumptions: 105 Assumption 7.3.1 (Subtask Transition Independence). Given subtask allocationb, denote the set of global states visited by the optimal subtask policies, b i , asS b . Subtask transition independence as- sumes the state transition distribution can be written in the following factored form: P (s 0 js;u) = Q i2I P b i (s 0 b i js b i ;u b i )8s;s 0 2S b Assumption7.3.2 (Subtask Reward Invariance). Given optimal subtask policies, we assume that the states and actions of agents not assigned to subtaski will have no impact on the rewards attained,r i . In other words, there exists a function dependent only on the subtask entities and the agents allocated to that subtask which can predict the subtask rewards. Formally stated:9f :S E i S b i U b i !R s.t. f(s b i ;u b i ) = r i (s i ;u)8s2S b ;u2U. We denote this “invariant” reward function asr b i :=f. With these assumptions, we can essentially decompose the environment into a set of independent sub-environments and rewrite Eqn. 7.5 as follows: Q tot i (s b i ;u b i ;b) =r b i (s b i ;u b i ) + E h maxQ tot i (s 0 b i ; ;b) s 0 b i P b i (js b i ;u b i ) i (7.6) Note that theQ-function now only depends on a “local” state, and, as such, can be approximated more easily due to the smaller function input space. Moreover, in the case where subtasks are drawn from the same distribution,Q-functions dened over the local state will more easily generalize to other instantiations of the same subtask since they no longer depend on the context outside of that subtask’s state. While Asms. 7.3.1 and 7.3.2 may not always hold (e.g. reghters heading to dierent buildings may bump into each other while on their way, violating transition independence), we validate their usefulness empirically in Section 7.4. Learning a function that approximates Eqn. 7.6 requires predictingQ-values for teams of varying sizes, as the quantity of agents assigned to a subtask is not xed. In fact, each unique combination of agents assigned to subtaski can be seen as a unique Dec-POMDP. We rely on recent work [67] which 106 learns factorized multi-taskQ-functions for teams of varying sizes by sharing parameters via attention mechanisms. This work falls under the category of factorized value function methods for cooperative MARL described in §2.3, and as such, we represent each subtask-specicQ-function as a monotonic mixture of agent utility functions computed in a decentralized fashion, such that agents can act independently without communication. Overview We provide an overview of the complete subtask allocation and execution procedure in Figure 7.2. Subtask allocationsb are selected in a centralized fashion everyN t steps which then determines the set of low level policies to execute. We dene the high-level policy (bjs) :=1(b =b (s)) whereb (s) is the highest valued allocation sampled from the proposal distribution, as dened in Section 7.3. The per-agent low-level policy for agenta assigned to subtaski is dened as a (u a js b i ;b) :=1(u a = arg max u a0Q a i (s b i ;u a0 ;b)) whereQ a i is the agent’s utility function which is monotonically mixed with alla2b i to formQ tot i . When collecting data during training, we select random low-level actions with probability in order to promote exploration. We use two types of exploration in the allocation controller. With probability p we select a full allocation sampled from the proposal distribution, rather than taking the highest valued one. Then, with probability r , we randomly select subtask allocations independently on a per-agent basis. All exploration probabilities are annealed over the course of training. To handle state spaces of variable size (i.e. variable quantity of entities), we use attention models [152] as in [67] for all components. In order to achieve the partial views over the global state required by our redened low-levelQ-function in Eqn. 7.6, we use masking in attention models to prevent agents from seeing certain entities. 107 7.4 Experiments High-LevelLearning Our high-level controller, at each step, attempts to solve a problem with underlying combinatorial structure. Many pure task allocation problems can be formulated in a manner which admits the use of traditional methods for solving or approximating solutions to combinatorial optimization problems (e.g. mixed integer programs). While the problems we are ultimately interested in are not amenable to such formulations, we can still evaluate our approach for learning sub-task allocation on specic instances of task allocation problems which admit optimal (as well as highly tuned heuristic) solvers. By doing so, we can evaluate our high-level learning procedure’s applicability to settings where an underlying combinatorial structure exists but is not known. Vehicle Routing [150] is an example of a well-studied allocation problem with strong solution approaches built from domain knowledge. The Vehicle Routing Problem (VRP) is a generalization of the Travelling Salesman Problem, expanding it to cases where we want to compute a tour of minimum total length with multiple agents that visit all customers. This can be thought of as a sequential task allocation problem where the customers are tasks. In particular, we consider the Capacitated Vehicle Routing Problem (CVRP), where each vehicle has limited capacity and each customer possesses a certain load. We train and evaluate the high-level controller portion of ALMA on randomized instances of CVRP with 4 vehicles and between 5 and 15 customers. While, these sizes are small by VRP standards, we note that our ultimate task involves low-level control on complex tasks and MARL has yet to scale signicantly beyond tasks of this quantity of agents on non-trivial tasks with heterogeneous agents. As such, demonstrating the ecacy of our high-level learning procedure on this task in comparison to existing solvers can, at the least, show that the high-level learning procedure is not a bottleneck. 108 5 6 7 8 9 10 11 12 N Customers 1.0 1.2 1.4 1.6 1.8 Opt Normalized Distance Method OR-Tools Greedy Learned 5 6 7 8 9 10 11 12 13 14 15 N Customers 1.00 1.25 1.50 1.75 2.00 2.25 OR-Tools Normalized Distance Figure 7.3: Total solution distance on 1000 instances of CVRP, separated by the number of customers (left) Normalized by the optimal distance, (right) Normalized by the solution distance found by OR-Tools We consider several solution methods for comparison. Optimal solutions are computed by formulating the CVRP as a mixed integer program [48], and using the solver provided by Google’s OR-Tools library [115]. We also use the approximate VRP solver provided by OR-Tools, which trades accuracy for computational eciency. Finally, in order to get a relative sense of solution quality, we implement a naive greedy heuristic which operates by sending vehicles to the nearest building which holds a load that they can carry given their currently held load and capacity. We visualize our results in Figure 7.3. Specically we plot the solutions’ total distance normalized by the optimal total distance as well as the total distance for the OR-Tools approximate solution for 1000 instances of the CVRP separated by the number of customers on the x-axis. We only compute optimal solutions for sizes up to 12, as computation times become infeasible since the MIP formulation has a number of constraints which scales exponentially with the number of customers. We nd that OR-Tools’ approximate solver is typically optimal or very close for problems of this size. Overall, we nd that our learned solver typically nds solutions that are optimal or close to optimal, though it is not competitive with the domain specic OR-Tools solver. This result is promising, as it indicates that our learning-based approach is capable of near-optimality in the cases we’re interested in where handcrafted solvers are not available. For a more thorough treatment of RL-based approaches for vehicle routing, see Nazari et al. [108]. 109 End-To-EndLearning We now evaluate our full method with high-level sub-task allocation and low-level control (ALMA) on two challenging environments, described below. Environments SaveTheCity This environment is inspired by the classical Dec-POMDP task “Factored Fireght- ing” [111] where several reghters must cooperate to put out independent res; however, we introduce several additional degrees of complexity. First, agents are capable of contributing to any subtask, such that the task is amenable to subtask allocation, and we can not use a xed value function factorization. Second, agents are embodied and must physically move themselves to buildings through low-level actions, rather than being xed and only having a high-level action space selecting buildings to ght res at. Finally, we introduce several types of agents with diering capabilities (all of which are crucial for success), such that the subtask allocation function must learn which subtasks require which capabilities and how to balance these. The task also shares similarities with the Search-and-Rescue task from [95]; however, we do not consider partial observability and we introduce agents with diverse capabilities. In each episode, there areN = [2; 5] agents (circles) andN + 1 buildings (squares) (see 7.4). Each building regularly catches re (red bar) which reduces the building’s “health” (black bar). The agents must learn to put out the res and then fully repair the damage, at which point the building will no longer burn. The episode ends when all buildings are fully repaired or burned down, and an episode is considered successful if no buildings burn down. The reghter (red) and builder (blue) agents are most eective at extinguishing res (a) and repairing damaged buildings (b), respectively, whilegeneralist (green) agents—though unable to make progress on their own—can move twice as fast, prevent further damage to a blazing building (c), and increase the eectiveness of other agents at their weak ability if at the same building. 110 3 2 2 4 5 (b) (c) (a) Numbers are not centered now, but when you export they become centered (weird interactions between images and vectors) 0.00 0.25 0.50 0.75 1.00 Step × 10 7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Success Rate Method ALMA No mask Heuristic COPA Flat Figure 7.4: SaveTheCity example task and training curves. 0.0 0.2 0.4 0.6 0.8 1.0 Step × 10 7 0.0 0.1 0.2 0.3 0.4 Success Rate Method ALMA No mask Heur. (dyn.) Heur. (match) COPA Flat 0.0 0.2 0.4 0.6 0.8 1.0 Step × 10 7 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0 Step × 10 7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.2 0.4 0.6 0.8 1.0 Step × 10 7 0.0 0.2 0.4 0.6 0.8 Figure 7.5: StarCraft II training curves, left to right: S&Z (a) disadvantage and (b) symmetric; MMM (c) disadvantage and (d) symmetric. The full map is a 16x16 grid and buildings are randomly spawned across the map. Agents always start episodes in a cluster in the center. Buildings begin episodes on re at a 40% rate. Agents are rewarded for increasing a building’s health, completing a building, putting out a re, and completing all buildings (global reward only). Agents are penalized for a building burning down or its health decreasing due to re. StarCraft The StarCraft multi-agent challenge (SMAC) [123] involves a set of agents engaging in a battle with enemy units in the StarCraft II video game. We consider the multi-task setting presented by Iqbal et al. [67] where each task presents a unique combination and quantity of agent and enemy types. 111 As such, agents cannot learn xed strategies, and must instead adapt strategies to their set of teammates and enemies. We increase the complexity by introducing multiple enemy armies and the additional objective of defending a centralized base. These enemy armies attack the base from multiple angles, and the agents must defeat all enemy units while preventing any single enemy unit from entering their base in order to succeed. As we do forSaveTheCity, we train simultaneously on tasks with variable types and quantities of agents. We consider four settings based on those presented in [67] consisting of unique combinations of unit types which require varying strategies in order to succeed. In one set of settings we give agents and enemies a symmetric (i.e. matching types) set of units, while in the other agents have one fewer unit than the enemies. Within each set we consider two dierent pools of unit types: “Stalkers and Zealots” (S&Z) and “Marines, Marauders, and Medivacs” (MMM). The former includes a mixture of melee and ranged units, while the latter includes units that are capable of healing. In each setting, there can be between 3 and 8 agents and 2 or 3 enemy armies. Each enemy army consists of up to 4 units. Stalkers are units that are capable of shooting enemies from afar and are useful for causing damage without taking as much damage, as they can run immediately after shooting. Zealots are melee units (i.e. they must walk up to their enemies to damage them), and they are especially strong against Stalkers. Marines are ranged units that are relatively weak with respect to damage output and health. Marauders are also ranged units and have more health and damage output. Medivacs are ships that oat above the battleeld and are able to heal their friendly units. Notably, multiple Medivacs cannot heal the same unit simultaneously, so agents can overpower a Medivac’s healing by targeting the same unit. Agents are rewarded for damaging enemy agents’ health, defeating enemy agents, and defeating enemy armies. The global reward also includes an additional reward for defeating all armies. 112 Results In our experimental validation we aim to answer the following questions: 1) Learning Ecacy: IsALMA eective in improving learning eciency and asymptotic performance in comparison to state-of-the-art hierarchical and non-hierarchical MARL methods? 2) Allocation Strategic Complexity: Are the learned allocation policies non-trivial or are any benets of ALMA purely gained from the subtask decomposition? 3) Assumption Validity: What, if anything, do we gain from subtask independence assumptions and how doesALMA fare when these assumptions are broken? 4) End-to-End Training: Do we receive any benets from training allocation and execution controllers jointly? LearningEcacy First, we aim to evaluate the eectiveness of our hierarchical abstraction in comparison to state-of-the-art cooperative MARL methods in variable entity multi-task settings. We compare to two non-hierarchical multi-agent baselines: QMIX [119] augmented with self-attention [152], for the purpose of extracting information from variable quantities of entities (A-QMIX), and REFIL [67], a state-of-the-art approach built on top of A-QMIX for generalizing across tasks with dierent compositions of agent and entity types. Both methods are referred to as “Flat” in our gures and dier depending on the environment. REFIL is used for StarCraft tasks, but we nd that it does not signicantly improve on A-QMIX in SaveTheCity, so we use A-QMIX in that setting. These methods also serve as the learning algorithms for the low-level controllers in our hierarchical methods. Next, we compare to a state-of-the-art approach in hierarchical MARL for varying entity settings: COPA [95]. COPA makes the assumption, similar toALMA, of a centralized agent that is able to communi- cate with decentralized agents on a periodic basis. We ensure that the communication frequency of COPA matches that of our method and that it is provided subtask labels for entities, so it receives a similar amount of information, even though it does not explicitly leverage the subtask decomposition. 113 t = 0 t = 15 t = 45 t = 40 t = 75 1 3 2 (a) (b) (c) (d) a b c d e f g h (e) (f) (g) (h) Figure 7.6: Walk-through of learnedALMA policy on a sampleStarCraft task. Agents start (t = 0) at the center, and two enemy armies immediately begin to attack (a). ALMA learns there is an advantage in numbers (e.g. “focus ring" on single enemies) and allocates most agents to Subtask 1 (b). One Zealot is additionally assigned to each attacking army to prevent the base from being stormed (c-d). This strategy was determined from the very rst time-step (a). Aroundt = 40, a Stalker is reallocated to Task 3 (e), as the Zealot previously assigned was defeated, leaving a vulnerability, and Subtask 1 has almost been completed by the other agents. Once Subtask 1 is complete (t = 45), most agents are allocated to Subtask 3, the largest army, but one Stalker (f) is allocated to assist the Zealot at Subtask 2 (g) since it is low on health. Once Subtask 2 is complete, the remaining agents are allocated to Subtask 3 to complete the task. In bothSaveTheCity andStarCraft, we nd thatALMA is able to outperform all baselines in most settings. Both A-QMIX (Flat) and COPA are unable to converge to a reasonable policy within the allotted timesteps inSaveTheCity (Fig. 7.4), and REFIL (Flat) and COPA are only competitive withALMA in terms of both sample eciency and asymptotic performance in one setting: MMM Symmetric (Fig. 7.5d). These results highlight the importance of ALMA’s hierarchical abstraction as well as the assumptions made in order to accelerate learning of low-level policies. InterestinglyALMA appears to supply the greatest performance gains on the environments that are most dicult, as judged by the absolute success rates. ALMA stands out in S&Z disadvantage (Fig. 7.5), where the next best methods achieve 20% success, while it only roughly matches the performance of the top performing methods in MMM symmetric where the highest success rates are around 75%. We hypothesize that when agents are faced with such a disadvantage, coordinated strategy becomes more crucial. 114 AllocationStrategicComplexity Next, we hope to learn whether our allocation controller is learning complex non-trivial strategies. As such, we implement allocation heuristics derived from domain knowledge in each setting and only learn the low-level controllers. These methods serve as a strong baseline for learned allocation, as they simplify the learning problem for low-level controllers by leveraging the subtask decomposition assumptions (i.e. masking subtask-irrelevant entities) and having a xed high-level allocation strategy. ForSaveTheCity, the heuristic allocates each agent to the nearest building at which they are most useful according to their capabilities. While relatively simple, devising a more sophisticated allocation strategy is nontrivial as there are many factors to weigh. ForStarCraft, we devise two heuristics: the rst (matching) allocates agents to enemy armies by matching unit types such that each individual battle is fair. The other (dynamic) only considers enemy armies that are currently attacking and also attempts to match the unit composition. InSaveTheCity (Fig. 7.4) we nd the heuristic is able to learn quickly at the beginning as the action- level policies improve, but it converges lower thanALMA. InStarCraft (Fig. 7.5) we nd that at least one of the heuristics can be competitive in some cases (e.g. S&Z Symmetric), although which heuristic performs best is not consistent across settings, andALMA always outperforms it. With these results we can conclude decomposing the task into subtasks to be solved by simpler task execution controllers is not sucient for superior performance in our environments and sophisticated high-level strategy must be learned. In Fig. 7.6 we qualitatively demonstrate some of the strategies thatALMA is able to learn. We nd thatALMA is able to allocate agents to subtasks in a manner that considers long term consequences and balances priorities gracefully (e.g. defending the base vs. defeating enemies). Assumption Validity In order to validate Asms. 7.3.1- 7.3.2 we ablate our approach to exclude the task-specic masking performed on each agent’s observation. This method is referred to as “No mask” in Figures 7.4 and 7.5, where we nd that our method experiences a signicant deterioration in performance when not masking irrelevant information. 115 Min Max Enemy Army Spread 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Success Rate ALMA Flat ALMA ALMA (Heur. π ) Heuristic 0.0 0.1 0.2 0.3 0.4 Success Rate Figure 7.7: Further analysis inStarCraft S&Z disadvantage setting. Left (a): Varying spread of enemy armies. Right (b): Zero-shot evaluation with alternate action policies. To test the boundaries of these assumptions, we evaluate on settings in StarCraft which violate them. While we train the agents only in cases where enemy armies are maximally spread out from each other along the boundary of a circle surrounding the agents’ base, in our evaluation setting we allow for armies to spawn arbitrarily close to one another. In this case, for example, agents assigned to one army may be attacked by other armies or bump into agents attacking other armies, violating subtask transition independence (Asm. 7.3.1). We plot the performance of ALMA alongside REFIL (Flat) in Fig. 7.7a, where the distance between armies varies along the x-axis. The further right, the more similar the tasks are to what is seen during training. While REFIL’s performance deteriorates smoothly as the composition of subtasks become more dissimilar to those seen during training, ALMA maintains steady performance and even sees a slight uptick in performance towards the middle. While surprising, this dierence in generalization capability can be attributed to the modularity of our approach. While REFIL treats each unique composition of subtasks as novel, the modules comprising our approach (specically the subtask-execution controllers and allocation proposal network) only see information relevant to their subtask, which are individually drawn from the 116 same distribution as seen during training. ALMA’s increase in performance in the middle can be attributed to the fact that armies being closer together makes it easier for agents to switch between tasks. Ultimately, as armies become closer together, our assumptions begin to fail, subtasks blur together, and we nally see ALMA’s performance drop o. End-to-EndTraining AlthoughALMA is able to learn all components (subtask allocation and execu- tion/action controllers) jointly, we evaluate its ability to utilize agent policies trained by a dierent method in a zero-shot manner. In particular, we see what happens when we replaceALMA’s action-level policies with those trained with a heuristic. We note that the procedure for training the low-level controllers is identical between these methods and the only dierence is the allocation strategy used throughout training. We see in Figure 7.7b that without further training,ALMA’s superior allocation strategy is able toquadruple the eectiveness of the policies learned with the heuristic. Ultimately though, the best performance is only attainable through training allocation and execution controllers jointly, as the execution controllers can learn to succeed in the settings the allocation controllers put them in, and vice versa, creating a virtuous feedback loop. 7.5 Conclusion In this work we have introducedALMA, a general learning method for tackling a variety of decomposable multi-agent settings in which each subtasks’ rewards and transitions can be assumed as independent. By simultaneously learning a high-level allocation policy and action-level agent policies,ALMA is able to succeed in settings which are dicult for at methods, and in which well-performing heuristics are dicult to come by. One promising direction for future work is to equipALMA’s allocation policy with more sophisticated exploration methods. Although the stochasticity resultant from the-greedy procedure and the proposal distribution sampling yields surprisingly good results, a more intelligent exploration 117 procedure could even better alleviate the challenges inherent to learning over such a large combinatorial action space. We expect such an improvement would, for example, prevent some of the variance we observe inSaveTheCity (Figure 7.4). Another direction for future work is to discover subtasks and/or observation masks from the environment automatically. We believe this work has immediate practical implications in real-world settings where our assumed subtask independence assumptions often hold, such as warehouse robotics. 118 PartIV Conclusion 119 Chapter8 Conclusion In this thesis, we presented several approaches for discovering and leveraging the structure of complex real-world settings, both in terms of task structure and interaction structure between agents. In Chapters 3 and 5 we proposed methods that learn value functions for multi-agent reinforcement learning with a greater structural understanding of relationships between agents and entities in a data-driven manner. These methods promote increased scalability and generalization to tasks with similar agent compositions. In Chapter 4 we proposed a method for MARL which allows agents to explore their state space in a structured manner that considers what other agents have explored. Then, in Chapters 6 and 7 we propose methods for leveraging sub-task decompositions by learning aordances to prune impossible sub-tasks and how to allocate agents to sub-tasks to maximize performance on the global task respectively. 8.1 FutureWork Now we will discuss some promising directions in which the contributions of this dissertation may be built upon. The main thrust of these directions is to decrease reliance on human knowledge. For example, the methods proposed in Part III rely on manually dened subtask decompositions. While human knowledge can help increase sample eciency in compute-constrained settings, Sutton [143] argues that, given enough compute and data, learning and search methods will overcome those that depend on human knowledge. This 120 eect has borne out with increasingly large neural architectures consistently achieving greater performance on challenging vision [37] and NLP [20] tasks. As such, we propose several directions which reduce reliance on human provided information but may be combined with the approaches we present in this dissertation. EntityDecompositionfromUnstructuredData In Chapters 5 and 7 our methods rely on a entity-based representation of the global state; however, this decomposition may not always be available. The state may be represented in a more unstructured manner. For example, we may receive a collection of sensor readouts, images, or other raw data from which semantically meaningful entities are not immediately identiable. Work in object detection (e.g. Redmon et al. [121]) provides a strong foundation to build upon in the image domain; however, these approaches are fully supervised and only recognize a discrete set of object types dened by the dataset. Ideally, an approach in automated entity decomposition would work in an unsupervised manner and recognize important discrete entities in raw data. Their importance should be grounded in the rewards and dynamics of the underlying MDP. Efroni et al. [39] provide an approach for ltering out exogenous information (i.e. irrelevant to the underlying dynamics) from an unstructured state; however, they do not provide an entity decomposition. Automated entity decomposition should similarly lter out exogenous information but also decompose the state into discrete salient entities. AutomatedTaskStructureDiscovery Chapters 6 and 7 rely on pre-existing sub-task decompositions dened by human knowledge. These decompositions take the form of multiple reward functions which indicate progress on independent objectives crucial for the completion of the global task. In order to discover these reward functions in an automated manner, we must possess a causal understanding of task dynamics and some notion of the degree of granularity and abstraction required for the decomposition. A possible data-driven approach 121 could be to collect a large quantity of related tasks (e.g. cooking dierent recipes), train agents to near optimality on these tasks, and then identify common sub-patterns of behavior in these optimal agents as a way to discover meaningful sub-tasks (e.g. peeling a potato). We could then distill the optimal agents’ behavior into modular sub-policies that we could ideally recombine into policies for novel tasks not seen in the set used to identify the sub-tasks. UnifyingTaskHierarchyandAllocation Finally, we propose the direction of unifying the ideas presented in Chapters 6 and 7. Many real-world tasks exist with both sub-task dependency structures as well as the opportunity for agents to “divide-and-conquer” based on their individual specialties. Consider, once again, the construction example presented in Chapter 1. Construction of a building is a long-horizon task which requires the assembly of components which are then combined to form larger components until the nal building is completed. Completing a building requires an understanding of the dependency structure of the sub-tasks that must be completed (e.g. the foundation must be completed before support beams are installed). This knowledge can accelerate the rate at which we learn how agents should be allocated to sub-tasks. The key challenge of combining these approaches would be to do so in a manner that is not unwieldy and requires complicated implementation and integration of disparate components. 122 Bibliography [1] Akshat Agarwal, Sumit Kumar, and Katia Sycara. “Learning transferable cooperative behavior in multi-agent teams”. In: arXiv preprint arXiv:1906.01202 (2019). [2] Adrian K Agogino and Kagan Tumer. “Analyzing and visualizing multiagent rewards in dynamic and stochastic domains”. In: Autonomous Agents and Multi-Agent Systems 17.2 (2008), pp. 320–338. [3] Ferran Alet, Tomás Lozano-Pérez, and Leslie P Kaelbling. “Modular meta-learning”. In: Conference on Robot Learning. PMLR. 2018, pp. 856–868. [4] Jacob Andreas, Dan Klein, and Sergey Levine. “Modular multitask reinforcement learning with policy sketches”. In: International Conference on Machine Learning. PMLR. 2017, pp. 166–175. [5] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. “Hindsight experience replay”. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, pp. 5055–5065. [6] Sébastien Arnold, Shariq Iqbal, and Fei Sha. “When maml can adapt fast and how to assist when it cannot”. In: International Conference on Articial Intelligence and Statistics. PMLR. 2021, pp. 244–252. [7] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. “Multiple object recognition with visual attention”. In: International Conference on Learning Representations. 2015. [8] Pierre-Luc Bacon, Jean Harb, and Doina Precup. “The option-critic architecture”. In: Proceedings of the AAAI Conference on Articial Intelligence. Vol. 31. 1. 2017. [9] Akhil Bagaria and George Konidaris. “Option discovery using deep skill chaining”. In: International Conference on Learning Representations. 2019. [10] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate”. In: International Conference on Learning Representations. 2015. [11] Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. “Emergent Tool Use From Multi-Agent Autocurricula”. In: International Conference on Learning Representations. 2019. 123 [12] Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. “Emergent Complexity via Multi-Agent Competition”. In: International Conference on Learning Representations. 2018. [13] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. “Unifying count-based exploration and intrinsic motivation”. In: Advances in Neural Information Processing Systems. 2016, pp. 1471–1479. [14] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. “Neural combinatorial optimization with reinforcement learning”. In: arXiv preprint arXiv:1611.09940 (2016). [15] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. “Machine learning for combinatorial optimization: a methodological tour d’horizon”. In: European Journal of Operational Research (2020). [16] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. “Curriculum learning”. In: Proceedings of the 26th Annual International Conference on Machine Learning. ICML ’09. Montreal, Quebec, Canada: Association for Computing Machinery, June 2009, pp. 41–48. [17] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. “Dota 2 with large scale deep reinforcement learning”. In: arXiv preprint arXiv:1912.06680 (2019). [18] Wendelin Böhmer, Vitaly Kurin, and Shimon Whiteson. “Deep Coordination Graphs”. In: Proceedings of Machine Learning and Systems (ICML). 2020, pp. 2611–2622. [19] Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. “Geometric deep learning: Grids, groups, graphs, geodesics, and gauges”. In: arXiv preprint arXiv:2104.13478 (2021). [20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are few-shot learners”. In: Advances in neural information processing systems 33 (2020), pp. 1877–1901. [21] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. “Exploration by random network distillation”. In: International Conference on Learning Representations. 2019.url: https://openreview.net/forum?id=H1lJJnR5Ym. [22] Nicholas Burden. “Deep Multi-Agent Reinforcement Learning in Starcraft II”. MA thesis. University of Oxford, 2020. [23] Lucian Buşoniu, Robert Babuška, and Bart De Schutter. “Multi-agent reinforcement learning: An overview”. In: Innovations in multi-agent systems and applications-1. Springer, 2010, pp. 183–221. [24] Trevor Campbell, Luke Johnson, and Jonathan P How. “Multiagent allocation of markov decision process tasks”. In: 2013 American Control Conference. IEEE. 2013, pp. 2356–2361. [25] Nicolas Carion, Nicolas Usunier, Gabriel Synnaeve, and Alessandro Lazaric. “A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning”. In: Advances in Neural Information Processing Systems. 2019, pp. 8128–8138. 124 [26] David Carmel and Shaul Markovitch. “Exploration and adaptation in multiagent systems: A model-based approach”. In: IJCAI (1). 1997, pp. 606–611. [27] Anthony Chemero. “An Outline of a Theory of Aordances”. In: Ecol. Psychol. 15.2 (Apr. 2003), pp. 181–195. [28] Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic Gridworld Environment for OpenAI Gym. https://github.com/maximecb/gym-minigrid. 2018. [29] Jinyoung Choi, Beom-Jin Lee, and Byoung-Tak Zhang. “Multi-focus Attention Network for Ecient Deep Reinforcement Learning”. In: arXiv preprint arXiv:1712.04603 (Dec. 2017). arXiv: 1712.04603 [cs.LG]. [30] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. “Empirical evaluation of gated recurrent neural networks on sequence modeling”. In: NIPS 2014 Workshop on Deep Learning, December 2014. 2014. [31] Caroline Claus and Craig Boutilier. “The dynamics of reinforcement learning in cooperative multiagent systems”. In: AAAI/IAAI 1998.746-752 (1998), p. 2. [32] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. “Fast and accurate deep network learning by exponential linear units (elus)”. In: arXiv preprint arXiv:1511.07289 (2015). [33] Robby Costales, Shariq Iqbal, and Fei Sha. “Possibility Before Utility: Learning And Using Hierarchical Aordances”. In: International Conference on Learning Representations. 2022.url: https://openreview.net/forum?id=7b4zxUnrO2N. [34] Peter Dayan and Georey E Hinton. “Feudal reinforcement learning”. In: Advances in neural information processing systems 5 (1992). [35] C Devin, D Geng, P Abbeel, T Darrell, and S Levine. “Plan Arithmetic: Compositional Plan Vectors for Multi-Task Control”. In: In Neural Information Processing Systems (NeurIPS) (2019). [36] Ali Dorri, Salil S Kanhere, and Raja Jurdak. “Multi-agent systems: A survey”. In: Ieee Access 6 (2018), pp. 28573–28593. [37] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. In: International Conference on Learning Representations. 2020. [38] Adrien Ecoet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Je Clune. “Go-explore: a new approach for hard-exploration problems”. In: arXiv preprint arXiv:1901.10995 (2019). [39] Yonathan Efroni, Dipendra Misra, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. “Provably Filtering Exogenous Distractors using Multistep Inverse Dynamics”. In: International Conference on Learning Representations. 2022.url: https://openreview.net/forum?id=RQLLzMCefQu. 125 [40] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. “Diversity is All You Need: Learning Skills without a Reward Function”. In: International Conference on Learning Representations. 2018. [41] Felix Fischer, Michael Rovatsos, and Gerhard Weiss. “Hierarchical reinforcement learning in communication-mediated multiagent coordination”. In: Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume 3. IEEE Computer Society. 2004, pp. 1334–1335. [42] Yannis Flet-Berliac. “The promise of hierarchical reinforcement learning”. In: The Gradient 9 (2019). [43] Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. “Learning to communicate with deep multi-agent reinforcement learning”. In: Advances in Neural Information Processing Systems. 2016, pp. 2137–2145. [44] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. “Counterfactual Multi-Agent Policy Gradients”. In: AAAI Conference on Articial Intelligence. 2018. [45] Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip H. S. Torr, Pushmeet Kohli, and Shimon Whiteson. “Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning”. In: Proceedings of the 34th International Conference on Machine Learning. Vol. 70. Proceedings of Machine Learning Research. International Convention Centre, Sydney, Australia, June 2017, pp. 1146–1155. [46] Brian P Gerkey and Maja J Matarić. “A formal analysis and taxonomy of task allocation in multi-robot systems”. In: The International journal of robotics research 23.9 (2004), pp. 939–954. [47] James J Gibson. “The theory of aordances”. In: Hilldale, USA 1.2 (1977), pp. 67–82. [48] Bruce L Golden, Thomas L Magnanti, and Hien Q Nguyen. “Implementing vehicle routing algorithms”. In: Networks 7.2 (1977), pp. 113–148. [49] Alex Graves, Greg Wayne, and Ivo Danihelka. “Neural turing machines”. In: arXiv preprint arXiv:1410.5401 (2014). [50] Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. “Cooperative Multi-agent Control Using Deep Reinforcement Learning”. en. In: Autonomous Agents and Multiagent Systems. Lecture Notes in Computer Science. Springer, Cham, May 2017, pp. 66–83. [51] David Ha, Andrew M Dai, and Quoc V Le. “HyperNetworks”. In: International Conference on Learning Representations. 2017. [52] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. “Soft Actor-Critic: O-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”. In: Proceedings of the 35th International Conference on Machine Learning. Vol. 80. Proceedings of Machine Learning Research. Stockholmsmässan, Stockholm Sweden, Oct. 2018, pp. 1861–1870. [53] Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroening. “Logically-constrained reinforcement learning”. In: arXiv preprint arXiv:1801.08099 (2018). 126 [54] Mohammadhosein Hasanbeig, Daniel Kroening, and Alessandro Abate. “Deep Reinforcement Learning with Temporal Logics”. In: Formal Modeling and Analysis of Timed Systems. Springer International Publishing, 2020, pp. 1–22. [55] Hado van Hasselt, Arthur Guez, and David Silver. “Deep Reinforcement Learning with Double Q-Learning”. In: Proceedings of the 13th AAAI Conference on Articial Intelligence. 2016, pp. 2094–2100. [56] Matthew J. Hausknecht and Peter Stone. “Deep Recurrent Q-Learning for Partially Observable MDPs”. In: 2015 AAAI Fall Symposia. 2015, pp. 29–37. [57] Matthew John Hausknecht. “Cooperation and Communication in Multiagent Deep Reinforcement Learning”. PhD thesis. The University of Texas at Austin, 2016. [58] He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daumé III. “Opponent modeling in deep reinforcement learning”. In: International Conference on Machine Learning. 2016, pp. 1804–1813. [59] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778. [60] Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. “Emergence of locomotion behaviours in rich environments”. In: arXiv preprint arXiv:1707.02286 (2017). [61] Harry Heft. “Aordances and the body: An intentional analysis of gibson’s ecological approach to visual perception”. en. In: J. Theory Soc. Behav. 19.1 (Mar. 1989), pp. 1–30. [62] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. “Rainbow: Combining improvements in deep reinforcement learning”. In: Thirty-second AAAI conference on articial intelligence. 2018. [63] Georey Hinton. “How to represent part-whole hierarchies in a neural network”. In: arXiv preprint arXiv:2102.12627 (2021). [64] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. “Vime: Variational information maximizing exploration”. In: Advances in Neural Information Processing Systems. 2016, pp. 1109–1117. [65] Siyi Hu, Fengda Zhu, Xiaojun Chang, and Xiaodan Liang. “{UPD}eT: Universal Multi-agent {RL} via Policy Decoupling with Transformers”. In: International Conference on Learning Representations. 2021. [66] Rodrigo Toro Icarte, Toryn Q Klassen, Richard Valenzano, and Sheila A McIlraith. “Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning”. In: arXiv preprint arXiv:2010.03950 (2020). 127 [67] Shariq Iqbal, Christian A Schroeder De Witt, Bei Peng, Wendelin Böhmer, Shimon Whiteson, and Fei Sha. “Randomized Entity-wise Factorization for Multi-Agent Reinforcement Learning”. In: International Conference on Machine Learning. PMLR. 2021, pp. 4596–4606. [68] Shariq Iqbal and Fei Sha. “Actor-Attention-Critic for Multi-Agent Reinforcement Learning”. In: Proceedings of the 36th International Conference on Machine Learning. Vol. 97. Proceedings of Machine Learning Research. PMLR, Sept. 2019, pp. 2961–2970.url: http://proceedings.mlr.press/v97/iqbal19a.html. [69] Shariq Iqbal and Fei Sha. “Actor-attention-critic for multi-agent reinforcement learning”. In: International Conference on Machine Learning. PMLR. 2019, pp. 2961–2970. [70] Shariq Iqbal and Fei Sha. “Coordinated exploration via intrinsic rewards for multi-agent reinforcement learning”. In: arXiv preprint arXiv:1905.12127 (2019). [71] Shariq Iqbal, Jonathan Tremblay, Andy Campbell, Kirby Leung, Thang To, Jia Cheng, Erik Leitch, Duncan McKay, and Stan Bircheld. “Toward sim-to-real directional semantic grasping”. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2020, pp. 7247–7253. [72] Eric Jang, Shixiang Gu, and Ben Poole. “Categorical reparameterization with gumbel-softmax”. In: International Conference on Learning Representations. 2017. [73] Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro A Ortega, D J Strouse, Joel Z Leibo, and Nando de Freitas. “Social Inuence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning”. In: arXiv preprint arXiv:1810.08647 (Oct. 2018). arXiv: 1810.08647 [cs.LG]. [74] Jiechuan Jiang and Zongqing Lu. “Learning Attentional Communication for Multi-Agent Cooperation”. In: arXiv preprint arXiv:1805.07733 (2018). [75] Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaśkowski. “ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning”. In: IEEE Conference on Computational Intelligence and Games. IEEE, Sept. 2016, pp. 341–348.url: http://arxiv.org/abs/1605.02097. [76] Shauharda Khadka, Somdeb Majumdar, Santiago Miret, Evren Tumer, Tarek Nassar, Zach Dwiel, Yinyin Liu, and Kagan Tumer. “Collaborative Evolutionary Reinforcement Learning”. In: arXiv preprint arXiv:1905.00976 (2019). [77] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. “Learning Combinatorial Optimization Algorithms over Graphs”. In: Advances in Neural Information Processing Systems. Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc., 2017. [78] Alaa Khamis, Ahmed Hussein, and Ahmed Elmogy. “Multi-robot task allocation: A review of the state-of-the-art”. In: Cooperative Robots and Sensor Networks 2015 (2015), pp. 31–51. [79] Alexander Khazatsky, Ashvin Nair, Daniel Jing, and Sergey Levine. “What Can I Do Here? Learning New Skills by Imagining Visual Aordances”. In: arXiv preprint arXiv:2106.00671 (2021). 128 [80] Khimya Khetarpal, Zafarali Ahmed, Gheorghe Comanici, David Abel, and Doina Precup. “What can I do here? A Theory of Aordances in Reinforcement Learning”. In: International Conference on Machine Learning. PMLR. 2020, pp. 5243–5253. [81] Daphne Koller and Ronald Parr. “Computing factored value functions for policies in structured MDPs”. In: IJCAI. Vol. 99. 1999, pp. 1332–1339. [82] Vijay R Konda and John N Tsitsiklis. “Actor-critic algorithms”. In: Advances in Neural Information Processing Systems. 2000, pp. 1008–1014. [83] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. “Imagenet classication with deep convolutional neural networks”. In: Advances in neural information processing systems 25 (2012). [84] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation”. In: Advances in neural information processing systems 29 (2016), pp. 3675–3683. [85] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation”. In: Advances in neural information processing systems 29 (2016), pp. 3675–3683. [86] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, and Thore Graepel. “A Unied Game-Theoretic Approach to Multiagent Reinforcement Learning”. In: Advances in Neural Information Processing Systems 30. Ed. by I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett. Curran Associates, Inc., 2017, pp. 4193–4206. [87] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. “Backpropagation applied to handwritten zip code recognition”. In: Neural computation 1.4 (1989), pp. 541–551. [88] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. “Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks”. In: Proceedings of the 36th International Conference on Machine Learning. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning Research. Long Beach, California, USA: PMLR, Sept. 2019, pp. 3744–3753. [89] X Li, Y Ma, and C Belta. “A Policy Search Method For Temporal Logic Specied Reinforcement Learning Tasks”. In: 2018 Annual American Control Conference (ACC). June 2018, pp. 240–245. [90] X Li, C Vasile, and C Belta. “Reinforcement learning with temporal logic rewards”. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Sept. 2017, pp. 3834–3839. [91] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. “Continuous control with deep reinforcement learning”. In: International Conference on Learning Representations. 2016. [92] Long-Ji Lin. “Self-improving reactive agents based on reinforcement learning, planning and teaching”. In: Machine Learning 8.3 (1992), pp. 293–321. 129 [93] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. “A structured self-attentive sentence embedding”. In: International Conference on Learning Representations. 2017. [94] Michael L Littman. “Markov games as a framework for multi-agent reinforcement learning”. In: Machine Learning Proceedings 1994. Elsevier, 1994, pp. 157–163. [95] Bo Liu, Qiang Liu, Peter Stone, Animesh Garg, Yuke Zhu, and Anima Anandkumar. “Coach-Player Multi-agent Reinforcement Learning for Dynamic Team Composition”. In: Proceedings of the 38th International Conference on Machine Learning. Ed. by Marina Meila and Tong Zhang. Vol. 139. Proceedings of Machine Learning Research. PMLR, July 2021, pp. 6860–6870. [96] Hao Liu and Pieter Abbeel. “Behavior from the void: Unsupervised active pre-training”. In: arXiv preprint arXiv:2103.04551 (2021). [97] Siqi Liu, Guy Lever, Zhe Wang, Josh Merel, SM Eslami, Daniel Hennes, Wojciech M Czarnecki, Yuval Tassa, Shayegan Omidshaei, Abbas Abdolmaleki, et al. “From Motor Control to Team Play in Simulated Humanoid Football”. In: arXiv preprint arXiv:2105.12196 (2021). [98] Qian Long, Zihan Zhou, Abhinav Gupta, Fei Fang, Yi Wu, and Xiaolong Wang. “Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning”. In: International Conference on Learning Representations. 2020. [99] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. “Multi-agent actor-critic for mixed cooperative-competitive environments”. In: Advances in Neural Information Processing Systems. 2017, pp. 6382–6393. [100] Yuchen Lu, Yikang Shen, Siyuan Zhou, Aaron Courville, Joshua B. Tenenbaum, and Chuang Gan. “Learning Task Decomposition with Ordered Memory Policy Network”. In: International Conference on Learning Representations. 2021. [101] Marlos C Machado, Marc G Bellemare, and Michael Bowling. “A laplacian framework for option discovery in reinforcement learning”. In: International Conference on Machine Learning. PMLR. 2017, pp. 2295–2304. [102] Anuj Mahajan, Tabish Rashid, Mikayel Samvelyan, and Shimon Whiteson. “Maven: Multi-agent variational exploration”. In: Advances in Neural Information Processing Systems. 2019, pp. 7613–7624. [103] Alexandre Manoury, Sao Mai Nguyen, and Cédric Buche. “Hierarchical aordance discovery using intrinsic motivation”. In: Proceedings of the 7th International Conference on Human-Agent Interaction. 2019, pp. 186–193. [104] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. “Recurrent models of visual attention”. In: Advances in Neural Information Processing Systems. 2014, pp. 2204–2212. [105] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (2015), p. 529. 130 [106] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. “Human-level control through deep reinforcement learning”. en. In: Nature 518.7540 (Feb. 2015), pp. 529–533. [107] Igor Mordatch and Pieter Abbeel. “Emergence of Grounded Compositional Language in Multi-Agent Populations”. In: AAAI Conference on Articial Intelligence. 2018. [108] Mohammadreza Nazari, Afshin Oroojlooy, Lawrence Snyder, and Martin Takác. “Reinforcement learning for solving the vehicle routing problem”. In: Advances in neural information processing systems 31 (2018). [109] Junhyuk Oh, Valliappa Chockalingam, Honglak Lee, et al. “Control of Memory, Active Perception, and Action in Minecraft”. In: International Conference on Machine Learning. 2016, pp. 2790–2799. [110] Frans A Oliehoek, Christopher Amato, et al. A concise introduction to decentralized POMDPs. Vol. 1. Springer, 2016. [111] Frans A Oliehoek, Matthijs TJ Spaan, Nikos Vlassis, and Shimon Whiteson. “Exploiting locality of interaction in factored Dec-POMDPs”. In: Int. Joint Conf. on Autonomous Agents and Multi-Agent Systems. 2008, pp. 517–524. [112] Georg Ostrovski, Marc G Bellemare, Aäron van den Oord, and Rémi Munos. “Count-based exploration with neural density models”. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org. 2017, pp. 2721–2730. [113] Pierre-Yves Oudeyer and Frederic Kaplan. “What is intrinsic motivation? A typology of computational approaches”. In: Frontiers in neurorobotics 1 (2009), p. 6. [114] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. “Curiosity-driven Exploration by Self-supervised Prediction”. In: Proceedings of the 34th International Conference on Machine Learning. Vol. 70. Proceedings of Machine Learning Research. PMLR, June 2017, pp. 2778–2787.url: http://proceedings.mlr.press/v70/pathak17a.html. [115] Laurent Perron and Vincent Furnon. OR-Tools. Version 7.2. Google, July 2019.url: https://developers.google.com/optimization/. [116] Doina Precup, Richard S Sutton, and Satinder Singh. “Theoretical results on reinforcement learning with temporally abstract options”. In: European conference on machine learning. Springer. 1998, pp. 382–393. [117] Scott Proper and Prasad Tadepalli. “Solving multiagent assignment markov decision processes”. In: Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 1. Citeseer. 2009, pp. 681–688. [118] Amanda Prorok, M Ani Hsieh, and Vijay Kumar. “The impact of diversity on optimal control policies for heterogeneous robot swarms”. In: IEEE Transactions on Robotics 33.2 (2017), pp. 346–358. 131 [119] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. “QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning”. In: Proceedings of the 35th International Conference on Machine Learning. Vol. 80. Proceedings of Machine Learning Research. Stockholmsmässan, Stockholm Sweden, Oct. 2018, pp. 4295–4304. [120] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. “Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning”. In: Journal of Machine Learning Research 21.178 (2020), pp. 1–51. [121] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. “You only look once: Unied, real-time object detection”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 779–788. [122] Stuart Russell and Andrew L. Zimdars. “Q-decomposition for reinforcement learning agents”. In: Proceedings of the Twentieth International Conference on International Conference on Machine Learning. ICML’03. Washington, DC, USA: AAAI Press, Aug. 2003, pp. 656–663.isbn: 978-1-57735-189-4. [123] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. “The starcraft multi-agent challenge”. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems. 2019, pp. 2186–2188. [124] Jürgen Schmidhuber. “Formal theory of creativity, fun, and intrinsic motivation (1990–2010)”. In: IEEE Transactions on Autonomous Mental Development 2.3 (2010), pp. 230–247. [125] Je Schneider, Weng-Keen Wong, Andrew Moore, and Martin Riedmiller. “Distributed Value Functions”. In: In Proceedings of the Sixteenth International Conference on Machine Learning. Morgan Kaufmann, 1999, pp. 371–378. [126] Christian Schroeder de Witt, Jakob Foerster, Gregory Farquhar, Philip Torr, Wendelin Boehmer, and Shimon Whiteson. “Multi-Agent Common Knowledge Reinforcement Learning”. In: Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019, pp. 9927–9939. [127] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. “Trust region policy optimization”. In: International Conference on Machine Learning. 2015, pp. 1889–1897. [128] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal policy optimization algorithms”. In: arXiv preprint arXiv:1707.06347 (2017). [129] Onn Shehory and Sarit Kraus. “Methods for task allocation via agent coalition formation”. In: Articial intelligence 101.1-2 (1998), pp. 165–200. [130] Kyriacos Shiarlis, Markus Wulfmeier, Sasha Salter, Shimon Whiteson, and Ingmar Posner. “Taco: Learning task decomposition via temporal alignment for control”. In: International Conference on Machine Learning. PMLR. 2018, pp. 4654–4663. 132 [131] Tianmin Shu and Yuandong Tian. “M3RL: Mind-aware Multi-agent Management Reinforcement Learning”. In: International Conference on Learning Representations. 2019. [132] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. “Mastering the game of Go with deep neural networks and tree search”. en. In: Nature 529.7587 (Jan. 2016), pp. 484–489. [133] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (2016), pp. 484–489. [134] Sungryull Sohn, Hyunjae Woo, Jongwook Choi, and Honglak Lee. “Meta Reinforcement Learning with Autonomous Inference of Subtask Dependencies”. In: International Conference on Learning Representations. 2020. [135] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. “QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning”. In: International Conference on Machine Learning. 2019, pp. 5887–5896. [136] Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. “Learning from noisy labels with deep neural networks: A survey”. In: arXiv preprint arXiv:2007.08199 (2020). [137] Nitish Srivastava, Georey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. “Dropout: a simple way to prevent neural networks from overtting”. In: The journal of machine learning research 15.1 (2014), pp. 1929–1958. [138] Alexander L. Strehl and Michael L. Littman. “An analysis of model-based Interval Estimation for Markov Decision Processes”. In: Journal of Computer and System Sciences 74.8 (2008). Learning Theory 2005, pp. 1309–1331.issn: 0022-0000.doi: https://doi.org/10.1016/j.jcss.2007.08.009. [139] structure, n. In: Oxford English Dictionary. Oxford University Press, Apr. 2022. [140] Sainbayar Sukhbaatar, Rob Fergus, et al. “Learning multiagent communication with backpropagation”. In: Advances in Neural Information Processing Systems. 2016, pp. 2244–2252. [141] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. “Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward”. In: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. AAMAS ’18. Stockholm, Sweden: International Foundation for Autonomous Agents and Multiagent Systems, 2018, pp. 2085–2087. [142] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to sequence learning with neural networks”. In: Advances in neural information processing systems 27 (2014). [143] Richard Sutton. “The bitter lesson”. In: Incomplete Ideas (blog) 13 (2019), p. 12. 133 [144] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. “Policy gradient methods for reinforcement learning with function approximation”. In: Advances in Neural Information Processing Systems. 2000, pp. 1057–1063. [145] Richard S Sutton, Doina Precup, and Satinder Singh. “Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning”. In:Artif.Intell. 112.1-2 (Aug. 1999), pp. 181–211. [146] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. “Multiagent cooperation and competition with deep reinforcement learning”. en. In: PLoS One 12.4 (Apr. 2017), e0172395. [147] Ming Tan. “Multi-agent reinforcement learning: independent versus cooperative agents”. In: Proceedings of the Tenth International Conference on International Conference on Machine Learning. Morgan Kaufmann Publishers Inc. 1993, pp. 330–337. [148] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. “# Exploration: A study of count-based exploration for deep reinforcement learning”. In: Advances in neural information processing systems. 2017, pp. 2753–2762. [149] Joshua B Tenenbaum, Charles Kemp, Thomas L Griths, and Noah D Goodman. “How to grow a mind: Statistics, structure, and abstraction”. In: science 331.6022 (2011), pp. 1279–1285. [150] Paolo Toth and Daniele Vigo. The vehicle routing problem. SIAM, 2002. [151] Tom Van de Wiele, David Warde-Farley, Andriy Mnih, and Volodymyr Mnih. “Q-learning in enormous action spaces via amortized approximate maximization”. In: arXiv preprint arXiv:2001.08116 (2020). [152] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in Neural Information Processing Systems. 2017, pp. 6000–6010. [153] Katja Verbeeck, Ann Nowé, and Karl Tuyls. “Coordinated exploration in multi-agent reinforcement learning: an application to load-balancing”. In: Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems. ACM. 2005, pp. 1105–1106. [154] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. “Feudal networks for hierarchical reinforcement learning”. In: International Conference on Machine Learning. PMLR. 2017, pp. 3540–3549. [155] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. “Grandmaster level in StarCraft II using multi-agent reinforcement learning”. In: Nature 575.7782 (2019), pp. 350–354. 134 [156] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P Agapiou, Max Jaderberg, Alexander S Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom L Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfa, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver. “Grandmaster level in StarCraft II using multi-agent reinforcement learning”. en. In: Nature 575.7782 (Nov. 2019), pp. 350–354. [157] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. “Pointer Networks”. In: Advances in Neural Information Processing Systems 28 (2015), pp. 2692–2700. [158] Tonghan Wang, Heng Dong, Victor Lesser, and Chongjie Zhang. “Roma: Multi-agent reinforcement learning with emergent roles”. In: Proceedings of the 37th International Conference on Machine Learning. 2020. [159] Tonghan Wang, Tarun Gupta, Anuj Mahajan, Bei Peng, Shimon Whiteson, and Chongjie Zhang. “{RODE}: Learning Roles to Decompose Multi-Agent Tasks”. In: International Conference on Learning Representations. 2021. [160] Tonghan Wang, Jianhao Wang, Yi Wu, and Chongjie Zhang. “Inuence-Based Multi-Agent Exploration”. In: International Conference on Learning Representations. 2020.url: https://openreview.net/forum?id=BJgy96EYvr. [161] Weixun Wang, Tianpei Yang, Yong Liu, Jianye Hao, Xiaotian Hao, Yujing Hu, Yingfeng Chen, Changjie Fan, and Yang Gao. “Action Semantics Network: Considering the Eects of Actions in Multiagent Systems”. In: International Conference on Learning Representations. 2020. [162] Weixun Wang, Tianpei Yang, Yong Liu, Jianye Hao, Xiaotian Hao, Yujing Hu, Yingfeng Chen, Changjie Fan, and Yang Gao. “From Few to More: Large-scale Dynamic Multiagent Curriculum Learning”. In: AAAI Conference on Articial Intelligence. 2020. [163] Christopher John Cornish Hellaby Watkins. “Learning from delayed rewards”. In: PhD thesis, Cambridge University (1989). [164] Ermo Wei, Drew Wicke, David Freelan, and Sean Luke. “Multiagent Soft Q-Learning”. In: arXiv preprint arXiv:1804.09817 (2018). [165] Ronald J Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. In: Reinforcement Learning. Springer, 1992, pp. 5–32. [166] Sen Wu, Hongyang R. Zhang, and Christopher Ré. “Understanding and Improving Information Transfer in Multi-Task Learning”. In: International Conference on Learning Representations. 2020. url: https://openreview.net/forum?id=SylzhkBtDB. [167] Jiachen Yang, Alireza Nakhaei, David Isele, Kikuo Fujimura, and Hongyuan Zha. “CM3: Cooperative Multi-goal Multi-stage Multi-agent Reinforcement Learning”. In: International Conference on Learning Representations. 2019. 135 [168] Yaodong Yang, Jianye Hao, Ben Liao, Kun Shao, Guangyong Chen, Wulong Liu, and Hongyao Tang. “Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning”. In: arXiv preprint arXiv:2002.03939 (2020). [169] Lim Zun Yuan, Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroening. “Modular deep reinforcement learning with temporal logic specications”. In: arXiv preprint arXiv:1909.11591 (2019). 136
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Machine learning in interacting multi-agent systems
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Sequential Decision Making and Learning in Multi-Agent Networked Systems
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Reward shaping and social learning in self- organizing systems through multi-agent reinforcement learning
PDF
Modeling, learning, and leveraging similarity
PDF
Program-guided framework for your interpreting and acquiring complex skills with learning robots
PDF
Robust and adaptive online reinforcement learning
PDF
Visual representation learning with structural prior
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Scaling robot learning with skills
PDF
An investigation of fully interactive multi-role dialogue agents
PDF
Understanding goal-oriented reinforcement learning
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Interactive learning: a general framework and various applications
PDF
Robust loop closures for multi-robot SLAM in unstructured environments
PDF
Emotional appraisal in deep reinforcement learning
PDF
Online reinforcement learning for Markov decision processes and games
PDF
High-throughput methods for simulation and deep reinforcement learning
Asset Metadata
Creator
Iqbal, Shariq Nadeem
(author)
Core Title
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-08
Publication Date
05/23/2022
Defense Date
05/06/2022
Publisher
University of Southern California. Libraries
(digital)
Tag
cooperative,hierarchical,MARL,multi-agent reinforcement learning,OAI-PMH Harvest,reinforcement learning,RL
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Matari?, Maja (
committee chair
), Luo, Haipeng (
committee member
), Savla, Ketan (
committee member
), Sha, Fei (
committee member
)
Creator Email
shariqiq@usc.edu,shariqiqbal2810@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111336167
Unique identifier
UC111336167
Identifier
etd-IqbalShari-10725.pdf (filename)
Legacy Identifier
etd-IqbalShari-10725
Document Type
Dissertation
Rights
Iqbal, Shariq Nadeem
Internet Media Type
application/pdf
Type
texts
Source
20220527-usctheses-batch-944
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
cooperative
hierarchical
MARL
multi-agent reinforcement learning
reinforcement learning
RL