Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Algorithms and systems for continual robot learning
(USC Thesis Other)
Algorithms and systems for continual robot learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ALGORITHMS AND SYSTEMS FOR CONTINUAL ROBOT LEARNING by Ryan Christopher Julian A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2021 Copyright 2021 Ryan Christopher Julian Dedication To Louise, my best friend, steadfast supporter, and the love of my life. ii Acknowledgements To my advisor Gaurav Sukhatme, thank you for seeing potential in me when others did not, and for reaffirm- ing your faith in me during our every interaction. I am so grateful for your ever-present guidance, support, and advice throughout my PhD, and for providing me seemingly-boundless freedom and resources to pursue my ideas, even though many of them were unpopular, or required tremendous patience to bear fruit. I am forever indebted to Karol Hausman, who functionally served as my co-advisor, for adopting me as a collab- orator and then mentee before the ink was even dry on his thesis, and for supporting me day-in and day-out, through ups and downs, abandoned ideas and bad drafts, deadlines met and missed, and rejections and tri- umphs for these past 4 years. Thank you to Stefan Schaal, who co-advised me for my first year, for telling me that all of my initial ideas were good but too easy, and convincing me to instead to focus on the much harder challenges of continual learning. Thank you to my committee members Satyandra Gupta, Joseph Lim, Heather Culbertson, and Stefanos Nikolaidis for their support, guidance, and helpful conversations. Additionally, I am eternally grateful to my undergraduate advisor Ron Fearing for taking a chance on an enthusiastic and overly-ambitious 3rd year student at Berkeley, who just wanted to build some robots, and supporting me for years while I learned just how hard that really is. I had the great privilege of spending the latter two years of my PhD both at USC and as a Student Re- searcher at Google Brain’s Robotics Team, where I conducted much of the research presented in Chapters 3 and 5. Thank you to Karol Hausman for giving me such an amazing opportunity, to Chelsea Finn and Sergey iii Levine for your daily guidance, advice, feedback, and support on my projects at Google, and Vincent Van- houcke for believing in the unique applied PhD I wanted to write, and giving me the resources and freedom to attempt it inside his research group. I would also like to thank Dmitry Kalashnikov, Jake Varley, Alex Irpan and Eric Jang for their help with QT-Opt and other Google infrastructure, Benjamin Swanson, Noah Brown, and Ivonne Fajardo for supporting the extensive experiments my projects required, and Ted Xiao, Yevgen Chebotar, Kanishka Rao, and Alexander Herzog for being generous collaborators who are always fun to work with. I had the privilege of mentoring many amazing students at USC during my PhD. This thesis would not exist without the wonderful efforts of Zhanpeng He, Hejia Zhang, Chang Su, Angel Gonzalez Garcia, Jonathon Shen, Gitanshu Sardana, Kevin Cheng, Utkarsh Patel, Anson Wong, Keren Zhu, Zequn Yu, Wei Cheng, Nam Thai Hoang, Nick Combs, Avnish Narayan, Roufu Wang, Iris Liu, Ziyi Wu, Yong Cho, Nisanth Hedge, Linda Wong, Mishari Aliesa, Nicole Ng, Hayden Shively, Adithya Bellathur, Ujjwal Puri, and Yulun Zhang. I would like to thank all you for your enthusiasm, hard work, and loyalty during our projects together, and your patience as you helped me learn how to be a better mentor and adviser. I would especially like to thank Zhanpeng He for his devotion to my earliest and most speculative research projects, Chang Su for lending her considerable talents to my earliest attempts at Garage, Angel Garcia and Gitanshu Sardana for doing the often-thankless work of making Garage a fully-qualified open source project, and Avnish Narayan for adopting Meta-World to be his own, and sticking with it through more than a year of toil to make it a truly world-class benchmark. In addition to the students I mentored, I had the great fortune of working with amazing collaborators at USC. Thank you to Eric Heiden for working through the kinks of hatching our first projects together, and to my dear old friend K.R. Zentner for her work on Garage, Meta-World, and Skill Builder, and for always being a willing an enthusiastic sounding board for my crazy research ideas. In addition to K.R., I would like to thank David Millard and Aravind Kumaraguru for joining me in Los Angeles from our past lives, and iv for sharing some of your graduate school journey with me. Thank you to the other members of the Robotic Embedded Systems Lab (RESL) and the Robotics and Autonomous Systems Center, including Chris Den- niston, Aleksei Petrenko, James Preiss, Isabel Rayas, Gautam Salhotra, Peter Englert, Artem Molchanov, Giovanni Sutanto, Stephanie Kemna, Shao-Hua Sun, Max Pflueger, Arnout Devos, Franzi Meier, and Sean Mason, for providing a supportive and friendly environment which allowed me to grow, learn, and make new friends. I enjoyed advice and support for friends, colleagues, and mentors throughout my PhD, for which I am truly grateful. I would like to thank Kate Rakelly for always being a source of friendship and great grad school advice, Abhishek Gupta, Vitchyr Pong, Suraj Nair, and Benjamin Eysenbach for exchanging thoughts on our research ideas, and Peter Pastor and Mrinal Ramakrishnan for giving me early advice which set me up for success in my PhD. I’d especially like to thank Torsten Kroeger for being a wonderful mentor to me throughout my career in robotics. No man succeeds alone, and I have my family and friends to thank most of all for the opportunity to pursue a PhD. Firstly, my parents Paul and Barbara Julian and my big brother Matthew for raising me, and continuing to support me, with unbounded love and patience, including and especially when I was ungrateful, and for always finding ways to nurture my interests even though they are quite different from your own. I have many primary and secondary school teachers to thank, who gave many second chances to a sometimes-troubled kid, including Matthew DeNote, Julie Harris, Manfred Dreilich, and Steven Kessel. Thank you to Marian Passmore and Andy Bradley, for giving me my first experiences in robotics and engi- neering leadership in FIRST Robotics. Thank you to the many friends who have supported me along this journey, including Robert Luan, Devyn Russell, Leo Keselman, Vanathi Ganesh, and my partners in crime Humphrey Hu, Xiao-Yu Fu, and John Wang. My apologies to Dennis Wai and Joey Knightbrook, for not spending more time together during my PhD, even though I moved to Los Angeles; hopefully this thesis is some consolation, and a window into what I was up to. Thank you to Ping Wang, Nina Mao, and Mao Mao v for providing me with a family home on the West Coast, and for treating me as if I were your own儿子 or 哥哥. Finally, this thesis is dedicated to my partner Louise Mao, who stood behind me at every step of the way, from my admission just days after we met, through years of a relationship in superposition, to my defense more than four years later, and without whose constant love, support, and patience none of this would have been possible. vi Table of Contents Dedication ii Acknowledgements iii List of Tables xii List of Figures xiv Abstract xix Chapter 1: Introduction 1 Chapter 2: Scaling Simulation-to-Real Transfer by Learning a Latent Space of Robot Skills 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Technical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Skill Embedding Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.3.1 Skill Embedding Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.4 Simulation-to-Real Training and Transfer Method . . . . . . . . . . . . . . . . . . . 16 2.3.4.1 Stage 1: Pre-Training in Simulation while Learning Skill Embeddings . . 16 2.3.4.2 Stage 2: Learning Hierarchical Policies . . . . . . . . . . . . . . . . . . . 16 2.3.4.3 Stage 3: Transfer and Execution . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.4.4 Post-smoothing Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Characterizing the Behavior of the Latent Skill Space . . . . . . . . . . . . . . . . . . . . . 19 2.4.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.1.1 Point Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.1.2 Sawyer Experiment: Reaching . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.1.3 Sawyer Experiment: Box Pushing . . . . . . . . . . . . . . . . . . . . . . 27 2.4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5 Using Model Predictive Control for Zero-Shot Sequencing in Skill Latent Space . . . . . . . 33 2.5.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.1.1 Reusing the Simulation for Foresight . . . . . . . . . . . . . . . . . . . . 34 2.5.1.2 Adaptation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.2.1 Sawyer: Drawing a Sequence of Points . . . . . . . . . . . . . . . . . . . 39 vii 2.5.2.2 Sawyer: Pushing the Box through a Sequence of Waypoints . . . . . . . . 40 2.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.5.3.1 Sawyer: Drawing a Sequence of Points . . . . . . . . . . . . . . . . . . . 43 2.5.3.2 Sawyer: Pushing a Box along a Sequence of Waypoints . . . . . . . . . . 44 2.5.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 3: Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning 48 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3 The Multi-Task and Meta-RL Problem Statements . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.1 Multi-task RL problem statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.2 Meta-RL problem statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 Meta-World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.1 The Space of Manipulation Tasks: Parametric and Non-Parametric Variability . . . . 55 3.4.2 Actions, Observations, and Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.3 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.6 Conclusion and Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.7 Task Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.8 Task Rewards and Success Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.9 Benchmark Verification with Single-Task Learning . . . . . . . . . . . . . . . . . . . . . . 71 3.10 Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.11 Hyperparameter Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.11.1 Single Task SAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.11.2 Single Task PPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.11.3 Multi-Task SAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.11.4 Multi-Task Multi-Headed SAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.11.5 Multi-Task PPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.11.6 Multi-Task TRPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.11.7 Task Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.11.8 PEARL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.11.9 RL 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.11.10 MAML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Chapter 4: Garage: Reproducibility, Stability, and Scale for Experimental Meta- and Multi-Task Reinforcement Learning 79 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2 Algorithms and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3 Garage: An Expressive Software Library for Reproducible RL Research . . . . . . . . . . . 82 4.3.1 Motivation, Audience, and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3.2 Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3.3 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.4 Bringing Garage to the World: Community-Building and Documentation . . . . . . 98 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.4.1 Random Seed Sensitivity of Meta- and Multi-Task RL . . . . . . . . . . . . . . . . 101 viii 4.4.2 Implementation Sensitivity in Meta- and Multi-Task RL: Small Details Make a Big Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.4.2.1 Reward Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.4.2.2 Time limits in off-policy meta- and multi-task RL algorithms . . . . . . . 105 4.4.2.3 Maximum Entropy RL . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.4.2.4 Effect of the “Subroutine Algorithm” . . . . . . . . . . . . . . . . . . . . 108 4.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.5.1 Importance of Implementation Decisions in Meta- and MTRL . . . . . . . . . . . . 109 4.5.1.1 Transfer and Adaptation Sample Efficiency . . . . . . . . . . . . . . . . . 110 4.5.2 Does MAML-TRPO actually adapt to shared structure tasks? . . . . . . . . . . . . . 111 4.5.3 Improving meta-RL and multi-task RL benchmarks . . . . . . . . . . . . . . . . . . 112 4.6 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.7 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.7.1 Seed Sensitivity on ML10 and MT10 . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.7.2 Task Sampling Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.7.3 Returns in RL 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.8 Comparing Garage Reference Implementations to Popular Baselines . . . . . . . . . . . . . 117 4.8.1 MT-PPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.8.2 MT-TRPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.8.3 MT-SAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.8.4 MAML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.8.5 RL 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.8.6 PEARL-SAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.9 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.9.1 MT-PPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.9.2 MT-TRPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.9.3 MT-SAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.9.4 TE-PPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.9.5 MAML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.9.6 RL 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.9.7 PEARL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.10 Experiment Procedure Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.10.1 Compute Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.10.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.11 Defining Consistent Metrics and Hyperparameters for Meta- and Multi-Task Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.11.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.11.2 Definitions and Hyperparameters for Meta- and Multi-Task RL . . . . . . . . . . . . 131 4.11.3 Evaluation Procedure for Meta-RL . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.11.4 Evaluation Metrics for Meta- and Multi-Task RL . . . . . . . . . . . . . . . . . . . 132 4.11.5 Statistical Comparisons for Meta- and MTRL Algorithms . . . . . . . . . . . . . . 133 4.11.6 Training Procedure for Meta- and Multi-Task RL . . . . . . . . . . . . . . . . . . . 135 4.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 ix Chapter 5: Never Stop Learning: The Effectiveness of Fine-Tuning in Robotic Reinforcement Learning 137 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.3 Identifying Adaptation Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.3.1 Pre-training process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.3.2 Robustness of the pre-trained policy . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.4 Large-Scale Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.4.1 A very simple fine-tuning method . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.4.2 Evaluating fine-tuning in simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.4.3 Evaluating offline fine-tuning for real-world grasping . . . . . . . . . . . . . . . . . 147 5.5 Evaluating Offline Fine-Tuning for Continual Learning . . . . . . . . . . . . . . . . . . . . 150 5.6 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.6.1 Performance and sample efficiency of fine-tuning . . . . . . . . . . . . . . . . . . . 152 5.6.2 Limitation: the early stopping problem . . . . . . . . . . . . . . . . . . . . . . . . 153 5.6.3 Comparing initializing with RL to initializing with supervised learning . . . . . . . 154 5.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.8 Project Website and Experiment Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.9 Additional Experiments on Forgetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.10 Assessing Fine-Tuning Techniques for End-to-End Robotic Reinforcement Learning . . . . 159 5.10.1 Fine-Tuning: Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.10.2 Experiments in simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.10.2.1 Adding a new head and other selective initialization techniques . . . . . . 161 5.10.3 Training with a mix of data from the base and target tasks . . . . . . . . . . . . . . 163 5.11 Additional Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.11.1 Pre-Training Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.11.2 Robustness Experiments with the Pre-Trained Policy . . . . . . . . . . . . . . . . . 164 5.11.3 Comparison Methods from Section 5.4 . . . . . . . . . . . . . . . . . . . . . . . . 164 5.11.3.1 “Scratch” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.11.3.2 “ResNet 50 + ImageNet” . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.11.4 Data Collection and Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 165 5.11.5 Continual Learning Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Chapter 6: Skill Builder: A Simple Approach to Continual Learning for Manipulation 166 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6.3 Understanding the Role of Parameters and Behaviors in Skill Policy Reuse . . . . . . . . . . 172 6.3.1 Transferring Skill Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.3.2 Transferring Skill Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.4 Skill Builder: Simple Continual Learning with Skill-Skill Transfer . . . . . . . . . . . . . . 176 6.4.1 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.4.2 Simple Continual Learning with Skill-Skill Transfer . . . . . . . . . . . . . . . . . 177 6.4.3 Skill Builder Learns Continually without Forgetting . . . . . . . . . . . . . . . . . 179 6.5 Efficient Continual Learning with Skill Curriculums . . . . . . . . . . . . . . . . . . . . . . 182 6.5.1 Measuring Skill-Skill Transfer Cost . . . . . . . . . . . . . . . . . . . . . . . . . . 183 6.5.2 Curriculum Selection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 6.5.3 Measuring the Effectiveness of Curriculum Selection . . . . . . . . . . . . . . . . . 188 6.6 Towards a Skill Policy Model for Active Curriculum Selection . . . . . . . . . . . . . . . . 189 x 6.6.1 Policy Model and Pre-Training Procedure . . . . . . . . . . . . . . . . . . . . . . . 189 6.6.2 Early Results for Continual Learning . . . . . . . . . . . . . . . . . . . . . . . . . 192 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Chapter 7: Conclusions 195 7.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 7.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 Bibliography 203 xi List of Tables 3.1 Summary of success rates for 7 meta-RL and MTRL algorithms on Meta-World ML10, MT10, ML45, and MT50 benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.2 Meta-World tasks and their descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.3 Reward functions used by each of the Meta-World tasks . . . . . . . . . . . . . . . . . . . . 68 3.4 Success metric used by each of the Meta-World tasks . . . . . . . . . . . . . . . . . . . . . 70 3.5 Hyperparameters for Single Task SAC experiments on Meta-World . . . . . . . . . . . . . . 74 3.6 Hyperparameters for Single Task PPO experiments on Meta-World . . . . . . . . . . . . . . 74 3.7 Hyperparameters for Single Task PPO experiments on Meta-World . . . . . . . . . . . . . . 75 3.8 Hyperparameters for Single Task PPO experiments on Meta-World . . . . . . . . . . . . . . 75 3.9 Hyperparameters for Multi-Task PPO experiments on Meta-World . . . . . . . . . . . . . . 76 3.10 Hyperparameters for Multi-Task TRPO experiments on Meta-World . . . . . . . . . . . . . 76 3.11 Hyperparameters for Task Embeddings experiments on Meta-World . . . . . . . . . . . . . 76 3.12 Hyperparameters for PEARL experiments on Meta-World . . . . . . . . . . . . . . . . . . 77 3.13 Hyperparameters for RL 2 experiments on Meta-World . . . . . . . . . . . . . . . . . . . . 77 3.14 Hyperparameters for MAML experiments on Meta-World . . . . . . . . . . . . . . . . . . . 78 4.1 Hyperparameters used for Garage experiments with Multi-Task PPO . . . . . . . . . . . . . 124 4.2 Hyperparameters used for Garage experiments with Multi-Task TRPO . . . . . . . . . . . . 125 4.3 Hyperparameters used for Garage experiments with Multi-Task SAC . . . . . . . . . . . . . 126 4.4 Hyperparameters used for Garage experiments with Task Embeddings PPO . . . . . . . . . 127 xii 4.5 Hyperparameters used for Garage experiments with MAML . . . . . . . . . . . . . . . . . 127 4.6 Hyperparameters used for Garage experiments with RL 2 . . . . . . . . . . . . . . . . . . . 128 4.7 Hyperparameters used for Garage experiments with PEARL . . . . . . . . . . . . . . . . . 129 4.8 Definitions of key hyperparameters in meta-RL and MTRL . . . . . . . . . . . . . . . . . . 131 5.1 Summary of the NSL adaptation challenges and their effect on the performance of the base policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.2 Summary of grasp success rates for NSL fine-tuning experiments . . . . . . . . . . . . . . . 148 5.3 Summary of grasp success rates for the NSL continual learning experiment . . . . . . . . . 151 xiii List of Figures 1.1 Schematic diagram of thesis organization and our theory of progress in robot learning . . . . 4 2.1 Block diagram of Task Embeddings architecture for pre-training . . . . . . . . . . . . . . . 12 2.2 Block diagram of proposed architecture for transfer learning with Task Embeddings . . . . . 17 2.3 Plots of learned skill embedding distributions, and corresponding policy trajectories, from the Point Mass multi-task environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Images of the matching simulation and real Sawyer Reach multi-task environments . . . . . 21 2.5 Plots of gripper trajectories from pre-trained policies in the Sawyer Reach environment . . . 22 2.6 Plot of gripper trajectory for the latent skill interpolation experiments in the Sawyer Reach environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.7 Plot of gripper trajectory for the DDPG latent skill composition experiment in the Sawyer Reach environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.8 Plot of gripper trajectory for the search-based planning in latent skill space experiment in the Sawyer Reach environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.9 Images of the matching simulation and real Sawyer Box Pushing multi-task environments . . 27 2.10 Plots of the gripper and object trajectories from pre-trained policies in the Sawyer Box Pushing environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.11 Plots of gripper and object trajectories for the latent skill interpolation experiment in the Sawyer Box Pushing environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.12 Plots of the gripper and object trajectories for the search-based planning in latent skill space experiment in the Sawyer Box Pushing environment . . . . . . . . . . . . . . . . . . . . . . 30 2.13 Plot comparing training performance of DDPG on latent skill action space parameterizations 32 xiv 2.14 Block diagram of proposed Simulator-Predictive Control architecture for transfer learning with skill embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.15 Plots of gripper trajectories for the SPC rectangle-drawing experiment in simulation . . . . . 40 2.16 Plots of gripper trajectories for the SPC rectangle-drawing experiment on the real robot . . . 40 2.17 Plots of gripper trajectories for the SPC triangle-drawing experiment in simulation . . . . . 41 2.18 Plots of gripper trajectories for the SPC triangle-drawing experiment on the real robot . . . . 41 2.19 Plots of gripper and object trajectories for the SPC block-pushing experiment in simulation . 42 2.20 Plots of gripper and object trajectories for the SPC block-pushing experiment on the real robot 43 2.21 Plots showing PPO’s failure to learn a long-horizon waypoint-reaching policy despite a highly-shaped reward function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1 Images mosaic overview of the 50 manipulation tasks in Meta-World . . . . . . . . . . . . . 51 3.2 Diagram illustrating the differences between parametric and non-parametric task variation . 55 3.3 Image mosaic illustrating train and test tasks used in Meta-World ML1, ML10, and MT10 . 58 3.4 Plot comparing Meta-World ML1 generalization performance of RL 2 , MAML, and PEARL 61 3.5 Plots comparing train and test performance of 7 algorithms on Meta-World ML10, M10, ML45, and MT50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.6 Plot of single-task performance of SAC and PPO on each of 50 Meta-World environments . 71 3.7 Plots of learning curves for 7 meta-RL and MTRL algorithns on Meta-World ML10, MT10, ML45, and MT50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.8 Plots of learning curves for RL 2 , MAML, and PEARL on Meta-World ML1 . . . . . . . . . 73 4.1 Architecture diagram from the Garage reinforcement learning library . . . . . . . . . . . . . 91 4.2 GitHub “stars” engagement over time for the Garage reinforcement learning library . . . . . 98 4.3 Screenshot of Garage documentation website . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4 Plot of learning curves with seed sensitivity of garage meta-RL and MTRL algorithms on Meta-World ML45/MT50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.5 Plot of learning curves from the reward normalization ablation study for RL 2 -PPO on Meta-World ML10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 xv 4.6 Plot of learning curves from the reward normalization ablation study for MT-SAC on Meta-World MT10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.7 Plot of learning curves from the time limits ablation study of PEARL and MT-SAC . . . . . 105 4.8 Plot of t-SNE projection of latent task vectors learned by PEARL on Meta-World ML10 . . 106 4.9 Plot of learning curves from the maximum entropy ablation study on RL 2 -PPO and MT-PPO 107 4.10 Plot of learning curves from the subroutine algorithm ablation study of MAML and RL 2 . . 108 4.11 Plot of adaptation sample efficiency of MAML-TRPO, PEARL-SAC, and RL 2 -PPO on Meta-World ML45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.12 Plot showing gains to adaptation over for MAML-TRPO for Meta-World ML10 and ML45 . 111 4.13 Plot of learning curves with seed sensitivity of garage meta-RL and MTRL algorithms on Meta-World ML10/MT10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.14 Plot of learning curves from the task sampling policy study for MT-PPO . . . . . . . . . . . 116 4.15 Plot of learning curves from the return processing ablation study for RL 2 -PPO . . . . . . . . 117 4.16 Plot comparing performance of garage and OpenAI Baselines implementations of MT-PPO . 118 4.17 Plot comparing performance of garage and OpenAI Baselines implementations of MT-TRPO 118 4.18 Plot comparing performance of garage and rlkit implementations of MT-SAC on Meta- World ML1-push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.19 Plot comparing performance of garage and rlkit implementations of MT-SAC on Meta- World ML1-reach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.20 Plot comparing performance of garage and ProMP implementations of MAML-TRPO on Meta-World ML1-push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.21 Plot comparing performance of garage and ProMP implementations of MAML-TRPO on Meta-World ML1-reach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.22 Plot comparing performance of garage and ProMP implementation’s of RL 2 -PPO on Meta-World ML1-push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.23 Plot comparing performance of garage and ProMP implementations of RL 2 -PPO on Meta-World ML1-reach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.24 Plot comparing performance of garage and the original authors’ implementations of PEARL-SAC on Meta-World ML1-push . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 xvi 4.25 Plot comparing performance of garage and the original authors’ implementations of PEARL-SAC on Meta-World ML1-reach . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.26 Plot comparing performance of garage and the original authors’ implementations of PEARL-SAC on Meta-World ML1-pick-place . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.1 Diagram and images summarizing the performance of NSL on 5 adaptation challenges . . . 137 5.2 Image mosaic of views from the robot camera for the base grasping task and each of five NSL adaptation challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.3 Block diagram summarizing the NSL fine-tuning algorithm . . . . . . . . . . . . . . . . . . 144 5.4 Plots demonstrating the performance of NSL compared to relevant baselines, and images demonstrating the simulated adaptation challenge . . . . . . . . . . . . . . . . . . . . . . . 146 5.5 Block diagram illustrating the continual learning experiment with the NSL algorithm . . . . 150 5.6 Plot of adaptation sample efficiency of NSL on three adaptation challenges . . . . . . . . . 152 5.7 Plot demonstrating the early stopping problem of choosing a number of gradient steps for NSL fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.8 Plot and block diagram illustrating the parameter change magnitudes induced in a Q-function network by NSL fine-tuning for 5 adaptation challenges . . . . . . . . . . . . . 155 5.9 Plots of grasp success rate during training of NSL and relevant baselines, for both the base and target tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.10 Plot of comparing grasp success rates during training for two Q-function fine-tuning architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.11 Plot comparing grasp success rates during fine-tuning for several ratios between base and target task data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.1 Performance of different target policy classes used to understand transferring skill parameters, and illustration of the CarGoal environment . . . . . . . . . . . . . . . . . . . . 173 6.2 Performance of skill switching on the CarGoal environment . . . . . . . . . . . . . . . . . . 176 6.3 Overview diagram of skill builder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.4 Images of Meta-World ML10 and MT10 environments used for experiments with Skill Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.5 Performance of Skill Builder on Meta-World ML10 and ML45 using Dueling Fine-Tuning selection rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 xvii 6.6 Matrix of skill-skill transfer costs for Meta-World MT10, using AUC cost metric . . . . . . 185 6.7 Predicated optimal and pessimal curriculum MSTs for Meta-World MT10 . . . . . . . . . . 187 6.8 Comparison of the performance-sample efficiency frontier for several curriculums for Meta-World MT10 with Skill Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.9 Diagram of the Skill Mixer policy model class for active skill curriculum selection . . . . . 190 6.10 Examples of successful and failed policy executions for adapting to the Meta-World window-open task. using Skill Mixer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 6.11 Skill Mixer policy performance during fine-tuning . . . . . . . . . . . . . . . . . . . . . . . 193 xviii Abstract The last decade has seen the rapid evolution of machine learning (ML) from an academic curiosity to an essential capability of modern computing systems, and with it has come an explosion of ML applications which help humans in the real world every day. Whether machine learning can affect such an evolution in the nearby field of robotics—concerned with automating tasks in the physical, rather than the digital world—remains to be seen. Unlike general purpose computers, most robots today are still laboriously hand- programmed, single-task machines. As embodied agents, continual learning is a uniquely salient problem setting for robots. Robots are physical intelligent agents, which necessarily inhabit a dynamic and quickly- changing world, all the variations of which cannot possibly be anticipated, collected, and trained-for a priori of deployment. A key way machine learning could enable practical, everyday, real-world robots is by allowing robots to efficiently learn to perform many new skills, rather than only allowing them to generalize or perfect existing skills, for which the robot has been laboriously hand-designed, programmed, and possibly trained with machine learning (ML) to perform. Though the last decade has seen an explosion of new methods, remarkably few of these exhibit properties which make them good candidates for building an efficient, continual, multi-task robot skill learning system. This discussion of course begs the question: given a novel method, how should we assess its suitability for building a continually-learning multi-task robot? In other words, how can we systematically assure progress towards new real world capabilities, and avoid getting trapped, running in circles studying novel trivialities? This thesis will argue that three elements—benchmarks, baselines, and novel methods—together xix form the three-legged stool of research in artificial intelligence. Robotics can only make progress towards the goal of real-world, general-purpose, continually-learning robots by periodically advancing each of these legs, hopefully creating a virtuous cycle of new challenges, new systems, new solutions, and ultimately new knowledge. This thesis follows one such cycle for a small slice of the field, namely continual learning for robotic manipulation. This thesis illustrates the effectiveness of this cycle by example. In Chapter 2, we will kick-start the cycle with a new method from machine learning, and study how it can be best used to allow real robots to generalize using manipulation skills they already have. We take what we learn about the challenge of skill generalization in Chapter 2, and in Chapter 3 use it to develop a new benchmark. This benchmark measures generalization in learning for robotic manipulation. It is grounded by real applications such as cooking, assembly, and logistics, and imposes realistic limitations, such as limited control bandwidth, and limits on how many variations of a task a robot can experience during training. The scale and diversity of this new benchmark spawns new challenges in designing systems that can actually solve it, which we address in Chapter 4 by evolving a new baseline library for implementing, testing, and disseminating new methods in adaptive and continual RL. We share this library with the research community, and use it to better our own understanding of why current mulit-task and meta-RL algorithms cannot solve our benchmark. In Chapters 5 and 6 we use this empirical, systems-driven approach to study important problems in continual learning for robotics, namely continual intra-skill and inter-skill generalization, and use benchmarks and baselines to introduce new methods for solving these problems. We conclude in Chapter 7 with a review of what we have learned, and some thoughts on how we should begin the next cycle innovating benchmarks, baselines, and methods for continual robot learning. xx Chapter 1 Introduction The last decade has seen the rapid evolution of machine learning (ML) from an academic curiosity to an essential capability of modern computing systems, and with it has come an explosion of ML applications which help humans in the real world every day. Better ML has enabled economically-viable automation of many tasks previously off-limits to computing, such as those involving natural language, computer vision, and predicting the behavior of the natural world, to name just a few. Whether machine learning can affect such an evolution in the nearby field of robotics—concerned with automating tasks in the physical, rather than the digital world—remains to be seen. Unlike general purpose computers, most robots today are still laboriously hand-programmed, single-task machines. As embodied agents, continual learning is a uniquely salient problem setting for robots. Purely digital intelligent systems perform inference on relatively stationary facts about the world. These facts are satisfac- torily modeled using static datasets as a proxy for the real world, such as labeled photographs of objects for image classification, or parallel corpora of documents for machine translation. Robots are physical intelli- gent agents, which necessarily inhabit a dynamic and quickly-changing world, all the variations of which cannot possibly be anticipated, collected, and trained-for a priori of deployment. That is, all robot learning problems are necessarily non-stationary learning problems. Additionally, as they are ultimately real ma- chines which cost money and consume space, the utility of given robot to its user is significantly enhanced, 1 if not entirely predicated-on, the notion of the robot as a general-purpose machine, i.e. one which is useful for many tasks, including those which its designers did not anticipate. While not the only way for a robot to justify its expense, endowing a single machine with the ability to serve many real world uses was essential to the widespread deployment of computers in everyday life, and there’s scant reason to believe that the same should not be true for robots. So a key way in which machine learning could enable practical, everyday, real-world robots is by allowing robots to efficiently learn to perform many new skills, rather than only allowing them to generalize or perfect existing skills, for which they have been laboriously hand-designed, programmed, and possibly trained with machine learning (ML) to perform. As we do not have complete information on all skills a robot might be asked to perform at design time, in the parlance of machine learning, this setting is necessarily a continual learning (CL) problem. Specifically, we seek robots which can efficiently and continually learn new manipulation skills which allow them to do work on the world, while presumably making the best use of skills they have already learned. The continual robot learning problem is related to, but not identical to, few-shot learning, meta-learning, and multi-task learning for robotics. While, as we will show in Chapter 6, one can certainly reduce the CL problem to one of iterated few-shot learning of multiple skills, our experiments in Chapter 5 show that the properties of few-shot learning algorithms that are well-suited for one-time adaptation are different from those that are suited for use as CL building blocks. Likewise, the properties of a good multi-task learning al- gorithm are desirable, but not sufficient, for continual multi-task learning. Indeed, we will show in Chapter 3 that, even among few-shot reinforcement learning (FSRL) algorithms and multi-task reinforcement learning algorithms (MTRL), the properties of algorithms well-suited for adapting a single manipulation skill to a new context (intra-skill adaptation) are fundamentally different from those well-suited for adapting a library of independent manipulation skills to acquire a new manipulation skill (inter-skill adaptation). So though 2 the last decade has seen an explosion of new methods in ML, RL, few-shot RL, and multi-task RL, remark- ably few of these exhibit properties which make them good candidates for building an efficient, continual, multi-task robot skill learning system. This discussion of course begs the question: given a novel method, how should we assess its suitability for building a continually-learning multi-task robot? In other words, how can we systematically assure progress towards new real world capabilities, and avoid getting trapped, running in circles studying novel trivialities? This thesis argues that the answer to this question is three-fold. First, we need to unambiguously encode and then disseminate our desiderata in the form of benchmarks, each of which represents a worthy model problem on the path to real world deployment. Benchmarks provide a yardstick against which we can measure our progress towards real-world robots, and a tool for comparing the relative performance of new methods to old ones. Second, we need to develop and disseminate standard baseline implementations of methods old and new, using common abstractions fitted to the problems at hand. These systems provide a common language for sharing not just ideas, but their concrete, unambiguous manifestations in the form of source code and designs, allowing us to succinctly communicate the differences and commonalities between methods. Third, we need novel methods to push the field to new capabilities by solving the real world problems summarized by benchmarks. These capabilities not only demonstrate new possibilities for robotic applica- tions, they inform our understanding of the science and systems of the robot learning problem. This allows us to evolve the aforementioned benchmarks and baselines, creating a positive feedback cycle of progress. This thesis illustrates the effectiveness of this cycle by example. In Chapter 2, we will kick-start the cycle with a new method from machine learning, and study how it can be best used to allow real robots to generalize using manipulation skills they already have. We take what we learn about the challenge of skill generalization in Chapter 2, and in Chapter 3 use it to develop a new benchmark. This benchmark measures generalization in learning for robotic manipulation. It is grounded by real applications such as 3 Novel Methods Chapter 2 Benchmarks Chapter 3 Baselines Chapter 4 Novel Methods Chapters 5 and 6 Chapter 7 Figure 1.1: Schematic diagram of our theory of progress in robot learning, and how this thesis is organized. The first of many in this document. cooking, assembly, and logistics, and imposes realistic limitations, such as limited control bandwidth, and limits on how many variations of a task a robot can experience during training. The scale and diversity of this new benchmark spawns new challenges in designing systems that can actually solve it, which we address in Chapter 4 by evolving a new baseline library for implementing, testing, and disseminating new methods in adaptive and continual RL. We share this library with the research community, and use it to better our own understanding of why current mulit-task and meta-RL algorithms cannot solve our benchmark. In Chapters 5 and 6 we use this empirical, systems-driven approach to study important problems in continual learning for robotics, namely continual intra-skill and inter-skill generalization, and use benchmarks and baselines to introduce new methods for solving these problems. We conclude in Chapter 7 with a review of what we have learned, and some thoughts on how we should begin the next cycle innovating benchmarks, baselines, and methods for continual robot learning. In the next chapter, we begin our tour around this cycle by asking the question “How can we use new methods from machine learning to generalize robot manipulation policies quickly and easily?” 4 Chapter 2 Scaling Simulation-to-Real Transfer by Learning a Latent Space of Robot Skills We present a strategy for simulation-to-real transfer, which builds on recent advances in robot skill decom- position. Rather than focusing on minimizing the simulation-reality gap, we propose a method for increasing the sample efficiency and robustness of existing simulation-to-real approaches which exploits hierarchy and online adaptation. Instead of learning a unique policy for each desired robotic task, we learn a diverse set of skills and their variations, and embed those skill variations in a continuously-parameterized space. We then interpolate, search, and plan in this space to find a transferable policy which solves more complex, high-level tasks by combining low-level skills and their variations. In this chapter, we first characterize the behavior of this learned skill space, by experimenting with several techniques for composing pre-learned latent skills. We then discuss an algorithm which allows our method to perform long-horizon tasks never seen in simulation, by intelligently sequencing short-horizon latent skills. Our algorithm adapts to unseen tasks online by repeatedly choosing new skills from the latent space, using live sensor data and simulation to predict which latent skill will perform best next in the real world. Importantly, our method learns to control a real robot in joint-space to achieve these high-level tasks with little or no on-robot time, despite the fact that the low-level policies may not be perfectly transferable from simulation to real, and that the low-level skills were not trained on any examples of high-level tasks. In addition to our results indicating a lower sample 5 complexity for families of tasks, our method provides a promising template for combining learning-based methods with proven classical robotics algorithms such as model-predictive control (MPC). 2.1 Introduction The Constructivist hypothesis proposes that humans learn to perform new behaviors by using what they already know [39]. To learn new behaviors, it proposes that humans leverage their prior experiences across behaviors, and that they also generalize and compose previously-learned behaviors into new ones, rather than learning them from scratch [54]. Whether we can make robots learn so efficiently is an open question. Instead of drawing on prior experience when faced with a new task, the agents considered in most work on “deep” reinforcement learning (RL) acquire new behavior from scratch. They achieve superhuman performance in various continuous control [151] and game play domains [176], but can do so only for a single particular task. This in turn hampers their sample efficiency by requiring millions of samples to converge [55] in a new environment. While recent attempts in deep RL for robotics are encouraging [144, 33, 80], data efficiency and generality remain the main challenges of such approaches. Learning from scratch using these algorithms on a real robot is therefore a resource-intensive endeavor, e.g. by requiring multiple robots to learn in parallel for weeks [145]. One promising approach to overcome the sample inefficiency in deep RL for robotics is to train agents entirely in faster-than-real-time simulation, and transfer the learned policies to a real robot (i.e. sim2real). Sim2real approaches suffer from the “reality gap” problem: it is practically impossible to implement a perfect-fidelity simulation which exhibits all of the physical intricacies of the real world. A learner trained in simulation and then deployed in the real world will eventually encounter a mismatch between its training- time and test-time experiences, and must either generalize across the reality gap, or adapt itself to the new real-world environment. Many sim2real methods focus directly on closing the reality gap or transferring across it, at the expense of significant up-front engineering effort and often additional sample complexity. 6 Instead, we propose a method for increasing the sample efficiency of almost any sim2real method, which allows this effort and sample complexity to be amortized among many tasks. Our contribution is a method for increasing the sample efficiency and robustness of the sim2real frame- work which exploits hierarchy, while retaining the flexibility and expressiveness of end-to-end RL ap- proaches. As we will show, our method’s hierarchical formulation also provides a template for mixing reinforcement learning for proven robotics algorithms, such as search-based planning and model-predictive control (MPC). Consider the illustrative example of block stacking. One approach is to learn a single monolithic policy which, given any arrangement of blocks on a table, grasps, moves, and stacks each block to form a tower. This formulation is succinct, but requires learning a single sophisticated policy in simulation and then trans- ferring that to the real world. We observe that block stacking–and many other practical robotics tasks–is easily decomposed into a few reusable primitive skills (e.g. locate and grasp a block, move a grasped block over the stack location, place a grasped block on top of a stack), and divide the problem into two parts: learning to perform and mix the skills in general in simulation, and learning to combine these skills into particular policies which achieve high-level tasks in the real world. This chapter draws on earlier work published in a conference paper [117]. Here, we provide more detailed explanations and discussions of our sim2real framework. Additionally, this article newly introduces an algorithm and supporting experiments, based on the template of model-predictive control, which allows our method to be used for zero-shot transfer learning with real robots. Our algorithm is simple, and leverages both latent skill spaces and online simulation to minimize the real-robot samples used to acquire a new task. Our open-source implementation of the Task Embedding algorithm can be found at https://github.com/ ryanjulian/embed2learn. 2.2 Related Work 7 Learning skill representations Learning skill representations to aid in generalization has been proposed in works old and new. Previous works proposed frameworks such as Associative Skill Memories [193] and probabilistic movement primitives [217] to acquire a set of reusable skills. Multi-task learning and hierarchical reinforcement learning Our approach builds on the work of Haus- man et al. [95], which learns a latent space which parameterizes a set of motion skills, and shows them to be temporally composable using interpolation between coordinates in the latent space. Importantly, this work shows that this idea, which was previously only evaluated in simulation using large amounts of training data, can be applied to real robotics problems with practical amounts of training data using a sim2real framework. In addition to learning reusable skills as in Hausman et al. [95], we characterize the behavior of these la- tent spaces, evaluate several methods for learning to compose latent skills to achieve high-level tasks with real robots, and propose a simple algorithm which allows these latent skill spaces to be used for zero-shot adaptation to new tasks on real robots. Similar latent-space methods have been recently used for hierarchical RL in simulation [100, 89, 211]. Other approaches introduce particular model architectures for multitask learning, such as Progressive Neural Networks [221], Attention Networks [206], and ensembles of multiple expert models [207, 141]. Using latent skill representations for exploration Exploration is a key problem in robot learning, and our method uses latent skill representations to address this problem. Using learned latent spaces to make exploration more tractable is also studied by Gu et al. [80], Eysenbach et al. [61] and Gupta et al. [85]. Our method exploits a latent space for task-oriented exploration: it uses model-predictive control and simulation to choose latent skills which are locally-optimal for completing unseen tasks, then executes those latent skills on the real robot. 8 Transfer learning Common approaches to simulation-to-reality (sim2real) transfer learning include ran- domizing the dynamics parameters of the simulation [196], and varying the visual appearance of the envi- ronment [223, 263, 113], both of which scale linearly or quadratically the amount of computation needed to learn a transfer policy. Another approach is explicit alignment: given a mapping of common features be- tween the source and target domains, domain-invariant state representations [268], or priors on the relevance of input features [137], can further improve transferability. The method we introduce in this chapter is orthogonal to these existing sim2real approaches. Instead of varying or adapting the simulation parameters, we approach the transfer learning problem from a control perspective. We aim to learn a policy that can accomplish a diverse set of composable skills in simulation, so that it can generalize to many tasks when applied in the real world. Our goal is to reduce the training effort needed on the real system by leverage the library of skills learned previously in the simulator, and subsequently only learning a high-level composing policy that leverages this embedding space to accomplish complex behaviors. As we show in our experiments, this framework leads to increased sample efficiency compared to typical reinforcement learning approaches, which need to retrain the entire policy to accomplish a new composition of skills. Ultimately, our approach can be combined with other sim2real methods, such as domain randomization or domain adaption of the simulator, to further improve the robustness of the transfer learning process. Other strategies, such as that of Barrett, Taylor, and Stone reuse models trained in simulation to make sim-to-real transfer more efficient [10]. In contrast to our approach, however, this work requires an explicit pre-defined mapping between seen and unseen tasks. Yu et al. learns control policies in a physical simulator and additionally implements online system identification to obtain a dynamics model that is used by the policy to adapt in the real world [294]. Similarly, Preiss, Hausman, and Sukhatme [202] use a latent space for system identification to help specialize a policy to different domains. Sæmundsson, Hofmann, and Deisenroth [225] and Nagabandi et al. [178] use meta-learning and learned representations to generalize 9 from pre-trained seen tasks to unseen tasks. These approaches, however, require that the unseen tasks to be very similar to the pre-trained tasks, and is few-shot rather than zero-shot. Our method is zero-shot with respect to real environment samples, and can be used to learn unseen tasks that are significantly out-of- distribution. In addition, our method can be used for composing learned skills in the time domain to achieve unseen tasks that are more complex than the underlying pre-trained task set. Reinforcement learning with MPC Using reinforcement learning with model-predictive control (MPC) has been explored in the past [235]. Kamthe and Deisenroth [120] proposed using MPC to increase the data efficiency of reinforcement algorithms by training probabilistic transition models for planning. Zhang et al. [298] employ MPC to generate training data for a neural network policy. This work takes a different approach, by exploiting our learned latent space and simulation directly to find policies for novel tasks online. Relationship to meta-learning This work is related to meta-learning methods [66, 215, 208], which seek to learn a single shared policy that can easily generalize to all skills in a task distribution. In this chapter, we learn an explicit latent space of robot skills that allows us to interpolate and sequence skills. Similarly, unlike recurrent meta-learning methods [56, 174], which implicitly address sequencing of a family of sub-skills to achieve goals, our method addresses generalization of single skills while providing an explicit representation of the relationship between skills. We show that this explicit representation allows us to combine our method with many algorithms for robot autonomy, such as optimal control, search-based planning, and manual programming, in addition to learning-based methods. Furthermore, our method can be used to augment most existing reinforcement learning algorithms, rather than requiring the formulation of an entirely new family of algorithms to achieve its goals. 10 Simultaneous work The MPC-based experiments in this work are closely related to simultaneous work performed by Co-Reyes et al. [211]. Whereas our method learns an explicit skill representations using pre- chosen skills identified by a known ID, Co-Reyes et al. learn an implicit skill representation by clustering trajectories of states and rewards in a latent space. Furthermore, we focus on MPC-based planning in the latent space to achieve robotic tasks learned online with a real robot, while their analysis focuses on the machine learning behind this family of methods and uses simulation experiments. 2.3 Technical Approach This work synthesizes two recent methods in deep RL–pre-training in simulation and learning composable motion policies–to make deep reinforcement learning more practical for real robots. Our strategy is to split the learning process into a two-level hierarchy (Fig. 2.2), with low-level skill policies and their latent space parameterization learned in simulation, and high-level task policies learned or planned either in simulation or on the real robot, using the imperfectly-transferred low-level skills. 2.3.1 Preliminaries Similarly to single-task reinforcement learning, our objective is to learn a policy ; , a parameteric func- tion approximator parameterized by and, which maximizes the expected total -discounted reward in a Markov Decision Process (MDP) with state spaces2S and action spacea2A. In our multi-task RL setting, we characterize a set of low-level tasks (primitive skills) with task IDsT =f1;:::;Ng, and ac- companying, per-task reward functionsr t (s;a). We then define our overall policy ; (s;t) =p ; (ajs;t), as conditioned on this task context variable t2T in addition to the state s2S. Since this is a multi- task reinforcement learning setting, the objective changes to maximize the expected total discounted reward averaged over all the tasksT . 11 A g ent En vironment Em b eddin g In ference Figure 2.1: Block diagram of proposed architecture for joint learning of a latent space of robotic skills and policies for those skills. 2.3.2 Overview To introduce our skill embedding method, we further decompose the overall policy into two components: a skill embedding distributionp (t) =p (zjt), parameterized by, mapping taskt to a distribution of latent vectorsz, and a latent-conditioned policy (s;z) =p (ajs;z). Note that the true skill identityt is hidden from the policy behind the embedding distributionp . This forces the policy to learn high-reward primitives that are capable of interpreting many different z’s, as opposed to a single task t2T . In particular, we optimize the embedding distribution p to be of high- entropy given the task, resulting in one-to-many mapping between the task t and its latent representation z. To aid in learning the embedding distribution, we introduce an inference function q () = p(zj) which, given a state-action trajectory window = (s i ;a i ) H of lengthH, predicts the latent vectorz which produced the trajectory. We sample z from p (zjt) once at the beginning of the rollout, then fed to the low-level skill policy (s;z)), ensuring that all time steps in are correlated with the samez. The goal of the additional inference network is to encourage diversity of solutions for a specific task. In particular, we 12 use it to formulate an RL objective that encourages the policy to produce a different trajectory for different latentsz. We achieve this goal by augmenting the reward function as discussed in the next section. This parameterization allows us to separate the representation of skillsp from their controllers (s;z). This separation enables our algorithm to explore and manipulate the skill space as a set of abstract vectors representing skill variations. In Sec. 2.5.1.2, we show how we can use simulation to predict the performance of those skill variations on new tasks. Once equipped with these two tools– a representation of learned robot skills, and a model for predicting those skills’ outcomes given a world state– we can apply many proven and efficient robotics algorithms to the high-level representation while still using learning- based methods for low-level control. 2.3.3 Skill Embedding Learning Algorithm The derivation of our method is based on entropy-regularized RL, where there is an additional policy entropy reward term that encourages diverse solutions to a given task (Equation 2.1), max E ;p 0 ;t2T h 1 X i=0 i r t (s i ;a i ) +H[ (a i js i ;t)] a i (js;t)s i+1 p(s i+1 ja i ;s i ) i ; (2.1) wherep 0 (s 0 ) is the initial state distribution and is a reward weighting term. As shown by Hausman et al. [95], the policy entropy part of the objective can be lower-bounded using Jensen’s inequality for the latent policy class as following (Equation 2.2). H[ (a i js i ;t)] =E [ log (a i js i ;t)] E (a;zjs;t) h logq (zj(a i ;s i ) H ) i +H[p (zjt)] +E p (zjt) h H[ (a i js i ;z)] i : (2.2) 13 This objective (Equations 2.3 and 2.4) results in an augmented reward function ^ r() that can be optimized using any parametric reinforcement learning architecture. L = max E (a;zjs;t) t2T " 1 X i=0 i ^ r(s i ;a i ;z;t) # (2.3) where ^ r(s i ;a i ;z;t) =r t (s i ;a i ) + 1 E t2T [H (p (zjt))] (2.4) + 2 logq (zj(s i ;a i ) H ) + 3 H ( (a i js i ;z)) We describe the resulting reward augmentations that allow the RL algorithm to train our skill embedding framework below: 1. Embedding entropy –E t2T [H (p (zjt))] We reward the algorithm for choosing an embedding function that produces a wide distribution of latent vectorsz when fed a single task one-hot vectort. This allows the latent space to represent many variations on a skill, rather than assigning each skill to just a single latent. 2. Inference cross-entropy – logq (zj = (s i ;a i ) H ) We reward the agent for following trajectories which achieve low cross-entropy with the inference distribution. This encourages the policy to produce distinct trajectories for different latent vectors (identifiability). We discuss this component in more detail below. 3. Policy entropy –H ( (a i js i ;z)) This term encourages the policy to take many different actions in response to the same state s and 14 latent vectorz, so that the policy does not collapse to a single solution for each task. This ensures that the policy can represent many different ways of achieving each task. During training, rather than revealing the task t to the policy, once per rollout we feed the task ID t, encoded as a one-hot vector, through the stochastic embedding functionp to produce a latent vectorz. The same value ofz is fed to the policy for the entire rollout, so that all steps in a trajectory are correlated with the same value ofz. During pre-training, we ensure that each pre-training taskt is represented equally in each training batch by sampling skills in a round-robin fashion. This prevents the algorithm from overfitting to any one particular skill. We train the policy and embedding networks using Proximal Policy Optimization [233], though our method may be used by any parametric reinforcement learning algorithm. We use the MuJoCo physics en- gine [264] to implement our Sawyer robot simulation environments. We represent the policy, embedding, and inference functions using multivariate Gaussian distributions whose mean and diagonal covariance are parameterized by the output of a multi-layer perceptron. The policy and embedding distributions are jointly optimized by the reinforcement learning algorithm, while we train the inference distribution using super- vised learning and a simple cross-entropy loss. 2.3.3.1 Skill Embedding Criterion In order for the learned latent space to be useful for completing unseen tasks, we seek to constrain the policy and embedding distributions to satisfy two important properties: 1. High Entropy: Each task IDt should induce a distribution over latent vectorsz which is as wide as possible, representing many variations of a single skill. 2. Identifiability: Given a state-trajectory window (s i ;a i ) H produced by the policy, the inference net- work should be able to predict with high confidence the latent vectorz which was fed to the policy 15 to produce that trajectory. This means the policy must act differently in response to different latent vectors. When applied together, these properties ensure that during training the embedding produces many dif- ferent latent parameterizations of each skill (high-entropy), while simultaneously ensuring that the policy learns distinct high-reward variations of that skill in response to each latent parameterization (identifiabil- ity). This dual constraint, as implemented by the augmented reward function in Eq. 2.3 is the key for using model predictive control or other proven algorithms in the latent space, as discussed in Secs. 2.4 and 2.5. 2.3.4 Simulation-to-Real Training and Transfer Method The full robot training and transfer method consists of three stages, and an optional fourth stage. 2.3.4.1 Stage 1: Pre-Training in Simulation while Learning Skill Embeddings We begin by training in simulation a multi-task policy for all low-level skills, and a composable param- eterization of that libraryp (zjt) (i.e. a skill embedding). This stage may be performed using any deep RL algorithm, along with the modified policy architecture and loss function described above. Our implementa- tion uses Proximal Policy Optimization [233] and the MuJoCo physics engine [264]. 2.3.4.2 Stage 2: Learning Hierarchical Policies In the second stage, we learn a high-level “composer” policy, represented in general by a probability distri- bution (zjs) over the latent vectorz. The composer actuates the low-level policy (ajs;z) by choosing among latent skills z. In effect, each latent skill z represents a unique feedback control policy, and the latent space a continuous and explorable space of these controllers, which is supported by a basis set of the pre-training skillsT . This hierarchical organization admits our novel approach to transfer learning and 16 A g ent Re a l En vironment Em b eddin g In ference Comp oser Simulated En vironment task Figure 2.2: Block diagram of proposed architecture for transfer learning. Shared low-level skill components are shown in green. The high level task-specific component is shown in blue. 17 adaptation: by freezing the low-level skill policy and embedding functions, and exploring only in the pre- learned latent space to acquire new tasks, we can transfer a multitude of high-level task policies derived from the low-level skills. This stage can be performed directly on the the real robot or in simulation. As we show in Secs. 2.4 and 2.5, composer policies may treat the latent space as either a discrete or continuous space, and may be found using learning, search-based planning, or even manual sequencing and interpolation. To succeed, the composer policy must explore the latent space of pre-learned skills, and learn to exploit the behaviors the low-level policy exhibits when stimulated with different latent vectors. We hypothesize that this is possible because of the richness and diversity of low-level skill variations learned in simulation, which the composer policy can exploit by actuating the skill embeddingto choose among these variations. 2.3.4.3 Stage 3: Transfer and Execution Lastly, we transfer the low-level skill policy, embedding and high-level composer policies to a real robot, and execute the entire system to perform high-level tasks. Alternatively, as discussed in Section 2.5, we can transfer only the low-level skill policy and skill embedding to the real robot directly, and use an adaptation algorithm to achieve new tasks on the real robot directly, by exploiting the imperfectly-transferred latent skills. 2.3.4.4 Post-smoothing Trajectories The trajectories generated by the stochastic control policy exhibit jerky behavior, as can be seen even for the basic reaching tasks in simulation (Fig. 2.5 top) and on the real robot (Fig. 2.5 center). While such noisy motions are crucial during the training of our reinforcement learning framework since it enables the exploration of effective policies – during execution time on the real robot we want to achieve trajectories that require less control effort. To realize this goal, we apply post-smoothing through a finite impulse response 18 (FIR) filter with a 10-step filter window to the action output of the control policy. Each dimension of the action vectora is filtered through a low-pass filter online during the execution on the real robot. 2.4 Characterizing the Behavior of the Latent Skill Space The goal of this chapter is to arrive at a method which will allow us to achieve sample efficiency and adaptation of sim2real policies by learning latent spaces of robot controllers. Prior work (as addressed in Section 2.2) defines an algorithm for learning such spaces and briefly explores ways of manipulating them, but does not address their sample efficiency and, importantly for our goal, does not characterize the behavior of these spaces, i.e. how the latent controllers behave as we visit points in the space which do not coincide with the pre-chosen basis skillsT . As such, our goal in the following experiments is then to characterize the behavior of these RL-learned latent skill spaces outside of the well-known basis skills, to evaluate the method’s suitability for sample- efficient learning and adaptation, and to inform the design of a more general adaptation method, which we describe in Sec. 2.5. 2.4.1 Experiments 2.4.1.1 Point Environment Before experimenting on complex robotics problems, we evaluate our approach in a point mass environment. Its low-dimensional state and action spaces, and high interpretability, make this environment our most basic test bed. We use it for verifying the principles of our method and tuning its hyperparameters before we deploy it to more complex experiments. Portrayed in Fig. 2.3 is a multi-task instantiation of this environment with four goals (skills). 19 −2 −1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 Embedding dimension 0 −2 −1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 Embedding dimension 1 −4 −2 0 2 4 x −4 −2 0 2 4 y Trajectories Figure 2.3: Skill embedding distribution which successfully disentangles the four different skills in the point mass environment using the embedding function. At each time step, the policy receives as state the point’s position and chooses an instantaneous two- dimensional velocity vector as its action, limited to 0.1 unit step . The policy receives a negative reward equal to the distance between the points and the goal positiong i , i.e.r i =jjsg i jj 2 . After 15,000 time steps, the embedding network learns a multimodal embedding distribution to repre- sent the four skills (Fig. 2.3). Introducing entropy regularization [95] to the policy alters the trajectories significantly: instead of steering to the goal position in a straight line, the entropy-regularized policy en- codes a distribution over possible solutions. Each latent vector produces a different solution. This illustrates that our approach is able to learn multiple distinct solutions for the same skill, and that those solutions are addressable using the latent vector input. 2.4.1.2 Sawyer Experiment: Reaching Now that we have used the simple point environment to verify the fundamentals of the latent skill learning algorithm, we can begin to characterize the behavior of these latent spaces for real robotics problems. Our 20 Figure 2.4: Multi-task environment in simulation (left) and reality (right) for the reaching experiment. In pre-training, the robot learns a low-level skill for servoing its gripper to each of eight goal positions. goal in these experiments is to assess whether the method’s behavior in simple environments such as the point experiment transitions to the higher-dimensional action and observation spaces of real robots. We start by transitioning to pre-training using a 6-DOF reaching experiment, which is analogous to the point experiment in the previous section. In the following experiments, we then characterize the behavior of this pre-trained latent skill policy using several composition and interpolation techniques. We ask the Sawyer robot to move its gripper to within 5 cm of a goal point in 3D space. Like the point experiment, in these experiments the policy is pre-trained with a simple goal-reaching reward function for each basis skill, i.e. r i = jjeg i jj 2 2 , where e is the end-point position of the gripper and g i is a 3- dimensional Cartesian goal position. Unlike the point environment, the agent’s action space is incremental joint position movements of up to0.04 rad for each joint. Joint position control for this action space is implemented by the robot’s internal joint position controllers, and these position controllers are allowed to settle before the agent may take another action. The policy receives a state observation with the robot’s seven joint angles, plus the Cartesian position of the robot’s gripper. Importantly, this agent must control the robot in joint space, not Cartesian space, while avoiding collisions with the environment and itself. We trained the low-level policy on eight goal positions in simulation, forming a 3D cuboid enclosing a volume in front of the robot (Fig. 2.4). As shown in Fig. 2.5, the embedding policy successfully attains 21 x 0.45 0.50 0.55 0.60 0.65 0.70 y −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 z 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Task 0 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 x 0.45 0.50 0.55 0.60 0.65 0.70 y −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 z 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Task 0 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Figure 2.5: Gripper position trajectories for each of the eight skills from the low-level pre-trained reaching policy, after 100 training iterations. Top-left: Simulation. Top-right: Real Robot. Bottom: Smoothed trajectories on the real robot. Shaded areas indicate 5 cm goal regions. accurate goal reaching performance in simulation (left), and without any fine-tuning generates trajectories that reach within the 5 cm goal radius for most skills on the real robot (center). While the trajectories appear jerky, through online filtering (described in Sec. 2.3.4.4), the trajectories on the real robot (Fig. 2.5, right) become significantly smoother. Although the FIR filter causes a delay in the control response, we did not observe a reduction in accuracy for the skills we investigated throughout our experiments. The composer policies feed latent vectors to the pre-trained low-level skill policy to achieve high-level tasks such as reaching previously-unvisited points (Fig. 2.6). 22 Composition Experiments Having verified that the latent skill learning method transitions to controlling a real robot using a sim2real policy, we experiment with several techniques for manipulating the robot in latent skill space. Recall from Sec. 2.3.3 that each vector in the latent skill space represents a unique feedback controller for the robot, so these experiments are exploring a space of controllers. As these experiments are descriptive rather than prescriptive, there is no reward function per se for each of these experiments, except where otherwise noted. All Sawyer composition experiments use the same low-level skill policy, pre-trained in simulation. We experimented both with composition methods which directly train new high-level skills on the robot (direct), and with methods which train new high-level skills using a second stage of pre-training in simulation before transfer (sim2real). Skill interpolation in the latent space (direct) The goal of this experiment is to characterize the “shape” or algebraic behavior of the skill latent space, e.g. “Is combining two or more skills to form a third skill as simple as taking their linear combination, or does the latent space have a non-linear algebra?” We evaluate the embedding function to obtain the mean latent vector for each of the 8 pre-training skills, then feed linear interpolations of these means to the latent input of the low-level skill policy, transferred directly to the real robot. For a latent pair (z a ;z b ), our experiment feedsz a for 1s, thenz i = i z a + (1 i )z b for i 2 [0; 1] for 1s, and finallyz b for 1s. We observe that the linear interpolation in latent space induces an implicit collision-free motion plan between the two points, despite the fact that pre-training never experienced this state trajectory. In one experiment, we used this method iteratively to control the Sawyer robot to draw a U-shaped path around the workspace (Fig. 2.6). This experiment demonstrates that, while the algorithm learns a latent space which correctly orients the the relative locations of pre-training skills in latent space (Fig. 2.5), the behavior of the controllers lying in-between these well-known skills does not satisfy a simple linear algebra. 23 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 x 0.50 0.55 0.60 0.65 0.70 y Start Goal 3 Goal 4 Goal 8 Goal 7 Figure 2.6: Gripper position trajectory while interpolating latents between skills 3, 4, 8 and 7 for the em- bedded reaching skill policy on the real robot. End-to-end learning in the latent space (sim2real) This experiment seeks to assess whether we may use these skill latent spaces to learn quickly using model-free reinforcement learning algorithms like the agent was pre-trained with, e.g. “Can the latent skillz be used as the action space for another model-free reinforcement learning algorithm, allowing us to deploy this method recursively?.” Note that this experiment uses a different algorithm (DDPG) from a different family of algorithms (off-policy) from the pre-training step (PPO and on-policy policy gradients, respectively). Using Deep Deterministic Policy Gradients (DDPG) [151], an off-policy reinforcement learning algo- rithm, we trained a composer policy to modulate the latent vector to reach a previously-unseen point. The reward function and state spaces for this experiment are identical to the pre-training phase but using an unseen goal, and the action space is simply the latent skill input vector z. We then transferred the com- poser and low-level policies to the real robot. The policy achieved an end-effector distance error of 5 cm, the threshold for skill completion as defined by our reward function (Fig. 2.7). As off-policy reinforcement 24 x 0.45 0.50 0.55 0.60 y 0.00 0.05 0.10 0.15 0.20 z 0.15 0.20 0.25 Start position Desired position End position Figure 2.7: End-effector position trajectory for composed reaching policy, trained with DDPG to reach an unseen point (green). The green ball around the desired position visualizes the goal radius of 5 cm. learning algorithms consider a task complete as soon as this completion threshold is achieved, DDPG does not learn to get any closer to the goal. This experiment demonstrates that the latent skill space can still be used with model-free reinforce- ment learning algorithms to quickly learn new policies and then transfer them to real robots. In the next experiment, we assess whether we can use even-simpler algorithms to achieve the same effect. Search-based planning in the latent space (sim2real and direct) Having shown we can efficiently ex- plore the latent skill skill spaces using a fairly complex algorithm (DDPG), our goal in this experiment is to assess whether that complexity is warranted, e.g. “Given a latent skill space, can we use a very simple algorithm to achieve sample-efficient adaptation?. We used Uniform Cost Search [219, Chapter II] (UCS) in the latent space to find a motion plan (i.e. sequence of latent vectors) for moving the robot’s gripper along a triangular trajectory. UCS is a specialized form of Dijkstra’s algorithm that finds the shortest path between a single start and goal node in a graph, and is perhaps the simplest general graph-based planning algorithm imaginable. Our search space treats the 25 −0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 0.4 y 0.10 0.15 0.20 0.25 0.30 0.35 Gripper position Latent 0 Latent 1 Latent 2 Latent 3 Latent 4 Latent 6 z Figure 2.8: Gripper position trajectory reaching points (black) in a triangle within a radius of 5 cm (grey) composed by search-based planning in the latent space. latent vector corresponding to each skill ID as a discrete option. We execute a plan by feeding each latent in in the sequence to the low-level policy for1s, during which the low-level policy executes in closed-loop. All joint position control commands are processed by the robot’s on-board controllers as described before, but we do not necessarily allow each command to settle before issuing a new one. The cost (reward) function used for this task is identical to the one from pre-training, but using an unseen goal location. In simulation, this strategy found a plan for tracing a triangle in less than 1min, and that plan successfully transferred to the real robot (Fig. 2.8). We replicated this experiment directly on the real robot, with no intermediate simulation stage. It took 24 min of real robot execution time to find a motion plan for the triangle tracing task. This experiment demonstrates that learning and then exploring in the latent skill space dramatically increases sample efficiency of acquiring a new skill, and that we can achieve this sample efficiency using a very simple algorithm. Training this policy without the latent space on a real robot would require an intractable number of trials. 26 Figure 2.9: Multi-task environment in simulation (left) and reality (right) for the box pushing experiment. Four goals are located on the table at a distance of 15 cm from the block’s initial position. 2.4.1.3 Sawyer Experiment: Box Pushing Our Sawyer Reacher experiments demonstrated that the fundamentals of the skill embedding learning and transfer method can be applied to real robots and algorithms for sample-efficient learning and adaptation, but they only assessed the method’s performance using a free-space motion task, which is comparatively easy to model in simulation. In these experiments, we seek to characterize the behavior of the method on a skill which is intractable to perfectly model in simulation, because it involves robot-object contact forces and object-object friction. The goal of the following experiments is to characterize the ability of simple composition strategies to adapt in a sim2real setting, despite the fact that we know that the pre-training policy cannot transfer perfectly from simulation to real training. “Can the skill embedding learning method be used to achieve useful tasks using a real robot, even if the pre-trained skills from simulation do not transfer perfectly?” Carefully note that the skill embedding learning algorithm can be applied in tandem with any number of other sim2real transfer methods (Sec. 2.2), but we intentionally omit these methods as they would confound the results of these experiments. We ask the Sawyer robot to push a box to a goal location relative to its starting position, as defined by a 2D displacement vector in the table plane. As in the reaching experiments from Sec. 2.4.1.2, the policy 27 receives a state observation with the robot’s seven joint angles, but this time the robot’s gripper position is replaced with a relative Cartesian position vector between the robot’s gripper and the box’s centroid. As before, the policy chooses incremental joint movements (up to0.04 rad) as actions, and these actions are executed by the robot’s on-board controllers are joint position commands which are allowed to settle between each action. As in the reaching experiment, we use for the basis skills a simple distance reward function defined on the x-y position of the box’s centroid, i.e. r i =jjbg i jj 2 2 , whereb is the x-y position of the box centroid andg i is a 2-dimensional Cartesian goal position for the box. In the real experimental environment, we track the position of the box using motion capture and merge this observation with proprioceptive joint angles from the robot. In this chapter we do not yet assess the method for end-to-end learning, e.g. from image pixels to joint commands, so the use of motion capture is representative of many other perception methods used by robots in the field, e.g. reconstruction from 3D point clouds or stereoscopic vision. 0.15 0.20 0.25 0.30 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800 0.15 0.20 0.25 0.30 0.35 0.50 0.55 0.60 0.65 0.70 0.75 target block gripper block start gripper start 0.20 0.25 0.30 0.35 0.50 0.55 0.60 0.65 0.70 0.10 0.15 0.20 0.25 0.30 0.55 0.60 0.65 0.70 0.75 0.80 Figure 2.10: Trajectories of the block (blue) and gripper position (orange) created by execution of the embedded skill policy for each of the four pre-trained skills (red dots). We trained the low-level pushing policy on four goal positions in simulation: up, down, left, and right of the box’s starting point (Fig. 2.10). Clearly, this skill does not transfer perfectly from simulation to the real robot. 28 Composition Experiments Now that we have established that the simulation policy transfer imperfectly to the real robot as expected, we characterize the behavior of the same composition methods from Sec- tion 2.4, but now in an environment where the policy does not transfer perfectly from simulation to real. We omit many details of the composition methods here for brevity. For detailed explanation and motivation for the composition method experiments explored here, please refer to Section 2.4.1.2. 0.20 0.25 0.30 0.35 0.625 0.650 0.675 0.700 0.725 0.750 0.775 task a task b block gripper block start gripper start 0.20 0.25 0.30 0.35 0.50 0.55 0.60 0.65 0.70 0.10 0.15 0.20 0.25 0.30 0.50 0.55 0.60 0.65 0.70 0.05 0.10 0.15 0.20 0.25 0.30 0.60 0.65 0.70 0.75 0.80 Figure 2.11: Trajectories of the block (blue) and gripper position (orange) created by executing the pre- trained embedded skill policy while feeding the mean latent vector of two neighboring skills. The robot pushes the block to positions between the goal locations of the two pre-trained skills. Skill interpolation in the latent space (direct) We evaluated the embedding function to obtain the mean latent vector for each of the four pre-trained pushing skills (i.e. up, down, left, and right of start position). We then fed the mean latent of adjacent skills (e.g.z up-left = 1 =2(z up +z left )) while executing the pre-trained policy directly on the robot (Fig. 2.11). The composer policies feed latent vectors to the pre-trained skill policy to push the box to positions which were never seen during training (Figs. 2.11 and 2.12). We find that in general this strategy induces the policy to move the block to a position between the two pre-trained skill goals. However, as in the interpolative reaching experiment (Sec. 2.4.1.2, Fig. 2.6), the magnitude and direction of block movement was not easily predictable from the pre-trained goal locations. This behavior was not reliable for half of the synthetic goals we tried, indicating that simple interpolation is not a robust adaptation strategy for these tasks. 29 0.4 0.5 0.6 0.7 x 0.05 0.10 0.15 0.20 0.25 0.30 0.35 y Composing task 1 Latent 0 Latent 1 Latent 2 Latent 3 0.4 0.5 0.6 0.7 x 0.00 0.05 0.10 0.15 0.20 0.25 0.30 y Composing task 2 0.6 0.7 0.8 x 0.05 0.10 0.15 0.20 0.25 0.30 0.35 y Composing task 3 0.6 0.7 0.8 x 0.00 0.05 0.10 0.15 0.20 0.25 0.30 y Composing task 4 Figure 2.12: Search-based planning in the latent space achieves plans for pushing tasks where the goal position (orange dot) for the block is outside the region of tasks (red dots) on which the low-level skill was trained. Colored line segments trace the path of the robot gripper, and indicate which skill latent was used for that trajectory segment. The block’s position is traced by the gray lines. Search-based planning in the latent space (sim2real) Similar to the search-based composer on the reaching experiment, we used Uniform Cost Search in the latent space to find a motion plan (sequence of latent vectors) for pushing the block to unseen goal locations (Fig. 2.12). The cost (reward) function for this task is identical to that from pre-training, but using an unseen box goal location. We found that search- based planning was able to find a latent-space plan to push the block to any location within the convex hull formed by the four pre-trained goal locations. Additionally, our planner was able to push blocks to some targets significantly outside this area (up to 20 cm). Unfortunately, we were not able to reliably transfer these composed policies to the robot. We attribute these failures to transfer partially to the non-uniform geometry of the embedding space, and partially to the difficulty of transferring contact-based motion policies learned in simulation, and discuss these results further in Sec. 2.4.2. In Sec. 2.5, we develop a method which can transfer these sim2real composition policies, despite imperfect transfer of skill dynamics from simulation to real. 2.4.2 Analysis The point environment experiments verify the principles of our method, and the single-skill Sawyer experi- ments demonstrate its applicability to real robotics tasks. Recall that all Sawyer skill policies used only joint 30 space control to actuate the robot, meaning that the skill policies and composer needed to learn how using the robot to achieve task-space goals without colliding the robot with the world or itself. The Sawyer composition experiments provide the most insight into the potential of latent skill decompo- sition methods for scaling simulation-to-real transfer in robotics. The method allows us to reduce a complex control problem–joint-space control to achieve task-space objectives–into a simpler one: control in latent skill-space to achieve task-space objectives. We found that the method performs best on new skills which are interpolations of existing skills. We pre-trained on just eight reaching skills with full end-to-end learning in simulation, and all skills were always trained starting from the same initial position. Despite this narrow initialization, our method learned a latent representation which allowed later algorithms to quickly find policies which reach to virtually any goal inside the manifold of the pre-training goals. Composed policies were also able to induce non-colliding joint-space motion plans between pre-trained goals (Fig. 2.8). Secondly, a major strength of the method is its ability to combine with a variety of existing, well- characterized algorithms for robotic autonomy. In addition to model-free reinforcement learning, we suc- cessfully used manual programming (interpolation) and search-based planning on the latent space to quickly reach both goals and sequences of goals that were unseen during pre-training (Figs. 2.8, 2.11, and 2.12). In- terestingly, we found that the latent space is useful for control not only in its continuous form, but also via a discrete approximation formed by the mean latent vectors of the pre-training skills. This enables our method to leverage a large array of efficient discrete planning and optimization algorithms, for sequencing low-level skills to achieve long-horizon, high-level goals. Conversely, algorithms which operate on full continuous spaces can exploit the continuous latent space. We find that a DDPG-based composer with access only to a discrete latent space (formed from the latent means of eight pre-trained reaching skills and interpolations of those skills) is significantly outperformed by a DDPG composer that leverages the entire embedding space as its action space (Fig. 2.13). This implies 31 0 100 200 300 400 Steps −40 −30 −20 −10 0 10 Average Return Training Progress Full embedding Interpolated latents Figure 2.13: Training returns for a composing policy using an embedding trained on eight-goal reacher to reach a new point between the trained goals. that the embedding function contains information on how to achieve skills beyond the instantiations the skill policy was pre-trained on. The method in its current form has two major challenges. First is the difficulty of the simulation-to-real transfer problem even in the single-skill domain. We found in the Sawyer box-pushing experiment (Fig. 2.10) that our ability to train transferable policies was limited by our simulation environment’s ability to accurately model friction. This is a well-known weakness in physics simulators for robotics. A more subtle challenge is evident in Fig. 2.5, which shows that our reaching policy did not transfer with complete accuracy to the real robot despite it being free-space motion skill. We speculate that this is a consequence of the policy overfitting to the latent input during pre-training in simulation. If the skill latent vector provides all the information the policy needs to execute an open-loop trajectory to reach the goal, it is unlikely to learn closed-loop behavior. The second major challenge is constraining the properties of the latent space, and reliably training good embedding functions, which we found somewhat unstable and hard to tune. The present algorithm formula- tion places few constraints on the algebraic and geometric relationships between different skill embeddings. 32 This leads to counterintuitive results, such as the mean of two pushing policies pushing in the expected direction but with unpredictable magnitude (Fig. 2.11), or the latent vector which induces a reach directly between two goals (e.g A and B) actually residing much closer to the latent vector for goal A than for goal B (Fig. 2.6). This lack of constraints also makes it harder for composing algorithms to plan or learn in the latent space. 2.5 Using Model Predictive Control for Zero-Shot Sequencing in Skill Latent Space Now that we have characterized the behavior of the learned skill latent space, we can develop a fully- specified algorithm for exploiting that space to achieve unseen tasks. Our algorithm allows policies learned under our method to adapt to unseen, long-horizon robotics tasks with no additional real-world samples of those tasks, which is why we refer to it as a “zero-shot” learning method. Before we discuss our proposed method, we recall some insights from our experiments in Sec. 2.4 to motivate its design. These experiments showed that skill embedding learning methods can be used for sim2real robot learning, and be used to quickly acquire new skills which are higher-order than their pre- training skills using a variety of methods, from the complex (DDPG) to the very simple (UCS). However, these experiments also showed that the latent skill learning algorithm produces latent spaces which do not necessarily conform to notions of linearity or smoothness (e.g. Fig. 2.6). This makes defining a general procedure for adapting in latent space challenging. The characterization experiments also showed that the latent skill learning algorithm is still susceptible to failure to transfer across the sim2real dynamics gap (e.g. Sec. 2.4.1.3) which affects other RL algorithms. As we discussed in Section 2.2, there are many approaches in the literature for addressing both of these problems individually. For instance, one could use e.g. dynamics randomization for the reality gap 33 problem [195], and a meta-reinforcement learning method to enable fast adaptation [66]. Applying one of these augmentations to reinforcement learning adds an additional level of considerable design and sample complexity, and applying more than one at a time produces a proposed algorithm which is very complex. With an eye towards this complexity conundrum, we seek a third, simpler path which recasts the transfer and adaptation problem as one of control rather than learning. We consider these learned latent skills to be partial descriptions of controllers we need to achieve high-level tasks, such as drawing in free space. They are partial because they only define controllers which can achieve part of a task (e.g. transitioning between waypoints) and their implicit internal dynamics models do not represent the real world. “Given partial low-level skill controllers, can we still achieve useful high-level tasks?” We propose SPC as a method for achieving success in the face of partially-useful controllers. SPC com- bines a proven robotics algorithm—model-predictive control (MPC)—with pre-trained latent skill policies and online simulation, to predict the behavior of simulation-trained controllers in the real world, and to plan around the irregular shape of the latent skill space. We initialize our online simulations with just a single state observation from the real world at each planning step, to prevent the planning process from drifting too far from the real robot state. This allows us to utilize these locally-useful pre-trained embedded skill policies to achieve longer-horizon tasks using simple algorithmic components rather than more complex ones. 2.5.1 Method 2.5.1.1 Reusing the Simulation for Foresight In order to evaluate the fitness of pre-trained latent skills for new tasks, we take advantage of the simulation that we used to train the primitive skills. For each adaptation step, we set the state of the simulation envi- ronment to the observed state of the real environment. This equips our robot with the ability to predict the behavior of pre-learned latent skills in response to new situations. Since the policy is trained in simulation, we can reuse the simulation from the pre-training step as a tool for approximate foresight when adapting to 34 Agent Simulated Environment Real Environment MPC ... Agent Figure 2.14: Architecture of our simulator-predictive control (SPC) framework where the optimal embed- ding is found in simulation to be executed in the real environment. unseen tasks. This allows us to select a latent skill that is locally-optimal for a task at the current timestep, even if that task was seen not during training. We show that this scheme allows us to perform zero-shot task execution and composition for families of related tasks. This is in contrast to existing methods, which have mostly focused on direct alignment between simulation and real [271], or data augmentation to generalize the policy using brute force [196, 223]. Despite much work on sim2real methods, neither of these approaches has demonstrated the ability to provide the adaptation capability needed for general-purpose robots in the real world. Our method enables a third path towards adaptation which warrants exploration, as a higher-level complement to these effective low-level approaches. 2.5.1.2 Adaptation Algorithm 35 Algorithm 1 Simulator-Predictive Control (SPC) Require: Latent-conditioned policy (ajs;z), Skill embedding distributionp (zjt), Skill distribution priorp(t), Simulation environmentS(s 0 js;a), Real environmentR(s 0 js;a), New taskt new with associated reward functionr new (s;a), RL discount factor , MPC horizonT , Real environment horizonN. whilet new is not complete do SampleZ =fz 1 ;:::;z k gp(z) =E tp(t) p (zjt) Observes real fromR forz i 2Z do Set inital state ofS tos real forj2f1;:::;Tg do Samplea j (a j js j ;z i ) Execute simulations j+1 S(s j ;a j ) end for CalculateR new i = P T j=0 j r new (s j ;a j ) end for Choosez = arg max z i R new i forl2f1;:::;Ng do Samplea l (a l js l ;z ) Execute real environments l+1 R(s l ;a l ) end for end while 36 Formally, we denote the new task t new corresponding to reward function r new , the real environment in which we attempt this taskR(s 0 js;a), and the RL discount factor . We use the simulation environ- mentS(s 0 js;a), frozen skill embeddingp (zjt), and latent-conditioned primitive skill policy (ajs;z), all trained as described in Sec. 2.3.3, to apply model-predictive control (MPC) in the latent space as described in Algorithm 1. We first samplek candidate latentsZ =fz 1 ;:::;z k g according top(z) =E tp(t) p (zjt). We observe the states real in the real environmentR. For each candidate latent z i , we set the initial state of the simulationS to s real . For a horizon of T time steps, we sample the frozen policy , conditioned on the candidate latent a j (a j js j ;z i ), and execute the actionsa j in the simulation environmentS, yielding the total discounted rewardR new i = P T j=0 j r new (s j ;a j ) for each candidate latent. We then choose the candidate latent acquiring the highest reward z = arg max i R new i , and use it to condition and sample the frozen policy a l (a j js j ;z ) to control the real environmentR for a horizon ofN <T timesteps. We repeat this MPC process to choose and execute new latent skills in sequence, until the new task has been achieved. Owing its novel use of online simulation, we call our algorithm Simulator-Predictive Control (SPC). The choice of MPC horizonT has a significant effect on the performance of our approach. Since our latent variable encodes a skill which only partially completes the task, executing a single skill for too long unnecessarily penalizes a locally-useful skill for not being globally optimal. Hence, we set the MPC horizon T to not more than twice the number of stepsN that a latent is actuated in the real environment. 2.5.2 Experiments Now that we have defined our new SPC method, we seek to assess its ability to simultaneously enable zero-shot sim2real transfer and adaptation. To do this, we use more-elaborate versions of the composition 37 experiments from Sec. 2.4. Unlike the experiments from the previous section, the goals of these experiments is prescriptive, i.e. we have a particular behavior in mind and seek to assess whether our new SPC method can be used to achieve this behavior. Recall once more that the pre-trained policies are never trained on the tasks in these experiments, and that all of these experiments are in a zero-shot setting with respect to the real robot, i.e. the new task is to be performed by the method with no on-robot training time. We evaluate our SPC approach by completing two sequencing tasks on a Sawyer robot: drawing a sequence of points and pushing a box along a sequential path, which are analogous to their matching exper- iments in Sec. 2.4. For each of the experiments, the robot must complete an overall task by sequencing skills learned during the embedding learning process. Sequencing skills poses a challenge to conventional RL algorithms due to the sparsity of rewards in sequencing tasks [7]. Because the agent only receives a reward for completing several correct complex actions in a row, exploration under these sequencing tasks is very difficult for conventional RL algorithms. By reusing the skills we have consolidated in the embedding space, we show a high-level controller can effectively compose these skills to achieve long-horizon tasks. In these experiments, the pre-training reward functions are identical to those from Sec. 2.4, and the adaptation-time (i.e. sequencing) reward function uses a simple switching scheme: for each waypoint, the reward function is the simple goal-distance reward function from pre-trainng. Once the robot achieves a waypoint, the reward function switches to the goal-distance reward for the next waypoint. For instance, for the drawing experniments below, the reward function while the robot moves from the origin to the first waypointg 1 isr =jjeg 1 jj 2 2 wheree is the gripper end-point position andg 1 is the 3-dimensional position of the first waypoint, once it reachesg 1 the reward function isr =jjeg 2 jj 2 2 whereg 2 is the 3-dimensional position of the second waypoint, etc. Note that in the results below, the high-entropy (i.e. semi-random) skill policy forces the gripper to move along oscillating paths between points, rather than straight lines. While the sampled actions from the 38 stochastic control policy result in trajectories that exhibit significant jerkiness, to smooth the motions, the same filtering technique we show in Section 2.3.4.4 can be applied to the output obtained from SPC. 2.5.2.1 Sawyer: Drawing a Sequence of Points The goal of this experiment is to assess whether we can use SPC to achieve a new higher-order task with zero-shot adaptation, by using SPC with our skill embedding learning algorithm. This tests one hypothesis from Sec. 2.1, which is that we can use skill embedding learning algorithms to quickly achieve more complex tasks than those we pre-trained on. We ask the Sawyer robot to move its end-effector to a sequence of points in 3D space using joint-space control to draw a shape. The policy receives as observation the robot’s seven joint angles as well as the Cartesian position of the robot’s gripper, and outputs incremental joint positions (up to 0.04 rads) as actions. We use the Euclidean distance between the gripper position and the current target as the cost function for the pre-trained skills. We pre-trained the policy and the embedding networks on eight goal positions in simulation, forming a 3D rectangular prism enclosing the workspace. This induces a reaching skill set in latent space. We then use our method to choose a sequence of latent vectors which allow the robot to draw an unseen shape. For both simulation and real robot experiments, we attempted two unseen tasks: drawing a rectangle in 3D space (Fig. 2.15 and 2.16) and drawing a triangle in 3D space (Fig. 2.17 and 2.18). Using SPC and despite imperfect skill policy transfer, the robot successfully draws approximate rect- angle and triangle shapes in 3D space. It does this in a single trial with no new on-robot learning time (zero-shot), using latent skill policies which were trained only in simulation on 8 pre-training skills which enclose the workspace, visiting many goals it was never trained on during pre-training. This experiment confirms our hypothesis in Sec. 2.1, which is that latent skill spaces encode partial information on how 39 0.50 0.55 0.60 0.65 0.70 0.75 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 0.4 x-y plane target gripper gripper start 0.50 0.55 0.60 0.65 0.70 0.75 0.10 0.15 0.20 0.25 0.30 x-z plane target gripper gripper start −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 0.4 0.10 0.15 0.20 0.25 0.30 y-z plane target gripper gripper start Figure 2.15: Gripper position plots for the unseen rectangle-drawing experiment in simulation. The unseen task is to move the gripper to draw a rectangle, and the pre-trained skill set is reaching in 3D space using joint-space control. 0.50 0.55 0.60 0.65 0.70 −0.2 −0.1 0.0 0.1 0.2 0.3 x-y plane target gripper gripper start 0.50 0.55 0.60 0.65 0.70 0.20 0.25 0.30 0.35 0.40 0.45 x-z plane target gripper gripper start −0.2 −0.1 0.0 0.1 0.2 0.3 0.20 0.25 0.30 0.35 0.40 0.45 y-z plane target gripper gripper start Figure 2.16: Gripper position plots for unseen rectangle-drawing experiment on the real robot. The unseen task is to move the gripper to draw a rectangle, and the pre-trained skill set is reaching in 3D space using joint-space control. to complete whole families of tasks, and they can be used with other methods which augment that infor- mation to quickly complete never-before-seen higher-order tasks. Below in Sec.2.5.3, we show that these sequencing tasks are extremely difficult to learn, even in very simple RL environments. 2.5.2.2 Sawyer: Pushing the Box through a Sequence of Waypoints Our goal in this experiment is to test whether SPC can adapt sim2real skill embedding policies, even in the face of imperfect dynamics transfer. Recall in Sec. 2.4.1.3 that the box-pushing policy composed in simulation failed to transfer due to dynamics gaps between the simulation and real robot. In this experiment, 40 0.450 0.475 0.500 0.525 0.550 0.575 0.600 0.625 −0.3 −0.2 −0.1 0.0 0.1 x-y plane target gripper gripper start 0.450 0.475 0.500 0.525 0.550 0.575 0.600 0.625 0.05 0.10 0.15 0.20 0.25 0.30 x-z plane target gripper gripper start −0.3 −0.2 −0.1 0.0 0.1 0.05 0.10 0.15 0.20 0.25 0.30 y-z plane target gripper gripper start Figure 2.17: Gripper position plots in unseen triangle-drawing experiment in simulation. The unseen task is to move the gripper to draw a triangle, and the pre-trained skill set is reaching in 3D space using joint-space control. 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64 −0.3 −0.2 −0.1 0.0 0.1 0.2 x-y plane 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.15 0.20 0.25 0.30 0.35 0.40 x-z plane target gripper gripper start −0.3 −0.2 −0.1 0.0 0.1 0.2 0.15 0.20 0.25 0.30 0.35 0.40 y-z plane Figure 2.18: Gripper position plots in unseen triangle-drawing experiment on the real robot. The unseen task is to move the gripper to draw a triangle, and the pre-trained skill set is reaching in 3D space using joint-space control. we see whether SPC can cross the sim2real dynamics gap and still achieve zero-shot adaptation. This experiment tests a hypothesis we discussed in Sec. 2.5, which is that we can use simulation as a tool for approximate foresight in sim2real problems. We are also testing whether we can use a proven robotics algorithm (MPC) with experimental RL algorithms, to make a system which can achieve this new capability (zero-shot sim2real transfer and adaptation). We ask the robot to push a box along a sequence of points in the table plane using task-space control. As in Sec. 2.4.1.3, we use the Euclidean distance between the position of the box and the current target position as the skill reward function, and switch reward functions between waypoints as described above. 41 Push block right, up, left Push block left, up, back to start Figure 2.19: Plot of block positions and gripper positions for the box-pushing experiments in simulation using SPC (Algorithm 1). Left: In the first experiment, the robot pushes the box to the right, up and then left. Right: In the second experiment, the robot pushes the box to the left, then up, and then back to its starting position. In these experiments, the unseen task is to move a box through a series of waypoints, and the pre-trained skillset is pushing a box along a straight line using task-space control. The policy receives a state observation with the relative position vector between the robot’s gripper and the box’s centroid, and outputs incremental gripper movements (up to0.03 cm) as actions. We first pre-train a policy to push the box to one of four goal locations relative to its starting position in simulation. This induces a planar pushing skill set. We trained the low-level multi-task policy with four skills in simulation: 20 cm up, down, left, and right of the box starting position. We then use our adaptation algorithm to choose latent vectors online which will push the box through a series of waypoints. In the simulation experiments, we use our method to push the box through three waypoints. In the real robot experiments, we use our method to complete two unseen tasks: pushing up-then-left and pushing left-then- down. These experiments confirm our hypothesis, which is that the SPC method can be used to augment em- bedded skill policies to cross a sim2real gap and simultaneously achieve new, higher-order tasks. 42 Push block left, then down Push block up, then left Figure 2.20: Trajectories of the positions of the block and the gripper in thexy-plane for the box-pushing experiments using SPC (Algorithm 1) on the real robot. In these experiments, the unseen task is to move a box through a series of waypoints, and the pre-trained skillset is pushing a box along a straight line using task-space control. Left: In the first experiment, the robot pushes the box to the left, and then down. Right: In the second experiment, the robot pushes the box up, and then to the left. 2.5.3 Results 2.5.3.1 Sawyer: Drawing a Sequence of Points In the unseen drawing experiments, we sampledk = 15 vectors from the skill latent distribution, and for each of them performed an MPC optimization with a horizon ofT = 4 steps. We then execute the latent with highest reward forN = 2 steps on the target robot. In simulation experiments, the Sawyer robot successfully drew a rectangle by sequencing 54 latents (Fig. 2.15) and drew a triangle with 56 latents (Fig. 2.17). In the real robot experiments, the Sawyer Robot successfully completed the unseen rectangle-drawing task by choosing 62 latents (Fig. 2.16) in two minutes of real time and completed the unseen triangle-drawing task by choosing 53 latents (Fig. 2.18) in less than two minutes. We investigate the performance of PPO, a state-of-the-art reinforcement learning algorithm, optimiz- ing a stochastic policy which does not learn a skill embedding function on a sequential point mass task 43 (Fig. 2.21 bottom). In this environment, a two-dimensional point is the state space and the action space is the two-dimensional change in state. If the current waypoint (goal) is reached, the state is reset to the origin from where the next waypoint is to be reached, until all waypoints are reached and the episode ends. In contrast to our earlier sequencing tasks using the composer policy, designing an adequate reward function that encourages the baseline algorithm to learn a policy that can move the point along the four waypoints poses a major challenge. A naive attempt of rewarding the negative distance between the current state and the next waypoint has the side-effect that the policy keeps making circles around the initial goal point without actually hitting it. After touching a waypoint, the reward will be highly negative before the next waypoint is reached, making it highly unlikely for a conventional algorithm, which maximizes the expected future reward, to ever explore the desired sequencing behavior. Instead, we carefully design the reward function to be monotonically increasing even when a waypoint is reached (Fig. 2.21 top). We keep track of the number of waypoints achieved and provide a very high completion bonus: r pointseq (s) = #(goals reached)jjsg i jj 2 2 : (2.5) As shown in one of the rollouts in Fig. 2.21 (bottom), despite extensive experimentation with the training set up, our baseline failed to achieve motions that reach to more than a single waypoint. 2.5.3.2 Sawyer: Pushing a Box along a Sequence of Waypoints In the pusher sequencing experiments, we sample k = 50 vectors from the latent distribution. We use an MPC optimization with a simulation horizon of T = 30 steps, and execute each chosen latent in the environment forN = 10 steps. In simulation experiments (Fig. 2.19), the robot completed the unseen up- left task less than 30 seconds of equivalent real time and the unseen right-up-left task less than 40 seconds of equivalent real time. In the real robot experiments (Fig. 2.20), the robot successfully completed the unseen 44 Time step Reward Evolution for Perfect Trajectory 1 2 3 4 Baseline Point Sequencing Policy x y Figure 2.21: Left: Evolution of the reward functionr pointseq (Eq. 2.5) is monotonic for the optimal trajectory that directly reaches to all goals in sequence. Bottom: Baseline policy that is trained to achieve a sequencing task in a 2D point mass environment, where the objective is to move the two-dimensional point clock-wise between four waypoints (red). Even with the complex hand-engineered reward function given in Eq. 2.5 which provides a monotonically reward landscape as the waypoints are reached, PPO with a policy that is not equipped with a skill embedding function fails to find a successful sequencing behavior. 45 left-down task by choosing three latents over approximately one minute of real time, and the unseen push up-left task by choosing eight latents in about 1.5 minutes of real time. 2.5.4 Analysis The experimental results show that SPC can learn composable skills and then quickly compose them online to achieve unseen tasks. Our approach is fast in wall clock time because we perform the model prediction in simulation instead of on the real robot. Note that our approach can utilize the continuous space of latent skills, whereas previous search methods only use an artificial discretization of the continuous latent space. In the unseen box-pushing real robot experiment (Fig. 2.20, Right), the Sawyer robot pushes the box towards the bottom-right right of the workspace to fix an error it made earlier in the task. This reactive behavior was never explicitly trained during the pre-training phase in simulation. This demonstrates that by exploring in the latent space, our adaptation method successfully composes skills to accomplish tasks even when it encounters situations which were never seen during the pre-training process. The experiments show that latent skill learning and SPC can adapt to mild sim2real dynamics gaps, however the primary advantage of these methods is sample-efficient adaptation for sim2real robot learning, not dynamics transfer. As we mentioned in Section 2.3, a real implementation of this method in the field would likely combine embedded skill learning and adaptation with a dynamics-transfer focused sim2real method to achieve the best transfer performance. 2.6 Conclusion The experiments in this chapter illustrate the promise and challenges of applying state-of-the-art deep re- inforcement learning to real-world robotics problems. For instance, our policies were able to learn and generalize task-space control and even motion planning skills, starting from joint-space control, with no 46 hand-engineering for those use cases. Simultaneously, the training and transfer process requires careful engineering and some hand-tuning. This control-oriented perspective provides a promising example of how to combine new learning-based robotics methods with proven algorithms like MPC, by demonstrating how to intentionally learn robotic policies which are manipulable by those algorithms without sacrificing their best properties, such as expres- siveness and the ability to learn from data. 47 Chapter 3 Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning Meta-reinforcement learning algorithms can enable robots to acquire new skills much more quickly, by leveraging prior experience to learn how to learn. However, much of the current research on meta- reinforcement learning focuses on task distributions that are very narrow. For example, a commonly used meta-reinforcement learning benchmark uses different running velocities for a simulated robot as different tasks. When policies are meta-trained on such narrow task distributions, they cannot possibly generalize to more quickly acquire entirely new tasks. Therefore, if the aim of these methods is enable faster acqui- sition of entirely new behaviors, we must evaluate them on task distributions that are sufficiently broad to enable generalization to new behaviors. This chapter proposes an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic manipulation tasks. Our aim is to make it possible to develop algorithms that generalize to accelerate the acquisition of entirely new, held-out tasks. It evaluates 6 state-of-the-art meta-reinforcement learning and multi-task learning algo- rithms on these tasks. Surprisingly, while each task and its variations (e.g., with different object positions) can be learned with reasonable success, these algorithms struggle to learn with multiple tasks at the same time, even with as few as ten distinct training tasks. Our analysis and open-source environments pave the 48 way for future research in multi-task learning and meta-learning that can enable meaningful generalization, thereby unlocking the full potential of these methods. * . 3.1 Introduction While reinforcement learning (RL) has achieved some success in domains such as assembly [143], ping pong [177], in-hand manipulation [6], and hockey [33], state-of-the-art methods require substantially more experience than humans to acquire only one narrowly-defined skill. If we want robots to be broadly useful in realistic environments, we instead need algorithms that can learn a wide variety of skills reliably and effi- ciently. Fortunately, in most specific domains, such as robotic manipulation or locomotion, many individual tasks share common structure that can be reused to acquire related tasks more efficiently. For example, most robotic manipulation tasks involve grasping or moving objects in the workspace. However, while current methods can learn to individual skills like screwing on a bottle cap [143] and hanging a mug [165], we need algorithms that can efficiently learn shared structure across many related tasks, and use that structure to learn new skills quickly, such as screwing a jar lid or hanging a bag. Recent advances in machine learning have provided unparalleled generalization capabilities in domains such as images [135] and speech [48], suggesting that this should be possible; however, we have yet to see such generalization to diverse tasks in reinforcement learning settings. Recent works in meta-learning and multi-task reinforcement learning have shown promise for addressing this gap. Multi-task RL methods aim to learn a single policy that can solve multiple tasks more efficiently than learning the tasks individually, while meta-learning methods train on many tasks, and optimize for fast adaptation to a new task. While these methods have made progress, the development of both classes of approaches has been limited by the lack of established benchmarks and evaluation protocols that reflect realistic use cases. On one hand, multi-task RL methods have largely been evaluated on disjoint and overly * Videos of the benchmark tasks are on the project page: meta-world.github.io. Open-source code for the benchmark is hosted at: https://github.com/rlworkgroup/metaworld 49 diverse tasks such as the Atari suite [103], where there is little efficiency to be gained by learning across games [192]. On the other hand, meta-RL methods have been evaluated on very narrow task distributions. For example, one popular evaluation of meta-learning involves choosing different running directions for simulated legged robots [66], which then enables fast adaptation to new directions. While these are techni- cally distinct tasks, they are a far cry from the promise of a meta-learned model that can adapt to any new task within some domain. In order to study the capabilities of current multi-task and meta-reinforcement learning methods and make it feasible to design new algorithms that actually generalize and adapt quickly on meaningfully distinct tasks, we need evaluation protocols and task suites that are broad enough to enable this sort of generalization, while containing sufficient shared structure for generalization to be possible. The key contributions of this chapter are a suite of 50 diverse simulated manipulation tasks and an extensive empirical evaluation of how previous methods perform on sets of such distinct tasks. We contend that multi-task and meta reinforcement learning methods that aim to efficiently learn many tasks and quickly generalize to new tasks should be evaluated on distributions of tasks that are diverse and exhibit shared structure. To this end, we present a benchmark of simulated manipulation tasks with everyday objects, all of which are contained in a shared, table-top environment with a simulated Sawyer arm. By providing a large set of distinct tasks that share common environment and control structure, we believe that this benchmark will allow researchers to test the generalization capabilities of the current multi-task and meta RL methods, and help to identify new research avenues to improve the current approaches. Our empirical evaluation of existing methods on this benchmark reveals that, despite some impressive progress in multi-task and meta-reinforcement learning over the past few years, current methods are generally not able to learn diverse task sets, much less generalize successfully to entirely new tasks. We provide an evaluation protocol with evaluation modes of varying difficulty, and observe that current methods only show success in the easiest modes. This opens the door for future developments in multi-task and meta reinforcement learning: instead of focusing on further increasing performance on current narrow task suites, we believe that it is essential for 50 Train tasks Test tasks 46. hand insert 47. close box 48. lock door 49. unlock door 50. pick bin 3. basketball 4. sweep into hole 41. close window 42. open window 45. open drawer 28. close drawer 12. assemble nut 43. open door 44. close door 19. hammer 16. place onto shelf 27. retrieve plate 26. retrieve plate side 40. unplug peg 2. sweep 38. pick & place 6. push 21. slide plate side 20. slide plate 11. pull handle side 18. press handle side 23. press handle 24. pull handle 1. turn on faucet 5. turn off faucet 7. pull lever8. turn dial 9. push with stick 10. get coffee 31. press button top w/ wall 13. pull with stick 14. pick out of hole 22. press button wall 25. soccer 29. press button top 30. reach 32. reach with wall 33. insert peg side 34. push 35. push with wall 36. pick & place w/ wall 37. press button 17. push mug 39. pull mug 15. disassemble nut Figure 3.1: Meta-World contains 50 manipulation tasks, designed to be diverse yet carry shared structure that can be leveraged for efficient multi-task RL and transfer to new tasks via meta-RL. In the most difficult evaluation, the method must use experience from 45 training tasks (left) to quickly learn distinctly new test tasks (right). future work in these areas to focus on increasing the capabilities of algorithms to handle highly diverse task sets. By doing so, we can enable meaningful generalization across many tasks and achieve the full potential of meta-learning as a means of incorporating past experience to make it possible for robots to acquire new skills as quickly as people can. 3.2 Related Work Previous works that have proposed benchmarks for reinforcement learning have largely focused on single task learning settings [22, 36, 255]. One popular benchmark used to study multi-task learning is the Arcade Learning Environment, a suite of dozens of Atari 2600 games [160]. While having a tremendous impact 51 on the multi-task reinforcement learning research community [192, 220, 103, 59, 239], the Atari games in- cluded in the benchmark have significant differences in visual appearance, controls, and objectives, making it challenging to acquire any efficiency gains through shared learning. In fact, many prior multi-task learn- ing methods have observed substantial negative transfer between the Atari games [192, 220]. In contrast, we would like to study a case where positive transfer between the different tasks should be possible. We therefore propose a set of related yet diverse tasks that share the same robot, action space, and workspace. Meta-reinforcement learning methods have been evaluated on a number of different problems, including maze navigation [56, 274, 173], continuous control domains with parametric variation across tasks [66, 215, 208, 65], bandit problems [274, 56, 173, 214], levels of an arcade game [187], and locomotion tasks with varying dynamics [178, 225]. Complementary to these evaluations, we aim to develop a test-bed of tasks and an evaluation protocol that are reflective of the challenges in applying meta-learning to robotic manipulation problems, including both parametric and non-parametric variation in tasks. There is a long history of robotics benchmarks [25], datasets [142, 67, 289, 32, 84, 164, 238], compe- titions [40] and standardized object sets [24, 34] that have played an important role in robotics research. Similarly, there exists a number of robotics simulation benchmarks including visual navigation [227, 133, 23, 226, 279], autonomous driving [53, 278, 212], grasping [121, 122, 79], single-task manipulation [63], among others. In this chapter, our aim is to continue this trend and provide a large suite of tasks that will allow researchers to study multi-task learning, meta-learning, and transfer in general. Further, unlike these prior simulation benchmarks, we particularly focus on providing a suite of many diverse manipulation tasks and a protocol for multi-task and meta RL evaluation. 52 3.3 The Multi-Task and Meta-RL Problem Statements Our proposed benchmark is aimed at making it possible to study generalization in meta-RL and multi-task RL. In this section, we define the meta-RL and multi-task RL problem statements, and describe some of the challenges associated with task distributions in these settings. We use the formalism of Markov decision processes (MDPs), where each taskT corresponds to a differ- ent finite horizon MDP, represented by a tuple (S;A;P;R;H; ), wheres2S correspond to states,a2A correspond to the available actions,P (s t+1 js t ;a t ) represents the stochastic transition dynamics,R(s;a) is a reward function,H is the horizon and is the discount factor. In standard reinforcement learning, the goal is to learn a policy(ajs) that maximizes the expected return, which is the sum of (discounted) rewards over all time. In multi-task and meta-RL settings, we assume a distribution of tasksp(T ). Different tasks may vary in any aspect of the Markov decision process, though efficiency gains in adaptation to new tasks are only possible if the tasks share some common structure. For example, as we describe in the next section, the tasks in our proposed benchmark have the same action space and horizon, and structurally similar rewards and state spaces. † 3.3.1 Multi-task RL problem statement. The goal of multi-task RL is to learn a single, task-conditioned policy (ajs;z), where z indicates an encoding of the task ID. This policy should maximize the average expected return across all tasks from the task distributionp(T ), given byE Tp(T ) [E [ P T t=0 t R t (s t ;a t )]]. The information about the task can be provided to the policy in various ways, e.g. using a one-hot task identification encodingz that is passed in addition to the current state. There is no separate test set of tasks, and multi-task RL algorithms are typically evaluated on their average performance over the training tasks. † In practice, the policy must be able to read in the state for each of the tasks, which typically requires them to at least have the same dimensionality. In our benchmarks, some tasks have different numbers of objects, but the state dimensionality is always the same, meaning that some state coordinates are unused for some tasks. 53 3.3.2 Meta-RL problem statement. Meta-reinforcement learning aims to leverage the set of training task to learn a policy(ajs) that can quickly adapt to new test tasks that were not seen during training, where both training and test tasks are assumed to be drawn from the same task distributionp(T ). Typically, the training tasks are referred to as the meta-training set, to distinguish from the adaptation (training) phase performed on the (meta-) test tasks. During meta- training, the learning algorithm has access to M tasksfT i g M i=1 that are drawn from the task distribution p(T ). At meta-test time, a new taskT j p(T ) is sampled that was not seen during meta-training, and the meta-trained policy must quickly adapt to this task to achieve the highest return with a small number of samples. A key premise in meta-RL is that a sufficiently powerful meta-RL method can meta-learn a model that effectively implements a highly efficient reinforcement learning procedure, which can then solve entirely new tasks very quickly – much more quickly than a conventional reinforcement learning algorithm learning from scratch. However, in order for this to happen, the meta-training distribution p(T ) must be sufficiently broad to encompass these new tasks. Unfortunately, most prior work in meta-RL evaluates on very narrow task distributions, with only one or two dimensions of parametric variation, such as the running direction for a simulated robot [66, 215, 208, 65]. 3.4 Meta-World If we want meta-RL methods to generalize effectively to entirely new tasks, we must meta-train on broad task distributions that are representative of the range of tasks that a particular agent might need to solve in the future. To this end, we propose a new multi-task and meta-RL benchmark, which we call Meta- World. In this section, we motivate the design decisions behind the Meta-World tasks, discuss the range of tasks, describe the representation of the actions, observations, and rewards, and present a set of evaluation protocols of varying difficulty for both meta-RL and multi-task RL. 54 Parametric Task Variation Non-Parametric Task Variation Reach Puck Open Window Figure 3.2: Parametric vs. non-parametric variation. Left: All “reach puck” tasks (left) can be parameter- ized by the puck position. Right: Conversely, the difference between “reach puck” and “open window” is non-parametric. 3.4.1 The Space of Manipulation Tasks: Parametric and Non-Parametric Variability Meta-learning makes two critical assumptions: first, that the meta-training and meta-test tasks are drawn from the same distribution,p(T ), and second, that the task distributionp(T ) exhibits shared structure that can be utilized for efficient adaptation to new tasks. If p(T ) is defined as a family of variations within a particular control task, as in prior work [66, 208], then it is unreasonable to hope for generalization to entirely new control tasks. For example, an agent has little hope of being able to quickly learn to open a door, without having ever experienced doors before, if it has only been trained on a set of meta-training tasks that are homogeneous and narrow. Thus, to enable meta-RL methods to adapt to entirely new tasks, we propose a much larger suite of tasks consisting of 50 qualitatively-distinct manipulation tasks, where continuous parameter variation cannot be used to describe the differences between tasks. With such non-parametric variation, however, there is the danger that tasks will not exhibit enough shared structure, or will lack the task overlap needed for the method to avoid memorizing each of the tasks. Motivated by this challenge, we design each task to include parametric variation in object and goal 55 positions, as illustrated in Figure 3.2. Introducing this parametric variability not only creates a substantially larger (infinite) variety of tasks, but also makes it substantially more practical to expect that a meta-trained model will generalize to acquire entirely new tasks more quickly, since varying the positions provides for wider coverage of the space of possible manipulation tasks. Without parametric variation, the model could for example memorize that any object at a particular location is a door, while any object at another location is a drawer. If the locations are not fixed, this kind of memorization is much less likely, and the model is forced to generalize more broadly. With enough tasks and variation within tasks, pairs of qualitatively- distinct tasks are more likely to overlap, serving as a catalyst for generalization. For example, closing a drawer and pushing a block can appear as nearly the same task for some initial and goal positions of each object. Note that this kind of parametric variation, which we introduce for each task, essentially represents the entirety of the task distribution for previous meta-RL evaluations [66, 208], which test on single tasks (e.g., running towards a goal) with parametric variability (e.g., variation in the goal position). Our full task distribution is therefore substantially broader, since it includes this parametric variability for each of the 50 tasks. To provide shared structure, the 50 environments require the same robotic arm to interact with different objects, with different shapes, joints, and connectivity. The tasks themselves require the robot to execute a combination of reaching, pushing, and grasping, depending on the task. By recombining these basic behavioral building blocks with a variety of objects with different shapes and articulation properties, we can create a wide range of manipulation tasks. For example, the open door task involves pushing or grasping an object with a revolute joint, while the open drawer task requires pushing or grasping an object with a sliding joint. More complex tasks require a combination of these building blocks, which must be executed in the right order. We visualize all of the tasks in Meta-World in Figure 3.1, and include a description of all tasks in Appendix 3.7. 56 All of the tasks are implemented in the MuJoCo physics engine [264], which enables fast simulation of physical contact. To make the interface simple and accessible, we base our suite on the Multiworld interface [183] and the OpenAI Gym environment interfaces [22], making additions and adaptations of the suite relatively easy for researchers already familiar with Gym. 3.4.2 Actions, Observations, and Rewards In order to represent policies for multiple tasks with one model, the observation and action spaces must contain significant shared structure across tasks. All of our tasks are performed by a simulated Sawyer robot, with the action space corresponding to 3D end-effector positions. For all tasks, the robot must either manipulate one object with a variable goal position, or manipulate two objects with a fixed goal position. The observation space is represented as a 3-tuple of either the 3D Cartesian positions of the end-effector, the object, and the goal, or the 3D Cartesian positions of the end-effector, the first object, and the second object, and is always 9 dimensional. Designing reward functions for Meta-World requires two major considerations. First, to guarantee that our tasks are within the reach of current single-task reinforcement learning algorithms, which is a prereq- uisite for evaluating multi-task and meta-RL algorithms, we design well-shaped reward functions for each task that make each of the tasks at least individually solvable. More importantly, the reward functions must exhibit shared structure across tasks. Critically, even if the reward function admits the same optimal pol- icy for multiple tasks, varying reward scales or structures can make the tasks appear completely distinct for the learning algorithm, masking their shared structure and leading to preferences for tasks with high- magnitude rewards [103]. Accordingly, we adopt a structured, multi-component reward function for all tasks, which leads to effective policy learning for each of the task components. For instance, in a task that involves a combination of reaching, grasping, and placing an object, leto2R 3 be the object position, where o = (o x ;o y ;o z ),h2R 3 be the position of the robot’s gripper,z target 2R be the target height of lifting the 57 ….. Pick and place with goal Pick and place with goal Pick and place with goal Pick and place Pushing Reaching Door opening Button press Peg insertion side Window opening Window closing Drawer opening Drawer closing Pick and place Pushing Reaching Door opening Button press Peg insertion side Window opening Window closing Drawer opening Drawer closing Window opening Button press Pick and place Reaching Pushing Peg insertion side Drawer closing Dial turning Sweep into goal Drawer opening Door closing Shelf placing Sweep an object off table Lever pulling Pick and place with unseen goal Basketball Train tasks Test tasks ML1 MT10 ML10 Figure 3.3: Visualization of three of our multi-task and meta-learning evaluation protocols, ranging from within task adaptation in ML1, to multi-task training across 10 distinct task families in MT10, to adapting to new tasks in ML10. Our most challenging evaluation mode ML45 is shown in Figure 3.1. object, andg2R 3 be goal position. With the above definition, the multi-component reward functionR is the additive combination of a reaching rewardR reach , a grasping rewardR grasp and a placing rewardR place , or subsets thereof for simpler tasks that only involve reaching and/or pushing. With this design, the reward functions across all tasks have similar magnitude and conform to similar structure, as desired. The full form of the reward function and a list of all task rewards is provided in Appendix 3.8. 3.4.3 Evaluation Protocol With the goal of providing a challenging benchmark to facilitate progress in multi-task RL and meta-RL, we design an evaluation protocol with varying levels of difficulty, ranging from the level of current goal-centric meta-RL benchmarks to a setting where methods must learn distinctly new, challenging manipulation tasks based on diverse experience across 45 tasks. We hence divide our evaluation into five categories, which we describe next. We then detail our evaluation criteria. Meta-Learning 1 (ML1): Few-shot adaptation to goal variation within one task. The simplest evaluation aims to verify that previous meta-RL algorithms can adapt to new object or goal configurations 58 on only one type of task. ML1 uses single Meta-World Tasks, with the meta-training “tasks” corresponding to 50 random initial object and goal positions, and meta-testing on 10 held-out positions. This resembles the evaluations in prior works [66, 208]. We evaluate algorithms on three individual tasks from Meta-World: reaching, pushing, and pick and place, where the variation is over reaching position or goal object position. The goal positions are not provided in the observation, forcing meta-RL algorithms to adapt to the goal through trial-and-error. Multi-Task 10, Multi-Task 50 (MT10, MT50): Learning one multi-task policy that generalizes to 10 and 50 training tasks. A first step towards adapting quickly to distinctly new tasks is the ability to train a single policy that can solve multiple distinct training tasks. The multi-task evaluation in Meta-World tests the ability to learn multiple tasks at once, without accounting for generalization to new tasks. The MT10 evaluation uses 10 tasks: reach, push, pick and place, open door, open drawer, close drawer, press button top-down, insert peg side, open window, and open box. The larger MT50 evaluation uses all 50 Meta-World tasks. The policy is provided with a one-hot vector indicating the current task. Meta-Learning 10, Meta-Learning 45 (ML10, ML45): Few-shot adaptation to new test tasks with 10 and 50 meta-training tasks. With the objective to test generalization to new tasks, we hold out 5 tasks and meta-train policies on 10 and 45 tasks. We randomize object and goals positions and intentionally select training tasks with structural similarity to the test tasks. Task IDs are not provided as input, requiring a meta-RL algorithm to identify the tasks from experience. Success metrics. Since values of reward are not directly indicative how successful a policy is, we define an interpretable success metric for each task, which will be used as the evaluation criterion for all of the above evaluation settings. Since all of our tasks involve manipulating one or more objects into a goal configuration, this success metric is based on the distance between the task-relevant object and its final goal pose, i.e.kogk 2 <, where is a small distance threshold such as 5 cm. For the complete list of success metrics and thresholds for each task, see Appendix 3.8. 59 3.5 Experimental Results and Analysis The first, most basic goal of our experiments is to verify that each of the 50 presented tasks are indeed solveable by existing single-task reinforcement learning algorithms. We provide this verification in Ap- pendix 3.9. Beyond verifying the individual tasks, the goals of our experiments are to study the following questions: (1) can existing state-of-the-art meta-learning algorithms quickly learn qualitatively new tasks when meta-trained on a sufficiently broad, yet structured task distribution, and (2) how do different multi- task and meta-learning algorithms compare in this setting? To answer these questions, we evaluate various multi-task and meta-learning algorithms on the Meta-World benchmark. We include the training curves of all evaluations in Figure 3.7 in the Appendix 3.10. Videos of the tasks and evaluations, along with all source code, are on the project webpage ‡ . In the multi-task evaluation, we evaluate the following RL algorithms: multi-task proximal policy opti- mization (PPO) [234]: a policy gradient algorithm adapted to the multi-task setting by providing the one-hot task ID as input, multi-task trust region policy optimization (TRPO) [232]: an on-policy policy gradient algorithm adapted to the multi-task setting using the one-hot task ID as input, multi-task soft actor-critic (SAC) [90]: an off-policy actor-critic algorithm adapted to the multi-task setting using the one-hot task ID as input, multi-task multi-head soft actor-critic (SAC) [90]: an off-policy actor-critic algorithm similar to multi-task SAC but using a multi-head policy with one head per task, and an on-policy version of task em- beddings (TE) [95]: a multi-task reinforcement learning algorithm that parameterizes the learned policies via shared skill embedding space. For the meta-RL evaluation, we study three algorithms: RL 2 [56, 274]: an on-policy meta-RL algorithm that corresponds to training a LSTM network with hidden states maintained across episodes within a task and trained with PPO, model-agnostic meta-learning (MAML) [66, 215]: an on-policy gradient-based meta-RL algorithm that embeds policy gradient steps into the meta-optimization, and is trained with PPO, and probabilistic embeddings for actor-critic RL (PEARL) [208]: an off-policy ‡ Videos are on the project webpage, atmeta-world.github.io 60 Figure 3.4: Comparison of algorithms using our simplest meta-RL evaluation, ML1. actor-critic meta-RL algorithm, which learns to encode experience into a probabilistic embedding of the task that is fed to the actor and the critic. We show results of the simplest meta-learning evaluation mode, ML1, in Figure 3.4. We find that there is room for improvement even in this very simple setting. Next, we look at results of multi-task learning across distinct tasks, starting with MT10 in the top left of Figure 3.5 and in Table 3.1. We find that multi-task multi-head SAC is able to learn the MT10 task suite well, achieving around 88% success rate averaged across tasks, while multi-task SAC that has a single head can only solve around 40% of the tasks, indicating that adopting a multi-head architecture can greatly improve multi-task learning performance. On-policy methods such as task embeddings, multi-task PPO, and multi-task TRPO perform significantly worse, achieving less than 30% success across tasks. However, as we scale to 50 distinct tasks with MT50 (Figure 3.5, bottom left, and average results in Table 3.1), we find that multi-task multi-head SAC achieves only 35.85% average performance across the 50 tasks, while the other four methods have less than 30% success rates, indicating significant room for improvement. 61 Finally, we study the ML10 and ML45 meta-learning benchmarks, which require learning the meta- training tasks and generalizing to new meta-test tasks with small amounts of experience. From Figure 3.5 and Table 3.1, we find that the prior meta-RL methods, MAML and RL 2 reach 36% and 10% success on ML10 test tasks, while PEARL is unable to generalize to new tasks on ML10. On ML45, PEARL manages to accomplish around 30% success rate on the test set, which suggests that having more meta-training tasks is conducive for PEARL to learn the underlying shared structure and adapt to unseen tasks. MAML and RL 2 solve around 20% of the meta-test tasks, potentially due to the additional optimization challenges in this regime. Note that, on both ML10 and ML45, the meta-training performance of all methods also has considerable room for improvement, suggesting that optimization challenges are generally more severe in the meta-learning setting. The fact that some methods nonetheless exhibit meaningful generalization suggests that the ML10 and ML45 benchmarks are solvable, but challenging for current methods, leaving considerable room for improvement in future work. Method MT10 MT50 Multi-Task PPO 25% 8.98% Multi-Task TRPO 29% 22.86% Task Embeddings 30% 15.31% Multi-Task SAC 39.5% 28.83% Multi-Task Multi-Head SAC 88% 35.85% Method ML10 ML45 Meta-Train Meta-Test Meta-Train Meta-Test MAML 25% 36% 21.14% 23.93% RL 2 50% 10% 43.18% 20% PEARL 42.78% 0% 11.36% 30% Table 3.1: Average success rates over all tasks for MT10, MT50, ML10, and ML45. The best performance in each benchmark is bolden. For MT10 and MT50, we show the average training success rate and multi-task multi-head SAC outperforms other methods. For ML10 and ML45, we show the meta-train and meta-test success rates. RL 2 achieves best meta-train performance in ML10 and ML45, while MAML and PEARL get the best generalization performance in ML10 and ML45 meta-test tasks respectively. 3.6 Conclusion and Directions for Future Work This chapter introduced an open-source benchmark for meta-reinforcement learning and multi-task learning, which consists of a large number of simulated robotic manipulation tasks. Unlike previous evaluation bench- marks in meta-RL, this benchmark specifically emphasizes generalization to distinctly new tasks, not just in terms of parametric variation in goals, but completely new objects and interaction scenarios. While meta-RL 62 can in principle make it feasible for agents to acquire new skills more quickly by leveraging past experience, previous evaluation benchmarks utilize very narrow task distributions, making it difficult to understand the degree to which meta-RL actually enables this kind of generalization. The aim of this benchmark is to make it possible to develop new meta-RL algorithms that actually exhibit this sort of generalization. The experiments in this chapter show that current meta-RL methods in fact cannot yet generalize effectively to entirely new tasks and do not even learn the meta-training tasks effectively when meta-trained across multiple distinct tasks. This suggests a number of directions for future work, which we describe below. Future directions for algorithm design. The main conclusion from our experimental evaluation with our proposed benchmark is that current meta-RL algorithms generally struggle in settings where the meta- training tasks are highly diverse. This issue mirrors the challenges observed in multi-task RL, which is also challenging with our task suite, and has been observed to require considerable additional algorithmic development to attain good results in prior work [192, 220, 59]. A number of recent works have studied algo- rithmic improvements in the area of multi-task reinforcement learning, as well as potential explanations for the difficulty of RL in the multi-task setting [103, 230]. Incorporating some of these methods into meta-RL, as well as developing new techniques to enable meta-RL algorithms to train on broader task distributions, would be a promising direction for future work to enable meta-RL methods to generalize effectively across diverse tasks, and our proposed benchmark suite can provide future algorithms development with a useful gauge of progress towards the eventual goal of broad task generalization. Future extensions of the benchmark. While the presented benchmark is significantly broader and more challenging than existing evaluations of meta-reinforcement learning algorithms, there are a number of extensions to the benchmark that would continue to improve and expand upon its applicability to realistic robotics tasks. First, in many situations, the poses of objects are not directly accessible to a robot in the real world. Hence, one interesting and important direction for future work is to consider image observations and sparse rewards. Sparse rewards can be derived already using the success metrics, while support for image 63 rendering is already supported by the code. However, for meta-learning algorithms, special care needs to be taken to ensure that the task cannot be inferred directly from the image, else meta-learning algorithms will memorize the training tasks rather than learning to adapt. Another natural extension would be to consider including a breadth of compositional long-horizon tasks, where there exist combinatorial numbers of tasks. Such tasks would be a straightforward extension, and provide the possibility to include many more tasks with shared structure. Another challenge when deploying robot learning and meta-learning algorithms is the manual effort of resetting the environment. To simulate this case, one simple extension of the benchmark is to significantly reduce the frequency of resets available to the robot while learning. Lastly, in many real- world situations, the tasks are not available all at once. To reflect this challenge in the benchmark, we can add an evaluation protocol that matches that of online meta-learning problem statements [69]. We leave these directions for future work, either to be done by ourselves or in the form of open-source contributions. To summarize, the proposed form of the task suite represents a significant step towards evaluating multi-task and meta-learning algorithms on diverse robotic manipulation problems that will pave the way for future research in these areas. 3.7 Task Descriptions In Table 3.2, we include a description of each of the 50 Meta-World tasks. 64 Figure 3.5: Full quantitative results on MT10, MT50, ML10, and ML45. Note that, even on the challenging ML10 and ML45 benchmarks, current methods already exhibit some degree of generalization, but meta- training performance leaves considerable room for improvement, suggesting that future work could attain better performance on these benchmarks. We also show the average success rates for all benchmarks in Table 3.1. 65 Task Description turn on faucet Rotate the faucet counter-clockwise. Randomize faucet positions sweep Sweep a puck off the table. Randomize puck positions assemble nut Pick up a nut and place it onto a peg. Randomize nut and peg positions turn off faucet Rotate the faucet clockwise. Randomize faucet positions push Push the puck to a goal. Randomize puck and goal positions pull lever Pull a lever down 90 degrees. Randomize lever positions turn dial Rotate a dial 180 degrees. Randomize dial positions push with stick Grasp a stick and push a box using the stick. Randomize stick positions. get coffee Push a button on the coffee machine. Randomize the position of the coffee machine pull handle side Pull a handle up sideways. Randomize the handle positions basketball Dunk the basketball into the basket. Randomize basketball and basket positions pull with stick Grasp a stick and pull a box with the stick. Randomize stick positions sweep into hole Sweep a puck into a hole. Randomize puck positions disassemble nut pick a nut out of the a peg. Randomize the nut positions place onto shelf pick and place a puck onto a shelf. Randomize puck and shelf positions push mug Push a mug under a coffee machine. Randomize the mug and the machine positions press handle side Press a handle down sideways. Randomize the handle positions hammer Hammer a screw on the wall. Randomize the hammer and the screw positions slide plate Slide a plate into a cabinet. Randomize the plate and cabinet positions slide plate side Slide a plate into a cabinet sideways. Randomize the plate and cabinet positions press button wall Bypass a wall and press a button. Randomize the button positions press handle Press a handle down. Randomize the handle positions pull handle Pull a handle up. Randomize the handle positions soccer Kick a soccer into the goal. Randomize the soccer and goal positions retrieve plate side Get a plate from the cabinet sideways. Randomize plate and cabinet positions retrieve plate Get a plate from the cabinet. Randomize plate and cabinet positions close drawer Push and close a drawer. Randomize the drawer positions press button top Press a button from the top. Randomize button positions reach reach a goal position. Randomize the goal positions press button top wall Bypass a wall and press a button from the top. Randomize button positions reach with wall Bypass a wall and reach a goal. Randomize goal positions insert peg side Insert a peg sideways. Randomize peg and goal positions pull Pull a puck to a goal. Randomize puck and goal positions push with wall Bypass a wall and push a puck to a goal. Randomize puck and goal positions pick out of hole Pick up a puck from a hole. Randomize puck and goal positions pick and place with wall Pick a puck, bypass a wall and place the puck. Randomize puck and goal positions press button Press a button. Randomize button positions pick and place Pick and place a puck to a goal. Randomize puck and goal positions pull mug Pull a mug from a coffee machine. Randomize the mug and the machine positions unplug peg Unplug a peg sideways. Randomize peg positions close window Push and close a window. Randomize window positions open window Push and open a window. Randomize window positions open door Open a door with a revolving joint. Randomize door positions close door Close a door with a revolvinig joint. Randomize door positions open drawer Open a drawer. Randomize drawer positions insert hand Insert the gripper into a hole. close box Grasp the cover and close the box with it. Randomize the cover and box positions lock door Lock the door by rotating the lock clockwise. Randomize door positions unlock door Unlock the door by rotating the lock counter-clockwise. Randomize door positions pick bin Grasp the puck from one bin and place it into another bin. Randomize puck positions Table 3.2: List of the Meta-World tasks and a description of each task. 66 3.8 Task Rewards and Success Metrics The form of the reward function is shared across tasks. In particular, the multi-component reward function R is a combination of a reaching reward R reach , a grasping reward R grasp and a placing reward R place as follows: R =R reach +R grasp +R place (3.1) =khok 2 | {z } R reach +1 khok 2 < c 1 min(o z ;z target ) | {z } R grasp +1 jozz targetj< c 2 e kogk 2 2 c 3 | {z } R place (3.2) where;c 1 ;c 2 ;c 3 are constant for all tasks. For tasks that involve reaching and pushing, the rewardR can be formed as a combination of a reaching rewardR reach and a pushing rewardR push : R =R reach +R push (3.3) =khok 2 | {z } R reach +1 khok 2 < c 2 e kogk 2 2 c 3 | {z } R push (3.4) With this design, the reward functions across all tasks have similar magnitude and conform to similar struc- ture, as desired. In Table 3.3, we include a complete list of reward functions of each of the 50 Meta-World tasks. In Table 3.4, we include a complete list of success metrics of each of the 50 Meta-World tasks. 67 Table 3.3: Reward functions used for each of the Meta-World tasks. Task Reward Function turn on faucet khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 sweep khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 pick out of hole khok 2 +1 khok2<0:05 100 min(o z ;z target ) +1 jozz targetj<0:05 1000e khgk 2 2 0:01 turn off faucet khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 push with stick khok 2 +1 khok2<0:05 100 min(o z ;z target ) +1 jozz targetj<0:05 1000e khgk 2 2 0:01 get coffee khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 pull handle side khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 basketball khok 2 +1 khok2<0:05 100 min(o z ;z target ) +1 jozz targetj<0:05 1000e khgk 2 2 0:01 pull with stick khok 2 +1 khok2<0:05 100 min(o z ;z target ) +1 jozz targetj<0:05 1000e khgk 2 2 0:01 sweep into hole khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 disassemble nut khok 2 +1 khok2<0:05 100 min(o z ;z target ) +1 jozz targetj<0:05 1000e khgk 2 2 0:01 assemble nut khok 2 +1 khok2<0:05 100 min(o z ;z target ) +1 jozz targetj<0:05 1000e khgk 2 2 0:01 place onto shelf khok 2 +1 khok2<0:05 100 min(o z ;z target ) +1 jozz targetj<0:05 1000e khgk 2 2 0:01 push mug khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 press handle side khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 hammer khok 2 +1 khok2<0:05 100 min(o z ;z target ) +1 jozz targetj<0:05 1000e khgk 2 2 0:01 slide plate khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 slide plate side khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 press button wall khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 press handle khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 pull handle khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 soccer khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 retrieve plate side khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 retrieve plate khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 close drawer khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 reach 1000e khgk 2 2 0:01 press button top wall khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 reach with wall 1000e khgk 2 2 0:01 insert peg side khok 2 +1 khok2<0:05 100 min(o z ;z target ) +1 jozz targetj<0:05 1000e khgk 2 2 0:01 push khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 push with wall khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 pick and place with wall khok 2 +1 khok2<0:05 100 min(o z ;z target ) +1 jozz targetj<0:05 1000e khgk 2 2 0:01 press button khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 press button top khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 68 Task Reward Function pick andplace khok 2 +1 khok2<0:05 100 min(o z ;z target ) +1 jozz targetj<0:05 1000e khgk 2 2 0:01 pull khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 pull mug khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 unplug peg khok 2 +1 khok2<0:05 100 min(o z ;z target ) +1 jozz targetj<0:05 1000e khgk 2 2 0:01 turn dial khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 pull lever khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 close window khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 open window khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 open door khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 close door khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 open drawer khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 insert hand 1000e khgk 2 2 0:01 close box khok 2 +1 khok2<0:05 100 min(o z ;z target ) +1 jozz targetj<0:05 1000e khgk 2 2 0:01 lock door khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 unlock door khok 2 +1 khok2<0:05 1000e khgk 2 2 0:01 pick bin khok 2 +1 khok2<0:05 100 min(o z ;z target ) +1 jozz targetj<0:05 1000e khgk 2 2 0:01 69 Task Success Metric turn on faucet 1 kogk2<0:05 sweep 1 kogk2<0:05 pick out of hole 1 kogk2<0:08 turn off faucet 1 kogk2<0:05 push 1 kogk2<0:07 push with stick 1 kogk2<0:08 get coffee 1 kogk2<0:02 pull handle side 1 kogk2<0:04 basketball 1 kogk2<0:08 pull with stick 1 kogk2<0:08 sweep into hole 1 kogk2<0:05 disassemble nut 1 kogk2<0:08 assemble nut 1 kogk2<0:08 place onto shelf 1 kogk2<0:08 push mug 1 kogk2<0:07 press handle side 1 kogk2<0:04 hammer 1 kogk2<0:05 slide plate 1 kogk2<0:07 slide plate side 1 kogk2<0:07 press button wall 1 kogk2<0:02 press handle 1 kogk2<0:02 pull handle 1 kogk2<0:04 soccer 1 kogk2<0:07 retrieve plate side 1 kogk2<0:07 retrieve plate 1 kogk2<0:07 close drawer 1 kogk2<0:08 reach 1 kogk2<0:05 press button top wall 1 kogk2<0:02 reach with wall 1 kogk2<0:05 insert peg side 1 kogk2<0:07 push with wall 1 kogk2<0:07 pick and place with wall 1 kogk2<0:07 press button 1 kogk2<0:02 press button top 1 kogk2<0:02 pick and place 1 kogk2<0:07 pull 1 kogk2<0:07 pull mug 1 kogk2<0:07 unplug peg 1 kogk2<0:07 turn dial 1 kogk2<0:03 pull lever 1 kogk2<0:05 close window 1 kogk2<0:05 open window 1 kogk2<0:05 open door 1 kogk2<0:08 close door 1 kogk2<0:08 open drawer 1 kogk2<0:08 insert hand 1 kogk2<0:05 close box 1 kogk2<0:08 lock door 1 kogk2<0:05 unlock door 1 kogk2<0:05 pick bin 1 kogk2<0:08 Table 3.4: Success metrics used for each of the Meta-World tasks. All units are in meters. 70 3.9 Benchmark Verification with Single-Task Learning This section aims to verify that each of the benchmark tasks are individually solvable provided enough data. To do so, it considers two state-of-the-art single task reinforcement learning methods, Proximal Policy Optimization (PPO) [234] and Soft Actor-Critic (SAC) [90]. This evaluation is purely for validation of the tasks, and not an official evaluation protocol of the benchmark. We provide hyperparameters for these experiments Section 3.11. Figure 3.6 illustrates the results of these experiments. We find that SAC can learn to perform all of the 50 tasks to some degree, while PPO can solve a large majority of the tasks. Figure 3.6: Performance of independent policies trained on individual tasks using soft actor-critic (SAC) and proximal policy optimization (PPO). We verify that SAC can solve all of the tasks and PPO can also solve most of the tasks. 71 3.10 Learning Curves In evaluating meta-learning algorithms, we care not just about performance but also about efficiency, i.e. the amount of data required by the meta-training process. While the adaptation process for all algorithms is extremely efficient, requiring only 10 trajectories, the meta-learning process can be very inefficient, particu- larly for on-policy algorithms such as MAML, RL 2 . In Figure 3.8, we show full learning curves of the three meta-learning methods on ML1. In Figure 3.7, we show full learning curves of MT10, ML10, MT50 and ML45. The MT10 and MT50 learning curves show the efficiency of multi-task learning, a critical evaluation metric, since sample efficiency gains are a primary motivation for using multi-task learning. Unsurprisingly, we find that off-policy algorithms such as soft actor-critic and PEARL are able to learn with substantially less data than on-policy algorithms. Figure 3.7: Learning curves of all methods on MT10, ML10, MT50, and ML45 benchmarks. Y-axis rep- resents success rate averaged over tasks in percentage (%). The dashed lines represent asymptotic per- formances. Off-policy algorithms such as multi-task SAC and PEARL learn much more efficiently than off-policy methods, though PEARL underperforms MAML and RL 2 . 72 Figure 3.8: Comparison of PEARL, MAML, and RL 2 learning curves on the simplest evaluation, ML1, where the methods need to adapt quickly to new object and goal positions within the one meta-training task. 73 3.11 Hyperparameter Details This section provides hyperparameter values for each of the methods in our experimental evaluation. 3.11.1 Single Task SAC Hyperparameter Value Batch size 128 Non-linearity ReLU Policy initialization Standard Gaussian Exploration parameters Uniform exploration policy for 1;000 steps Ratio of environment and gradient steps per iteration 1 environment step per gradient step Policy learning rate 310 −4 Q-function learning rate 310 −4 Optimizer Adam Discount factor 0:99 Horizon 150 Reward scale 1:0 Temperature Learned Table 3.5: Hyperparameters for Single Task SAC experiments on Meta-World 3.11.2 Single Task PPO Hyperparameter Value Non-linearity ReLU Batch size 4;096 Policy initial standard deviation 2:0 Entropy regularization Coefficient 110 −3 Value function Linear feature baseline Table 3.6: Hyperparameters for Single Task PPO experiments on Meta-World 74 3.11.3 Multi-Task SAC Hyperparameter Value Network architecture MLP Network size (400; 400; 400) Batch size 128 Number of tasks Non-linearity ReLU Policy initialization Standard Gaussian Exploration parameters Uniform exploration policy for 1;000 steps Ratio of environment and gradient steps per iteration 1 environment step per gradient step Policy learning rate 310 −4 Q-function learning rate 310 −4 Optimizer Adam Discount factor 0:99 Horizon 150 Reward scale 1:0 Temperature Learned and disentangled for each task Table 3.7: Hyperparameters for Single Task PPO experiments on Meta-World 3.11.4 Multi-Task Multi-Headed SAC Hyperparameter Value Network architecture MLP Network size (400; 400; 400) Batch size 128 Number of tasks Non-linearity ReLU Policy initialization Standard Gaussian Exploration parameters Uniform exploration policy for 1;000 steps Ratio of environment and gradient steps per iteration 1 environment step per gradient step Policy learning rate 310 −4 Q-function learning rate 310 −4 Optimizer Adam Discount factor 0:99 Horizon 150 Reward scale 1:0 Temperature Learned and disentangled for each task Table 3.8: Hyperparameters for Single Task PPO experiments on Meta-World 75 3.11.5 Multi-Task PPO Hyperparameter Value Batch size Number of tasks 10 150 Policy initial standard deviation 2:0 Entropy regularization Coefficient 0.002 Value function Linear feature baseline, fit with observations and returns Table 3.9: Hyperparameters for Multi-Task PPO experiments on Meta-World 3.11.6 Multi-Task TRPO Hyperparameter Value Batch size Number of tasks 10 150 Policy initial standard deviation 2:0 Step size 110 −2 Value function Linear feature baseline, fit with observations and returns Table 3.10: Hyperparameters for Multi-Task TRPO experiments on Meta-World 3.11.7 Task Embeddings Hyperparameter Value Non-linearity tanh Batch size Number of tasks 10 150 Latent dimension 6 Inference window length 20 Embedding maximum standard deviation 2:0 Value function Gaussian MLP, fit with observations, latent variables and returns Table 3.11: Hyperparameters for Task Embeddings experiments on Meta-World 76 3.11.8 PEARL Hyperparameter Value Policy learning rate 310 −4 Q-function learning rate 310 −4 Discount factor 0:99 Horizon 150 Ratio of environment and gradient steps per iteration 22;500 environment steps per 4;000 training steps KL-divergence loss coefficient 0:1 Non-linearity ReLU Table 3.12: Hyperparameters for PEARL experiments on Meta-World 3.11.9 RL 2 Hyperparameter Value Non-linearity tanh Policy initialization Standard Gaussian Policy initial standard deviation 2:0 Value function Linear fit, with polynomial features (n = 2) of observation and time step Meta-batch size 40 Number of roll-outs per task 10 Horizon 150 Optimizer Adam Learning rate 110 −3 Discount factor 0:99 Batch size Number of tasks 10 150 Table 3.13: Hyperparameters for RL 2 experiments on Meta-World 77 3.11.10 MAML Hyperparameter Value Non-linearity tanh Meta-batch size 20 Number of roll-outs per task 10 Inner gradient step learning rate 510 −2 Discount factor 0:99 Table 3.14: Hyperparameters for MAML experiments on Meta-World 78 Chapter 4 Garage: Reproducibility, Stability, and Scale for Experimental Meta- and Multi-Task Reinforcement Learning Meta- and multi-task reinforcement learning (meta-RL and MTRL) show great promise for achieving long- term research goals like continual learning, and real-world applications like robotics, by making adaptation a first-class capability of intelligent agents. Strong quantitative evaluation is key for realizing these methods’ potential. However, establishing the significance of quantitative results in RL is difficult, and this is com- pounded by the additional complexity of meta-RL and MTRL. To tackle this complexity, this chapter ana- lyzes several design decisions each author must make when they implement a meta-RL or MTRL algorithm, and show that small details make a big difference. This chapter uses over 500 experiments, with a set of meta- and MTRL algorithm implementations with consistent hyperparameter definitions and a unified eval- uation procedure, to (1) show that seemingly-small implementation details can create statistically-significant variations in a single algorithm’s performance that exceed the performance differences between algorithms themselves, and (2) highlight important algorithm design challenges which are easily overlooked, and show- case open research problems which are unique to meta- and MTRL. To promote strong empirical research in meta- and MTRL, we share with the community an open source package of these unified reference implementations, which use consistent hyperparameters and evaluation procedures, achieve state-of-the-art- performance, and seek to follow the works introducing the algorithms as closely as possible. 79 4.1 Introduction The rapidly-emerging sub-fields of meta- and multi-task reinforcement learning (meta-RL and MTRL, re- spectively) have tremendous potential to help machine learning achieve long-standing goals, such as robotic agents which live, serve, and learn every day in the real world. These methods seek to capitalize on the suc- cess of single-task RL at many individual tasks, such as playing static video games [17, 175], or controlling a robot to manipulate known objects in a known environment [146], and extend it to diverse and ever-changing environments. Both sub-fields focus on learning shared structure among tasks, and leveraging it to succeed at new tasks. Meta-RL methods focus on rapidly learning new skills at test time given a small amount of labeled experience, while MTRL methods aim to accelerate learning of many skills by learning them simul- taneously. As they are so new, these sub-fields do not yet benefit from rigorous definitions of performance and correct implementation. Establishing these definitions over time creates a shared understanding of the state-of-the-art, allows the research community to achieve new and unforseen milestones by building on each other’s work. Ensuring that empirical results are reproducible is central to creating this shared understanding. This topic was summarized by [101] for single-task RL by asking the question, “[How can we] ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted?” Consistently implementing and measuring progress for meta- and MTRL presents numer- ous challenges and complexities beyond those that affect single-task RL: they require defining a training and evaluation procedure over multiple tasks, which often have disparate specifications (such as reward magni- tudes); they are multi-objective optimization problems, and therefore not all metrics used in single-task RL can be directly applied to meta- and MTRL; and these complex algorithms have complex implementations with many design decisions which are unique to meta- and MTRL. We show in Section 4.4 that these imple- mentation decisions matter, and can create performance differences for a single method which statistically explain between 30 and 100% of the observed performance difference between state-of-the-art algorithms. 80 That is not to say that the field has not made tremendous progress. Indeed, the methods studied in this chapter have already created many agents with capabilities which are difficult or impossible for single- task RL, such as agents that can play all Atari games at a human level [103], agents which quickly adapt to new variations in real world environments [178], and agents which can rapidly learn from a single demonstration [299]. The fields of supervised learning and single-task RL have achieved continued suc- cess and real-world impact by adopting three key pillars of empirical research: standardized challenges (benchmarks) [218, 22, 272, 255], well-defined performance measurements on those challenges [191], and widely-shared reference implementations, which exhibit repeatable performance measurements on these benchmarks [50, 21, 269]. For meta- and MTRL progress to continue past initial success stories, it is essen- tial to quantitatively measure how much each method’s success can be attributed to each individual design decision, how much can be attributed to each individual implementation decision, and how much can be attributed to a particular choice of experiment. These measurements help our community identify the most essential research challenges that need to be solved next. This chapter contributes to this process by example, and gives a window onto several algorithm design, implementation, and evaluation challenges in meta-RL and MTRL. It draws this picture using over 500 experiments, performed on 7 implementations of state-of-the-art meta- and MTRL algorithms. Section 4.4.1 uses 10 random seeds to demonstrate the substantial seed sensitivity of these algorithms. Section 4.4.2 then uses statistically-tested ablation experiments to carefully control for this seed sensitivity, and demonstrate that meta- and MTRL algorithms have many “small choices with large effects, ”, which when applied to a single algorithm can be attributed 30-100% of observed performance differences between algorithms. Knowing which of these details matters gives us insights into some of the challenges ahead of meta-RL and MTRL, which Section 4.4 discusses inline, and Section 4.5 elaborates on. In addition to this chapter, we contribute to the community a set of rigorously-tested, open source reference implementations of all the above algorithms, experiments, and evaluation procedures, and a well-defined community process 81 for contributing new reference implementations to the set. For comprehensive experimental results see Sections 4.7 and 4.8, and for code see the project’s GitHub page. * 4.2 Algorithms and Approach This chapter uses a few of the most frequently cited meta- and multi-task RL algorithms to use for our experiments. In particular, for meta-RL it chooses RL 2 and MAML because of their age and high relevance within the field [56, 66]. It also chooses PEARL, because we wanted to highlight how similar changes can affect on-policy and off-policy meta-RL differently [208]. For multi-task RL, this chapter uses the same algorithms as analyzed in [292]. However, this focuses on MT-SAC in particular, since it performed by far the best in our experiments. When comparing two algorithms, this chapter typically chooses pairs with as much similarity as possible. This chapter uses the Meta-World benchmark [292], which is a suite of 50 simulated robotic arm tasks with shared problem structure and internal variations, arranged into three levels of difficulty, and designed to assess the ability of meta-RL and multi-task RL algorithms to exploit shared structure between tasks. This chapter focuses on this benchmark because existing algorithms perform very well on most other benchmarks, which makes it difficult to analyze performance differences between methods. 4.3 Garage: An Expressive Software Library for Reproducible RL Research This section discusses the design of the Garage reinforcement learning library. It first motivates our rea- sons for developing the library, and the design philosophy we use throughout its implementation. It then * For code, seehttps://github.com/rlworkgroup/garage/ 82 proceeds to a quick overview of Garage’s design, and follows with detailed explanations of some important components. Later sections will illustrate by example the quantitative, one-to-one comparisons between algorithms and design choices Garage’s design enables. These comparisons provide insights into designing meta-RL and MTRL algorithms and benchmarks. Such aggressively empirical research is impractical without a system like Garage. 4.3.1 Motivation, Audience, and Scope As modern reinforcement learning has grown as a research field, the methods explored by RL researchers have grown in complexity. Modern RL, IL, and robot learning methods make use of diverse sources of data (e.g. demonstrations, on-policy experience, off-policy experience, third-person observations), model agents with increasingly-complex function approximators (e.g. ensemble methods, multi-headed architec- tures, networks-of-networks, hypernetworks), learn with many optimization methods (e.g. SGD, LBFGS, trajectory optimization, MPC) in a single algorithm, and target myriad environments (e.g. board games, computer games, robotic manipulation, power system optimization) and problem settings (e.g. on-policy RL, off-policy RL, imitation learning, multi-task RL, meta-RL, offline RL, continual learning). With this explosion in methodological complexity comes an explosion in the complexity of software implementing these methods, posing great risks to reproducibility for the field as a whole, and productivity for individual researchers. Many RL methods are developed using an integrationist perspective, in which researchers modify and combine previously-successful designs into new, more adept agents. This admits a software design which emphasizes modularity, composability, and legibility. Such a design would phrase an RL algorithm as a combination of verifiably-correct and reusable components. It would enable a high level of abstraction, ideally allowing a researcher to compare two methods side by side, and easily spot commonalities and 83 differences in their approaches. It would it use consistent interfaces, both to allow combining algorithm building blocks in unexpected ways, and to clearly isolate changes to the building blocks themselves from changes to the overall system. Garage is a working hypothesis for how to build such a system. Its goal is to promote high-quality empirical research and the free exchange of ideas, by allowing researchers to easily define, modify, dissem- inate, and reproduce RL experiment workflows. Audience and Scope Explicitly choosing the audience for a software library is important, because identi- fying the user is the first step to practicing empathy for the user, and helps us focus on what is most important for that user. Garage is a tool built for RL researchers, by RL researchers. It prioritizes researcher productiv- ity on a single machine first, and assumes the expertise level and computing resources of a typical PhD-level RL researcher. Notably, Garage is not designed to be a model system for teaching or learning about RL as a novice, it is not designed for commercial use cases where integration with very large code bases is important, and it is not designed for large-scale distributed training across machine clusters. While we do not specifically forestall these use cases in the design, we gladly sacrifice them in pursuit of maximizing researcher productivity on a single workstation, where the bulk of RL algorithm experimentation happens today. 4.3.2 Design Principles Much of Garage’s design is a direct application of the widely-used SOLID [166] design principles to the unique problem domain of a reinforcement learning library. Additionally, as a community-oriented open source project in the Python ecosystem, Garage strives to be a good citizen of that ecosystem by following the Python community’s design ethos, as summarized in “PEP 20: The Zen of Python” [198]. Besides design standards, Garage attempts to follow Python community concensus wherever possible, such as style guides, which versions of Python to support, packaging release and versioning practices, etc. Rather than 84 recapitulating SOLID and Python design philosophies here, we will focus on those design decisions which are completely addressed by neither, and are instead artifacts of the unique RL problem domain and research audience which Garage serves. The following list of aphorisms summarizes Garage’s design philosophy, in no particular order. • Empathy for the user is the most important feature Empathy for the user is the hallmark of a successful software library. If Garage is to be successful, it must be adopted by researchers. To be adopted by researchers, it must be designed with researchers problems, stresses, and fears in mind. The maintainers must be responsive to the researchers’ needs, or else the researchers will look elsewhere. Therefore, to design and maintain Garage well, we have to put ourselves in the users’ shoes, and scrutinize the library through their eyes. • Everything is optional Engineering research is in the business of trying new and unexpected things. If Garage were designed perfectly, but inflexibly, for our understanding of RL today, it would be of little use to RL researchers who are designing the learning systems of tomorrow. As such, Garage is designed as a toolkit of interlocking parts which have little knowledge of each other (so-called “loosely coupled” [166]). This allows researchers to pick and choose which parts of the toolkit are useful for their research. Rather than forcing a researcher to commit to phrasing their project as a variant-of or addition-to the main Garage code base—as is the norm for most other RL research libraries—Garage encourages the researcher to own create her own code base, import Garage as a library, and mix and match other libraries as she sees fit. It strives to be a worthy addition to her research code, and a good neighbor to other libraries she may need. Not only does this design principle encourage greater adoption by avoiding researchers’ well-warranted fears of lock-in, it increases productivity. It allows researchers to quickly swap out “stock” Garage components with their own variant, either because that component 85 is the object of their research, or because the Garage version has some limitation they would like to overcome. • To the greatest extent possible, flow control should remain with the user Taking flow control away from the user is one of the great temptations for any complex software library. It is a particularly common affliction of libraries like Garage, which are used to express long- running computations, involving multiple elements which require coordination in time, and some- times access exclusive resources such as GPUs. Programs written with these libraries are very tightly coupled to the library and, in a sense, no longer written in the host programming language. They instead are written in the constructs of the library, with the host programming language as a substrate. It makes these libraries easier for the author to write, by drastically limiting the time and manner in which APIs are called to a fixed set of scenarios, pre-imagined by the author. However, this simplic- ity comes at great expense to the user’s flexibility. It removes the author’s incentive to design the system’s components for independent use in unexpected ways, and makes it difficult to for the au- thor incorporate support for new ideas which might require more complex flow control than a simple training pipeline, e.g. ensembles, evolutionary methods, or multi-agent RL. What’s more, it makes it impossible for the user to implement many new ideas herself without modifying the library. Garage allows researchers to write their RL algorithms and experiments in Python, not in Garage, so flow control always stays with the user. An excellent example of this principle in action is the design of experiment launchers and the @wrap_experiment decorator, which allows users to express any RL experiment as a simple Python function, to be called in the same manner as any other function. For instance, this allows a hyperparameter sweep to be phrased as a for loop which calls an experiment function multiple times, with different hyperparameters each time. • Do not stand between the user and her numerical library 86 Garage is a library for RL research, which in its modern form often implies training neural networks and other advanced numerical techniques. Learning to use a numerical library well is a substantial time investment for most researchers, for which they enjoy little immediate career benefit. Garage provides extensive support for using three of the most popular modern numerical libraries, in the form of ready-to-use components, helper functions, examples, and whole baseline algorithm libraries implemented using them. However, at no time does it hide those APIs behind its own, or substitute a Garage API when one already exists in the numerical library. When Garage does add a numerical tool, such as a custom neural network module, optimizer, probability distribution, or helper function, it does so by formally implementing the existing APIs provided by the numerical library, rather than by inventing its own. This forces Garage to stick to what it is best at, which is RL. It allows users to take advantage of their existing expertise with numerical libraries to be productive using Garage, rather than forcing them to learn a new one. • Code is for reading Garage is a library for expert users, whose use for the software likely involves extending it, but are not experts on Garage itself. Their starting point for extending a component will likely be reading Garage’s own implementation of that component. This means it is imperative for Garage code to be easy to read and understand, and to follow as straightforwardly as possible from its origins as an abstract algorithm. In addition to the benefits to the user, writing readable code forces the authors to organize their implementations in a conventional and easy-to-follow style, making the software easier to modify and maintain in the future. • Don’t confuse performance with productivity Every library must choose a point within the triad trade-off between raw performance, simplicity of implementation, and expressivity for users. Garage chooses to prioritize expressivity for users 87 first, simplicity of implementation second, and raw performance third. This is not to say that Garage has low performance—indeed, it usually achieves higher throughout than most other non-distributed RL libraries—but instead is a statement of priorities. We implement algorithms and components in their most reusable and easy-to-understand forms first. Only then do we address performance bottlenecks. Usually, this takes the form of providing optional, more complex versions of default components. These exhibit higher performance, but often at the cost of increased complexity and lower legibility in the implementation, and more complex configuration and limitations on their use. An excellent example of this pattern can be seen in the garage.sampler package, which provides environment sampling facilities at every level of performance and complexity, from a glorified for loop, to sampling across multiple processes on a single machine, to sampling across a distributed computing cluster, all behind a single unified API. • If it’s not reproducible, it didn’t happen; if it’s not tested, it doesn’t work One of Garage’s main motivations is to improve the reproducibility in RL research. This is only possi- ble if Garage itself provides reproducible algorithm implementations, and only possible if researchers trust and use Garage in their research. With an eye towards these imperatives, Garage maintains high software testing and quality standards for all contributions, because software which breaks frequently will not be trusted by the community. All contributions are tested against the full test suite before they are committed, and must be accompanied by tests which cover 95% of the code they introduce. Likewise, algorithmic contributions to Garage are heavily scrutinized. They are only committed to the library once the contributor has provided evidence to maintainers that implementation meets or exceeds the performance of the best existing implementation we can find, does so reproducibly, and is faithful to the original work introducing the method. • Logging should be easy, reliable, and customizable 88 Garage is a library for defining and executing experiments with RL systems. The purpose of those experiments is usually to collect measurements of performance and other algorithm properties, so the library has not served its purpose until those measurements are stored and displayed to the user. In other words, what can often be an afterthought in other software is a primary design goal in Garage. Logging virtually anything should be easy for users, and they should have confidence that nothing they log will ever get lost, which would make their time and compute investment in an experiment all for nought. Additionally, as we cannot anticipate how RL research will evolve in the future, we cannot anticipate what and to where users will want to log in the future, so logging facilities should best just as customizable as the rest of Garage. • Debugging is a primary use case Debugging is an integral part of writing software, and debugging software written with Garage should be as painless as possible. Users should be able to stop the program at any time, and trace its entire execution step-by-step, without having to cross process or machine boundaries. When Garage has no choice but to raise an error, it should tell the user in excruciating detail what went wrong and how. If possible, it should provide her with hints on how to fix it. This allows researchers to fix problems, and get back to experimentation as quickly as possible. Garage achieves this through extensive checking of arguments for common pitfalls, and by ensuring that there is a simple, single-process version of all components available for debugging. • Make correctness easy If Garage produces a measurement of algorithm performance, it should be correct, even if the algo- rithm itself has a bug. Specifically, it should be difficult to accidentally forget to count environment samples used in training, or to compute a performance metric of a policy inconsistently. Garage en- sures this by using environment samples consistently as the default x-axis for reporting metrics, by 89 automatically counting and logging environment samples for the the user, and by providing policy evaluation modules which always compute and log performance metrics in exactly the same way. • Stand on the shoulders of giants A great software library is good at one and only one thing. For Garage, that one thing is defining and executing RL experiments. This must be balanced with the goal of using as few and as high- quality dependencies as possible. We follow the guiding principle that where a high-quality solution to a problem already exists, Garage tries to use it before inventing its own. This principle applies recursively to adopting dependencies: if a solution is available in the Python Standard Library, we use it before a third-party package; if a solution is available in an existing dependency, we use that before introducing a new dependency; if no high-quality dependency provides a solution to our problem, only then do we implement our own solution, and only after asking ourselves again if this is a problem which is really worth solving. This decision rule keeps Garage focused on solving problems for which ours users do not already have solutions. Examples of this design principle in action include Garage’s use of TensorBoard [167] for monitoring experiments, and its use of the pickle package from the Python Standard Library for storing algorithm state, which allows Garage to implement pausing and resuming of experiments. 90 4.3.3 Design Overview Experiment Disk Trainer Logger Snapshotter TensorBoard CSV tensorboard.dev Algorithm Loss Function Replay Buffer Task Sampler Policy Value Function Q-Function Encoder etc... Preprocess Sampler Worker 1 Environment Environment Environment Worker n Environment Environment Environment ... Plotter Display Evaluator Environment Environment Environment snapshot.pkl etc... Metrics Policy Task Algo State Samples Hardware Garage Component Files Neural Network Simulation or Robot May be distributed Optional save() train() resume() restore() Cloud Optimization Figure 4.1: Architecture overview diagram for the Garage reinforcement learning library. Samples are collected at the top of the diagram, and are gradually transformed by the algorithm and Garage into policies and evaluation metrics, which are stored and displayed for the user at the bottom of the diagram. At its core, Garage is a machine for turning algorithm and experiment designs into policies and evaluation metrics. To tame complexity, the design partitions this machine into prototypical modules with a single 91 responsibility, whose interactions and data flow among each other are well-defined by APIs. For instance, there is a single Sampler API, but several implementations of that API. See Figure 4.1 for a graphical overview of the architecture. The Experiment Launcher #!/usr/bin/env python3 """An example to train TD3 algorithm on InvertedDoublePendulum using PyTorch.""" import torch from torch.nn import functional as F from garage import wrap_experiment from garage.envs import GymEnv, normalize from garage.experiment.deterministic import set_seed from garage.np.exploration_policies import AddGaussianNoise from garage.np.policies import UniformRandomPolicy from garage.replay_buffer import PathBuffer from garage.sampler import FragmentWorker, LocalSampler from garage.torch import prefer_gpu from garage.torch.algos import TD3 from garage.torch.policies import DeterministicMLPPolicy from garage.torch.q_functions import ContinuousMLPQFunction from garage.trainer import Trainer @wrap_experiment(snapshot_mode='none') def td3_pendulum(ctxt=None, seed=1, n_epochs=750, steps_per_epoch=40, sampler_batch_size=100): set_seed(seed) num_timesteps = n_epochs * steps_per_epoch * sampler_batch_size trainer = Trainer(ctxt) env = normalize(GymEnv('InvertedDoublePendulum-v2')) policy = DeterministicMLPPolicy(env_spec=env.spec, hidden_sizes=[256, 256], hidden_nonlinearity=F.relu, output_nonlinearity=torch.tanh) exploration_policy = AddGaussianNoise(env.spec, policy, total_timesteps=num_timesteps, max_sigma=0.1, min_sigma=0.1) uniform_random_policy = UniformRandomPolicy(env.spec) qf1 = ContinuousMLPQFunction(env_spec=env.spec, hidden_sizes=[256, 256], hidden_nonlinearity=F.relu) qf2 = ContinuousMLPQFunction(env_spec=env.spec, hidden_sizes=[256, 256], hidden_nonlinearity=F.relu) replay_buffer = PathBuffer(capacity_in_transitions=int(1e6)) sampler = LocalSampler(agents=exploration_policy, envs=env, max_episode_length=env.spec.max_episode_length, worker_class=FragmentWorker) td3 = TD3(env_spec=env.spec, 92 policy=policy, qf1=qf1, qf2=qf2, replay_buffer=replay_buffer, sampler=sampler, policy_optimizer=torch.optim.Adam, qf_optimizer=torch.optim.Adam, exploration_policy=exploration_policy, uniform_random_policy=uniform_random_policy, target_update_tau=0.005, discount=0.99, policy_noise_clip=0.5, policy_noise=0.2, policy_lr=1e-3, qf_lr=1e-3, steps_per_epoch=steps_per_epoch, start_steps=1000, grad_steps_per_env_step=1, min_buffer_size=int(1e4), buffer_batch_size=100) prefer_gpu() td3.to() trainer.setup(algo=td3, env=env) trainer.train(n_epochs=n_epochs, batch_size=sampler_batch_size) td3_pendulum() Listing 4.1: Example of a Garage experiment launcher, written in Python. This experiment uses the TD3 algorithm to train a policy which solves theInvertedDoublePendulum-v2 environment from the Open AI Gym benchmark suite. The experiment launcher is the main interface between Garage and a user, and exemplifies the readable, composable, and flexible design philosophy of the library. See Listing 4.1 for an example, which uses TD3 [74] to train a policy to solve an inverted double pendulum environment from Open AI Gym [22]. In Figure 4.1, the experiment launcher is portrayed as the outer frame. Observe that the launcher itself is a fully- executable Python script, and the experiment definition (i.e. td3_pendulum), is a simple Python function, which is called at the final line of the script to execute the experiment. Keeping configuration and execution of experiments as simple Python constructs gives the user incredible flexibility in how to define and execute her experiments. For instance, she can parameterize an experiment simply by adding function parameters (as is the case in this example), she can export experiments as functions in a larger library, and she can even define functions which themselves return experiment functions. Inside the experiment function, users configure the system (e.g. set_seed() sets the random seed of all random number generators in the system), construct to their specification the environments, samplers, and algorithm building blocks their experiment 93 will use, and construct an algorithm instance which will use those building blocks and hyperparameters. We will discuss the purpose of the function decorator @wrap_experiment and Trainer class later in this section. Environment Interface Garage’s Environment interface closely mirrors the classic MDP formulation of reinforcement learning introduced by Sutton and Barto (Figure 4.1, top and lower-right). It very similar to the gym.Env interface from the popular Open AI Gym benchmark suite [22]. garage.Environment adds a few restrictions and changes to the Gym version, most notably: instances of garage.Environment must be serializable by the built-in pickle module from the Python Standard Library, to ensure Garage may distribute environment sampling across processes or machines; unlike Gym, it enforces a formal distinction in its step data type between final transitions resulting from entering a terminal state, and those which are final because the agent has run for the maximum number of transitions allowed by the system, because this has caused reproducibility problems for many off-policy algorithms in the past; and it distinguishes between render()ing an environment and visualize()ing one, where for former generates data for algorithm consumption, and the latter launches a human-readable visualization of the environment. Sampler Interface The purpose of thegarage.sampler module is to turn policies and environments into experience samples from which algorithms will learn policies (Figure 4.1, top). There is a single Sampler API, and several sampler implementations with different trade-offs between complexity and performance. Internally, a Sampler is organized into one or more Workers, which actually execute the policy and envi- ronment to generate samples. These Workers may be local to the process and executed serially, they may be executed in parallel on remote processes on the same machine to make better use of resources, or they might be run on remote machines on a computing cluster. The outer Sampler module is responsible for allocating, initializing, and de-allocating workers, communicating updates to the policy and environment to 94 Workers, requesting samples from Workers, and collating them into unified batches for presentation to the caller, which is typically an algorithm. Composable Algorithm Building Blocks Garage provides a large array of algorithm building blocks to users, most notably in the form of pre-defined neural networks or other function approximators for modeling common components from RL and IL, such as policies, Q-functions, value functions, encoders/decoders, and symbolic distributions (Figure 4.1, in red). These building blocks are implemented using the first-party APIs of one of the three numerical libraries which Garage supports, and typically include variants which use popular neural network architectures such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), multi-layer perceptrons (MLPs), and stochastic function approximators modeled as a symbolic distribution (e.g. Gaussian, Categorical, etc.) parameterized by one or more of the aforemen- tioned neural network architectures. Garage implements algorithms which are agnostic to the internals of these building blocks. For instance, Garage’s implementation of PPO [233] can be switched from using an MLP-based policy to an RNN-based policy simply by changing the policy instance passed to PPO in the experiment launcher from an MLP-based building block to an RNN-based one. This composition pattern applies similarly to non-neural network based building blocks used by algorithms, such as replay buffers, ex- ploration strategies, and task samplers: algorithms depend only on a small, well-defined interface exported by these building blocks, and users may pass any object which implements that interface to the algorithm when they construct the algorithm object in the experiment launcher. Experiment Management, Logging, Plotting, and Snapshotting Most of Garage’s most researcher- friendly features work silently below the surface. The experiment manager, which is accessed succinctly by decorating any experiment function with @wrap_experiment creates a unique labeled directory for the experiment’s output files, configures the 95 snapshotting and logging modules, and automatically saves a record of hyperparameters which were passed to the experiment. These features help ensure data and important experiment parameters are never lost. The Logger module provides a simple API which allows users to log any key-value or text data once using a single line of code (e.g. logger.log(‘ExampleKey‘, 10.0)) and output that data to numerous relevant formats, such as files on disk, messages and tables printed to the console, and metric plotting tools such as TensorBoard. Where relevant, metrics are always logged and plotted automatically along with key independent variables such as number of gradient steps applied and total number of environment steps sampled, helping ensure correctness of measurements without additional user effort. The Snapshotter is a small module with an important purpose: at the conclusion of each step of training, it saves the state of the entire algorithm and building blocks to disk. This allows Garage to later resume experiments which are interrupted by machine failures, user error, or intentionally. It also allows users to load and inspect the state of the algorithm and policies at every step of training after-the-fact. The Trainer module coordinates the flow control between Garage and algorithms, allowing Garage to expose important features to algorithms with minimal interruption to the user’s flow control. The Trainer is responsible for saving snapshots of algorithms, displaying a graphical visualisation of policy behavior (if requested), and tracking important global measurement state such as number of optimization epochs and the total number of environment steps sampled. It is exposed to the algorithm in the form of a function which the algorithm calls once per training epoch, which hands flow control to Garage briefly to perform these logging and supervisory functions. Importantly, Garage only receives this flow control when the user wants it to. Baseline Algorithms The library provides a large and growing collection of 22 baseline algorithm imple- mentations for users to experiment with. These are implemented using constructs from Garage itself, with no special access to library state or components. That is, they function both as features of the library, and as 96 examples to users for how to implement custom algorithms in their own code using the Garage library. Al- gorithms and their implementations must meet a high bar for inclusion into the library: algorithms must have an authoritative description published in a peer-reviewed venue, and must have received substantial adoption in a research community. Implementations must follow Garage’s design philosophy, style, and code qual- ity standards, reflect the source literature on that algorithm as closely as possible, and must reproducibly meet or exceed the performance (on relevant benchmarks) of the best publicly-available implementation the maintainers can find. These standards keep Garage’s library of algorithms relevant and high-quality. Testing and Continuous Integration To maintain high quality and speed development, Garage maintains an extensive unit and integration testing suite which covers 91% of the lines of code in the library, i.e. virtually all semantically-important lines of code. This suite tests all components both in isolation, and under many combinations, to ensure that the library remains functional and correct as new contributions are added and dependencies are updated. All contributions (phrased as “Pull Requests” to the GitHub project which hosts Garage † ) are auto- matically tested by Garage’s continuous integration (CI) system when they are proposed. In addition to running the full test suite, this system checks contributions for conformance with Garage’s code style and design standards, and ensures that new code is accompanied by tests which cover 95% of its lines. The system blocks contributions which break tests, or are non-conforming, from merging into the library’s main development repository until authors fix them. This same system is used to automatically merge contribu- tions which pass the CI and have received approval from Garage maintainers, to automatically generate the project’s documentation website ‡ , and to post new releases of the code to the Python Package Index for download by users § . † https://github.com/rlworkgroup/garage ‡ https://garage.readthedocs.io/ § https://pypi.org/project/garage/ 97 In addition to unit and integration tests, Garage maintains a set of benchmark experiments ¶ for each algorithm, which reproduce that algorithm’s performance against baselines from the literature and other software libraries. This allows Garage developers and users to verify that changes to the library do not compromise the performance of the included algorithm implementations. This high degree of automation is essential to Garage’s technical strategy. It allows Garage to integrate the changes of many contributors at once (most Garage releases include hundreds of changes authored by 15-20 unique contributors) while maintaining a high quality bar, and avoids burdening maintainers with large amounts of unproductive toil coordinating and testing changes. 4.3.4 Bringing Garage to the World: Community-Building and Documentation 2017-07 2018-01 2018-07 2019-01 2019-07 2020-01 2020-07 2021-01 0 250 500 750 1000 1250 1500 1750 2000 deepmind/acme IntelLabs/coach tensorflow/agents astooke/rlpyt rlworkgroup/garage chainer/chainerrl SurrealAI/surreal Star Count on GitHub Figure 4.2: GitHub “stars” engagement over time for the Garage reinforcement learning library. Garage garners comparable levels of engagement and growth to other popular reinforcement learning libraries of similar scope, all but two of which are supported with a full-time development team and marketing from a major corporation. Garage’s community-led development model allows it to grow steadily and organically, primarily through word-of-mouth and citations. ¶ https://github.com/rlworkgroup/garage/tree/master/benchmarks 98 Community Garage is a community-oriented project which is developed by researchers for researchers. Unlike most other popular RL libraries of similar quality and scope [77, 82, 75, 247, 63], it does not re- ceive full-time developers, marketing support, and a brand halo from a major corporation. Garage instead relies on the generosity of its users, primarily researchers and academic research labs, who donate their time and resources to its development, in recognition of how Garage benefits their work. We believe this community-oriented governance and development approach is the most sustainable model for Garage’s fu- ture. So far it as been successful, as is demonstrated by Garage’s large and steadily-growing adoption on GitHub (Figure 4.2), which keeps pace with projects with devoted marketing and development resources. A community-lead model means that the research community can evolve the software along with the research field, ensuring will remain relevant to contemporary RL research. As community-building is so central to Garage’s mission and survival, the Garage project focuses on cultivating community in as many ways as possible. The software is developed in public on GitHub, includ- ing planning tools such as issue trackers and feature road maps. Any user in the world can propose a code change or report an issue there, and receive a response from the Garage development team. Additionally, Garage maintains an announcement mailing list || , and a public chat server ** , where users can seek support from each other in using the software, and communicate with maintainers about more difficult issues. Documentation Documentation is key to building community around an open source project, so Garage places great emphasis on maintaining high-quality documentation of all of the project’s features (Figure 4.3). This documentation is developed in the same repository as the software, which allows users anywhere in the world to easily proposed updates, corrections, and additions. Like other parts of Garage, the documentation website is immediately updated by the projects automation software when new changes are committed to the software repository, ensuring users always have the most up-to-date documentation on the project. || https://groups.google.com/g/garage-announce ** https://rlworkgroup-discuss.slack.com 99 Figure 4.3: Screenshot of Garage documentation website (retrieved April 2021). Adoption Estimates As of April 2021, Garage is followed (“starred”) by ~1,200 users, has been copied for modification (“forked”) by 220 users, and is a dependency to 17 other projects on GitHub. The project’s chat server has 174 members, and its announcement mailing list has 179 members. It is mentioned in 34 publications indexed by Google Scholar †† . The project’s GitHub page is visited by ~1,000 unique users per month, and its source code is downloaded by ~400 users per month. In addition to downloading the source code directly from GitHub, users may install Garage directly as a Python library into their own project, using tools included with the Python language itself. The Python Package Index, the central global repository for Python open source software, reports ‡‡ that Garage is downloaded by ~600 user per month using those tools. These download statistics imply that there are between approximately 500 and 1,000 active Garage users. 100 0 10 20 30 40 50 Train Seed Sensitivity Study of Meta and Multi-Task RL on ML 45 and MT50 (n = 10) MT-PPO MT-TRPO MT-SAC MAML-TRPO PEARL-SAC RL 2 -PPO TE-PPO 0 1 2 3 4 5 6 7 8 Total Environment Steps 1e7 0 20 Test Success Rate (%) Figure 4.4: Top: Training success rate of 7 meta- and multi-task RL algorithms on the ML45 (RL 2 -PPO, MAML-TRPO, PEARL-SAC) and MT50 (MT-PPO, MT-TRPO, TE-PPO) benchmarks. Bottom: Test suc- cess rate of the 3 meta-RL algorithms on the ML45 benchmark. Mean and 95% confidence interval over 10 random seeds. 4.4 Experiments 4.4.1 Random Seed Sensitivity of Meta- and Multi-Task RL As it is essential to establishing the significance of other empirical results, the first property of algorithms this chapter studies is the consistency of their performance across random variations in the training and evaluation process, which are unrelated to any design or implementation decision. To measure this, we vary the global random seed used by the training process, and measure the resulting success rate on the ML45 and MT50 benchmarks [292], as presented in Figure 4.4. As can be seen from the plot, these are very challenging meta-RL and multi-task RL benchmarks–no method exceeds 50% performance during train or test. The primary feature of these results, which is clearly visible in the figure, is the high variability in perfor- mance across seeds among all algorithms due only to random process variation. We must carefully control †† https://scholar.google.com ‡‡ https://pypistats.org/packages/garage 101 for this variation when attempting to measure algorithm performance, and do this for all subsequent exper- iments in three ways: (1) we perform all experiments using at least 5 random seeds to establish confidence intervals for all measurements, (2) we use statistical tests (depicted using dotted lines) to verify the signifi- cance of these measurements despite these confidence intervals, and to establish that we have run a sufficient number of seeds to make a conclusion, and (3) we make extensive use of ablation experiments which give us confidence that our statistical tests are only measuring the effect of a single design or implementation decision. We apply our statistical method (as described in the Appendix) to compare the performance of different algorithms on this benchmark. Although we find statistically significant differences between MT-SAC and the other multi-task algorithms (p = 0:001 for MT-PPO, p = 0:01 for MT-TRPO, and p = 0:0009 for TE-PPO), we do not find any other statistically-significant performance differences between the methods. 4.4.2 Implementation Sensitivity in Meta- and Multi-Task RL: Small Details Make a Big Difference This section uses ablation experiments to investigate a series of design and implementation choices which authors make when implementing meta- and multi-task RL algorithms. This section shows that several minor details, most of which are not specified in the original works in- troducing these algorithms, can have a large impact on their relative performance. In particular, it shows that implementation decisions can create performance differences for a single algorithm which can statistically explain absolute performance differences of 10-40% with 95% confidence on the Meta-World benchmarks. This is larger than most performance gaps we observe between different algorithms, which range between 0- 25%. This section also explains why the same decisions can have different impacts on different algorithms. 102 0 1 2 3 4 5 6 7 8 9 Total Environment Steps 1e7 0 5 10 15 20 25 30 35 40 Train Success Rate (%) RL 2 -PPO Reward Normalization Study (n = 5) RL 2 -PPO RL 2 -PPO Reward Normalized Figure 4.5: Reward normalization study of RL 2 -PPO success rates on ML10, with and without per-task reward normalization. Vertical lines indicate best epoch overall for each experimental variant. Mean and 95% confidence interval over 5 random seeds. We find a statistically significant result that reward scaling improves the peak performance of RL 2 -PPO (p = 0:04%). 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Total Environment Steps 1e7 0 10 20 30 40 50 Train Success Rate (%) MT-SAC MT-SAC Reward Normalized 0 1 2 3 4 5 6 7 Q Loss 1e9 MT-SAC Reward Normalization Study (n = 5) Q Loss Figure 4.6: Reward normalization study of MT-SAC success rates on MT10 with and without per-task reward normalization. Right axis shows the Q loss of the reward normalization variant, which is exhibits training instability. 103 4.4.2.1 Reward Normalization Figure 4.5, shows the effects of naive per-task reward normalization on RL 2 -PPO and MT-SAC. We find a 75% increase in the performance of RL 2 , a statistically significant difference (p = 0:04%). In multi-task RL and meta-RL, differences in reward scale between tasks can cause the gradient scales of different tasks to differ significantly. This can lead the RL algorithm to “fixate” on solving a subset of the tasks which have larger rewards (and therefore corresponding gradient magnitudes) to the detriment of others. Even if the absolute scale of achievable rewards are similar across tasks, during training the reward scales of different tasks are likely to differ as the different tasks are learned at different rates [103]. In the Meta-World benchmarks [292], rewards on different tasks can differ by multiple orders of magnitude during training, which causes meta-RL and multi-task RL algorithms to fixate on a small subset of tasks. A very straightforward method that might address this problem is to apply a rolling average normaliza- tion (as described in [58]) to the reward of each task independently. This is sufficient to nearly double the performance of RL 2 , as shown in Figure 4.5. However, the performance of MT-SAC appears to become un- stable with this change, and typically collapses to nearly complete failure. This highlights the need for new methods to address overfitting and the non-stationarity of rewards in MTRL, such as PopArt [103], which maintains normalizes per-task gradient updates in the face of varying per-task rewards, and PCGrad [291], an optimizer that seeks to prevents gradient from one task interfering with gradients from other tasks. This is a significant open research problem. Authors of both of these methods cite poor resource scaling prop- erties, and degrading asymptotic performance in the face of large numbers of tasks, as challenges their methods could not overcome, suggesting the community has not yet found a scalable long-term solution to non-stationary and varying reward magnitudes in MTRL. Although the exact cause of MT-SAC’s instability with this change is unclear, one plausible explanation is as follows. When performance improves dramatically on a task, samples with very large rewards are placed in the replay buffer. Then, once the rolling average over the rewards adapts to that high reward, 104 0.0 0.2 0.4 0.6 0.8 1.0 Total Environment Steps 1e7 0 10 20 30 40 50 60 70 80 Train Success Rate (%) Time Limits Ablation Study of MT-SAC and PEARL (n 4) MT-SAC MT10 MT-SAC MT10 Without Time Limits PEARL ML10 PEARL ML10 Without Time Limits Figure 4.7: Ablation study of environment time limits on PEARL and MT-SAC success rates on ML10/MT10. By removing the terminal signal at the time limit, MT-SAC performance improves (p = 1:5%) by 110%. Mean and 95% confidence interval over 5 random seeds; 20 experiments. samples with the same states and actions but much lower rewards are placed into the replay buffer. This leads to the Q function catastrophically under-fitting, as can be seen by the Q loss increasing by over seven orders of magnitude after the initial successful period, as show in Figure 4.6. 4.4.2.2 Time limits in off-policy meta- and multi-task RL algorithms Figure 4.7 shows the effect of removing the time limit signal from two Meta-World benchmarks. In all Meta-World benchmarks, the time limit signal is conflated with the termination signal. Because off-policy algorithms rely on this signal in order to bootstrap terminal states correctly [175], this conflation makes it unnecessarily difficult for off-policy algorithms, such as MT-SAC and PEARL, to perform well in Meta- World benchmarks. Removing this conflation increases the performance of MT-SAC by 110% (p = 1:5%). Given that PEARL uses SAC extensively as a subroutine, one might expect that this change would also improve the performance of PEARL, but it does not. This indicates that PEARL’s performance is limited on 105 t-SNE Plot of Inferred Task Latents Task reach-v1 push-v1 pick-place-v1 door-open-v1 drawer-close-v1 button-press-topdown-v1 peg-insert-side-v1 window-open-v1 sweep-v1 basketball-v1 drawer-open-v1 door-close-v1 shelf-place-v1 sweep-into-v1 lever-pull-v1 Figure 4.8: Visualization of latent codes learned by PEARL on ML10. This 2D embedding was generated from the 7D latent codes using t-SNE. this benchmark for different reasons than MT-SAC’s performance. A likely culprit is the inference network, which is used in PEARL to infer a latent code that represents the task, which is then used in a way similar to the task one-hot encoding in MT-SAC. Task inference is essential for performance in meta-RL [204]: if a meta-learner cannot successfully infer which task it is trying to achieve, it cannot possibly solve that task. Figure 4.8 shows a 2 dimensional embedding generated using t-SNE of 7 dimensional latent codes inferred by PEARL. Although some tasks are well separated by the embedding, several tasks are not separated, or appear as multiple mode in the embedding. This suggests that task inference is a bottleneck in meta-RL, particularly in a shared-structure task set such in the MetaWorld benchmark, and that future research should focus as much on successful task inference as raw asymptotic performance. 4.4.2.3 Maximum Entropy RL Figure 4.9 evaluates the effect of maximum-entropy RL on the performance of MT-PPO and RL 2 -PPO. Applying our statistical method, we find that the added maximum entropy term impedes RL 2 -PPO from 106 0 1 2 3 4 5 6 7 8 9 Total Environment Steps 1e7 0 5 10 15 20 25 Train Success Rate (%) Entropy Maximization Ablation Study (n=5) RL 2 -PPO RL 2 -PPO MaxEnt MT-PPO MT-PPO MaxEnt Figure 4.9: Ablation study of MaxEnt on RL 2 -PPO and MT-PPO. We find statistically significant evidence that MaxEnt prevents RL 2 -PPO from reaching its peak performance (p = 0:6%). Horizontal lines indicate the mean performance measure for each experiment variant. Mean and 95% confidence interval over 5 random seeds. reaching the same peak performance on this benchmark (p = 0:6%). We also find no evidence that the max entropy term has affected MT-PPO on this benchmark. PPO can solve all of these tasks in the single-task setting, so we believe the main difficulty in this setting is exploring as-yet unsolved tasks while maintaining high performance on the tasks solved so far. The most successful multi-task method this chapter studies, MT-SAC, is a maximum entropy RL method, so it might be expected that entropy maximization would be sufficient to overcome this difficulty. Unfortunately, this experiment shows that maximum entropy RL is not sufficient to gain the same performance as MT-SAC. Looking at the individual runs of RL 2 -PPO shows that most runs perform the same as without entropy maximization, although the minimum entropy of the policy never becomes as low. In a small number of runs, the entropy remains high throughout training, which prevents learning some of the relatively easy tasks. This is the cause of the decreased average performance. 107 0 1 2 3 4 5 Total Environment Steps 1e7 0 5 10 15 20 25 Train Success Rate (%) Subroutine Algorithms Ablation Study (n=5) MAML-PPO MAML-TRPO RL 2 -PPO RL 2 -TRPO Figure 4.10: Comparison of using TRPO or PPO as the “subroutine algorithm” for RL 2 and MAML. As can clearly be seen in the graph, the selection of subroutine algorithm has a significant effect on performance (p = 6:3 10 6 for RL 2 andp = 0:016% for MAML). Mean and 95% confidence interval over 5 random seeds. 4.4.2.4 Effect of the “Subroutine Algorithm” Figure 4.10 compares the performance of MAML and RL 2 using different “subroutine algorithms.” Most meta-RL and multi-task RL algorithms use a single-task RL as some form of “subroutine” (often for solving some meta-objective as a variant of single-task RL) [66, 56]. In some cases, the meta-RL algorithms does not require a specific single-task algorithm, allowing implementers to choose from among one of a family of single-task algorithms. For example, the original description of MAML [66] uses the classical policy gradient method to optimize a meta-objective. In practice it is typically replaced with TRPO [232], but in some cases (such as some comparisons in [153]), it is replaced with PPO. Recent work has indicated that PPO and TRPO can perform very similarly [58], which might indicate that the choice between them will not significantly effect the behavior of meta-RL algorithms that use them as subroutines. This experiment tests 108 the effects of using either PPO or TRPO in two meta-RL algorithms: MAML and RL 2 on the ML10 bench- mark. We find that MAML performs significantly worse when using PPO (p = 0:016%). This confirms the findings of [215], who cite the low performance of MAML-PPO as one of the primary motivations for the creation of ProMP, which performs MAML-style meta updates without requiring constrained optimization. We also find that RL 2 performs much better with PPO (p = 6:3 10 6 ). Looking at the per-task performance gives some indication why. RL 2 -PPO improves performance for different tasks throughout the training process at the cost of a (sometimes temporary) loss in performance on other tasks. In comparison, RL 2 -TRPO always spends the first 10 million timesteps improving significantly on a particular set of three tasks while completely ignoring the others. It spends the remaining 90 million timesteps slowly improving on only a single particularly easy task without improving on any other task. It appears from this experiment that constrained optimization, as used in TRPO, prevents a temporary sacrifice in performance which is necessary to improve on additional tasks. 4.5 Analysis 4.5.1 Importance of Implementation Decisions in Meta- and MTRL Figure 4.4 shows both the train and (where applicable) test performances of several meta- and MTRL algo- rithms across 10 random seeds. It is apparent from looking at the graph that many very different algorithms have similar performance (within 25% of each other, with largely overlapping confidence intervals) on the ML45/MT50 benchmarks. However, several of the experiments in Section 4.4 showed small, often one-line changes that can cause performance changes from 30% to 110% of the baseline on these benchmarks. This implies that making comparisons between these algorithms requires great care in implementation as well as experimentation to discover potential changes caused by implementation differences. 109 2 0 2 2 2 4 2 6 2 8 2 10 2 12 2 14 Adaptation Batch Size 0 5 10 15 20 25 30 Success Rate (%) Adaptation Sample Efficiency of Meta-RL Algorithms (n = 10) MAML-TRPO PEARL-SAC RL 2 -PPO Figure 4.11: Adaptation sample efficiency of MAML-TRPO, RL 2 -PPO, and PEARL on the ML45 bench- mark. Mean and 95% confidence interval over 10 random seeds. 4.5.1.1 Transfer and Adaptation Sample Efficiency One property we would like to measure is how the meta-RL algorithms adapt to unseen tasks. In particular, we are interested in how many samples it takes from the new task to adapt. We evaluate this by varying the number of meta-test exploration trajectories. We evaluate every power of two up to 128 trajectories, as well as the performance without any adaptation. The resulting success rates with respect to adaptation experience is shown in Figure 4.11. We find no evidence that any Meta-RL algorithm successfully adapts to tasks in this benchmark over any of a wide range of adaptation batch sizes. However, given that MAML’s inner adaptation loop is itself an RL algorithm (either PPO or TRPO), which has been independently verified to be able to solve these tasks [292], it should eventually reach near 100% success on the new task. 110 ML10 ML45 0 2 4 6 8 10 12 14 16 Success Rate (%) MAML-TRPO Adaptation Gains Pre-adaptation Post-adaptation Figure 4.12: Study of MAML-TRPO gains to adaptation on ML10 and ML45 using an adaptation batch size of 4800 samples. Mean and 95% confidence interval for 5 seeds. 4.5.2 Does MAML-TRPO actually adapt to shared structure tasks? Figure 4.11 shows the test-time performance of MAML-TRPO on ML10 and ML45 over 5 random seeds, before and after the adaptation step. That is, the "Pre-adaptation" performance rates are for the un-adapted exploration policy. Our results suggest with high confidence that on the Meta-World benchmarks, MAML- TRPO does not actually adapt to new tasks, and in fact its performance degrades with adaptation in these environments. Rather, it learns an exploration policy which has high zero-shot performance on the task distribution. These results confirm some of the findings of [204], who find that MAML’s effectiveness is largely attributable to learning representations which are reusable between tasks in the distribution (useful even for zero-shot performance), and not necessarily from rapidly adapting between members of the task distribution (few-shot adaptation). 111 4.5.3 Improving meta-RL and multi-task RL benchmarks Designing benchmarks is an important area of research in an emerging field, and is often neglected. The results of the environment time limits ablation study (Fig. 4.7) suggest that benchmark authors must be careful to not conflate desired environment time limits and the MDP termination signal. This can make benchmarks less-representative of true performance for off-policy algorithm such as MT-SAC, which rely on an accurate signal for MDP termination. Intra-task reward scales, as discussed in Section 4.4.2.1, are a very important benchmark design detail. As creating globally-normalized dense reward functions for many tasks is extremely difficult, these results suggest that future meta- and MTRL benchmarks should focus on dense energy-based rewards, which are easier to normalize, or sparse success-failure rewards which are trivially-normalized. More creative avenues, such as benchmarks which provide only demonstrations, should also be considered. 4.6 Background and Related Work Meta- and multi-task RL: The meta- and multi-task RL research community has identified many promis- ing approaches for creating adaptive agents which combine RL with techniques such as recurrent neural net- works [56, 273], attention mechanisms [174], variational inference [208, 95, 225, 282, 86], second-order and constrained optimization [66, 215], adversarial methods [249], and transfer learning and distillation [259]. Reproducibility of deep reinforcement learning: While some works address the reproducibility prob- lem directly [101], others study different aspects of the problem, such as determining appropriate statistical significance tests and norms for presenting RL results [37], or quantifying the sensitivity of RL algorithms to small implementation decisions, hyperparameters, and random seeds [181, 154, 8, 102, 111]. These decisions are often ambiguous given the formal definition of the algorithm, and are not uniformly applied 112 across implementations. Henderson et al. [101] and Engstrom et al. [58] show that the performance dif- ferences created by these decisions can exceed the measured performance margins between single-task RL algorithms, and Deleu, Guiroy, and Hosseini [45] showed similar results on small challenges for a single meta-RL algorithm (MAML). This chapter studies differences in implementation which are unique to meta- and MTRL by studying 7 algorithms, and shows that implementation decisions alone can explain between 30% and 100% of the measured performance differences between these algorithms. Benchmarks for RL algorithms: The first pillar of reproducible RL is standardized challenges, and in RL that corresponds to environments. Following from the success of single-task RL benchmarks in many domains [13, 11, 124, 255, 22, 52, 187, 133, 23, 200, 283, 246], researchers have begun to propose a new generation of benchmarks which are specifically designed to evaluate meta- and multi-task RL methods. In addition to introducing the notion of a train-test split among the tasks, tactics used by these new benchmarks include challenging agents to adapt to new environments created with random variations on existing contin- uous control challenges [297], diverse multi-task robotic environment [114], procedural generation of game levels [35, 118], and principled variations on small tasks which are specifically designed to assess RL agent generalization in several ways [189]. Emperical performance studies of RL Empirical studies that seek to characterize the baseline perfor- mance of model-free [55] and model-based [275] deep RL algorithms on a suite of simulated continuous control tasks [22], and of model-free RL algorithms on real robots [163], provide important goal posts for future proposed methods. This chapter provides a multi-seed performance comparison of meta- and MTRL algorithms on a standard benchmark, in hopes of establishing reproducible performance a standards for the community. Metrics and Evaluations for RL: The second pillar of reproducible RL is standardized performance evaluation procedures and metrics. Most works proposing new RL algorithms evaluate them by comparing 113 their performance on benchmark tasks (i.e. average discounted return) as a function of total samples col- lected, and the choice of hyperparameters and random seeds for both the proposed and baseline methods are not specified by any shared standard [125]. Some authors have proposed new metrics and evaluation protocols in an effort to make measurements are more consistent [265], or more predictive of other im- portant aspects of performance, such as generalization and adaptation [189], safety and reliability [29], or performance in the real world [57]. A contributions of this chapter are precise definitions of important hyperparameters, and an evaluation procedure which is relevant to any meta- or multi-task RL method. Reference implementations for RL: The third and final pillar for reproducible research is widely- disseminated reference implementations. The recent popularity of neural networks and deep RL set off a rapid proliferation of open source libraries for single-task RL [50, 149, 81, 63, 229, 132, 156, 75]. A few of these specifically state adherence to the original algorithm definitions an explicit goal of their implemen- tations [105, 28, 2], which Section 4.4 will show is an important property of reference implementations. While there exist open source implementations for individual meta-RL algorithms [215, 66, 208, 46, 44], no repository implements more than 3 algorithms, and each uses different design decisions and definitions of key hyperparameters. This chapter shows that this makes it challenging to obtain benchmark results which researchers can be confident are comparable, (even if the individual implementations are strong). To address this, we contribute an open source repository of 7 reference implementations with shared hyperparameter definitions and design decisions. The most closely related work to this one is Yu et al. [292], which introduces the Meta-World bench- mark for meta-RL and multi-task RL algorithms, which this chapter uses extensively. Whereas [292] present single-seed performance results to demonstrate the usefulness of the proposed benchmark, this chapter pro- vides results 9 or more random seeds to establish a reproducible standard of performance on that benchmark. Furthermore, Yu et al. were forced to use algorithm implementations from several disparate sources; we pro- vide a unified set of open source reference implementations for 7 multi-task and meta-RL algorithms, so that 114 0 10 20 30 40 50 Train Seed Sensitivity Study of Meta and Multi-Task RL on ML10 and MT10 (n 4) MAML-TRPO MT-PPO MT-SAC PEARL-SAC RL 2 -PPO 0 1 2 3 4 5 Total Environment Steps 1e7 0 20 Test Success Rate (%) Figure 4.13: Top: Training success rate of 5 meta- and multi-task RL algorithms on the ML45 (RL 2 -PPO, MAML-TRPO, PEARL-SAC) and MT50 (MT-PPO, MT-SAC) benchmarks. Bottom: Test success rate of the 3 meta-RL algorithms on the ML45 benchmark. Mean and 95% confidence interval over 4 random seeds. the community can benefit from a shared definition of each algorithm and its hyperparameters. Addition- ally, this chapter precisely defines—both mathematically and in source code—an evaluation procedure for calculating the performance of any meta-RL or multi-task RL algorithm on any adaptation benchmark such as Meta-World. 4.7 Additional Experiments 4.7.1 Seed Sensitivity on ML10 and MT10 Figure 4.13 shows the seed sensitivity of several meta- and multi-task RL algorithms on ML10 and MT10, respectively. The plots show that RL 2 -PPO consistent outperforms the other meta-RL methods in meta- training success rate on this benchmark. As with MT50, MT-SAC is able to reach high performance on MT10, although with high variance between runs. 115 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Total Environment Steps 1e7 0 5 10 15 20 25 30 Train Success Rate (%) K-Held-Out of Multi-Task RL (n=5) MT-PPO 0 Held-out (Control) MT-PPO 1 Held-out MT-PPO 2 Held-out MT-PPO 4 Held-out MT-PPO 8 Held-out Figure 4.14: Effect of holding outk tasks from MT-PPO on MT10. Mean and 95% confidence interval for 5 seeds. All levels of hold-out show similar rise times, although holding out 8 tasks impedes performance. Besides 8 held-out, there do not appear to be any differences. 4.7.2 Task Sampling Policies In real-world settings, being able to train with experience from only a subset of tasks available may be desirable, since it would allow incorporating experience in an online fashion. Figures 4.14 shows the affect of omitting a subset ofk tasks in each meta-batch on MT-PPO. Besides 8-held out (in which only 2 tasks are used in each meta-batch), there does not appear to be any significant difference between experiments. This suggests that MT-PPO could be used as part of an online or continual multi-task RL algorithm, wherein not all tasks are available to learn from at train time. It also suggests that there is a lower limit on the number of tasks which can be sampled during any one training epoch, or else MT-PPO will become unstable. 4.7.3 Returns in RL 2 Figure 4.15 compares two slightly different variants of RL 2 . The algorithm RL 2 [56] implements meta- learning by training a recurrent neural network (RNN) to maximize return over a series of consecutive 116 0 5 10 15 20 25 30 Train Return Processing Ablation Study for RL 2 -PPO (n=5) RL 2 -PPO ML10 RL 2 -PPO Individual ML10 0.0 0.2 0.4 0.6 0.8 1.0 Total Environment Steps 1e8 0 20 Test Success Rate (%) Figure 4.15: Return processing ablation study for RL 2 -PPO, with a a single line-difference on how the algorithm processes returns. Mean and 95% confidence interval across 5 random seeds. episodes. However, the most commonly used implementation of RL 2 , which was originally implemented for comparison in [215], instead computes return separately for each episode within the trial, then optimizes the sum over those returns. Although there is a slight trend indicating that this modification has improved RL 2 performance, it is not statistically significant. 4.8 Comparing Garage Reference Implementations to Popular Baselines This section compares our implementations of meta- and multi-task RL algorithms to existing available implementations. Of note is that our implementations perform similarly or slightly better in almost all cases, with the exception of MT-TRPO. Also, note that it excludes a baseline comparison for Task Embeddings [95], since a baseline implementation is not available. 117 4.8.1 MT-PPO 0 1000000 2000000 3000000 4000000 5000000 6000000 Total Environment Steps 0 2 4 6 8 10 Train Success Rate (%) Ours vs OpenAI for MT-PPO on ML1-push (n=10) Ours OpenAI Figure 4.16: Comparison of garage implementation of MT-PPO to one based on OpenAI Baselines. Mean and 95% confidence interval over 10 random seeds. 4.8.2 MT-TRPO 0 1000000 2000000 3000000 4000000 5000000 6000000 Total Environment Steps 0 5 10 15 20 25 30 Train Success Rate (%) Ours vs OpenAI for MT-TRPO on ML1-push (n=10) Ours OpenAI Figure 4.17: Comparison of garage implementation of MT-TRPO to one based on OpenAI Baselines. Mean and 95% confidence interval over 10 random seeds. 118 4.8.3 MT-SAC 0.0 0.2 0.4 0.6 0.8 1.0 Total Environment Steps 1e7 0 10 20 30 40 50 60 Train Success Rate (%) Ours vs Rlkit for MT-SAC on ML1-push (n=5) Ours Rlkit Figure 4.18: Comparison of garage implementation of MT-SAC to one based on rlkit. Mean and 95% confidence interval over 5 random seeds. 0.0 0.2 0.4 0.6 0.8 1.0 Total Environment Steps 1e7 0 5 10 15 20 Train Success Rate (%) Ours vs Rlkit for MT-SAC on ML1-reach (n=5) Ours Rlkit Figure 4.19: Comparison of garage implementation of MT-SAC to one based on rlkit. Mean and 95% confidence interval over 5 random seeds. 119 4.8.4 MAML 0 20 40 60 Train Success Rate (%) Ours vs ProMP for MAML-TRPO on ML1-push (n=5) 0 1 2 3 4 5 Total Environment Steps 1e7 0 20 40 60 80 Test Success Rate (%) Ours ProMP Figure 4.20: Comparison of garage implementation of MAML-TRPO to the the implementation from widely-cited implementation in ProMP [215] on the ML1-push benchmark from [292]. Our implemen- tation exhibits higher meta-train success rates on this benchmark (p = 5:7%), but the change in meta-test success rates is not statistically significant (p = 5:%). Mean and 95% confidence interval over 5 random seeds. 0 10 20 Train Success Rate (%) Ours vs ProMP for MAML-TRPO on ML1-reach (n=5) 0 1 2 3 4 5 Total Environment Steps 1e7 0 20 40 Test Success Rate (%) Ours ProMP Figure 4.21: Comparison of garage implementation of MAML-TRPO to the the implementation from widely-cited implementation in ProMP [215] on the ML1-reach benchmark from [292]. Our implemen- tation performs similarly to the ProMP, although it exhibits higher variance in the meta-test success rate. Mean and 95% confidence interval over 5 random seeds. 120 4.8.5 RL 2 0 25 50 75 100 Train Success Rate (%) Ours vs ProMP for RL 2 on ML1-push (n=5) 0 1 2 3 4 5 6 Total Environment Steps 1e7 0 25 50 75 100 Test Success Rate (%) Ours ProMP Figure 4.22: Comparison of garage implementation of RL 2 -PPO to the widely-cited implementation in ProMP [215] on the ML1-push benchmark from [292]. Mean and 95% confidence interval over 5 random seeds. Both implementations have equal asymptotic performance, although ours exhibits slightly faster rise time. Specifically, at 15 million total environment steps, the garage implementation has a slightly higher meta-train success rate (p = 3:2%). 0 20 40 60 80 Train Success Rate (%) Ours vs ProMP for RL 2 on ML1-reach (n=5) 0 1 2 3 4 5 6 Total Environment Steps 1e7 0 20 40 60 80 Test Success Rate (%) Ours ProMP Figure 4.23: Comparison of garage implementation of RL 2 -PPO to the widely-cited implementation in ProMP [215] on the ML1-reach benchmark from [292]. Mean and 95% confidence interval over 5 random seeds. Both implementations have equal asymptotic performance, although ours exhibits measurably faster rise time. Specifically, at 15 million total environment steps, our implementation has a slightly higher meta- train success rate (p = 0:8%) and meta-test success rate (p = 1:9%). 121 4.8.6 PEARL-SAC 0 20 40 60 Train Success Rate (%) Ours vs Oyster for PEARL on ML1-push (n=6) 0 1000000 2000000 3000000 4000000 5000000 Total Environment Steps 0 20 40 60 Test Success Rate (%) Ours Oyster Figure 4.24: Comparison of garage implementation of PEARL-SAC to that of the original authors [208] on the ML1-push benchmark from [292]. The two implementations appear to perform similarly. Mean and 95% confidence interval over 5 random seeds. 0 10 20 Train Success Rate (%) Ours vs Oyster for PEARL on ML1-reach (n=5) 0 1000000 2000000 3000000 4000000 5000000 Total Environment Steps 0 10 20 30 Test Success Rate (%) Ours Oyster Figure 4.25: Comparison of garage implementation of PEARL-SAC to that of the original authors [208] on the ML1-reach benchmark from [292]. The two implementations appear to perform similarly. Mean and 95% confidence interval over 5 random seeds. 122 0 10 20 30 Train Success Rate (%) Ours vs Oyster for PEARL on ML1-pick-place (n=5) 0 1000000 2000000 3000000 4000000 5000000 Total Environment Steps 0 10 20 30 Test Success Rate (%) Ours Oyster Figure 4.26: Comparison of our implementation of PEARL-SAC to that of the original authors [208] on the ML1-pick-place benchmark from [292]. The two implementations appear to perform similarly. Mean and 95% confidence interval over 5 random seeds. 4.9 Hyperparameters This section summarizes in as much detail as possible the hyperparameters used for each experiment in this chapter. Seed values were individually chosen at random for each experiment. 123 4.9.1 MT-PPO Description ML1 MT10 MT50 variable_name Normal Hyperparameters Batch size 7;500 15;000 75;000 batch_size Number of epochs 800 1;334 1;334 n_epochs Path length per roll-out 150 150 150 max_path_length Discount factor 0:99 0:99 0:99 discount Algorithm-Specific Hyperparameters Policy mean hidden sizes (64; 64) (64; 64) (64; 64) hidden_sizes Policy standard deviation hidden sizes (32; 32) (32; 32) (32; 32) std_hidden_sizes Activation function of mean hidden layers tanh tanh tanh hidden_nonlinearity Activation function of standard deviation hidden layers tanh tanh tanh std_hidden_nonlinearity Optimizer learning rate 310 −4 310 −4 310 −4 learning_rate Likelihood ratio clip range 0:2 0:2 0:2 lr_clip_range Trainable standard deviation True True True learn_std Advantage estimation 0:97 0:97 0:97 gae_lambda Use layer normalization False False False layer_normalization Use trust region constraint False False False use_trust_region Entropy method no_entropy no_entropy no_entropy entropy_method Loss function surrogate_clip surrogate_clip surrogate_clip pg_loss Maximum number of epochs for update 4 4 4 max_epochs Minibatch size for optimization 30 30 30 batch_size Value Function Hyperparameters Policy hidden sizes (64; 64) (64; 64) (64; 64) hidden_sizes Activation function of hidden layers tanh tanh tanh hidden_nonlinearity Trainable standard deviation True True True learn_std Initial value for standard deviation 1 1 1 init_std Use layer normalization False False False layer_normalization Use trust region constraint False False False use_trust_region Normalize inputs True True True normalize_inputs Normalize outputs True True True normalize_outputs Table 4.1: Hyperparameters used for Garage experiments with Multi-Task PPO 124 4.9.2 MT-TRPO Description ML1 MT10 MT50 variable_name General Hyperparameters Batch size 7;500 15;000 75;000 batch_size Number of epochs 800 1;334 1;334 n_epochs Path length per roll-out 150 150 150 max_path_length Discount factor 0:99 0:99 0:99 discount Algorithm-Specific Hyperparameters Policy mean hidden sizes (64; 64) (64; 64) (64; 64) hidden_sizes Policy standard deviation hidden sizes (32; 32) (32; 32) (32; 32) std_hidden_sizes Activation function of mean hidden layers tanh tanh tanh hidden_nonlinearity Activation function of standard deviation hidden layers tanh tanh tanh std_hidden_nonlinearity Trainable standard deviation True True True learn_std Advantage estimation 0:97 0:97 0:97 gae_lambda Maximum KL divergence 110 −2 110 −2 110 −2 max_kl_step Number of CG iterations 10 10 10 cg_iters Regularization coefficient 110 −5 110 −5 110 −5 reg_coeff Use layer normalization False False False layer_normalization Use trust region constraint False False False use_trust_region Entropy method no_entropy no_entropy no_entropy entropy_method Loss function surrogate surrogate surrogate pg_loss Value Function Hyperparameters Hidden sizes (64; 64) (64; 64) (64; 64) hidden_sizes Activation function of hidden layers tanh tanh tanh hidden_nonlinearity Trainable standard deviation True True True learn_std Initial value for standard deviation 1 1 1 init_std Use layer normalization False False False layer_normalization Use trust region constraint False False False use_trust_region Normalize inputs True True True normalize_inputs Normalize outputs True True True normalize_outputs Table 4.2: Hyperparameters used for Garage experiments with Multi-Task TRPO 125 4.9.3 MT-SAC Description ML1 MT10 MT50 variable_name General Hyperparameters Batch size 150 1;500 7;500 batch_size Number of epochs 501 512 533 epochs Path length per roll-out 150 150 150 max_path_length Discount factor 0:99 0:99 0:99 discount Algorithm-Specific Hyperparameters Policy hidden sizes (400; 400; 400) (400; 400; 400) (400; 400; 400) hidden_sizes Activation function of hidden layers ReLU ReLU ReLU hidden_nonlinearity Policy learning rate 310 −4 310 −4 310 −4 policy_lr Q-function learning rate 310 −4 310 −4 310 −4 qf_lr Policy minimum standard deviation e 20 e 20 e 20 min_std Policy maximum standard deviation e 2 e 2 e 2 max_std Gradient steps per epoch 250 150 250 gradient_steps_per_itr Number of epoch cycles 266 26 5 epoch_cycles Soft target interpolation parameter 510 −3 510 −3 510 −3 target_update_tau Use automatic entropy Tuning True True True use_automatic_entropy_tuning Table 4.3: Hyperparameters used for Garage experiments with Multi-Task SAC 126 4.9.4 TE-PPO Description ML1 MT10 MT50 variable_name General Hyperparameters Batch size 1;500 15;000 75;000 batch_size Number of epochs 110 7 110 7 110 7 n_epochs Algorithm-Specific Hyperparameters Policy hidden sizes (200; 200) (200; 200) (200; 200) hidden_sizes Activation function of hidden layers tanh tanh tanh hidden_nonlinearity Likelihood ratio clip range 0:2 0:2 0:2 lr_clip_range Latent dimension 6 6 6 latent_length Inference window length 20 20 20 inference_window Embedding maximum standard deviation 2 2 2 embedding_max_std Gradient steps per epoch 2;000 2;000 2;000 n_itr Value function Gaussian MLP fit with observations, latent variables and returns baseline Table 4.4: Hyperparameters used for Garage experiments with Task Embeddings PPO 4.9.5 MAML Description MAML-PPO MAML-TRPO argument_name Meta-/Multi-Task Hyperparameters Meta-batch size 20 20 meta_batch_size Roll-outs per task 10 10 rollouts_per_task General Hyperparameters Path length per roll-out 100 100 max_path_length Discount factor 0:99 0:99 discount Expected action scale 10 10 expected_action_scale Algorithm-specific Hyperparameters Policy hidden sizes (100; 100) (100; 100) hidden_sizes Activation function of hidden layers tanh tanh hidden_nonlinearity Inner algorithm learning rate 510 −2 510 −2 inner_lr Optimizer learning rate 110 −3 110 −3 outer_lr Maximum KL divergence 110 −2 N/A max_kl_step Likelihood ratio clip range N/A 0:5 lr_clip_range Number of inner gradient updates 1 1 num_grad_update Table 4.5: Hyperparameters used for Garage experiments with MAML 127 4.9.6 RL 2 Description ML1 ML10 ML45 argument_name Meta-/Multi-Task Hyperparameters Meta-batch size 40 40 45 meta_batch_size Roll-outs per task 10 10 10 rollouts_per_task General Hyperparameters Path length per roll-out 150 150 150 max_path_length Discount factor 0:99 0:99 0:99 discount Algorithm-Specific Hyperparameters Policy hidden sizes (64; ) (200; 200) (300; 300; 300) hidden_sizes Activation function of hidden layers tanh tanh tanh hidden_nonlinearity Activation function of recurrent layers sigmoid sigmoid sigmoid recurrent_nonlinearity Optimizer learning rate 110 −3 110 −3 110 −3 optimizer_lr Likelihood ratio clip range 0:2 0:2 0:2 lr_clip_range Advantage estimation 1 1 1 gae_lambda Normalize advantages True True True normalize_advantage Optimizer maximum epochs 5 5 5 optimizer_max_epochs RNN cell type used in Policy GRU GRU GRU cell_type Value function Linear feature baseline baseline Table 4.6: Hyperparameters used for Garage experiments with RL 2 128 4.9.7 PEARL Description ML1 ML10 ML45 argument_name Meta-/Multi-Task Hyperparameters Meta-batch size 16 16 16 meta_batch_size Tasks sampled per epoch 15 15 15 num_tasks_sample Number of independent evaluations 5 5 5 num_evals Steps sampled per evaluation 450 1;650 1;650 num_steps_per_eval General Hyperparameters Batch size 256 256 256 batch_size Path length per roll-out 150 150 150 max_path_length Reward scale 10 10 10 reward_scale Discount factor 0:99 0:99 0:99 discount Algorithm-Specific Hyperparameters Policy hidden sizes 300 300 300 net_size Activation function of hidden layers ReLU ReLU ReLU hidden_nonlinearity Policy learning rate 310 −4 310 −4 310 −4 policy_lr Q-function learning rate 310 −4 310 −4 310 −4 qf_lr Value function learning rate 310 −4 310 −4 310 −4 vf_lr Latent dimension 7 7 7 latent_dimension Policy mean regularization coefficient 110 −3 110 −3 110 −3 policy_mean_reg_coeff Policy standard deviation regularization coefficient 110 −3 110 −3 110 −3 policy_std_reg_coeff Soft target interpolation parameter 510 −3 510 −3 510 −3 soft_target_tau KL 0:1 0:1 0:1 KL_lambda Use information bottleneck True True True use_information_bottleneck Use next observation in context False False False use_next_observation_in_context Gradient steps per epoch 4;000 4;000 4;000 num_steps_per_epoch Steps sampled in the initial epoch 4;000 4;000 4;000 num_initial_steps Prior steps sampled per epoch 750 750 750 num_steps_prior Posterior steps sampled per epoch 0 0 0 num_steps_posterior Extra posterior steps sampled per epoch 750 750 750 num_extra_steps_posterior Embedding batch size 64 64 64 embedding_batch_size Embedding minibatch size 64 64 64 embedding_mini_batch_size Table 4.7: Hyperparameters used for Garage experiments with PEARL 129 4.10 Experiment Procedure Details 4.10.1 Compute Infrastructure We ran experiments using a dozen local computers, as well as several dozen Amazon EC2 instances (pri- marily, c5.4xlarge instances, although we had to run much larger instances for several comparisons). Our framework logs results into a consistent tabular file format, which we synchronize during experiment exe- cution with an Amazon S3 bucket containing all of our experiment results. We then use executable Python notebooks (using Google Colab, which produce each plot from the experiment data as the experiment is executing. 4.10.2 Data Processing We apply differing levels of smoothing to the runs in our plots in order to improve readability. We exclude some incomplete runs which were terminated early due to machine failures. We truncated the x-axis of several plots to more clearly show the differences between algorithms. We use seaborn [276] to compute confidence intervals on the unsmoothed data. Except where otherwise noted, these confidence intervals are 95% confidence intervals. 130 Hyperparameter Definition Meta-batch size A batch of some number of tasks sampled from the task distribution T Meta-train or meta-test batch size Size of a meta-train or meta-test batch, measured in number of tasks Adaptation batch size Number of samples used by a meta-RL algorithm to adapt to a spe- cific task Exploration policy Policy used by the meta-RL algorithm to acquire adaptation samples on a new task Adapted policy Policy computed by a meta-RL algorithm’s adaptation procedure, ex- pected to perform well on a single new task Evaluation sample size Number of trajectories used to measure an adapted policy’s success rate Task sampling strategy Rule or algorithm used to choose the members of each meta-batch, e.g. round-robin, uniform sampling with replacement, k-held out sampling, etc. Table 4.8: Definitions of key hyperparameters in meta-RL and MTRL 4.11 Defining Consistent Metrics and Hyperparameters for Meta- and Multi-Task Reinforcement Learning 4.11.1 Preliminaries We presume some prior task distributionT , which may be discrete or continuous, and from which the algorithm can sample new tasks. We further presume two disjoint sets have been drawn from this distribu- tion,T train ;T test 2T and sample meta-train and meta-test batches from them, respectively. In the case of multi-task RL,T train =T test and we may neglect the distinction between meta-train and meta-test evaluation. 4.11.2 Definitions and Hyperparameters for Meta- and Multi-Task RL We precisely define important hyperparameters for meta- and multi-task RL, so that we can consistently evaluate and compare algorithms of from these families. 131 4.11.3 Evaluation Procedure for Meta-RL We want to fairly compare meta-RL algorithms fairly (so that we do not give too much preference to any algorithm) and consistently (so that we can draw meaningful conclusions across algorithms). To that end, we define a fairly straightforward, but precise, method of defining meta-RL algorithms, and consistently use it in our experiments. The meta-evaluation procedure allows the meta-RL algorithm to explore for a fixed number of time steps to produce an adapted policy. The adapted policy is then cloned to produce as many adapted policies as the evaluation batch size. Then, each adapted policy is rolled out in the environment for one trajectory, and its success or failure is recorded. See Algorithm 2 for pseudocode describing the procedure in more detail. When finding the success rate for a single task, we still perform this evaluation procedure as described, with the meta-test task sampling strategy simply repeating the sample task multiple times. We do not collect multiple test trajectories per adapted policy, because this could overrepresent the performance of methods that can continue to adapt during the testing paths (in particular, RL 2 ). 4.11.4 Evaluation Metrics for Meta- and Multi-Task RL Average task success rate: Mean of the success rates for each individual task in a meta-batch. Train / test success rate: Average task success rate for the tasks in a meta-train/meta-test batch, respectively. Single-task sample complexity: The number of total environment stepsS t needed for an RL algorithm to achieve a success rate (usually 100%) on a given taskt. Multi-task sample complexity (MTRL): The number of total environment steps (across all tasks)S MT needed for a multi-task algorithm to achieve a success rate (usually 100%) on average across a distribution of tasksT . 132 Algorithm 2 Meta-RL Evaluation Procedure Input: Meta-RL algorithmA, with Exploration subroutine explore A explore (), Adaptation subroutine t A adapt (D t ), Adaptation batch sizeN adapt , Task rollout functionD; success rollout(;t;N), Evaluation sample sizeN eval , Maximum trajectory lengthN traj n success 0 for allt2T test do explore A explore () D t ; rollout( explore ;t;N adapt ) t A adapt (D t ) Clone t;k clone( t )N eval times for 1 toN eval do ;s rollout( t;k ;t;N traj ) n success n success +s end for end for Output: Average Success Rate = n success jT testjN eval Adaptation sample efficiency (meta-RL): Number of adaptation samples necessary to achieve a particular test success rate. Only applies to meta-RL. May be presented as a curve or summary statistics. 4.11.5 Statistical Comparisons for Meta- and MTRL Algorithms The methods and metrics above allow us to precisely measure the performance of multi-task and meta-RL algorithms. However, they do not describe how to evaluate differences between algorithms (or variants on the same algorithm) given these measurements. Consistently analyzing performance requires we define a statistical framework for deriving high-confidence results from our measurements. [38] describes a methodology for evaluating RL algorithms which we follow as closely as possible. That methodology requires deriving a scalar performance measure from each run (which typically corresponds to a training experiment on a single environment with a single random seed) of an algorithm. Although neither 133 [38] nor [37] advocate for a particular performance measure, they use the average discounted return of the last 100 iterations of the run in their examples. Average discounted return is misleading in meta- and multi-task RL: Meta- and multi-task RL are multi-objective optimization problems which seek to maximize the average discounted return over a distri- bution of tasks. However, the reward magnitude of any single task is arbitrary, and the reward magnitudes of tasks can vary wildly. For instance, the dense reward magnitudes of successful trajectories for two different tasks in the Meta-World benchmark can range between 10 2 and 10 5 . This means the predictive power of any average return metric calculated across tasks in meta- or multi-task RL is easily corrupted by tasks with very large or very small reward magnitudes. Because the return between tasks on multi-task benchmarks can be misleading (see Section 4.4.2.1), we follow [292] and use average success rate, a sparse binary success signal provided by each environment, for all algorithm comparisons in this chapter. Contending with asymptotic instability: Many meta- and multi-task RL algorithm implementations we evaluate, despite still being useful, are not always asymptotically stable. This means that always mea- suring on the last 100 iterations of the run, as in [37], would not accurately represent these algorithms’ performance. To minimize the impact of asymptotic instability, we instead use the average success rate of the best 100 sequential iterations as the performance measure of a run. We use this performance measure throughout Section 4.4, and find that this it has relatively low variance across runs and is consistent with our empirical observations of performance, as well the practices of prior works in reporting meta- and multi-task RL performance results. This measure allows us to make meaningful statistical statements even for highly asymptotically unstable algorithm variants. When we use it, we also typically plot the mean of the performance values of each variant’s runs as a dashed line across the plot. As recommended in [38], we apply Welch’s t-test to the performance measures of runs in order to measure the probability, p, of the null hypothesis (i.e. that both sets of runs are drawn from the same distribution). 134 This section unambiguously defines our evaluation procedure for meta-RL and MTRL algorithms. 4.11.6 Training Procedure for Meta- and Multi-Task RL At a high level, the training procedure we use for meta-RL and multi-task RL simply involve alternating a training step and a testing step. For multi-task RL, the training task sampling strategy is generally the same as the testing task sampling strategy, but this need not be the case in general. See Algorithm 3 for pseudocode describing the procedure. Algorithm 3 MTRL Training Procedure Input: MTRL algorithmA, with Training procedureA train Evaluation procedureA evaluate Train task sampling strategyS train Test task sampling strategyS test repeat T train S train A train (T train ) T test S test A evaluate (T test ) until convergence ofA 4.12 Conclusion This chapter shows that in meta- and MTRL research, small implementation decisions make a big difference. Our results from Section 4.4.2 show that implementation decisions alone in meta- and MTRL algorithms can create absolute performance differences of 10-35% on the Meta-World benchmarks for a single al- gorithm. Combined with the results from Section 4.4.1, which show that statistically-significant absolute performance differences between these algorithms range from 0-30%, we can conclude that it is possible that implementation decisions can confound performance improvement claims in meta- and MTRL algorithms research. What’s more, this chapter only examines a small fraction of possible design decisions and algo- rithms in meta- and MTRL, and it is very likely there exist many other subtle design decisions which create 135 significant performance differences in these algorithms. This underscores the need for widely-disseminated reference implementations and consistent evaluation protocols, so that the community be can be confident in algorithmic performance comparisons in meta- and MTRL research. This chapter also provides evidence that particular research directions are essential for progress towards practical meta- and MTRL application. Namely, simple handling of differing reward scales between tasks (Section 4.4.2.1), inference of useful task structure (Figure 4.8), and experimentally fair benchmarks (Sec- tion 4.4.2.2). In the future, we plan to expand our reference implementation library to include more meta- amd multi- task RL methods such as PCGrad [291], MAESN [86], ProMP [215], REPTILE [186], SNAIL [174], and QT-Opt [119]. We also look forward to welcoming contributions of reference implementations from throughout the research community, and have created a well-defined process for accepting new reference implementations into the library. 136 Chapter 5 Never Stop Learning: The Effectiveness of Fine-Tuning in Robotic Reinforcement Learning One of the great promises of robot learning systems is that they will be able to learn from their mistakes and continuously adapt to ever-changing environments. Despite this potential, most of the robot learning systems today produce static policies that are not further adapted during deployment, because the algorithms which produce those policies are not designed for continual adaptation. This chapter presents an adaptation method, and empirical evidence that it supports a robot learning framework for continual adaption. We show that this very simple method–fine-tuning off-policy reinforcement learning using offline datasets–is robust to changes in background, object shape and appearance, lighting conditions, and robot morphology. We 32% → 63% 49% → 66% 50% → 90% 75% → 93% 43% → 98% https://colordesigner.io/convert/hextohsv (HSV) 86% Figure 5.1: Left: Original robot configuration used for pre-training. Right: Adaptation challenges (high- lighted in pink) studied in this chapter. Top: Associated performance improvements obtained using our fine-tuning method. 137 demonstrate how to adapt vision-based robotic manipulation policies to new variations using less than 0.2% of the data necessary to learn the task from scratch. Furthermore, we demonstrate that this robustness holds in an episodic continual learning setting. We also show that pre-training via RL is essential: training from scratch or adapting from supervised ImageNet features are both unsuccessful with such small amounts of data. Our empirical conclusions are consistently supported by experiments on simulated manipulation tasks, and by 60 unique fine-tuning experiments on a real robotic grasping system pre-trained on 580; 000 grasps. For video results and an overview of the methods and experiments in this study, see the project website at https://ryanjulian.me/continual-fine-tuning. 5.1 Introduction Surviving and existing in the real world inevitably requires nearly every living being to constantly learn, adapt, and evolve. Similarly, to thrive in the real world, robots should be able to continuously learn and adapt throughout their lifetime to the ever-changing environments in which they are deployed. This is a widely recognized requirement. In fact, there is an entire academic sub-field of lifelong learning [260] which studies how to create agents that never stop learning. Despite the wide interest in this ability, most of the intelligent agents deployed today are not tested for their adaptation capabilities. Even though reinforcement learning theoretically provides the ability to perpetually learn from trial and error, this is not how it is typically evaluated. Instead, the predominant method of acquiring a new task with reinforcement learning is to initialize a policy from scratch, collect entirely new data in a stationary environment, and evaluate a static policy that was trained with this data. This static paradigm does not evaluate the robot’s capability to adapt, and traps robotic reinforcement learning in the worst-case regime for sample efficiency: the cost to acquire a new task is dominated by the complexity of the task. Most machine learning models successfully deployed in the real world, such as those used for computer vision and natural language processing (NLP) do not live in this regime. For instance, the predominant 138 method of acquiring a new computer vision task is to start learning the new task with a pre-trained model for a related task, acquired from a pre-collected data set, and fine-tune that model to achieve the new task [51, 106, 49]. This changes the sample efficiency regime of the learning process from one which is dominated by task complexity to one that is dominated by task novelty, i.e., the differences between the new task and the task on which the model was pre-trained. While a number of works have studied how to use pre-trained ImageNet [47] features for robotics [288, 108, 134], there are remarkably few works that study how to adapt motor skills themselves. This chapter attempts to bridge the gap: instead of focusing on the robot’s performance in the environ- ment in which it was trained, we purposefully modify the robot and its environment to mimic the persistent change of the real world, and investigate its ability to adapt. Likewise, rather than proposing a new adap- tation algorithm, with new complexity and caveats, we show how to successfully adapt robotic policies to substantial changes, using only the most basic components of existing off-policy reinforcement learning algorithms. The main contributions of this chapter are (1) a careful real-world study of the problem of end-to-end skill adaptation for a continually-learning robot, and (2) evidence that a very simple fine-tuning method can achieve that adaptation. To our knowledge, this work is the first to demonstrate that simple fine-tuning of off-policy reinforcement learning can successfully adapt to substantial task, robot, and envi- ronment variations which were not present in the training distribution (i.e. off-distribution). 5.2 Related Work Reinforcement learning is a long-standing approach for enabling robots to autonomously acquire skills [127] such as locomotion [130, 258], pushing objects [162, 68], ball-in-cup manipulation [128], peg insertion [83, 144, 231, 140, 296], throwing objects [78, 295], or grasping [199, 119]. We specifically focus on the problem of deep reinforcement learning from raw pixel observations [144], as it allows us to place very few restrictions on state representation. A number of works have also considered this problem setting [70, 78, 139 68, 296, 3, 176]. However, a key challenge with deep RL methods is that they typically learn each skill from scratch, disregarding previously-learned skills. If we hope for robots to generalize to a broad range of real world environments, this approach is not practical. We instead consider how we might transfer knowledge for efficient learning in new conditions [256, 190, 252], a widely-studied problem particularly outside the domain of robotics [51, 106, 49, 42, 205]. Prior works in robotics have considered how we might transfer information from models trained with supervised learning on ImageNet [47] by fine-tuning [144, 70, 84, 199] or other means [236, 97]. Our experiments show that transfer from pre-trained conditions is significantly more successful than transfer from ImageNet. Other works have leveraged experience in simulation [223, 263, 224, 253, 188, 222, 196, 104, 92] or representations learned with auxiliary losses [213, 172, 228] for effective transfer. While successful, these approaches either require significant engineering effort to construct an appropriate simulation or significant supervision. Most relevantly, recent work in model-based RL has used predictive models for fast transfer to new experimental set-ups [30, 87], i.e. by fine-tuning predictive models [43], via online search of a pre- learned representation of the space models, policies, or high-level skills [31, 41, 123, 171], or by learning physics simulation parameters from real data [209, 115]. We show how fine-tuning is successful with a model-free RL approach, and show how a state-of-the-art grasping system can be adapted to new conditions. Other works have aimed to share and transfer knowledge across tasks and conditions by simultaneously learning across multiple goals and tasks [216, 221]. For example, prior works in model-based RL [68, 286, 180] and in goal-conditioned RL [3, 183, 194, 201, 293] have shared data and representations across multiple goals and objects. Along a similar vein, prior work in robotic meta-learning has aimed to learn representations that can be quickly adapted to new dynamics [178, 4, 179] and objects [71, 112, 290, 20]. We consider adaptation to a broad class of changes including dynamics, object classes, and visual observations. We include conditions that shift substantially from the training conditions, and we do not require the full set of conditions to be represented during the initial training phase. 140 5.3 Identifying Adaptation Challenges To study the problem of continual adaptation, we evaluate a pre-trained grasping policy in five different conditions that were not encountered during pre-training. In this section, we will describe the pre-training process and test the robustness of the pre-trained policy to various robot and environment modifications. We choose these modifications to reflect changes we believe a learning robot would experience, and should be expected to a adapt to, when deployed “on the job” in the real world. In Section 5.4, we will describe a simple fine-tuning based adaptation process, and we evaluate it using these modifications in Sections 5.4 and 5.5. 5.3.1 Pre-training process We pre-train the grasping policy, which we refer to as the “base policy,” using the QT-Opt algorithm in two stages, as described in [119]. This procedure yields a final base policy that achieves 96% accuracy on a set of previously-unseen test objects. We use a challenging subset of six of these test objects for most experiments in this chapter. On this set, our base model achieves a success rate of 86% on the baseline grasping task. 5.3.2 Robustness of the pre-trained policy We begin by choosing a set of significant modifications to the robot and environment, which we believe are characteristic of a real-world continual learning scenario. We then evaluate the performance of the base policy on increasingly-severe versions of these modifications. This process allows us to assess the robustness of policies trained using the pre-training method. Once we find a modification that is severe enough to compromise the base policy’s performance in each category, we use it to define a “Challenge Task” for our study of adaptation methods. Next, we describe these challenges and the corresponding performance of the base policy. 141 Challenge Task Type Base Policy Checkerboard Backing Background 50% -36% Harsh Lighting Lighting conditions 31% -55% Extend Gripper 1 cm Gripper shape 76% -10% Offset Gripper 10 cm Robot morphology 47% -39% Transparent Bottles Unseen objects 49% -37% Table 5.1: Summary of modifications to the robot and environment, and their effect on the performance of the base policy. Changing the background lighting, morphology, and objects leads to substantial degradation in performance compared to the original training conditions. Background: We introduce a black-white 1 inch checkerboard pattern that we glue to the bottom of the robot’s workspace (see Fig. 5.1, fourth from left). This often fools the robot into grasping at checkerboard edges rather than objects. Lighting conditions: We introduce a high-intensity halogen light source parallel with the workspace (see Fig. 5.1, second from left), creating a bright spot in the robot’s camera view, and intense light-dark contrasts along the plane of the workspace. Gripper shape: We extend the parallel gripper attached to the robot by 1 cm and significantly narrow its width and compliance in the process (see Fig. 5.1, fifth from left). This changes the robot’s kinematics (lengthening the gripper in the distal direction), while also lowering the relative pose of the robot with respect to the workspace surface by 1 cm. Robot morphology: We translate the gripper laterally by 10 cm (approximately a full gripper or arm link width) (see Fig. 5.1, far-right). Unseen objects: We introduce completely-transparent plastic beverage bottles (see Fig. 5.1, third from left) that were not present in the training set. This causes the robot to often grasp where two bottles are adjacent, as though it cannot differentiate which parts of the scene are inside vs outside a bottle. See Table 5.1 for a summary of the modification experiments, and their effect on base policy perfor- mance. 142 Harsh Lighting Transparent Bottles Checkerboard Backing Extend Gripper 1cm Offset Gripper 10cm Base Grasping Figure 5.2: Views of from the robot camera for each of our five Challenge Tasks and the base grasping task. 5.4 Large-Scale Experimental Evaluation We define then evaluate a simple technique for offline fine-tuning. Our experiments model an “on the job” adaptation scenario, where a robot is initially trained to perform a general task (in our case, grasping diverse objects), and then the conditions of the task change in a drastic and substantial way as the robot performs the task, e.g. significantly brighter lighting, or a peculiar and unexpected type of object. The robot must adapt to this change quickly in order to recover a proficient policy. Handling these changes reflects what we expect to be a common requirement of reinforcement learning policies deployed in the real world: since an RL policy can learn from all of the experience that it has collected, there is no need to separate learning into clearly distinct training and deployment phases. Instead, it is likely desirable to allow the policy to simply continue learning “on the job” so as to adapt to these changes. We define a very simple fine-tuning procedure for off-policy RL, as follows (Fig. 5.3). 143 Adapted Q-function QT-Opt QT-Opt 50% 50% 2. Explore ℒ Base Q-function Target Q-function ℒ 1. Pre-Train 4. Adapt 5. Evaluate Target Data <800 Base Data ≈608,000 43% Success 98% Success 3. Initialize 86% Success Figure 5.3: Schematic of the simple method we test in Sections 5.4 and 5.5. We pre-train a policy using the old data from the pre-training task, which is then adapted using the new data from the fine-tuning task. 144 5.4.1 A very simple fine-tuning method First, we (1) pre-train a general grasping policy, as described in Section 5.3.1 and [119]. To fine-tune a policy onto a new target task, we (2) use the pre-trained policy to collect an exploration dataset of attempts on the target task. We then (3) initialize the same off-policy reinforcement learning algorithm used for pre-training (QT-Opt, in our case) with the parameters of the pre-trained policy, and both the target task and base task datasets * as the data sources (e.g. replay buffers). Using this training algorithm, we (4) update the policy, using a reduced learning rate, and sample training examples with equal probability from the base and target task datasets, for some number of update steps. Finally, we (5) evaluate the fine-tuned policy on the target task. Our method is offline, i.e. it uses a single dataset of target task attempts and requires no robot interaction after the initial dataset collection to compute a fine-tuned policy, which may then be deployed onto a robot. 5.4.2 Evaluating fine-tuning in simulation We evaluate the simple fine-tuning method we introduced in Section 5.4.1 on simulated adaptation challenge, and compare its performance to simpler and more-complex state-of-the-art alternatives (Fig. 5.4). In this simulation experiment, we first pre-train a policy to grasp randomly-generated opaque objects (Fig. 5.4b), using a total of 3500 grasp attempts. We then change the task by turning the objects transparent (Fig. 5.4c, similar to the “Transparent Bottles” Challenge Task in Sec. 5.3), and challenge state-of-the-art methods to adapt to the transparent objects using only 75 grasp attempts, which is 0:5% of the data used to train from scratch. In addition to testing our simple fine-tuning method (“FT”) and motivated by our desire to find an adaptation method for a continual learning setting, we test a 2-task instantiation of a Progressive Neural Networks [221] (“PNN”) model, and the combination of the PNN model with our simple fine-tuning method * We assume this dataset was saved during training of the base policy 145 0 100000 200000 300000 400000 500000 Gradient Steps 0 20 40 60 80 100 Grasp Success Rate Scratch PNN FT FT + PNN Grasping Performance (10 seeds, 95% confidence interval) (a) Performance (b) Base Task (c) Target Task Figure 5.4: Grasping performance (a) of four transfer methods on a simulated fine-tuning scenario, in which the robot is trained to grasp opaque objects (b) using 3500 attempts, but must adapt to transparent objects (c) using only 75 attempts. Scratch: Train a new policy from scratch (3500 attempts). PNN: Progressive Neural Networks. FT: Fine-tuning method from Sec. 5.4.1. FT + PNN: PNN model trained using the FT method. 146 (“FT + PNN”). For comparison, we also train a policy on the target task from scratch, using 3500 new attempts. The simple fine-tuning method is able to achieve the baseline performance using the 75 exploration grasps in about 200; 000 gradient steps. The PNN model is unable to adapt to the new task with access to so few grasp attempts, and never exceeds 10% success rate, and believe this is likely due to the additional sample complexity of training its randomly-initialized “adapter” layers. Applying our simple offline fine- tuning method to the PNN model allows it to converge to a much higher 50% success rate, but still never achieve baseline performance. Note that an important feature of the PNN model is that it is guaranteed to never experience performance degradation in the base task due to adaptation (i.e. “forgetting” or negative backward transfer). Though our much simpler method has no such guarantee, we find no evidence of forgetting in both simulated and real- world fine-tuning and continual learning experiments. See the Appendix for more details. 5.4.3 Evaluating offline fine-tuning for real-world grasping We now turn our attention to evaluating this simple method’s effectiveness as an adaptation procedure for end-to-end robot learning, and perhaps continual learning. Our goal is to determine whether the method is sample efficient, whether it works over a broad range of possible variations, and to determine whether it performs better than simpler ways of acquiring the target tasks. With this goal in mind, we conduct a large panel of ablation experiments experiments on a real 7 DoF Kuka arm. These experiments evaluate the performance of our method across the diverse range of previously-defined Challenge Tasks and a continuum of target task dataset sizes, and compare this perfor- mance to two comparison methods. For videos of our experimental results, see the project website. † † For video results, seehttps://ryanjulian.me/continual-fine-tuning 147 Challenge Task Original Policy Ours (exploration grasps) Comparisons 25 50 100 200 400 800 Best () Scratch ImageNet Checkerboard Backing 50% 67% 48% 71% 47% 89% 90% 90% (+40) 0% 0% Harsh Lighting 32% 23% 16% 52% 44% 58% 63% 63% (+31) 4% 2% Extend Gripper 1 cm 75% 93% 67% 80% 51% 90% 69% 93% (+18) 0% 14% Offset Gripper 10 cm 43% 73% 50% 60% 56% 91% 98% 98% (+55) 37% 47% Transparent Bottles 49% 46% 43% 65% 65% 58% 66% 66% (+17) 27% 20% Baseline Grasping Task 86% 98% 81% 84% 78% 93% 89% 98% (+12) 0% 12% Table 5.2: Summary of grasping success rates (N 50) for the experiments by challenge task, fine-tuning method, and number of exploration grasps. Collect exploration datasets First, we collect a dataset of 800 grasp attempts for each of our 5 challenge tasks (see Table 5.1) plus the base grasping task. We then partitioned each dataset into 6 tiers of difficulty by number of exploration grasps (25, 50, 100, 200, 400, and 800 grasp attempts), yielding 36 individual datasets. Adapt policies using fine-tuning We train a fine-tuned policy for each of these 36 datasets using the procedure described above. We execute the fine-tuning algorithm for 500,000 gradient steps (see Sec. 5.6 for more information on how we chose this number) and use a learning rate of 10 4 , which is 25% of the learning rate used for pre-training. This yields 36 fine-tuned policies, each trained with a different combination of target task and target dataset size. This set of 36 policies includes 6 policies fine-tuned on data from the base grasping task, for validation. Train comparison policies To provide points of comparison, we train two additional policies for each challenge task and the base grasping task, yielding 12 additional policies, for a total of 54. The first comparison (“Scratch”) is a policy trained using the aforementioned fine-tuning procedure and an 800-grasp data set, but using a randomly-initialized Q-function rather than the Q-function obtained from pre-training. The purpose of this comparison is to help us assess the contribution of the pre-trained parameters to the fine-tuning process’ performance. 148 The second comparison (“ImageNet”) is also trained using an identical fine-tuning procedure and the 800-grasp dataset, but initialized with the weights obtained by training the network to classify images from the ImageNet dataset [47]. The purpose of this comparison is to provide a comparison to a strong alternative to end-to-end RL for obtaining pre-training parameters. It uses a modified Q-function architecture in which we replace the convolutional trunk of the network with that of the popular ResNet50 architecture [99]. Refer to to Fig. 5.8 for a diagram of the Q-function architecture, and the Appendix for more details. Evaluate performance Finally, we evaluate all 54 policies on their target task by deploying them to the robot and executing 50 or more grasp attempts to calculate the policy’s final performance. The full experiment required more than 15,000 grasp attempts and 14 days of real robot time, and was conducted over approximately one month. The experiments are very challenging. For example, the “Transparent Bottles” task presents a major challenge to most grasping systems: the transparent bottles generally confuse depth-based sensors and, es- pecially in cluttered bins, require the robot to singulate individual items and position the gripper in the right orientation for grasping. Although our base policy uses only RGB images, it is still not able to grasp the transparent bottles reliably, because they differ so much from the objects it observed during training. However, after fine-tuning with only 1 hour (100 grasp attempts) of experience, we observe that the trans- parent bottles can be picked up with a success rate of 66%, 20% better than the base policy. Similarly, the “Checkerboard Backing” challenge task asks the robot to differentiate edges associated with real objects from edges on an adversarial checkerboard pattern. It never needed this capability to succeed during pre- training, where the background is always featureless and grey, and all edges can be assumed to be associated with a graspable object. After 1 hour (100 grasp attempts) of experience, using our method the robot can grasp objects on the checkerboard background with a 71% success rate, 21% better than the base policy, and this success rate reaches 90% after 8 hours of experience (800 grasp attempts). 149 We present a full summary of our results in Table 5.2. Across the board, we observe substantial benefits from fine-tuning, suggesting that the robot can indeed adapt to drastically new condition with a modest amount of data: our most data-intensive experiment uses just 0.2% of the data used train the base grasping policy to similar performance. Our method consistently outperforms both the “ImageNet” and “Scratch” comparison methods. We provide more detailed analysis of this experiment in Section 5.6). 5.5 Evaluating Offline Fine-Tuning for Continual Learning 86% 50% Harsh Lighting Grasping (Pre-Train) A dap t 86% QT-Opt ℒ Harsh Light Q-function Initializ e QT-Opt Grasping Q-function ℒ QT-Opt ℒ Bottles Q-function QT-Opt ℒ Extend Q-function QT-Opt ℒ Offset Q-function QT-Opt ℒ Checker Q-function Initializ e Initializ e Initializ e Initializ e 50% 50% 50% 50% 50% 50% 50% 50% 50% A dap t A dap t A dap t A dap t Pr e- T r ain Transparent Bottles Checkerboard Backing Extend Gripper Offset Gripper Bottles 800 Checker 800 Harsh Light 800 Extend 800 Offset 800 Grasping ≈608,000 32% 49% 50% 75% 43% 63% 74% 88% 91% Figure 5.5: Flow chart of the continual learning experiment, in which we fine-tune on a sequence of condi- tions. Every transition to a new scenario happens after 800 grasps. Now that we have defined and evaluated a simple method for offline fine-tuning, we evaluate its suit- ability for use in continual learning, which could allow us to achieve the goal of a robot that adapts to ever-changing environments and tasks. To do so, we define a simple continual learning challenge as follows (Fig. 5.5). As in the fine-tuning experiments, we begin with a base policy pre-trained for general object grasping. Likewise, we also use our fine-tuning method to adapt the base policy to a target task, in this case “Harsh Lighting.” We then use this adapted policy – not the base policy – as the initialization for another iteration 150 Challenge Task Continual Learning Base Single Harsh Lighting 63% +32% - Transparent Bottles 74% +25% +8% Checkerboard Backing 86% +36% 4% Extend Gripper 1 cm 88% +12% 5% Offset Gripper 10 cm 91% +44% 7% Table 5.3: Summary of grasping success rates (N 50) for the continual learning experiment by challenge task, and comparison to single-step fine-tuning. of our fine-tuning algorithm, this time targeting “Transparent Bottles.” We repeat this process until we have run out of new tasks, ending at the task “Offset Gripper 10cm,” at which point we evaluate the policy on the last task. We perform this experiment using 800 exploration-grasp datasets for each Challenge Task from our ablation study of online fine-tuning with real robots. We summarize the results in Table 5.3. Recall that our goal for this experiment is to determine whether continual fine-tuning incurs a significant performance penalty compared to the single-step variant, because we are interested in finding a building block for continual learning algorithms. We find that continual fine-tuning does not impose a drastic perfor- mance penalty compared to single-step fine-tuning. The continual fine-tuning policies for the “Checkerboard Backing,” “Extend Gripper 1 cm,” and “Offset Gripper 10 cm” challenges succeeded in grasping between 4% and 7% less often than their single-step fine-tuning counterparts, whereas the policy for the challenging “Transparent Bottles” case actually succeeded 8% more often. These small deltas are within the margin-of- error of our evaluation procedure, so we conclude that the effect of continual fine-tuning on the performance compared to single-step fine-tuning is very small. Additionally, we find that the continually fined-tuned policies do not experience forgetting of the base task. This experiment demonstrates that our method can perform continual adaptation, and may serve as the basis for a continual end-to-end robot learning method. 151 0 50 100 200 400 800 Exploration Grasps 50 60 70 80 90 100 Success Rate (%) Sample Efficiency Checkerboard Backing Offset Gripper 10cm Transparent Bottles Figure 5.6: Sample efficiency of our fine-tuning method on selected real-robot challenge tasks. 5.6 Empirical Analysis In this section, we aim to further investigate the efficiency, performance, and characteristics of our large- scale real-world adaptation experiments. 5.6.1 Performance and sample efficiency of fine-tuning Figure 5.6 shows the success rates for our method from Table 5.2 against the amount of data used to achieve that success rate for selected tasks. The data indicate that a simple fine-tuning method can adapt policies to many new tasks using modest amounts of data. For instance, “Extend Gripper 1cm” and “Offset Gripper 10cm” both needed only 25 exploration grasps to achieve substantial gains in performance (+18% and +30%, respectively). All policies attain substantial performance gains over the base policy as they are fine-tuned with increasing amounts of data, but the relationship between data and performance is not linear. All policies experience a substantial improvement in performance after 100 or fewer exploration grasps. However, we observe that these performance improvements in the very low-data regime (e.g. 200 grasp attempts) are 152 10 6 3 × 10 5 5 × 10 5 7 × 10 5 3 × 10 6 Gradient Steps 0 20 40 60 80 100 Success Rate (%, N 10) Forgetting in Off-Policy Fine-Tuning - Offset Gripper 10cm Fine-tuned policy, 400 exploration grasps Our experiments (500k) Scratch policy, 800 exploration grasps Base policy Figure 5.7: Evaluation performance of a single offline fine-tuning experiment at different numbers of gradi- ent steps (optimization epochs). The blue curve is real robot performance on the target task (Offset Gripper 10cm) when trained using 400 exploration grasps. The green dotted line is the performance of training the same policy from scratch (random initialization) using 800 exploration grasps, and the yellow dotted line the performance of the base policy. The red dotted line portrays the number of gradient steps we choose to use for our large-scale fine-tuning study. also unstable. We attribute both of these phenomenon to the early stopping problem, which we did not solve in this chapter, and discuss in detail below. 5.6.2 Limitation: the early stopping problem Our results indicate that offline fine-tuning can adapt robotic policies to substantial performance improve- ments with modest amounts of data An additional benefit of offline methods such as this one is that they are not limited by the need to preserve an always-sufficient exploration policy, as with online methods. However, we identify one significant drawback to the method compared to online fine-tuning. A pure offline fine-tuning method has no built-in evaluation step which would inform us when the robot’s performance on the target task has stopped improving, and therefore when we should stop fine-tuning with a fixed set of target task data. This is a subset of the off-policy evaluation problem [110]. Knowing when the policy stops improving is important: fine-tuning exists in a low-data regime, and repeatedly updating 153 a neural network model with small amounts of data leads to overfitting onto that data. Not only does this degrade the performance on the target task, but also the ability of the network to adapt to new tasks later (i.e. for continual learning). We can see this phenomenon in Figure 5.7 showing a real robot’s performance on the “Offset Gripper 10cm” target task at different numbers of steps into an offline fine-tuning process that uses 400 exploration grasps. Performance quickly rises until around 500,000 gradient steps. Past this point, it precipitously drops and never recovers, dropping below even the initial performance of the base policy from which it was trained, as the initialization is being overwritten by overfitting to the target samples. The point at which overfitting begins is a function of the initialized model, target dataset, learning algorithm, and many other factors, and is not necessarily stable or easily predictable. For the purposes of our large-scale fine-tuning study, we use this experiment and several others to deter- mine that 500,000 gradient steps was an acceptable choice for the real-world experiments. The variance in the results in Table 5.2 and Figure 5.6, however, shows that this choice was not necessarily optimal for all of our tasks and datasets. In addition to off-policy evaluation metrics, one practical solution to this problem for a continual learning robot is to use a mix of offline fine-tuning and online evaluation. The point at which performance stops improving represents when the training process has exhausted the fine-tuning dataset of new information, and the robot must return to exploring online to continue improving. 5.6.3 Comparing initializing with RL to initializing with supervised learning In order to answer the question of whether RL is better suited than supervised learning for pre-training a continually-learning robotic agent, in Section 5.4 we compared our results to an ImageNet-pretrained baseline. As shown in Table 5.2, the best performing ImageNet-based agent achieves a success rate of 47% on “Offset Gripper 10cm,” a 4% improvement over the base policy performance. This result confirms our hypothesis that our RL-based pre-training is crucial for subsequent fine-tuning. Note that we first attempted 154 conv1 conv2 conv3 conv4 conv6 conv7 conv8 conv9 conv10 conv11 conv12 conv13 conv14 conv15 conv16 fc0 fc1 logit pixels Checkerboard Backing Harsh Lighting Offset Gripper 10cm Extend Gripper 1cm Transparent Bottles conv5 fcgrasp1 fcgrasp2 action exp. return Figure 5.8: Analysis of parameter changes induced by different fine-tuning target tasks. This plot portrays the cosine distance between the parameters of the pre-trained and fine-tuned networks for our 5 fine-tuning target tasks. The bar heights are normalized by the magnitude of parameter changes induced in the Q- function network by fine-tuning the baseline grasping task. to fine-tune these ImageNet-based policies while holding the ImageNet feature layers constant, but this procedure failed to achieve any non-zero success rates. This suggests that, unlike adapting computer vision networks to new visual tasks, adapting end-to-end robot learning to new sensorimotor tasks may require changing the features used to represent the problem, and not just the post-processing of said features. Figure 5.8 highlights some of the changes that happen during the RL-based fine-tuning in greater de- tail. While it is unsurprising that primarily-visual challenges such as “Checkerboard Backing” and “Harsh Lighting” induce large changes in the parameters of the convolutional parts of the network, we observe that even ‘Offset Gripper 10cm,” a purely-morphological change to the robot, induces substantial changes to the network’s image-processing parameters (e.g. layers conv2-conv7). We attribute this to the successful agent’s need for hand-eye coordination to complete the task: offsetting the gripper not only changes robot morphology, it changes the location of the robot in its own visual field drastically. In order to perform ef- fective visual servoing with a new morphology, both the image and action-processing parts of the network must be updated. 155 5.7 Conclusion and Future Work For robots to be able to operate in unconstrained environments, they must be able to continuously adapt to new situations. We empirically studied this challenge by evaluating a state-of-the-art vision-based robotic grasping system, and testing its robustness to a range of new conditions such as varying backgrounds, lighting conditions, the shape and appearance of objects, and robot morphologies. We found that these new conditions degraded performance of the trained grasping system substantially. Motivated by this initial study, we explored how to adapt vision-based robotic manipulation policies by fine-tuning with off-policy reinforcement learning. Our large-scale study shows that combining off-policy RL with a very simple fine-tuning procedure is an effective adaptation method, and this method is capable of achieving remarkable improvements in robot performance on new tasks with very little new data. Furthermore, our continual learning experiment shows that using this simple method in a continual setting imposes very little performance penalty compared to the single-step setting. This suggests that the combination of off-policy RL and fine-tuning can serve as a building block for future continual learning methods. Our results comparing supervised learning-based initialization to initialization with RL highlight a fa- miliar truism about robotics: that robotic agents must do more than perceive the world, they must also act in it. The ability to learn the combination of these two capabilities is what makes RL well-suited for creating continually-learning robots. While this chapter demonstrated promising results on a real-world robotic grasping system under a wide range of scenarios, both perceptual and physical, further work is needed to understand how such adaptation performs on a broader range of robotic manipulation tasks. In future work, we would like to further assess our method’s suitability for continual adaptation, by using longer continual learning sequences, and mea- suring forward and backward transfer during continual fine-tuning. We also plan to open source the models and datasets used in this work, to aid research in offline RL and off-policy evaluation metrics. 156 5.8 Project Website and Experiment Videos Please see our project website at https://ryanjulian.me/continual-fine-tuning for experiment videos and overview of the method. To access the experiment videos directly, access the video directly at https:// youtu.be/pPDVewcSpdc. Experiment videos start at 2:52. 5.9 Additional Experiments on Forgetting Figure 5.9 presents the opaque-to-transparent object grasping simulation experiment from Figure 5.4a, but includes a lower pane showing the performance of the fine-tuned policy on the base task (opaque object grasping) at every step of training. The simple fine-tuning method presented in Section 5.4.1 (“FT”) retains base task performance, even as it is adapted to solve the target task. When trained with our method, the target task column of the PNN model (“FT + PNN”) starts at around 25% performance, drops to below 10% performance, then recovers base task performance in about 300; 000 gradient steps. As the training process does not modify the parameters of the policy for the base task, the base task column of the PNN model (“PNN”) retains its original performance on the base task. 157 0 20 40 60 80 100 Transparent Objects (Target) Scratch FT FT + PNN PNN 0 100000 200000 300000 400000 500000 Gradient Steps 0 20 40 60 80 100 Opaque Objects (Base) Grasping Performance (10 seeds, 95% confidence interval) Grasp Success Rate (%) Figure 5.9: Grasping performance of four transfer methods on a simulated fine-tuning scenario, in which the robot is trained to grasp opaque objects using 3500 attempts, but must adapt to transparent objects using only 75 attempts. Top: Performance on the target task (grasping transparent objects). Bottom: Performance on the base task (grasping opaque objects). Scratch: Train a new policy from scratch (3500 attempts). PNN: Progressive Neural Networks. FT: Fine-tuning method from Sec. 5.4.1. FT + PNN: PNN model trained using the FT method. 158 5.10 Assessing Fine-Tuning Techniques for End-to-End Robotic Reinforcement Learning We propose a conceptual framework for fine-tuning algorithms, and use simulation experiments to assess the suitability of some algorithm variations for end-to-end robot learning. “Fine-tuning” refers to a family of transfer learning techniques in which we seek to acquire a neural network for one task (which we will refer to as the “target” task) by making use of some or all of a network trained on a related task (the “base” task). This is a very common technique for quickly acquiring new tasks in computer vision [51, 108, 134] and natural language processing [107]. As collecting new robot experience data is expensive, our goal is to use as little target task data as possible. In this section, we first describe the general algorithmic sketch for fine-tuning, then enumerate some of the most common fine-tuning techniques. We evaluate the suitability of these techniques for end-to-end robot learning in Sections 5.4, 5.5, and 5.6. 5.10.1 Fine-Tuning: Conceptual Framework We can organize fine-tuning for end-to-end reinforcement learning into four essential steps (See Fig. 5.3 for a detailed schematic). Different fine-tuning techniques change the details of one of these steps. 1. Pre-training: Pre-train a policy to perform some base task, which is related to our target task. In the experiments in this chapter, the base task is always indiscriminate object grasping. In computer vision and NLP, this step can often by skipped by making use of one of many pre-trained and publicly- available state-of-the-art vision and language models. We hope for a future in which this is possible in robotics. 2. Exploration: Explore in the new target task, to collect data for adaptation. In principle, any policy may be used for exploration in off-policy reinforcement learning. In our study, however, we always 159 use the pre-trained policy for exploration, which we believe to be most representative of a real-world continual learning scenario. 3. Initialization: Initialize the policy for the target task using some or all of the weights from the pre- trained policy. The standard implementation of this step is to start with the entire pre-trained network. Some techniques may choose to use only a subset of the pre-trained network (e.g. truncating the last few layers of a CNN). 4. Adaptation: Use the exploration data update the initialized policy to perform the new task. The standard version of this step continues updating the entire initialized policy with the same algorithm and hyperparameters as was used for the pre-training process, but with the target task data. There are many variations on this step, including which parts of the network to update, at what learning rate, with what data, with which optimization algorithm, whether to add additional network layers, etc. 5. Evaluation: Assess performance of the fine-tuned network on the new task. If this step only happens once, we refer to such a technique as “offline fine-tuning,” because the adaptation step never uses data from an updated policy. If this step happens repeatedly (e.g. exploration and evaluation are one-and- the-same), and its result is used for further adaptation to the same target task, we refer to a technique as “online fine-tuning.” We explore both variations in our experiments. Using this fine-tuning framework, we consider several variations of fine-tuning, and assess their suit- ability for end-to-end robotic RL. Notably, we neglect an analysis of pre-training techniques for fine-tuning reinforcement learning (i.e. (1)), which has a large and rapidly-growing body of research in the meta- and multi-task RL communities (see Sec. 5.2). Instead, we focus on initialization (2) and adaptation (3). All of our experiments use end-to-end off-policy reinforcement learning of an indiscriminate object grasping task for their pre-training step. Refer to Section 5.3.1 for details on our pre-training process. 160 0 1000 2000 3000 4000 5000 6000 Exploration Grasps 0 20 40 60 80 100 Success Rate (%) Grasping Simulation - Adding a Head (online) Add a new head Use pre-train head Base policy Figure 5.10: Comparison of fine-tuning performance for a policy which uses all base parameters, and a policy which initializes the head parameters from scratch. Re-initializing parameters has a negative effect on sample efficiency for fine-tuning. 5.10.2 Experiments in simulation We use simulation experiments to evaluate the suitability of some fine-tuning variations, along the axes we defined in Section 5.10.1. 5.10.2.1 Adding a new head and other selective initialization techniques Selective-initialization techniques start the fine-tuning process with a policy that has some of its parameters initialized to random. For example, a popular variant is to “add a head” to a pre-trained neural network by omitting its last few layers from initialization, so that the new head can be trained to perform on the target task. Figure 5.10 portrays a study of partial initialization for online fine-tuning using a simulated grasping experiment. In this experiment, the base task is “grasp opaque blocks” and the target task “grasp semi- transparent blocks.” The base policy performance is 98% when trained from scratch on 43,000 grasp at- tempts. Both fine-tuned policies begin with low performance, around 15%. After 5000 exploration grasps 161 0 2000 4000 6000 8000 10000 Exploration Grasps 0 20 40 60 80 100 Success Rate (%) Grasping Simulation - Target Task Data Ratio 1% 24% 44% 71% 99% Figure 5.11: Performance curve for an online fine-tuning simulation experiment. The base policy is pre- trained to grasp opaque colored blocks, and the target task is to grasp semi-transparent blocks. Each curve represents a different fraction of target task data, and the remaining data is sampled from the base task. In simulation, the amount of target task data has a straightforward relationship with sample efficiency. (12% of the data used for the base policy), the performance of the full initialization policy has reached the base policy performance, while the policy with a new head has barely reached 30%. This gap shows that the combination of off-policy RL and selective initialization is unsuitable for sample-efficient fine-tuning. Our experiments immediately make apparent the downsides of selective initialization for fine-tuning. In particular, online fine-tuning requires to maintain a policy that can competently explore the target task at all times. Any method which compromises the performance of such a policy – even temporarily – has a high risk of failing as a sample-efficient fine-tuning technique. The resulting performance gap, once created, is hard to recover. As a consequence, we find in simulation experiments that online fine-tuning with selective re-initialization takes a significant fraction of the pre-training samples to converge to baseline performance, making this family of fine-tuning methods sample inefficient. 162 5.10.3 Training with a mix of data from the base and target tasks We experiment with mixing data from the pre-training task into the fine-tuning process (Fig. 5.11), and find this has a predictable relationship with sample efficiency in simulation: higher shares of target task data allow the fine-tuning policy to achieve higher performance faster. Our goal is to design a fine-tuning algorithm for real robots which might be used for continual learning, and our conclusion from this brief study is that online fine-tuning is a poor fit for for this goal. In particular the experiments with selective re-initialization highlight the primary challenge of online fine-tuning: it only allows us to use algorithms which preserve the exploration ability of the policy at all times. Offline fine- tuning is more practical than online fine-tuning, due to the inherent complexity of placing a robot in the loop of a reinforcement learning algorithm. If used as part of a continual learning method, an offline method would also allow a robot to collect data on a new task piecemeal, and only attempt to adapt to that new task when it has collected enough data to succeed. 5.11 Additional Experiment Details 5.11.1 Pre-Training Procedure First, we train a Q-function network offline using data from 580,000 real grasp attempts over a corpus of 1,000 visually and physically diverse objects. Second, we continue training this network online ‡ over the course of 28,000 real grasp attempts on the same corpus of objects. That is, we use a real robot to collect trials using the current network, update the network using these new trials, deploy the updated network to the real robot, and repeat. ‡ Following the example set byKalashnikov et al. [119], we refer to this procedure as “online” rather than “on-policy” as the policy is still updated by the off-policy reinforcement learning algorithm. 163 5.11.2 Robustness Experiments with the Pre-Trained Policy Background: We observe that conventional variations in the workspace surface, such as uniform changes in color or specularity, have no effect on the base policy’s performance. Lighting conditions: The base policy was trained in standard indoor lighting conditions, with no expo- sure to natural light or significant variation. We observe that mild perturbations in lighting conditions (i.e., those created by standard-intensity household lights) have no effect on the base policy’s performance. Gripper shape: No additional details. Robot morphology: Note that during training this policy experienced absolutely no variation in robot morphology. We observe that translating the gripper laterally by up to 5 cm has no impact on performance. Unseen objects: Based on our experiments, the system is robust to a broad variety of previously-unseen objects, as long as they have significant opaque components. For example, even though there are no drinking bottles in the training set, we find the system is able to pick up labeled drink bottles with 98% success rate. Success rates for other novel, opaque objects are similarly consistent with the baseline performance on the test set. 5.11.3 Comparison Methods from Section 5.4 The experiments “Scratch” and “ResNet 50 + ImageNet” both use 800 exploration grasps and the same update process as the other experiments. 5.11.3.1 “Scratch” “Scratch” starts the grasping network with randomly-initialized parameters. 164 5.11.3.2 “ResNet 50 + ImageNet” “ResNet 50 + ImageNet” refers to training a grasping network with an equivalent architecture to the other experiments, but with its convolutional layers replaced with a ResNet 50 architecture and pre-loaded with ImageNet features. We initialize the remaining fully-connected layers with random parameters, and concate- nate the action input features at the end of the CNN (rather than the adding them in middle of the CNN, as in the original architecture). Note that in this comparison, the fine-tuning process still updates all parameters, including those of the ResNet50 sub-network. 5.11.4 Data Collection and Performance Evaluation To reduce the variance of our evaluation statistics, we shuffle the contents of the bin between each trial by executing a randomly-generated sequence of sweeping movements with the end-effector. 5.11.5 Continual Learning Experiment “Base” refers to the baseline grasping policy before fine-tuning, and “Single” refers to the best performance from the single-step fine-tuning experiment in Table 5.2. Note that because it is the first step of the continual learning experiment, the policy for “Harsh Lighting” is identical to that of the 800-grasp variant of the single-step experiment. 165 Chapter 6 Skill Builder: A Simple Approach to Continual Learning for Manipulation In order to be effective general purpose machines in real world environments, robots not only will need to adapt their existing manipulation skills to new circumstances, they will need to acquire entirely new skills on-the-fly. A great promise of continual learning is to endow robots with this ability, by using their accu- mulated knowledge and experience from prior skills. We take a fresh look at this problem, by considering a setting in which the robot is limited to storing that knowledge and experience only in the form of learned skill policies. We show that storing skill policies, careful pre-training, and fine-tuning with on-policy re- inforcement learning (RL), are sufficient for building a continual manipulation skill learner. We call this approach “Skill Builder,” because it acquires new skills by building on old ones, such that learning a new skill can never cause the robot to forget an old one. We first use simple experiments to understand how to best re-use a skill policy library. We introduce the method, and use the challenging Meta-World simulation benchmark to demonstrate Skill Builder’s ability to continually acquire robotic manipulation skills without forgetting, and using far fewer samples than needed to train them from scratch. We then reduce efficient continual learning with our method to the problem of inferring the optimal skill pre-training and fine-tuning curriculum. Finally, we introduce a novel policy class for rapid skill acquisition with a skill library, and share promising results towards using it for continutal learning. 166 6.1 Introduction Reinforcement learning (RL) with rich function approximators—so-called “deep” reinforcement learning (DRL)—has been used in recent years to automate tasks which were previously-impossible with computers, such as beating humans in board and video games [175, 176] and navigating high-altitude baloons [12]. In the field of robotics, DRL has shown the promise by allowing robots to automatically learn sophisticated manipulation behaviors end-to-end from high-dimensional multi-modal sensor streams [144], and to quickly adapt these behaviors to new environments and circumstances [116]. What remains to be seen is whether DRL can bridge the significant gap from efficiently adapting existing skills to efficiently acquiring entirely new skills. If such a capability could be applied repeatedly throughout the life of a robot (i.e. continual learning), we stand the chance of unlocking new possibilities for physical automation with general purpose robots, much as general purpose computers unlocked theretofore unforeseen possibilities for information automation half a century ago. While not the only relevant formulation, episodic, continual, multi-task reinforcement learning is a worthy problem setting, because it describes this skill acquisition capability we seek. In this setting, we ask the robot to acquire new manipulation skills repeatedly, using time-delineated experiences of attempts at those skills (episodes), and some durable store of previously-acquired knowledge. The possibilities for the form of this store seem endless, but are actually bound to only two possibilities by construction: an RL system consumes raw data in the form of experiences, and outputs processed data in the form of parameters, which either directly specify a policy function, or condition a policy decision rule by specifying one or more other functions * (e.g. a value function, Q-function, transition model, etc.). So a continual reinforcement learning robot can store one or both of (1) experience data or (2) parameters. Among these two options, there are many reasons to prefer parameters over data. Keeping a compre- hensive dataset of all prior experience in local storage quickly becomes intractable for a single robot. Even * Neglecting the degenerate case of a policy computed just-in-time from raw data. 167 if stored remotely in a “cloud” and retrieved, no server could quickly locate and retrieve a subset of that data relevant to a new task without first processing it into parameters itself. Parameters not only allow a single robot to store all of its skills locally, they are also more practical to share and disseminate than datasets (precisely because they are already processed). Consider the success of pre-trained models in language and computer vision: these are computed at great expense by institutions with immense datasets, storage, and compute resources. These institutions share them for the benefit of the entire community, who can then quickly re-use them for myriad applications. The parameters for state-of-the-art language and computer vision models fit on a cell phone in the palm of your hand, but require warehouse-sized machines to com- pute. While the best continual learning robot will likely store a mix of both data and parameters, we believe parameters deserve an especially-enthusiastic study. For the sake of simplicity, in this chapter we shall focus on them in isolation. If we imagine a continual learning robot which can store only skill policies but no experiences, how can it use them for acquiring new skills? Skill policies have two uses: their parameters directly and—uniquely to reinforcement learning—the behaviors they indirectly encode. In this chapter, we will show that under this skill storage-only assumption, efficient skill acquisition requires using both skills policys’ parameters and the behaviors they encode. In Section 6.3, we first use experiments with imitation learning and fine-tuning to gain insight into how we can re-use both skill policy parameters and their behaviors to acquire new skills rapidly, by exploiting shared structure between tasks. We formalize our continual learning setting in Section 6.4, and describe a simple continual learning method which uses an expanding library of skill policies as its knowledge store, and on-policy fine-tuning with Proximal Policy Optimization (PPO) [234] for new skill acquisition. We call this simple approach “Skill Builder” because it acquires new skills by building on old ones, such that learning a new skill can never cause the robot to forget an old one. Using experiments with the challenging Meta-World benchmark, we show that Skill Builder can achieve continual learning for robotic manipulation skills. We reduce the 168 efficient continual learning problem in this setting to determining the optimal skill-skill transfer curriculum for a given set of target tasks in Section 6.5, and use further continual learning experiments to verify our curriculum selection algorithm. We then use this analysis to formulate a promising policy class for continual learning in Section 6.6, which allows our learner to use both policy parameters and behavior from all skills at once, rather than just one at a time, and show promising early results towards using it for continual learning. 6.2 Related Work Reinforcement learning for robotics Reinforcement learning has been studied for decades as an approach for learning robotic capabilities [127, 162, 152, 245]. In addition to manipulation skills [145, 119, 199, 83, 78, 296], RL has been used for learning locomotion [131, 130, 280, 88], navigation [16, 301], motion planning [244, 60], autonomous helicopter flight [9, 1, 185], and multi-robot coordination [168, 284, 155]. The recent resurgence of interest in neural networks for use in supervised learning domains such as computer vision and natural language processing, (i.e. “deep learning” (DL)) [14], corresponded with a resurgence of interest in neural networks for reinforcement learning (i.e. “deep reinforcement learning” (DRL)) [73, 175]. With it came a wave of new research on using RL for learning in robotics and continuous control [176, 150], though the fields of neural networks, reinforcement learning, and robotics have overlapped continuously since each of their inceptions [127, 91]. Transfer, continual, and lifelong learning for robotics Transfer learning is a heavily-studied problem outside the robotics domain [51, 106, 49, 42, 205]. Many approaches have been proposed for rapid transfer of robot skill policies to new domains, including residual policy learning [240], simultaneously learning across multiple goals and tasks [216, 221], methods which use model-based RL [68, 286, 180, 30, 87, 43, 31, 41, 123, 171, 209, 115], and goal-conditioned RL [3, 183, 194, 201, 293]. All of these share data and 169 representations across multiple goals and objects, but not skills per se. Similarly, work in robotic meta- learning focuses on learning representations which can be quickly adapted to new dynamics [178, 4, 179] and objects [71, 112, 290, 20], but has thus far been less successful for skill-skill transfer [292]. Pre-training methods are particularly popular, including pre-training with supervised learning [47, 144, 70, 84, 199], experience in simulation [223, 263, 224, 253, 188, 222, 196, 104, 92], auxiliary losses [213, 172, 228], and other methods [236, 97]. While successful, these methods are often designed for domain transfer rather than skill-skill transfer, require significant engineering by hand to anticipate specific domain shifts, and are designed for single-step rather than continual transfer. Similar to Julian et al. and Nair et al., our work uses the very simple approach of on-line fine-tuning to achieve rapid adaptation. Lifelong and continual learning have long been recognized as an important capability for autonomous robotics [261]. Like Taylor, Stone, and Liu, our approach to continual learning relies on rapidly adapting policies for an already-acquired skill into a policy for a new skill. Much like Cao, Kwon, and Sadigh, Bod- nar et al., and Kumar et al., this chapter uses experiments to analyze different transfer techniques from a geometric perspective on the skill-skill adaptation problem. As in prior work [159, 287], this study observes that the selection of pre-training tasks is essential for preparing RL agents for rapid adaptation. Our work uses experiments to formulate a decision rule for how to pre-train our skills. A comprehensive overview of literature in continual reinforcement learning beyond robotics is beyond the scope of this work, but please see Khetarpal et al. for an excellent survey. Reusable skill libraries for efficient learning and transfer Learning reusable skill libraries is a classic approach [83] for efficient acquisition and transfer of robot motion policies. Prior to the popularity of DRL- based methods, Associative Skill Memories [193] and Probabilistic Movement Primitives [217, 300] were proposed for acquiring a set of reusable skills for robotic manipulation. In addition to manipulation [254, 285, 109, 277, 270, 26, 148, 158, 136], DRL-based skill decomposition methods are particularly popular 170 today for learning and adaptation in locomotion and whole-body humanoid control [197, 94, 170, 147, 262]. Our work argues that once decomposed, these skill libraries are useful for rapid adaptation, and ultimately continual learning for manipulation with real robots. Hausman et al. proposed learning reusable libraries of robotic manipulation skills in simulation using RL and learned latent spaces, and Julian et al. showed these skill latent spaces could be used for efficient simulation-to-real transfer and rapid hierarchical task acquisition with real robots. As we also study in this chapter, learning reusable skill libraries requires exploring how new skills are related to old ones. As other works have pointed out [15, 242, 18, 243, 5], this can be achieved efficiently by re-using policies, representations, and data from already-acquired skills. Continual robot learning with skill libraries and curriculums Like ours, recent works have begun to use DRL with skill libraries for continual robot learning. They have explored maintaining a skill library in form of factorized policy model classes [169], learned latent spaces [157, 129, 98], options [96], policy mod- els which partition the state space [281], movement primitives [161], or as per-skill or all-skill datasets [266, 157, 98]. As in this work and others [266, 248], Fernández and Veloso proposed directly storing and re-using policies for continual learning in the context of robot soccer. Once acquired, these works propose various methods for reusing these skills, such as via online model-based planning [157], via sequencing, mixture, selection, or generation with online inference [281, 248, 98, 161], as a high-level action space for hierar- chical RL [96], and (as in this work) keeping a specific policy network for each new skill [169, 266, 129]. Intertwined with how to maintain such skill libraries is the question of how to update them throughout the life of the robot. Recent works have proposed using on-policy RL algorithms to directly update skills [169, 281, 248, 129, 161], using a continually-growing skill data buffer to update skill networks [157, 98], and repeatedly distilling the policy library [96, 266]. We believe that continual learning for manipulation is achievable by using modular skill libraries and repeated efficient adaptation to new tasks. Alet, Lozano-Pérez, and Kaelbling, Sharma et al., and Raziei 171 and Moghaddam have all recently proposed rapid adaptation methods for manipulation which make use of modular skill learning and re-use. This work seeks to extend some of those ideas, in simplified form, to the continual learning setting. In addition to our analysis of skill policy transfer and skill curriculums, we propose policy class and pre-training procedure for skill re-use which is inspired by Peng et al. and Tseng et al., but for continual reinforcement learning rather than imitation learning or meta-RL. See Narvekar et al. for a survey of curriculum learning in reinforcement learning. Like Fabisch et al., this chapter observes that continual skill learning is an active learning problem, and that measuring task novelty is an important capability for efficient active skill learning. Like and Foglino et al., we highlight the importance of skill curriculum, and propose a method for computing the optimal skill curriculum given an oracle for relative skill novelty, and show that these curriculums indeed make continual learning more efficient. 6.3 Understanding the Role of Parameters and Behaviors in Skill Policy Reuse In this section, we use experiments with imitation learning to better-understand how to reuse skill policy parameters and behaviors, and to inform the design of Skill Builder, which we introduce in Section 6.4. 6.3.1 Transferring Skill Parameters We begin our design study by measuring the effectiveness of different parameter reuse strategies. Our goal is to determine whether skill policy parameters are primarily useful for (a) extracting high-level features of the state with which a controller is easily learned, or (b) for encoding closed-loop controllers which use relatively little state featurization. Hypothesis (a), feature reuse, would suggest that we can efficiently adapt by learning new controllers conditioned on the early layers of a neural network (action transformation and 172 0 250 500 750 1000 1250 1500 1750 2000 Total Environment Steps 0 20 40 60 80 100 Success Rate (%) CarGoal BC Observation Alignment (CEM) Action Alignment (CEM) Observation Alignment (SGD) Action Alignment (SGD) Action Re-Alignment (SGD) Action Re-Alignment (CEM) Figure 6.1: Left: Performance of different target policy classes used to understand transferring skill pa- rameters. The shaded regions represent a 95% confidence interval. Right: A screenshot of the CarGoal environment. The goal region shown in the image is not observable by the policy. action feature transformation, below). Hypothesis (b), closed-loop controllers, would suggest that we can efficiently adapt by learning a transformation of the state space which relocates the novel goal’s position to a goal position which the agent has already learned to reach (observation transformation), or that we can learn a transformation of the actions from a known goal to a novel goal (action transformation). We summarize the model classes considered in our experiments below, and then review the results. Observation Transformation Observation transformation produces a policy to solve a target task by ap- plying a transformationT (s) to the observation before passing it to the base skill policy to generate actions. In our experiments we use an affine (linear transformation plus bias) transformation of the observation to simplify optimization, and to exploit simple geometric priors such as rigid transformation. As the optimiza- tion process is efficient, we choose a base policy by training aT (a) for all2T base and using the lowest loss member of the ensemble for the final target policy. target (ajs) = base (ajT (s)) (6.1) Action Transformation Similarly to observation transformation, action transformation learns a low- dimensional affine transformation function, but instead of transforming the state input of a base skill policy, 173 this functionT (a) transforms the action output. As in observation transformation, we train an ensemble of these target policies for a target task, and choose the one with the lowest loss. target (ajs) = base (T (a)js) (6.2) Action Feature Transformation We find that naïve action transformation performs poorly, because the final output layer of base often destroys necessary information. Action feature transformation instead uses all but the last layer of the base policy (hereafter referred to as base 1 ) to transform the current state into a featurizationz. We then use a learned transformationT to compute the parameters of a normal action distribution ( z ; z ) =T (z), and sample actions fromN ( z ; z ). target (ajs) =N ( z ; z ) (6.3) where ( z ; z ) =T (z) z = base 1 (s) Experiments and Results To understand the effectiveness of the above parameter reuse methods, we train RL agents in a simple but dynamically-rich multi-task environment we call CarGoal (Figure 6.1, Right), which is based on theCarRacing-v0 environment from the Open AI Gym suite [22]. The RL agent receives the car’s 2D position as state and must choose throttle and steering actions to drive the car to a hidden goal location. Each goal location is a different task. The environment models slip and friction, and it is easy to lose control of the car. As in robotic manipulation, the CarGoal agent must contend with non-linear dynamics, and choose control actions which are non-trivially related to the state and reward. The results of our experiments strongly support hypothesis (a), efficient featurization, and weakly refute hypothesis (b), sophisticated controllers. Both action re-alignment and observation alignment are successful at efficiently adapting to new goals during fine-tuning, but only action realignment consistently achieves 174 100% performance, and does so more rapidly than observation alignment. This suggests that featurization is an important role of policy parameters. The complete failure of action alignment, and relatively-weak performance of observation alignment, suggest that skill controllers learned with RL do not encode feed- back controller general-enough to be re-targeted, even for relatively trivial task spaces such as this goal- conditioned example. 6.3.2 Transferring Skill Behaviors Having revealed the importance of featurization, and relatively narrow applicability of skill controllers, we turn our attention to experiments with reusing skill policy behaviors. Our hypothesis is that reusing skill policy behaviors will allow for rapid adaptation, because skill policies in the library are likely to encode controllers which are useful for partially solving new tasks, such as servoing in the direction of a goal. Skill Switching Skill switching learns a state-conditioned weighting function W (s;) over the skills of in the library. To discourage switching too often, the skill switching policy enforces hysteresis during sampling: it maintains a state,h of the the most recently selected skill, and continues to use that skill until another skill has a weight which is larger than the current skill. target (ajs;h) = base;h 0 (ajs) (6.4) where h 0 = 8 > > > > < > > > > : h ifW (s;h) +>W (s;)82T base arg max W (s;) otherwise (6.5) Experiments and Results We experiment with this skill switching method, again in the CarGoal envi- ronment. In these experiments, we pre-train a skill policy for each of 3 corners in a space, then attempt to quickly learn a skill policy for the fourth corner using skill switching. Our results (Figure 6.2) show that switching is effective, even if we limit the rate at which skills are switched, validating our hypothesis that reusing policy behavior allows for efficient adaptation in 175 0 5000 10000 15000 20000 25000 30000 35000 Gradient Steps 0 20 40 60 80 100 Success Rate (%) Car-Goal Hard Switching - Success Rate 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 5000 10000 15000 20000 25000 30000 Gradient Steps 0 5 10 15 20 Switching Rate (%) Car-Goal Hard Switching - Switching Rate 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Figure 6.2: Left: Performance of skill switching on CarGoal for various values of , a regularization which discourages switching too often. Right: Switching probability learned for various values of. Skill switching learns new policies quickly be re-using behaviors from policies in the skill library, even when skills selected must we uses for relatively-long horizons. dynamically-rich environments. We find that we are able to efficiently learn skill policies only for goals which are inside the convex hull formed by the pre-training skills, which adds evidence to our finding above about the relative narrowness of skill policies. The tasks we use skill switching to achieve in the CarGoal experiments all share the same dynamics, and tasks corresponded to goal regions, as shown in Figure 6.1. We fid that using skill switching to transfer skill policies for tasks in the Meta-World benchmark, where new tasks have new dynamics in addition to new rewards, was not effective. 6.4 Skill Builder: Simple Continual Learning with Skill-Skill Transfer In this section, we establish our formal continual learning setting, and introduce Skill Builder, our simple approach to continual learning for manipulation. We use experiments with the challenging Meta-World ma- nipulation benchmark [292] to establish that Skill Builder can achieve continual learning without forgetting. Later in Section 6.5, we will show how skill curriculums can make continual learning with Skill Builder more efficient than learning skills from scratch. 176 6.4.1 Problem Setting We formalize our continual learning problem as iterated transfer learning for multi-task reinforcement learn- ing (MTRL) on a possibly-unbounded discrete space of tasksT . As we are interested in learning robot manipulation policies, we presume all tasks inT share a single continuous state spaceS and continuous action spaceA, and the MTRL problem is defined by the tuple (T;S;A) Each task2T is an infinite- horizon Markov decision process (MDP) defined by the tuple = (S;A;p (s;a;s 0 );r (s;a;s 0 )). As tasks are differentiated only by their reward functionsr and state transition dynamicsp , we may abbreviate this definition to simply = (r ;p ). Importantly, we do not presume that the robot ever has access to all tasks inT at once, or even a representative sample thereof, and can only access one task at a time. We shall refer to time between task transitions an “epoch” and count them from 0, but in general two different epochs can be assigned the same task (i.e. tasks may reappear). When solving a task (hereafter, the “target task”), the robot only has access to skill policies acquired while solving prior tasksM (the “skill library”). Extending our problem to include these assumptions, we can say that a single epoch of this continual multi-task learning problem is defined by an infinite-horizon MDP (S;A;M i ;p i ;r i ), wherei is the epoch number andM 0 is the (possibly-empty) set of manipulation skills with which the robot is initialized. In this chapter, we will assume that the robot can choose which task to learn in each epoch, and also when to stop learning that task and begin a new epoch. In Section 6.5 we will discuss at length the implications of such decisions, and in Section 6.6, we will discuss some implications of howM 0 is chosen. 6.4.2 Simple Continual Learning with Skill-Skill Transfer Having convinced ourselves in Section 6.3 that efficient skill reuse is possible with RL fine-tuning, we introduce a very simple framework we call Skill Builder (Figure 6.3). A Skill Builder starts with a pre- trained set of base skillsM 0 . On each epochi, it chooses a target task and a base skill policy base 2M i , 177 Pre-Train Adapt Expand Library pick and place π 1 π 1 π 2 π 3 insert peg π 2 push π 3 Skill Library π 1 π 2 π 3 open drawer π 4 π 1 π 2 π 3 open drawer π 4 ✓ ✓ ✓ ✓ π 4 ?? Figure 6.3: Overview diagram of Skill Builder. Starting with a library of pre-trained skills, the robot acquires a new skill by selecting an existing skill from its library and fine-tuning it with PPO [233] in the new training environment. Once acquired, the new skill is added to the library, and the robot repeats the process. and uses an RL algorithmF to fine-tune a clone of base to solve, which returns a new policy target and the success rate. It then adds target to the skill library, and continues (Algorithm 4). Algorithm 4 Skill Builder Continual Learning Framework 1: Input: Initial skill libraryM 0 , target task spaceT , RL fine-tuning algorithmF! (;), target task selection ruleChooseTargetTask, base skill selection ruleChooseBaseSkill 2: i 1 3: while not done do 4: ChooseTargetTask(T;M i1 ) 5: base ChooseBaseSkill(T;M i1 ) 6: target ; F(;clone( base )) 7: M i f target g[M i1 8: i i + 1 9: end while 10: Output: Skill libraryM i In this chapter, we use on-policy fine-tuning with PPO for the fine-tuning algorithmF, but in generalF may be any parametric RL algorithm, including an off-policy algorithm. 178 “Warm-Up” Procedure for Value Function Transfer In order to tune a skill on a task using on-policy reinforcement learning, we also need a value functionV ;pi 0 (s) which estimates the expected return of that skill policy on the task. Although PPO learns a value function as it trains the policy, we found that using a value function not fitted to the current task destroyed the skill policy’s parameters before they could be fine-tuned by PPO. Copying value functions of the same skill on prior tasks is particularly ineffective, since those value functions are very likely to over-estimate initial performance, and are thus not admissible. To avoid this issue, before applying any gradient updates to the policy, we sample a batch of the skill’s behavior on the new task, and train a new value function to convergence on the Monte Carlo return estimates from those samples. We found that this “value function warm-up” procedure was sufficient to produce an accurate value function of any skill on the new task, and was necessary to perform fine-tuning effectively with PPO. 6.4.3 Skill Builder Learns Continually without Forgetting Figure 6.4: Images of Meta-World ML10 and MT10 environments used for experiments with Skill Builder. Left: Training tasks. Right: Test tasks. Before we consider whether Skill Builder can achieve continual learning efficiently, we will first consider whether it can achieve continual learning at all. That is, can Skill Builder (1) acquire manipulation skills for new task while (2) not forgetting skills for old tasks? While (2) is guaranteed by construction, we use 179 experiments with the challenging Meta-World ML10 robot manipulation benchmark, along with a naïve implementation of Skill Builder, to verify (1) Naïve Skill Builder with Dueling Fine-Tuning In order to determine whether Skill Builder can achieve continual learning, we first need to specify theChooseTargetTask andChooseBaseSkill decision rules. While Section 6.5 will discuss a more intelligent way of choosing skill curriculums, here we use simple rules designed to ensure progress is always made without wasting too many samples on base skill policies which do not work well for a target task. These rules run a tournament between the best-known skill policy for a task and a challenger, so we call this version “Dueling Fine-Tuning” (Algorithm 5). Based on the observation that successful base skill policies generally either improve rapidly in performance during fine- tuning or not at all, these rules trade-off between searching for a successful base skill for fine-tuning, and using samples to train a policy from scratch, should no good base policy candidate exist. First, we define a success criteria success , which is the desired minimum success each task, and initialize a skill library with a one-to-one mapping between tasks2T and skills. Tasks with no pre-trained skill are mapped to randomly-initialized policies, as in from-scratch training. For ChooseTargetTask, we choose randomly among the all2T which do not yet have a corresponding skill policy which has met the success criteria † . ForChooseBaseSkill, we choose both the existing skill for the target task from the skill library (the “incumbent”, incumbent ), and a different skill from the library at random (the “challenger”, challenger ). Skill Builder then fine-tunes both policies for a limited number of environment steps (generally not enough to fully fine-tune a task). If the challenger policy achieves a higher success rate the incumbent policy, the challenger replaces the incumbent in the skill library for that task. † Tasks with pre-trained skills are assumed to have already met the success criteria (Algorithm 5, Line 4) 180 Algorithm 5 Skill Builder: Dueling Fine-Tuning 1: Input: Initial skill libraryM 0 , target task spaceT , RL fine-tuning algorithmF ! (;), success criteria success 2: i 1 3:M 0 f(; random )j = 2M 0 )g[M 0 4:W fj2M 0 g 5: while not done do 6: = random_selection(f2Tj = 2W) 7: incumbent M i1 [] 8: challenger random_selection(M i1 ) 9: 0 incumbent ; incumbent F(;clone( incumbent )) 10: 0 challenger ; challenger F(;clone( challenger )) 11: if challenger > incumbent then 12: target ; target challenger ; challenger 13: else 14: target ; target incumbent ; incumbent 15: end if 16: M i f target g[M i1 17: if target success then 18: W fg[W 19: end if 20: i i + 1 21: end while 22: Output: Skill libraryM i 181 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Total Environment Steps 1e7 0 10 20 30 40 50 60 70 Success Rate (Test) % Dueling Fine-Tuning Performance ML10 SOTA ML10 From Scratch Figure 6.5: Performance of Skill Builder on Meta-World ML10 and ML45 using Dueling Fine-Tuning selection rules. Figure 6.5 shows the results of using Skill Builder with dueling fine-tuning on two Meta-World chal- lenges, ML10, which has 10 pre-training tasks and 5 test tasks. Skill Builder with dueling fine-tuning achieves a success rate of 51% on ML10 after 95 epochs, consuming approximately 14M environment steps in the process. Achieving the same success rate by training the 5 tests tasks from scratch consumes approx- imately 8M environment steps. These results indicate that naïve skill builder successfully learns continually without forgetting, but does so less efficiently than learning from scratch. In the next section, we will discuss how using skill curriculums can make Skill Builder more efficient. 6.5 Efficient Continual Learning with Skill Curriculums In this section, we show that Skill Builder can be more efficient than from scratch learning if it has a curriculum specifying which skills should be used to solve each task. In the next Section, we will introduce a skill policy model for online curriculum selection, and share some promising results towards using it for continual learning with Skill Builder. 182 We first develop a notion of skill-skill transfer cost, by counting the number of samples needed to acquire a target skill starting from a given base skill. We then show that given this metric, we can reduce efficient continual learning with Skill Builder to solving a Minimum Spanning Tree (MST) problem. We show the effectiveness of our skill curriculum selection algorithm by using it to continually learn all skills in the Meta-World MT10 benchmark using a fraction of the total samples needed to learn each skill from scratch. 6.5.1 Measuring Skill-Skill Transfer Cost We define efficient continual learning as acquiring a skill policies for a given set of tasks while consuming the lowest number of environment samples possible. Since Skill Builder reduces the continual multi-task learning problem to one of repeated skill-skill transfer, it follows that efficient continual learning with skill builder is equivalent to minimizing the sum of the environment steps used for each adaptation step. With- out any prior on the relationship between two tasks, estimating such a quantity is difficult [241]. This is compounded by the fact that skill-skill transitions are not independent: in Skill Builder, the robot only has access to skill policies it acquired during adaptation to tasks it has already seen, so the lowest-cost skill-skill transition for any given epoch depends on the skills which were acquired in previous epochs. In this section, we are interested in determining whether it is possible to continually learn efficiently with Skill Builder. Inferring the skill-skill transfer cost for a pair of tasks offline is itself a rich research problem we leave for another day. Instead, to make progress on our central question, we make two simplifying assumptions: (1) before continual learning begins, we can access the task space to build a cost metric for skill-skill transfer and (2) we assume that skill-skill transfer costs are conditionally-independent (i.e. as long as the robot has a skill policy for a manipulation task, the skill-skill transfer cost is independent of the skill transfer sequence it used to acquire that policy). Our experiments below with Meta-World MT10 will validate the conditional independence assumption. In Section 6.6 we introduce a skill policy class which shows promising results towards inferring the skill-skill transfer cost online. 183 To build our offline skill-skill cost metric, before continual learning begins, we train a skill policy from scratch for each task2T . We then use these as base skill policies, and fine-tune a copy of each to solve each other target task. As we want to select base-target skill pairs which fine-tune quickly, we calculate the advantage of fine-tuning as the ratio of the areas-under-the-curve (AUC) of the success rate for the fine-tuning experiment and the from-scratch experiment. This represents the advantage of fine-tuning over learning from scratch. Our chosen cost metric is simply the negative of this advantage A base!target = AUC base!target AUC scratch!target (6.6) C base!target =A base!target (6.7) Figure 6.6 portraysA base!target for each skill-skill pair in the Meta-World MT10 benchmark. 6.5.2 Curriculum Selection Algorithm Our curriculum selection algorithm is based on the observation that we can interpret the skill-skill transfer cost matrix in Figure 6.6 as the weighted adjacency matrix of a densely-connected directed graph, with our skill-skill transfer cost metricC base!target as the directed edge weights. Under this interpretation, we can extract the lowest cost of visiting all tasks by solving for the Minimum Spanning Tree (MST) using Kruskal’s Algorithm [138]. To complete this MST formulation, we need to take into account the possibility that it is most efficient to train a skill policy from scratch, rather than starting learning with one which already exists in the skill library M. To achieve this, we add a “scratch” vertex to our graph representation, with out-directed edges from the scratch vertex to each task with edge weight1, representing no advantage (i.e. a ratio of 1) compared to training from scratch. This would be represented in Figure 6.6 as an addition row, whose values are all 1:0. 184 button-press-topdown door-open drawer-close drawer-open peg-insert-side pick-place push reach window-close window-open Target Environment button-press-topdown door-open drawer-close drawer-open peg-insert-side pick-place push reach window-close window-open Base Environment 1.07 1.43 1.02 0.0 0.0 0.02 0.68 0.51 1.04 1.03 0.0 1.46 1.02 0.01 0.0 0.0 0.06 0.15 0.89 0.0 0.33 1.12 1.02 0.71 0.0 0.02 0.46 0.92 1.03 0.96 1.03 1.25 0.98 1.1 0.0 0.02 0.01 0.32 0.73 0.95 0.0 0.0 1.02 0.82 1.22 1.38 0.46 0.85 1.02 1.01 0.97 0.0 1.01 1.06 1.03 2.19 1.22 0.98 0.98 0.85 0.0 1.33 1.02 0.92 1.11 1.19 1.22 0.56 1.03 1.03 0.69 0.0 1.0 0.01 0.0 0.01 0.48 1.03 1.01 0.73 0.87 1.14 1.02 0.72 0.7 0.0 0.41 1.01 1.04 0.92 0.77 0.0 1.01 1.1 0.0 0.01 0.63 0.99 1.01 1.03 Predicted Skill-Skill Transfer Cost for Meta-World MT10 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Figure 6.6: Matrix of skill-skill transfer costs for Meta-World MT10, using our cost metric. This figure shows the ratio of time steps required to learn a task from scratch compared to learning it by fine-tuning, using each other possible skill policy as a base skill for fine-tuning. Note that we always run at least a single training iteration, and use at most 12M time-steps. This prevents the matrix from containing infinity along the diagonal and 0 in any entries. 185 In addition to solving for the MST to find the optimal curriculum predicted by our cost metric, we can also solve for the Maximum Spanning Tree to find a predicted pessimal curriculum. See Figure 6.7 for examples of optimal and pessimal curriculum trees computed using our MST-based method for Meta-World MT10, using the skill-skill transfer cost data in Figure 6.6. Our experience indicates that these graphs make intuitive sense, and tell us a few things about the structure of the task space and the most efficient tasks with which to pre-train for continual learning. For instance, the optimal curriculum begins with peg-insert-side, empirically the most challenging task in the benchmark, then uses it to solvepick-place, which is empirically the second most-challenging task in the benchmark. The optimal curriculum is an approximate ordering of the tasks in the benchmark from least to most challenging, eventually terminating in solvingreach, the easiest task. The optimal tree branches at button-press-topdown because that skill requires reaching to a location and pushing the gripper down. Its terminal child nodedoor-open can be achieved by using thebutton-press-topdown skill, then simply pulling the robot arm backwards. Similarly, its other terminal child nodedrawer-close is solved even more easily, but reaching forward to close the drawer. For the pessimal curriculum, the same observations about inter-task structure hold. The pessimal cur- riculum instead begins by learning the empirically-easiest task, drawer-close which can be solved very quickly buy simply reaching forward. It then solves the hardest task, peg-insert-side, which requires essentially the same number of samples as scratch training when starting fromdrawer-close. The remain- der of the curriculum alternates between solving easier tasks and harder tasks, terminating in harder tasks. Each harder-easier transition destroys behaviors useful for future tasks, making the learning process take as long as possible. Not only do these results suggest that our pairwise fine-tuning cost metric discovers structure in the task space useful for learning skills, it also suggests a somewhat counter-intuitive result: that the most efficient curriculum begins by learning the hardest tasks first, then using them to solve the easier tasks. Most 186 scratch peg-insert-side pick-place push window-open drawer-open button-press-topdown door-open drawer-close window-close reach scratch drawer-close peg-insert-side button-press-topdown drawer-open push window-close pick-place door-open reach window-open Figure 6.7: Left: Predicted optimal curriculum MST for MT10. Note that the curriculum begins by learning the hardest task first, then transfers those skills to easier tasks.Right: Predicted pessimal curriculum MST for MT10. curriculum learning methods in RL instead learn the easiest tasks first, then use them to generalize to harder tasks. These results suggest we should initialize our skill libraryM by learning policies for the hardest tasks first. They also confirm one of the major hypothesis behind skill builder: that transferring policy behavior, not parameters, is most important for efficient continual manipulation skill learning. We empirically confirm these results below with continual learning experiments using these curricula. Given the solved optimal curriculum treeT optimal , Skill Builder learns all tasks continually by learn- ing skill in the sequence produced by the tree traversal onT optimal , starting from the root “scratch” node (Algorithm 6) ‡ . ‡ Note that if we find that a skill fails to fine-tune as quickly as predicted by our cost metric, we end training early, remove that edge from the graph, re-compute the curriculum using MST to find a better curriculum online, and continue learning under the new curriculum. 187 Algorithm 6 Skill Builder with MST-Based Curriculum 1: Input: Initial skill libraryM 0 , target task spaceT , RL fine-tuning algorithmF! (;AUC) 2: V T[scratch 3: E fg 4: for base 2T do 5: E (scratch; base ;1:0) 6: base ;AUC scratch!target F( base ; random ) 7: for target 2T do 8: ;AUC base!target =F( target ; base ) 9: C base!target AUC base!target AUC scratch!target 10: E ( base ; target ;C base!target )[E 11: end for 12: end for 13: G = (V;E) 14: T optimal kruskal(G) 15: i 1 16: base random 17: for target 2 traverse(T optimal ) do 18: target ; F( target ;clone( base )) 19: M i f target g[M i1 20: i i + 1 21: end for 22: Output: Skill libraryM i 6.5.3 Measuring the Effectiveness of Curriculum Selection We demonstrate the effectiveness of our curriculum selection algorithm in Figure 6.8, in which the total cost in environment steps is shown for each success rate we could use as a success criteria, calculated using the data from the skill-skill cost metric training experiments. The optimal curriculum computed using our curriculum selection algorithm uses fewer samples than training from scratch at all performance criteria. Both pessimal and random curriculums are far less sample efficient than our predicted optimal curriculum or training from scratch. Skill Builder with MST (portrayed as points on the frontier, because we cannot evaluate every possible experiment) outperforms training from scratch in many experiments, though the performance has high variance, corresponding with the seed sensitivity of fine-tuning neural networks. Taking such sensitivity into account when computing the cost metric is an interesting question for future work. 188 0 20 40 60 80 100 Success Rate % 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Total Environment Steps (All Tasks) 1e8 Skill Curriculum Performance Frontier for MT10 Optimal Curriculum (Predicted) Pessimal Curriculum (Predicted) Pessimal Curriculum (Actual) From Scratch (Actual) Random Curriculum (Predicted) Random Curriculum (Actual) Optimal Curriculum (Actual) Figure 6.8: Comparison of the performance-sample efficiency frontier for several curriculums for Meta- World MT10 with Skill Builder. The high level of agreement between the actual and predicted curricula indicate that our skill-skill transfer cost metric accurately predicts the true cost to transfer skills. 6.6 Towards a Skill Policy Model for Active Curriculum Selection In this section we introduce a novel policy model which allows us to rapidly acquire new skills by mixing parameters and behavior from an entire skill library at once, rather than by reusing only one skill at a time. We share promising results towards using this policy model class with Skill Builder for continual learning. As this policy model infers online which skill behaviors and parameters to use for acquiring a new skill, combining it will Skill Builder can be interpreted as a form of active curriculum selection for continual learning [72]. 6.6.1 Policy Model and Pre-Training Procedure In Section 6.5, we showed that we could use an offline cost metric calculation on the task space to solve for an efficient continual learning curriculum for use with Skill Builder. This has the obvious downside of requiring exhaustive evaluation of all skill-skill pairs in the task space. Prior work has shown that, for large enough task spaces, it is possible to build an inference model to predict the skill-skill transfer cost by 189 pre-training an inference model using data from a representative sample of the task space [241]. However, the ideal algorithm for this setting would learn to infer skill-skill transfer cost online, rather than requiring offline pre-training. 2. Adapt 3. Expand Capacity 1. Pre-Train π k π 2 π 1 ... W(s) 1 Π W(s) 2 Π W(s) n Π W(s) n+1 Π ... ... π 3 Skill Library Mixing Functions Tasks π k+1 Figure 6.9: Diagram of the Skill Mixer policy model class for active skill curriculum selection. This model class allows RL agents to reuse both parameters and behavior from all skills in a library during fine-tuning, and uses RL supervision actively select which skills with which to explore and which skills’ parameters to update. In order to perform this inference online, we would need to collect data on the effectiveness the all skills in the skill library online during training, and dynamically select which skills are most beneficial for the current target task. Previous works [69] have described how to achieve this with implicit inference on neural network parameters. We recall our observation from earlier in this chapter: for RL for manipulation skills, it is reusing skill behavior which is most important for continual learning performance. “How can we infer online which skill behaviors are most beneficial for learning the current task?” If we can somehow execute all skill policies from the skill library at once as the base policy during fine-tuning, on-policy RL thankfully provides us with the answer to the inference question, in the form of the policy gradient estimator. 190 We propose a policy model architecture for rapid fine-tuning of skill behaviors based on a probabilistic product-of-experts model. Our policy model class is similar to that proposed by Peng et al. for use for adaptation in imitation learning and Tseng et al. for meta-RL. The policy model (Figure 6.9) is composed of a library of skill policies i=1:::k (ajs) § and a learned observation-conditioned mixing function w for each target task . Note that unlike in previous sections, the number of skill policies in the library is independent of the number of tasks. The mixing functions share the skill library networks, and mix them by computing a weighted product of their Gaussian distributions, where the weights are the output of the learned observation-conditioned mixing function for that task. (ajs) = 1 Z(s) Y ik i (ajs) w i (s) (6.8) where i =N ( i (s); i (s)) (6.9) Pre-training During pre-training, we select a small subset of the task spaceT pretrain T with which to initialize our skill library, and jointly train the independent mixing functions, and shared skill library, for all pre-train tasks pretrain 2T pretrain ¶ . Note that this joint training through the mixing functions is especially important: it forces the algorithm to learn skill policies which can be effectively mixed by the task-specific mixing functions while also solving the tasks, i.e. it forces the policy class to learn to decompose the task policies into independent skills. The evidence presented in Section 6.5 suggests that the best tasks to use for pre-training are the most difficult ones. Adaptation To adapt the skill library to a new task, we initialize a new task-specific mixing functionw for that task, and use on-policy fine-tuning to select its weights. We keep the skill policies frozen, so that they do not forget their pre-trained parameters, but new, unfrozen skill library policies may be initialized § i.e. each skill policy is a Gaussian distribution parameterized by observation-conditioned mean and variance functions ¶ We find that distilling single-task policies for the pre-train tasks using behavioral cloning is effective, but this step may also be performed with RL 191 and learned at this time, to expand the library capacity. The policy gradient loss measures the overall effectiveness of the skill mixing policy’s behavior, and adjusts the relative weights of the skill in the library to maximize return on the new task. 6.6.2 Early Results for Continual Learning Figure 6.10: Examples of successful and failed policy executions for adapting to the window-open task. Left: The skill mixing policy reaches to the window and pushes, but chooses the wrong side from which to push. Right: The skill mixing policy successfully opens the window. Figure 6.11 shows for using this policy class, pre-training, and adaptation procedure for adaptation within Meta-World MT10 (Figure 6.4). In this experiment, the pre-train tasks are the 7 MT10 tasks not listed in the legend. We find that this skill mixing policy class is very effective as a prior for adaptation, partially-solving two out of three test tasks in just 250; 000 environment steps. All prior policies start training with equal or higher average returns than training from scratch. The drawer-close environment is solved on the first optimization epoch. The task this fine-tuning procedure does not solve is window-open, which has an interesting failure mode (Figure 6.10). The skill mixing policy immediately learns to reach-to, then push-on the window handle. Unfortunately, because it was pre-trained with window-close, the prior policy behavior pushes from the right side of the window handle towards the left (Figure 6.10, Left). A successful policy would push from the left side of the window handle towards the right (Figure 6.10). Our RL agent choose this 192 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 Total Environment Steps 1e5 0 10 0 10 1 10 2 10 3 Average Return Skill Mixing Policy Transfer Performance Window Open Window Open(scratch) Push Push(scratch) Drawer Close Drawer Close(scratch) Figure 6.11: Skill Mixer policy performance during fine-tuning, as measured with average discounted return. The skill mixing policies fine-tune very quickly to high-return policies for all 3 target tasks. Despite this, some policies fail to achieve full success on the task due to inadequate exploration. policy because it failed to explore reaching to the left side of the window handle. We find that a naïve application of the policy model class from Peng et al. results in RL skill policies with low entropy, so though they encode excellent priors, they fail to explore. In future work, we will consider how to retain the excellent behavior reuse properties of this model class while encouraging online exploration, and combine it with Skill Builder to achieve online skill curriculum learning with structured exploration. 6.7 Conclusion In this chapter, we introduced Skill Builder, a simple approach to continual learning for robotic manipula- tion, based around a growing library of skill policies and repeated skill-skill fine-tuning. We first informed the intuition behind Skill Builder’s design with simple skill policy adaptation experiments, which show that both skill policy parameters and behavior are essential for efficient policy adaptation. We then introduced the method as an abstract framework, and showed that a naïve implementation of that framework (dueling fine-tuning) achieves continual skill learning without forgetting, but is not more efficient than training all 193 skills from scratch. We discussed the importance of skill curriculums for efficient continual learning, re- duced the efficient continual learning problem with Skill Builder to minimizing total skill-skill transfer cost, developed an offline metric for measuring that cost based on skill-skill fine-tuning performance, and used this metric to solve for optimal and pessimal curriculums using an MST-based algorithm. We found that the optimal and pessimal curriculums produced by this MST-based curriculum algorithm are not only intuitive, but they make clear a rather unintuitive result: that it is best to pre-train continual learners with the hardest manipulation skills first. We tested these curriculums using a curriculum-based version of Skill Builder to continually learn Meta-World MT10, and verified that continual learning with the optimal curriculum out- performs training from scratch, and the pessimal curriculum performs much worse than both training scratch and a random curriculum. Finally, we introduced a policy model class for active online curriculum selection during fine-tuning, and shared promising results towards using it for continual learning. In future work, we look forward to further developing our Skill Mixing policy model class for active online curriculum selection, and integrating it with Skill Builder to achieve efficient continual learning. We will also investigate different skill-skill transfer cost metrics, especially those which might be inferred online, or by learning an inference model during pre-training which predicts the cost of learning future skills given the pre-train tasks. 194 Chapter 7 Conclusions Three elements—benchmarks, baselines, and novel methods—together form the three-legged stool of re- search in artificial intelligence. Robotics can only make progress towards the goal of real-world, general- purpose, continually-learning robots by periodically advancing each of these legs, hopefully creating a vir- tuous cycle of new challenges, new systems, new solutions, and ultimately new knowledge. This thesis has followed one such cycle for a small slice of the field, namely continual learning for robotic manipulation. We began in Chapter 2 with new possibilities for generalization in manipulation, brought about by advances in reinforcement learning. We learned that representation learning methods from reinforcement learning can allow us to amortize the cost of skill pre-training and simulation-to-real transfer over a combina- torial space of tasks, greatly reducing the per-task cost of training. We achieved this by using representation learning to separate the representation and parameterization of manipulation skills from their controllers. The representations are grounded in semantics, and thus transfer directly from simulation to the real world. The controllers are grounded in the physical dynamics of the robot and the world, and so take substantial effort and system complexity to transfer. Using representation learning for skill decomposition allowed us minimize the number of skills we had to learn and transfer, and once transferred, maximize the number of tasks we could solve with those skills by composing them in the learned latent space. We also learned in 195 Chapter 2 that in practice, then-state-of-the-art RL algorithms could only learn, and transfer to real world robots, manipulation skills of very limited sophistication. Taking what we learned about these limitations, in Chapter 3 we introduced Meta-World, a new bench- mark and evaluation for multi-task and meta-reinforcement learning. This benchmark expanded the avail- able RL benchmarks for generalization in robot manipulation from just a few goal-conditioned environ- ments such as reaching, pushing, and pick-and-place, to 50 diverse environments grounded in real robotics problems and limitations, each having substantial intra-skill variation. Since its release, Meta-World has appeared in over 276 scholarly publications, and is continuously updated and improved by members of the research community. Our experiments with using state-of-the-art meta-RL and multi-task RL algorithms to attempt to solve Meta-World showed that current methods excel at generalizing to intra-skill variations such as different object and fixture locations, but fail to achieve inter-skill generalization, which is essential for the continual multi-task learning robots we seek. Meta-World also taught us that current RL systems, espe- cially baseline algorithm libraries, struggle train on such large challenges, and are not designed to contend with the complexity of meta-RL and multi-task RL on such diverse settings. These lessons about system complexity lead us in Chapter 4 to design Garage, a library which allows RL researchers to easily define, modify, disseminate, and reproduce complex RL experiments. We designed Garage for modularity, composability, and legibility, making it especially useful for multi-task and meta-RL problems. We demonstrated this by making it the single most comprehensive source for high-quality imple- mentations of RL algorithms for generalization in continuous control. We made the project an open source, community-lead effort from the start, so that it can grow and evolve with the needs of the research commu- nity. We disseminated Garage to the community, where it now has a rapidly-growing base of between 500 and 1,000 active users. We then used Garage to better understand the reasons why results in meta-RL and multi-task RL are so difficult to reproduce, and how to better design future algorithms. We showed that the statistically-significant performance differences between state-of-the-art meta-RL and MTRL methods are 196 only between 0% and 25% on Meta-World. We then used Garage to show that several seemingly-innocuous implementation choices in meta-RL and MTRL algorithms, rather than the algorithm definitions themselves, can explain between 10% and 40% of the measured performance differences between these methods. We used these measurements to make recommendations to future researchers for how to carefully implement meta-RL and MTRL methods, and contributed standardized hyperparameter definitions and evaluation pro- cedures for meta-RL and MTRL, in addition to the software itself. In Chapter 5, we applied this systems-forward perspective to large-scale reinforcement learning with real robots, and introduced a simple method for intra-skill adaptation in off-policy reinforcement learning. Our method, called Never Stop Learning, allows a robot to adapt a visuomotor manipulation policy to a wide range of new circumstances, such as changing lighting conditions, backgrounds, novel objects, robot wear- and-tear, and even changes in the robot’s morphology for which it was never trained. While designing Never Stop Learning, we showed that many adaptation methods from supervised machine learning are impractical to use with robotic reinforcement learning, primarily because they destroy the robot’s ability to explore in its new environment. We showed that Never Stop Learning can achieve intra-skill adaptation in a single step and completely offline, using only a small dataset of exploration episodes in the changed environment. Notably, Never Stop Learning is one of few methods for adapting robotic manipulation which can be used with off- policy RL, and is extremely simple compared to the alternatives. The success of Never Stop Learning on intra-skill adaptation also highlighted the weakness of multi-task RL on inter-skill adaptation. Chapters 3 and 5 both revealed the persistent failure of meta-RL and multi-task RL algorithms to achieve intra-skill adaptation, so in Chapter 6 we embarked on deeper study of this essential component for contin- ual robot learning. We started from first principles, using simple experiments to better understand whether transferring skill parameters or behaviors is more important for intra-skill adaptation. We found that both are important, but that transferring skill behavior is especially important for rapid intra-skill adaptation, be- cause previous skill behaviors are likely to explore much more efficiently than random. Many prior methods, 197 which struggle with inter-skill adaptation, are built around transferring only skill parameters or the data used to train them. So we introduced a simple continual learning framework for robotic manipulation called Skill Builder, which is built around transferring skill behavior for continual inter-skill adaptation. We showed how to formalize continual learning for manipulation as continual skill-skill transfer, and then used exper- iments with Meta-World to show that our simple framework actually achieves continual learning without forgetting. We then reduced the problem of efficient continual learning with Skill Builder to inferring an efficient skill-skill transfer curriculum, and use an offline implementation of such an algorithm to show that skill curriculums can indeed make Skill Builder efficient, even when using naïve fine-tuning as our adap- tation algorithm. We then detailed policy model class which may make online inference of such skill-skill curriculums possible with RL, and shared promising early results on using it for efficient skill-skill transfer. 7.1 Future Directions Incorporating strong priors about the real world in robot learning One thing all experiments in this thesis have in common is that the robots in question learn only on data from their own experiences. The limitations of this approach are apparent in Chapter 2, where the robot has to simultaneously learning to model the world semantically (by learning representations) while also learning to use those semantics to explore and manipulate the world (by learning controllers, conditioned on those representations). This is an exceedingly challenging optimization problem, and makes task representation learners difficult to train and scale to real applications. This problem is also apparent in Chapter 6, which shows that the continual skill learning problem is difficult precisely because the robot has no a priori knowledge of tasks, their relationships, and their relative difficulties. This is not the case for other successful intelligent systems, such as those used for image classification or machine translation. Instead, those systems are trained on data which attempts to capture the entire world, such as datasets of millions of labeled images retrieved from the web, or parallel corpora of translated documents from law, technical documentation, websites, and even 198 television subtitles. Data from the natural world allows these systems to encode strong priors about that world before ever interacting with it. Though the embodied nature of robotic learning makes learning from in-situ experience a required capability, it does not mean that embodied experience should be the only source of data used for training robotic agents. Thus it is imperative to perform more research on how to best imbue robot learners with strong priors about the natural world, such as the laws of physics, affordances on and between objects, and the relationships between tasks. Skill reuse and structured exploration Most research on adaptation and generalization for robotic ma- nipulation focuses on intra-skill adaptation: using a learned manipulation skill in an environment, lighting conditions, or with a different object than those the robot was trained with. However, as argued in Chapter 1, robots will have the greatest impact on the lives of people in the real world if we can make them general purpose machines which can achieve many tasks. The experiments in Chapter 2 show the power of skill re-use for making robot learning efficient, and the Chapter 6 shows how important structured exploration (i.e. exploration in the space of skills, rather than actions) is for rapidly acquiring new skills. Skill-skill adaptation and structured exploration are an important research direction for robot learning over the next several years, as they will be required for achieving the continual robot learning we need to deploy general purpose robots into the world. Bridging the gap between discrete intent and continuous control Robots move through a world with continuous time and space, but are asked to achieve fundamentally discrete intents. A human does not ask her robot “Please move the coffee carafe 10cm up, then 5cm to the right, then tilt it by 90 degrees, then reverse that motion,” she asks her robot “Please pour me a cup of coffee.” This presents challenges to the current robot learning paradigm, which addresses manipulation as a form of continuous control. Never Stop Learning (Chapter 5) showcases the successes of this approach so far: the robots in that study achieve reliable grasping across a surprisingly-wide range of environments and circumstances, without ever 199 using manually-programmed perception or motion planning. However, Chapter 6 shows that reinforcement learning for continuous control has few answers for how to efficiently acquire new continuous skills for discrete intents, and task representation learning algorithms such as that in Chapter 2 do not specify how we should describe tasks to robot learning agents, only how we can learn the relationships between them. This coupled learning problem, recently summarized as Task and Motion Planning (TAMP) by Garrett et al., touches on what are some of the most fundamental issues facing robot learning for manipulation today. “How can we learn manipulation skill policies such that are easy combine and reuse, such as in problem settings like TAMP?” 7.2 Recommendations Simple benchmarks with high diversity, grounded in reality As this thesis has gone to great lengths to show, benchmarks are essential for making progress in AI research. However, benchmarks are only useful to the extent that the community adopts them, and to the extent that they are easily interpreted by readers. Simpler benchmarks are more likely to be adopted, and easier to interpret. This is not to say that benchmarks should only encode simple problems. On the contrary, benchmarks should encode complex problems, but expose them to researchers in a simple, straight-forward way, which makes as few assumptions as possible about the methods being used to solve them. For instance, if what we seek is robots which can manipulate a wide variety of objects in a wide variety of circumstances, a benchmark should present a robot with a wide variety of objects in a wide variety of circumstances and count its successes manipulating those objects, rather than attempting to isolate either objects or circumstances. Not only is this simpler than the alternative to use, it is closer to the goal the community would ultimately like to achieve. Future benchmarks in robot learning should emphasize training and testing in environments high di- versity, because robot manipulation in structured environments which never change is essentially a solved problem. They should also emphasize faithfulness to real world robot implementations, such as limited 200 control bandwidth, controllers with jittery timing behavior, and precision limits, because these will help benchmarks identify blind spots in current methods, which prevent these methods from making their way into real world robot systems. Simple methods, shared systems Robotics is a complex engineering problem domain, as is machine learning. The combination of the these two begets an exceedingly complex engineering problem. As robot learning grows in scope and maturity, ever-more up-front engineering is necessary for a researcher to address the experimental hypothesis at hand. There are only two ways to address this problem, and the field should pursue both in tandem. The first is a preference for simpler methods over more complex ones. It is possible to achieve incre- mental performance gains in many systems by adding inordinate complexity to them. In a field whose major design pillars have been long-established and the performance frontier is shallow, this might make sense. Robot learning is not in this regime, and such approaches rarely admit a path towards the major performance strides we need. Rather it is the simple methods with inordinate performance which lead to rapid progress. As Rich Sutton opined in his essay “The Bitter Lesson” [250], simple methods may tend to succeed because they can take better advantage of rapidly-following costs of computation on rapidly-growing stores of data. The second is a preference for shared systems over bespoke ones. Systems like ROS [203], NumPy [93], and OpenCV [21] greatly accelerated robotics research by freeing researchers from immense amounts of up- front engineering to produce research results. By allowing researchers to share and contribute their engineer- ing effort, shared systems amortize the cost of up-front engineering across an entire research community. While this moment has not yet come for robot learning, it will be necessary in the next several years to ensure continued progress. Robotics as machine learning, not machine learning for robotics A great deal of robot learning re- search, including some in this thesis, begins with the question “How can we take a new advancement in 201 machine learning, and apply it to robot learning?” While attempting to answer this has question has lead to some recent great strides, it also begins with the supposition that robotics and machine learning are two different problems. Thus a more accurate rephrasing of this question might be “How can we take a new advancement from computer vision, natural language processing, reinforcement learning, etc. and apply it to robotics?” This rephrasing reveals a fundamental mismatch in purposes, and explains why so much robot learning research rewards great feats of engineering and applied mathematics with minor gains. Instead, the challenge for both fields is to embrace the perspective they started with, when they would have both been lumped under the umbrella term “cybernetics” instead. That is, robot learning is machine learning. Specifically, robot learning is machine learning for the physical world. The most notable strides in computer vision (CV), natural language processing (NLP), and reinforce- ment learning have been made by applying the tools of applied mathematics, computation, and system design to their learning problems from scratch, encoding important aspects of that problem along the way. For instance, convolutional neural networks (CNN) predate hand-engineered features, but their modern form arose as a way for replacing successful hand-engineered image features, which were already applied using convolution, with learned features instead. Recurrent neural networks (RNNs) and auto-regressive neural networks (“transformers”) are learned instantiations of hand-engineered algorithms for sequence modeling engineered over the course of decades, to solve NLP problems without any learning at all. The most suc- cessful discrete reinforcement learning algorithms are learned relaxations of Q-learning and Monte Carlo Tree Search, two algorithms for solving MDPs and decision processes which are nearly as old as the field of AI itself. Rather than attempting to reduce robot learning to a CV problem, an NLP problem, an RL problem, or any other problem, robot learning should seek to develop approaches which treat robotics as a first-class member of the machine learning field. We can achieve great strides as other fields did, by fusing decades of human ingenuity in robotics with contemporary advances in computation, data collection, and modeling to create something greater than the sum of its parts. 202 Bibliography [1] Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y Ng. “An application of reinforcement learning to aerobatic helicopter flight”. In: Advances in neural information processing systems 19 (2007), p. 1. [2] Joshua Achiam. “Spinning Up in Deep Reinforcement Learning”. In: (2018). [3] Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. “Learning to poke by poking: Experiential learning of intuitive physics”. In: Advances in neural information processing systems. 2016, pp. 5074–5082. [4] Ferran Alet, Tomás Lozano-Pérez, and Leslie P Kaelbling. “Modular meta-learning”. In: arXiv preprint arXiv:1806.10166 (2018). [5] Arthur Allshire, Roberto Martın-Martın, Charles Lin, Shawn Manuel, Silvio Savarese, and Animesh Garg. “LASER: Learning a Latent Action Space for Efficient Reinforcement Learning”. In: arXiv preprint arXiv:2103.15793 (2021). [6] Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. “Learning dexterous in-hand manipulation”. In: arXiv:1808.00177 (2018). [7] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay. 2017. arXiv: 1707.01495[cs.LG]. [8] Antreas Antoniou, Harrison Edwards, and Amos Storkey. “How to train your maml”. In: arXiv preprint arXiv:1810.09502 (2018). [9] J Andrew Bagnell and Jeff G Schneider. “Autonomous helicopter control using reinforcement learning policy search methods”. In: Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No. 01CH37164). V ol. 2. IEEE. 2001, pp. 1615–1620. [10] Samuel Barrett, Matthew E. Taylor, and Peter Stone. “Transfer Learning for Reinforcement Learning on a Physical Robot”. In: Ninth International Conference on Autonomous Agents and Multiagent Systems - Adaptive Learning Agents Workshop (AAMAS - ALA). May 2010. URL: http://www.cs.utexas.edu/users/ai-lab/?AAMASWS10-barrett. 203 [11] Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen. DeepMind Lab. 2016. arXiv: 1612.03801[cs.AI]. [12] Marc G Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C Machado, Subhodeep Moitra, Sameera S Ponda, and Ziyu Wang. “Autonomous navigation of stratospheric balloons using reinforcement learning”. In: Nature 588.7836 (2020), pp. 77–82. [13] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. “The arcade learning environment: An evaluation platform for general agents”. In: Journal of Artificial Intelligence Research 47 (2013), pp. 253–279. [14] Yoshua Bengio, Ian Goodfellow, and Aaron Courville. Deep learning. V ol. 1. MIT press Massachusetts, USA: 2017. [15] Fabien CY Benureau and Pierre-Yves Oudeyer. “Behavioral diversity generation in autonomous exploration through reuse of past experience”. In: Frontiers in Robotics and AI 3 (2016), p. 8. [16] Hee Rak Beom and Hyung Suck Cho. “A sensor-based navigation for a mobile robot using fuzzy logic and reinforcement learning”. In: IEEE transactions on Systems, Man, and Cybernetics 25.3 (1995), pp. 464–477. [17] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. “Dota 2 with Large Scale Deep Reinforcement Learning”. In: arXiv preprint arXiv:1912.06680 (2019). [18] Ondrej Biza, Dian Wang, Robert Platt, Jan-Willem van de Meent, and Lawson LS Wong. “Action Priors for Large Action Spaces in Robotics”. In: arXiv preprint arXiv:2101.04178 (2021). [19] Cristian Bodnar, Karol Hausman, Gabriel Dulac-Arnold, and Rico Jonschkowski. “A Geometric Perspective on Self-Supervised Policy Adaptation”. In: arXiv preprint arXiv:2011.07318 (2020). [20] Alessandro Bonardi, Stephen James, and Andrew J Davison. “Learning One-Shot Imitation from Humans without Humans”. In: arXiv preprint arXiv:1911.01103 (2019). [21] G. Bradski. “The OpenCV Library”. In: Dr. Dobb´ s Journal of Software Tools (2000). [22] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. 2016. arXiv: arXiv:1606.01540 [cs.LG]. [23] Simon Brodeur, Ethan Perez, Ankesh Anand, Florian Golemo, Luca Celotti, Florian Strub, Jean Rouat, Hugo Larochelle, and Aaron Courville. “Home: A household multimodal environment”. In: arXiv preprint arXiv:1711.11017 (2017). [24] Berk Calli, Arjun Singh, Aaron Walsman, Siddhartha Srinivasa, Pieter Abbeel, and Aaron M Dollar. “The ycb object and model set: Towards common benchmarks for manipulation research”. In: International Conference on Advanced Robotics (ICAR). 2015. 204 [25] Berk Calli, Aaron Walsman, Arjun Singh, Siddhartha Srinivasa, Pieter Abbeel, and Aaron M Dollar. “Benchmarking in manipulation research: The YCB object and model set and benchmarking protocols”. In: arXiv:1502.03143 (2015). [26] Alberto Camacho, Jacob Varley, Deepali Jain, Atil Iscen, and Dmitry Kalashnikov. “Disentangled Planning and Control in Vision Based Robotics via Reward Machines”. In: arXiv preprint arXiv:2012.14464 (2020). [27] Zhangjie Cao, Minae Kwon, and Dorsa Sadigh. “Transfer Reinforcement Learning across Homotopy Classes”. In: IEEE Robotics and Automation Letters 6.2 (2021), pp. 2706–2713. [28] Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare. “Dopamine: A Research Framework for Deep Reinforcement Learning”. In: (2018). URL: http://arxiv.org/abs/1812.06110. [29] Stephanie CY Chan, Sam Fishman, John Canny, Anoop Korattikara, and Sergio Guadarrama. “Measuring the Reliability of Reinforcement Learning Algorithms”. In: arXiv preprint arXiv:1912.05663 (2019). [30] Konstantinos Chatzilygeroudis and Jean-Baptiste Mouret. “Using parameterized black-box priors to scale up model-based policy search for robotics”. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2018, pp. 1–9. [31] Konstantinos Chatzilygeroudis, Vassilis Vassiliades, and Jean-Baptiste Mouret. “Reset-free trial-and-error learning for robot damage recovery”. In: Robotics and Autonomous Systems 100 (2018), pp. 236–250. [32] Yevgen Chebotar, Karol Hausman, Zhe Su, Artem Molchanov, Oliver Kroemer, Gaurav Sukhatme, and Stefan Schaal. “Bigs: Biotac grasp stability dataset”. In: ICRA 2016 Workshop on Grasping and Manipulation Datasets. 2016. [33] Yevgen Chebotar, Karol Hausman, Marvin Zhang, Gaurav Sukhatme, Stefan Schaal, and Sergey Levine. “Combining model-based and model-free updates for trajectory-centric reinforcement learning”. In: ICML. JMLR. org. 2017, pp. 703–711. [34] Young Sang Choi, Travis Deyle, Tiffany Chen, Jonathan D Glass, and Charles C Kemp. “A list of household objects for robotic retrieval prioritized by people with ALS”. In: International Conference on Rehabilitation Robotics. 2009. [35] Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. “Leveraging Procedural Generation to Benchmark Reinforcement Learning”. In: arXiv preprint arXiv:1912.01588 (2019). [36] Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. “Quantifying generalization in reinforcement learning”. In: arXiv:1812.02341 (2018). [37] Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. “A Hitchhiker’s Guide to Statistical Comparisons of Reinforcement Learning Algorithms”. In: arXiv preprint arXiv:1904.06979 (2019). 205 [38] Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. How Many Random Seeds? Statistical Power Analysis in Deep Reinforcement Learning Experiments. 2018. arXiv: 1806.08295[cs.LG]. [39] Peter A Cooper. “Paradigm shifts in designed instruction: From behaviorism to cognitivism to constructivism”. In: Educational technology 33.5 (1993), pp. 12–19. [40] Nikolaus Correll, Kostas E Bekris, Dmitry Berenson, Oliver Brock, Albert Causo, Kris Hauser, Kei Okada, Alberto Rodriguez, Joseph M Romano, and Peter R Wurman. “Analysis and observations from the first amazon picking challenge”. In: IEEE Transactions on Automation Science and Engineering 15.1 (2016), pp. 172–188. [41] Antoine Cully, Jeff Clune, Danesh Tarapore, and Jean-Baptiste Mouret. “Robots that can adapt like animals”. In: Nature 521.7553 (2015), pp. 503–507. [42] Wenyuan Dai, Qiang Yang, Gui-Rong Xue, and Yong Yu. “Boosting for transfer learning”. In: Proceedings of the 24th international conference on Machine learning. 2007, pp. 193–200. [43] Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. RoboNet: Large-Scale Multi-Robot Learning. 2019. arXiv: 1910.11215[cs.RO]. [44] Tristan Deleu. Model-Agnostic Meta-Learning for Reinforcement Learning in PyTorch. Available at: https://github.com/tristandeleu/pytorch-maml-rl. 2018. [45] Tristan Deleu, Simon Guiroy, and Seyedarian Hosseini. “On the reproducibility of gradient-based Meta-Reinforcement Learning baselines”. In: (2018). [46] Tristan Deleu, Tobias Würfl, Mandana Samiei, Joseph Paul Cohen, and Yoshua Bengio. Torchmeta: A Meta-Learning library for PyTorch. Available at: https://github.com/tristandeleu/pytorch-meta. 2019. URL: https://arxiv.org/abs/1909.06576. [47] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Imagenet: A large-scale hierarchical image database”. In: 2009 IEEE conference on computer vision and pattern recognition. Ieee. 2009, pp. 248–255. [48] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: CoRR abs/1810.04805 (2018). arXiv: 1810.04805. [49] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep bidirectional transformers for language understanding”. In: arXiv preprint arXiv:1810.04805 (2018). [50] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. OpenAI baselines. https://github.com/openai/baselines. 2017. 206 [51] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. “Decaf: A deep convolutional activation feature for generic visual recognition”. In: International conference on machine learning. 2014, pp. 647–655. [52] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. “CARLA: An Open Urban Driving Simulator”. In: Proceedings of the 1st Annual Conference on Robot Learning. 2017, pp. 1–16. [53] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. “CARLA: An open urban driving simulator”. In: arXiv:1711.03938 (2017). [54] Gary L Drescher. Made-up minds: a constructivist approach to artificial intelligence. MIT press, 1991. [55] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. “Benchmarking deep reinforcement learning for continuous control”. In: International Conference on Machine Learning. 2016, pp. 1329–1338. [56] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. “RL 2 : Fast reinforcement learning via slow reinforcement learning”. In: arXiv preprint arXiv:1611.02779 (2016). [57] Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. “Challenges of real-world reinforcement learning”. In: arXiv preprint arXiv:1904.12901 (2019). [58] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. “Implementation Matters in Deep {RL}: A Case Study on {PPO} and {TRPO}”. In: International Conference on Learning Representations. 2020. URL: https://openreview.net/forum?id=r1etN1rtPB. [59] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, V olodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. “Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures”. In: arXiv:1802.01561 (2018). [60] Michael Everett, Yu Fan Chen, and Jonathan P How. “Motion planning among dynamic, decision-making agents with deep reinforcement learning”. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE. 2018, pp. 3052–3059. [61] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. “Diversity is all you need: Learning skills without a reward function”. In: arXiv preprint arXiv:1802.06070 (2018). URL: https://openreview.net/forum?id=SJx63jRqFm. [62] Alexander Fabisch, Jan Hendrik Metzen, Mario Michael Krell, and Frank Kirchner. “Accounting for task-difficulty in active multi-task robot control learning”. In: KI-Künstliche Intelligenz 29.4 (2015), pp. 369–377. [63] Linxi Fan, Yuke Zhu, Jiren Zhu, Zihua Liu, Orien Zeng, Anchit Gupta, Joan Creus-Costa, Silvio Savarese, and Li Fei-Fei. “SURREAL: Open-sourceReinforcement Learning Framework and Robot Manipulation Benchmark”. In: Conference on Robot Learning. 2018, pp. 767–782. 207 [64] Fernando Fernández and Manuela Veloso. “Probabilistic policy reuse in a reinforcement learning agent”. In: Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems. 2006, pp. 720–727. [65] Chrisantha Fernando, Jakub Sygnowski, Simon Osindero, Jane Wang, Tom Schaul, Denis Teplyashin, Pablo Sprechmann, Alexander Pritzel, and Andrei Rusu. “Meta-learning by the baldwin effect”. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion. ACM. 2018, pp. 1313–1320. [66] Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-agnostic meta-learning for fast adaptation of deep networks”. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. V ol. 70. JMLR. org. 2017, pp. 1126–1135. [67] Chelsea Finn, Ian Goodfellow, and Sergey Levine. “Unsupervised learning for physical interaction through video prediction”. In: Advances in neural information processing systems. 2016, pp. 64–72. [68] Chelsea Finn and Sergey Levine. “Deep visual foresight for planning robot motion”. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2017, pp. 2786–2793. [69] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. “Online meta-learning”. In: ICML (2019). [70] Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. “Deep spatial autoencoders for visuomotor learning”. In: 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2016, pp. 512–519. [71] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. “One-shot visual imitation learning via meta-learning”. In: arXiv preprint arXiv:1709.04905 (2017). [72] F Foglino, C Coletto Christakou, R Luna Gutierrez, and M Leonetti. “Curriculum Learning for Cumulative Return Maximization”. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence. IJCAI. 2019, pp. 2308–2314. [73] Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G Bellemare, and Joelle Pineau. “An introduction to deep reinforcement learning”. In: arXiv preprint arXiv:1811.12560 (2018). [74] Scott Fujimoto, Herke Hoof, and David Meger. “Addressing function approximation error in actor-critic methods”. In: International Conference on Machine Learning. PMLR. 2018, pp. 1587–1596. [75] Yasuhiro Fujita, Toshiki Kataoka, Prabhat Nagarajan, and Takahiro Ishikawa. “ChainerRL: A Deep Reinforcement Learning Library”. In: arXiv preprint arXiv:1912.03905 (2019). [76] Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. “Integrated task and motion planning”. In: arXiv preprint arXiv:2010.01083 (2020). 208 [77] Jason Gauci, Edoardo Conti, Yitao Liang, Kittipat Virochsiri, Zhengxing Chen, Yuchen He, Zachary Kaden, Vivek Narayanan, and Xiaohui Ye. “Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform”. In: arXiv preprint arXiv:1811.00260 (2018). [78] Ali Ghadirzadeh, Atsuto Maki, Danica Kragic, and Mårten Björkman. “Deep predictive policy training using reinforcement learning”. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE. 2017, pp. 2351–2358. [79] Corey Goldfeder, Matei Ciocarlie, Hao Dang, and Peter K Allen. “The columbia grasp database”. In: (2008). [80] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates”. In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE. 2017, pp. 3389–3396. [81] Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro, Holly Ethan, Sam Fishman, Ke Wang, Ekaterina Gonina, Neal Wu, Efi Kokiopoulou, Luciano Sbaiz, Jamie Smith, Gábor Bartók, Jesse Berent, Chris Harris, Vincent Vanhoucke, and Eugene Brevdo. TF-Agents: A library for Reinforcement Learning in TensorFlow. https://github.com/tensorflow/agents. [Online; accessed 25-June-2019]. 2018. URL: https://github.com/tensorflow/agents. [82] Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro, Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Neal Wu, Efi Kokiopoulou, Luciano Sbaiz, Jamie Smith, Gábor Bartók, Jesse Berent, Chris Harris, Vincent Vanhoucke, and Eugene Brevdo. TF-Agents: A library for Reinforcement Learning in TensorFlow. https://github.com/tensorflow/agents. [Online; accessed 25-June-2019]. 2018. URL: https://github.com/tensorflow/agents. [83] Vijaykumar Gullapalli, Judy A Franklin, and Hamid Benbrahim. “Acquiring robot skills via reinforcement learning”. In: IEEE Control Systems Magazine 14.1 (1994), pp. 13–24. [84] Abhinav Gupta, Adithyavairavan Murali, Dhiraj Prakashchand Gandhi, and Lerrel Pinto. “Robot learning in homes: Improving generalization and reducing dataset bias”. In: Advances in Neural Information Processing Systems. 2018, pp. 9112–9122. arXiv: 1807.07049[cs.RO]. [85] Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel, and Sergey Levine. “Meta-reinforcement learning of structured exploration strategies”. In: Advances in Neural Information Processing Systems. 2018, pp. 5302–5311. [86] Abhishek Gupta, Russell Mendonca, Yuxuan Liu, Pieter Abbeel, and Sergey Levine. “Meta-Reinforcement Learning of Structured Exploration Strategies”. In: CoRR abs/1802.07245 (2018). arXiv: 1802.07245. URL: http://arxiv.org/abs/1802.07245. [87] David Ha and Jürgen Schmidhuber. “Recurrent world models facilitate policy evolution”. In: Advances in Neural Information Processing Systems. 2018, pp. 2450–2462. [88] Tuomas Haarnoja, Sehoon Ha, Aurick Zhou, Jie Tan, George Tucker, and Sergey Levine. “Learning to Walk Via Deep Reinforcement Learning.” In: Robotics: Science and Systems. 2019. 209 [89] Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. “Latent Space Policies for Hierarchical Reinforcement Learning”. In: International Conference on Machine Learning. 2018, pp. 1846–1855. [90] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor”. In: arXiv preprint arXiv:1801.01290 (2018). [91] Raia Hadsell, Pierre Sermanet, Jan Ben, Ayse Erkan, Marco Scoffier, Koray Kavukcuoglu, Urs Muller, and Yann LeCun. “Learning long-range vision for autonomous off-road driving”. In: Journal of Field Robotics 26.2 (2009), pp. 120–144. [92] Aleksi Hämäläinen, Karol Arndt, Ali Ghadirzadeh, and Ville Kyrki. “Affordance learning for end-to-end visuomotor robot control”. In: arXiv preprint arXiv:1903.04053 (2019). [93] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. “Array programming with NumPy”. In: Nature 585.7825 (Sept. 2020), pp. 357–362. DOI: 10.1038/s41586-020-2649-2. [94] Leonard Hasenclever, Fabio Pardo, Raia Hadsell, Nicolas Heess, and Josh Merel. “CoMic: Complementary Task Learning & Mimicry for Reusable Skills”. In: Proceedings of the 37th International Conference on Machine Learning. Ed. by Hal Daumé III and Aarti Singh. V ol. 119. Proceedings of Machine Learning Research. PMLR, July 2020, pp. 4105–4115. URL: http://proceedings.mlr.press/v119/hasenclever20a.html. [95] Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. “Learning an Embedding Space for Transferable Robot Skills”. In: International Conference on Learning Representations. 2018. URL: https://openreview.net/forum?id=rk07ZXZRb. [96] Majd Hawasly and Subramanian Ramamoorthy. “Lifelong transfer learning with an option hierarchy”. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE. 2013, pp. 1341–1346. [97] Murtaza Hazara and Ville Kyrki. “Transferring generalizable motor primitives from simulation to real world”. In: IEEE Robotics and Automation Letters 4.2 (2019), pp. 2172–2179. [98] Murtaza Hazara, Xiaopu Li, and Ville Kyrki. “Active incremental learning of a contextual skill model”. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE. 2019, pp. 1834–1839. [99] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778. 210 [100] Nicolas Heess, Gregory Wayne, Yuval Tassa, Timothy P. Lillicrap, Martin A. Riedmiller, and David Silver. “Learning and Transfer of Modulated Locomotor Controllers”. In: CoRR abs/1610.05182 (2016). URL: http://arxiv.org/abs/1610.05182. [101] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. “Deep reinforcement learning that matters”. In: Thirty-Second AAAI Conference on Artificial Intelligence. 2018. [102] Peter Henderson, Joshua Romoff, and Joelle Pineau. “Where did my optimum go?: An empirical analysis of gradient descent optimization in policy gradient methods”. In: arXiv preprint arXiv:1810.02525 (2018). [103] Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. “Multi-task Deep Reinforcement Learning with PopArt”. In: CoRR abs/1809.04474 (2018). arXiv: 1809.04474. URL: http://arxiv.org/abs/1809.04474. [104] Juan Camilo Gamboa Higuera, David Meger, and Gregory Dudek. “Adapting learned robotics behaviours through policy adjustment”. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2017, pp. 5837–5843. [105] Ashley Hill, Antonin Raffin, Maximilian Ernestus, Adam Gleave, Anssi Kanervisto, Rene Traore, Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Stable Baselines. https://github.com/hill-a/stable-baselines. 2018. [106] Jeremy Howard and Sebastian Ruder. Universal Language Model Fine-tuning for Text Classification. 2018. arXiv: 1801.06146[cs.CL]. [107] Jeremy Howard and Sebastian Ruder. “Universal language model fine-tuning for text classification”. In: arXiv preprint arXiv:1801.06146 (2018). [108] Minyoung Huh, Pulkit Agrawal, and Alexei A Efros. “What makes ImageNet good for transfer learning?” In: arXiv preprint arXiv:1608.08614 (2016). [109] Brian Ichter, Pierre Sermanet, and Corey Lynch. “Broadly-Exploring, Local-Policy Trees for Long-Horizon Task Planning”. In: arXiv preprint arXiv:2010.06491 (2020). [110] Alexander Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, and Sergey Levine. “Off-policy evaluation via off-policy classification”. In: Advances in Neural Information Processing Systems. 2019, pp. 5438–5449. [111] Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup. “Reproducibility of benchmarked deep reinforcement learning tasks for continuous control”. In: arXiv preprint arXiv:1708.04133 (2017). [112] Stephen James, Michael Bloesch, and Andrew J Davison. “Task-Embedded Control Networks for Few-Shot Imitation Learning”. In: Conference on Robot Learning. 2018, pp. 783–795. 211 [113] Stephen James, Andrew J Davison, and Edward Johns. “Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task”. In: Conference on Robot Learning (CoRL) (2017). [114] Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. “Rlbench: The robot learning benchmark & learning environment”. In: arXiv preprint arXiv:1909.12271 (2019). [115] Rae Jeong, Jackie Kay, Francesco Romano, Thomas Lampe, Tom Rothorl, Abbas Abdolmaleki, Tom Erez, Yuval Tassa, and Francesco Nori. “Modelling Generalized Forces with Reinforcement Learning for Sim-to-Real Transfer”. In: arXiv preprint arXiv:1910.09471 (2019). [116] Ryan Julian, Benjamin Swanson, Gaurav S Sukhatme, Sergey Levine, Chelsea Finn, and Karol Hausman. “Never Stop Learning: The Effectiveness of Fine-Tuning in Robotic Reinforcement Learning”. In: arXiv e-prints (2020), arXiv–2004. [117] Ryan C Julian, Eric Heiden, Zhanpeng He, Hejia Zhang, Stefan Schaal, Joseph Lim, Gaurav S Sukhatme, and Karol Hausman. “Scaling simulation-to-real transfer by learning composable robot skills”. In: International Symposium on Experimental Robotics. Springer. 2018. URL: https://ryanjulian.me/iser%5C_2018.pdf. [118] Arthur Juliani, Ahmed Khalifa, Vincent Pierre Berges, Jonathan Harper, Ervin Teng, Hunter Henry, Adam Crespi, Julian Togelius, and Danny Lange. “Obstacle tower: A generalization challenge in vision, control, and planning”. In: 28th International Joint Conference on Artificial Intelligence, IJCAI 2019. International Joint Conferences on Artificial Intelligence. 2019, pp. 2684–2691. [119] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. “Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation”. In: Conference on Robot Learning. 2018, pp. 651–673. [120] Sanket Kamthe and Marc Deisenroth. “Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control”. In: International Conference on Artificial Intelligence and Statistics. 2018, pp. 1701–1710. [121] Daniel Kappler, Jeannette Bohg, and Stefan Schaal. “Leveraging big data for grasp planning”. In: 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2015, pp. 4304–4311. [122] Alexander Kasper, Zhixing Xue, and Rüdiger Dillmann. “The KIT object models database: An object model database for object recognition, localization and manipulation in service robotics”. In: IJRR (2012). [123] Rituraj Kaushik, Pierre Desreumaux, and Jean-Baptiste Mouret. “Adaptive prior selection for repertoire-based online adaptation in robotics”. In: Frontiers in Robotics and AI 6 (2020), p. 151. [124] Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´ skowski. “ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning”. In: IEEE Conference on Computational Intelligence and Games. The best paper award. Santorini, Greece: IEEE, Sept. 2016, pp. 341–348. URL: http://arxiv.org/abs/1605.02097. 212 [125] Khimya Khetarpal, Zafarali Ahmed, Andre Cianflone, Riashat Islam, and Joelle Pineau. “Re-evaluate: Reproducibility in evaluating reinforcement learning algorithms”. In: (2018). [126] Khimya Khetarpal, Matthew Riemer, Irina Rish, and Doina Precup. “Towards Continual Reinforcement Learning: A Review and Perspectives”. In: arXiv preprint arXiv:2012.13490 (2020). [127] Jens Kober, J Andrew Bagnell, and Jan Peters. “Reinforcement learning in robotics: A survey”. In: The International Journal of Robotics Research 32.11 (2013), pp. 1238–1274. [128] Jens Kober and Jan R Peters. “Policy search for motor primitives in robotics”. In: Advances in neural information processing systems. 2009, pp. 849–856. [129] Nathan Koenig and Maja J Matari´ c. “Robot life-long task learning from human demonstrations: a Bayesian approach”. In: Autonomous Robots 41.5 (2017), pp. 1173–1188. [130] Nate Kohl and Peter Stone. “Machine Learning for Fast Quadrupedal Locomotion”. In: The Nineteenth National Conference on Artificial Intelligence. July 2004, pp. 611–616. [131] Nate Kohl and Peter Stone. “Policy gradient reinforcement learning for fast quadrupedal locomotion”. In: IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004. V ol. 3. IEEE. 2004, pp. 2619–2624. [132] Sergey Kolesnikov and Oleksii Hrinchuk. “Catalyst. RL: A Distributed Framework for Reproducible RL Research”. In: arXiv preprint arXiv:1903.00027 (2019). [133] Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. “AI2-THOR: An Interactive 3D Environment for Visual AI”. In: arXiv (2017). [134] Simon Kornblith, Jonathon Shlens, and Quoc V Le. “Do better imagenet models transfer better?” In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2019, pp. 2661–2671. [135] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems. 2012. [136] Oliver Kroemer, Christian Daniel, Gerhard Neumann, Herke Van Hoof, and Jan Peters. “Towards learning hierarchical skills for multi-phase manipulation tasks”. In: 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2015, pp. 1503–1510. [137] Oliver Kroemer and Gaurav S. Sukhatme. “Learning Relevant Features for Manipulation Skills using Meta-Level Priors”. In: CoRR abs/1605.04439 (2016). arXiv: 1605.04439. URL: http://arxiv.org/abs/1605.04439. [138] Joseph B Kruskal. “On the shortest spanning subtree of a graph and the traveling salesman problem”. In: Proceedings of the American Mathematical society 7.1 (1956), pp. 48–50. 213 [139] Saurabh Kumar, Aviral Kumar, Sergey Levine, and Chelsea Finn. “One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL”. In: Advances in Neural Information Processing Systems 33 (2020). [140] Michelle A Lee, Yuke Zhu, Krishnan Srinivasan, Parth Shah, Silvio Savarese, Li Fei-Fei, Animesh Garg, and Jeannette Bohg. “Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks”. In: arXiv preprint arXiv:1810.10191 (2018). [141] Youngwoon Lee, Shao-Hua Sun, Sriram Somasundaram, Edward S Hu, and Joseph J Lim. “Composing Complex Skills by Learning Transition Policies”. In: International Conference on Learning Representations. 2019. [142] Ian Lenz, Honglak Lee, and Ashutosh Saxena. “Deep learning for detecting robotic grasps”. In: IJRR (2015). [143] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. “End-to-End Training of Deep Visuomotor Policies”. In: Journal of Machine Learning Research (JMLR) (2016). [144] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. “End-to-end training of deep visuomotor policies”. In: The Journal of Machine Learning Research 17.1 (2016), pp. 1334–1373. [145] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection”. In: The International Journal of Robotics Research 37.4-5 (2018), pp. 421–436. DOI: 10.1177/0278364917710318. eprint: https://doi.org/10.1177/0278364917710318. [146] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection”. In: The International Journal of Robotics Research 37.4-5 (2018), pp. 421–436. [147] Tianyu Li, Nathan Lambert, Roberto Calandra, Franziska Meier, and Akshara Rai. “Learning generalizable locomotion skills with hierarchical reinforcement learning”. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2020, pp. 413–419. [148] Yunfei Li, Yilin Wu, Huazhe Xu, Xiaolong Wang, and Yi Wu. “Solving Compositional Reinforcement Learning Problems via Task Reduction”. In: arXiv preprint arXiv:2103.07607 (2021). [149] Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. “RLlib: Abstractions for Distributed Reinforcement Learning”. In: International Conference on Machine Learning. 2018, pp. 3053–3062. [150] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. “Continuous control with deep reinforcement learning”. In: arXiv preprint arXiv:1509.02971 (2015). 214 [151] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. “Continuous control with deep reinforcement learning”. In: CoRR abs/1509.02971 (2015). arXiv: 1509.02971. URL: http://arxiv.org/abs/1509.02971. [152] Long-Ji Lin. Reinforcement learning for robots using neural networks. 1992. [153] Hao Liu, Richard Socher, and Caiming Xiong. “Taming MAML: Efficient unbiased meta-reinforcement learning”. In: Proceedings of the 36th International Conference on Machine Learning. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. V ol. 97. Proceedings of Machine Learning Research. Long Beach, California, USA: PMLR, June 2019, pp. 4061–4071. URL: http://proceedings.mlr.press/v97/liu19g.html. [154] Zhuang Liu, Xuanlin Li, Bingyi Kang, and Trevor Darrell. “Regularization Matters in Policy Optimization”. In: arXiv preprint arXiv:1910.09191 (2019). [155] Pinxin Long, Tingxiang Fan, Xinyi Liao, Wenxi Liu, Hao Zhang, and Jia Pan. “Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning”. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2018, pp. 6252–6259. [156] Keng Wah Loon, Laura Graesser, and Milan Cvitkovic. “SLM Lab: A Comprehensive Benchmark and Modular Software Framework for Reproducible Deep Reinforcement Learning”. In: arXiv preprint arXiv:1912.12482 (2019). [157] Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. “Reset-Free Lifelong Learning with Skill-Space Planning”. In: arXiv preprint arXiv:2012.03548 (2020). [158] Yuchen Lu, Yikang Shen, Siyuan Zhou, Aaron Courville, Joshua B Tenenbaum, and Chuang Gan. “Learning Task Decomposition with Ordered Memory Policy Network”. In: arXiv preprint arXiv:2103.10972 (2021). [159] R Luna Gutierrez and M Leonetti. “Information-theoretic Task Selection for Meta-Reinforcement Learning”. In: Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020). Leeds. 2020. [160] Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew J. Hausknecht, and Michael Bowling. “Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents”. In: CoRR abs/1709.06009 (2017). arXiv: 1709.06009. [161] Guilherme Maeda, Marco Ewerton, Takayuki Osa, Baptiste Busch, and Jan Peters. “Active Incremental Learning of Robot Movement Primitives”. In: Proceedings of the 1st Annual Conference on Robot Learning. Ed. by Sergey Levine, Vincent Vanhoucke, and Ken Goldberg. V ol. 78. Proceedings of Machine Learning Research. PMLR, Nov. 2017, pp. 37–46. URL: http://proceedings.mlr.press/v78/maeda17a.html. [162] Sridhar Mahadevan and Jonathan Connell. “Automatic programming of behavior-based robots using reinforcement learning”. In: Artificial intelligence 55.2-3 (1992), pp. 311–365. 215 [163] A Rupam Mahmood, Dmytro Korenkevych, Gautham Vasan, William Ma, and James Bergstra. “Benchmarking Reinforcement Learning Algorithms on Real-World Robots”. In: Conference on Robot Learning. 2018, pp. 561–591. [164] Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. “ROBOTURK: A Crowdsourcing Platform for Robotic Skill Learning through Imitation”. In: arXiv:1811.02790 (2018). [165] Lucas Manuelli, Wei Gao, Peter R. Florence, and Russ Tedrake. “kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation”. In: CoRR abs/1903.06684 (2019). arXiv: 1903.06684. [166] Robert C Martin. Clean code: a handbook of agile software craftsmanship. Pearson Education, 2009. [167] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org. 2015. URL: https://www.tensorflow.org/. [168] Maja J Matari´ c. “Reinforcement Learning in the Multi-Robot Domain”. In: Autonomous Robots 4.1 (1997), pp. 73–83. [169] Jorge Mendez, Boyu Wang, and Eric Eaton. “Lifelong Policy Gradient Learning of Factored Policies for Faster Training Without Forgetting”. In: Advances in Neural Information Processing Systems 33 (2020). [170] Josh Merel, Saran Tunyasuvunakool, Arun Ahuja, Yuval Tassa, Leonard Hasenclever, Vu Pham, Tom Erez, Greg Wayne, and Nicolas Heess. “Catch & Carry: reusable neural controllers for vision-guided whole-body tasks”. In: ACM Transactions on Graphics (TOG) 39.4 (2020), pp. 39–1. [171] Josh Merel, Saran Tunyasuvunakool, Arun Ahuja, Yuval Tassa, Leonard Hasenclever, Vu Pham, Tom Erez, Greg Wayne, and Nicolas Heess. “Reusable neural skill embeddings for vision-guided whole body movement and object manipulation”. In: arXiv preprint arXiv:1911.06636 (2019). [172] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. “Learning to navigate in complex environments”. In: arXiv preprint arXiv:1611.03673 (2016). [173] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. “A simple neural attentive meta-learner”. In: arXiv:1707.03141 (2017). [174] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. “Meta-Learning with Temporal Convolutions”. In: CoRR abs/1707.03141 (2017). arXiv: 1707.03141. URL: http://arxiv.org/abs/1707.03141. 216 [175] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. “Playing atari with deep reinforcement learning”. In: arXiv preprint arXiv:1312.5602 (2013). [176] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (2015), pp. 529–533. ISSN: 14764687. DOI: 10.1038/nature14236. arXiv: 1312.5602. [177] Katharina Mülling, Jens Kober, Oliver Kroemer, and Jan Peters. “Learning to select and generalize striking movements in robot table tennis”. In: IJRR (2013). [178] Anusha Nagabandi, Ignasi Clavera, Simin Liu, Ronald S Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. “Learning to adapt in dynamic, real-world environments through meta-reinforcement learning”. In: arXiv:1803.11347 (2018). URL: https://openreview.net/forum?id=HyztsoC5Y7. [179] Anusha Nagabandi, Chelsea Finn, and Sergey Levine. “Deep Online Learning via Meta-Learning: Continual Adaptation for Model-Based RL”. In: arXiv preprint arXiv:1812.07671 (2018). [180] Anusha Nagabandi, Kurt Konoglie, Sergey Levine, and Vikash Kumar. “Deep Dynamics Models for Learning Dexterous Manipulation”. In: arXiv preprint arXiv:1909.11652 (2019). [181] Prabhat Nagarajan, Garrett Warnell, and Peter Stone. “The impact of nondeterminism on reproducibility in deep reinforcement learning”. In: (2018). [182] Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. “Accelerating online reinforcement learning with offline datasets”. In: arXiv preprint arXiv:2006.09359 (2020). [183] Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. “Visual reinforcement learning with imagined goals”. In: Advances in Neural Information Processing Systems. 2018, pp. 9191–9200. [184] Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. “Curriculum learning for reinforcement learning domains: A framework and survey”. In: Journal of Machine Learning Research 21.181 (2020), pp. 1–50. [185] Andrew Y Ng, H Jin Kim, Michael I Jordan, and Shankar Sastry. “Autonomous helicopter flight via reinforcement learning”. In: Proceedings of the 16th International Conference on Neural Information Processing Systems. 2003, pp. 799–806. [186] Alex Nichol, Joshua Achiam, and John Schulman. “On First-Order Meta-Learning Algorithms”. In: CoRR abs/1803.02999 (2018). arXiv: 1803.02999. URL: http://arxiv.org/abs/1803.02999. [187] Alex Nichol, Vicki Pfau, Christopher Hesse, Oleg Klimov, and John Schulman. Gotta Learn Fast: A New Benchmark for Generalization in RL. 2018. arXiv: 1804.03720[cs.LG]. 217 [188] OpenAI, Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, Jonas Schneider, Nikolas Tezak, Jerry Tworek, Peter Welinder, Lilian Weng, Qiming Yuan, Wojciech Zaremba, and Lei Zhang. Solving Rubik’s Cube with a Robot Hand. 2019. arXiv: 1910.07113[cs.LG]. [189] Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepezvari, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, and Hado Van Hasselt. “Behaviour Suite for Reinforcement Learning”. In: International Conference on Learning Representations. 2020. URL: https://openreview.net/forum?id=rygf-kSYwH. [190] Sinno Jialin Pan and Qiang Yang. “A survey on transfer learning”. In: IEEE Transactions on knowledge and data engineering 22.10 (2009), pp. 1345–1359. [191] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. “BLEU: a method for automatic evaluation of machine translation”. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics. 2002, pp. 311–318. [192] Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. “Actor-mimic: Deep multitask and transfer reinforcement learning”. In: arXiv:1511.06342 (2015). [193] Peter Pastor, Mrinal Kalakrishnan, Ludovic Righetti, and Stefan Schaal. “Towards Associative Skill Memories”. In: 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012). Nov. 2012, pp. 309–315. DOI: 10.1109/HUMANOIDS.2012.6651537. [194] Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Fred Shentu, Evan Shelhamer, Jitendra Malik, Alexei A. Efros, and Trevor Darrell. “Zero-Shot Visual Imitation”. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (June 2018). DOI: 10.1109/cvprw.2018.00278. [195] Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. “Sim-to-Real Transfer of Robotic Control with Dynamics Randomization”. In: CoRR abs/1710.06537 (2017). arXiv: 1710.06537. URL: http://arxiv.org/abs/1710.06537. [196] Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. “Sim-to-real transfer of robotic control with dynamics randomization”. In: 2018 IEEE international conference on robotics and automation (ICRA). IEEE. 2018, pp. 1–8. [197] Xue Bin Peng, Michael Chang, Grace Zhang, Pieter Abbeel, and Sergey Levine. “MCP: Learning composable hierarchical control with multiplicative compositional policies”. In: arXiv preprint arXiv:1905.09808 (2019). [198] Tim Peters. PEP 20: The Zen of Python. 2004. URL: https://www.python.org/dev/peps/pep-0020/. [199] Lerrel Pinto and Abhinav Gupta. “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours”. In: 2016 IEEE international conference on robotics and automation (ICRA). IEEE. 2016, pp. 3406–3413. 218 [200] Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. “Multi-goal reinforcement learning: Challenging robotics environments and request for research”. In: arXiv preprint arXiv:1802.09464 (2018). [201] Vitchyr H Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and Sergey Levine. “Skew-Fit: State-Covering Self-Supervised Reinforcement Learning”. In: arXiv preprint arXiv:1903.03698 (2019). [202] James A Preiss, Karol Hausman, and Gaurav S Sukhatme. “Learning a System-ID Embedding Space for Domain Specialization with Deep Reinforcement Learning”. In: NeurIPS Workshop on Reinforcement Learning under Partial Observability (2018). [203] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, Andrew Y Ng, et al. “ROS: an open-source Robot Operating System”. In: ICRA workshop on open source software. V ol. 3. 3.2. Kobe, Japan. 2009, p. 5. [204] Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. “Rapid learning or feature reuse? towards understanding the effectiveness of maml”. In: arXiv preprint arXiv:1909.09157 (2019). [205] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng. “Self-taught learning: transfer learning from unlabeled data”. In: Proceedings of the 24th international conference on Machine learning. 2007, pp. 759–766. [206] Janarthanan Rajendran, P. Prasanna, Balaraman Ravindran, and Mitesh M. Khapra. “ADAAPT: A Deep Architecture for Adaptive Policy Transfer from Multiple Sources”. In: CoRR abs/1510.02879 (2015). arXiv: 1510.02879. URL: http://arxiv.org/abs/1510.02879. [207] Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine. “EPOpt: Learning Robust Neural Network Policies Using Model Ensembles”. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. V ol. abs/1610.01283. 2017. arXiv: 1610.01283. URL: https://openreview.net/forum?id=SyWvgP5el. [208] Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. “Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables”. In: International Conference on Machine Learning. 2019, pp. 5331–5340. [209] Divyam Rastogi, Ivan Koryakovskiy, and Jens Kober. “Sample-efficient reinforcement learning via difference models”. In: [210] Zohreh Raziei and Mohsen Moghaddam. “Adaptable Automation with Modular Deep Reinforcement Learning and Policy Transfer”. In: arXiv preprint arXiv:2012.01934 (2020). [211] John D Co-Reyes, Yuxuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and Sergey Levine. “Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings”. In: ICML (June 2018). arXiv: 1806.02813[cs.LG]. URL: http://arxiv.org/abs/1806.02813. 219 [212] Stephan R Richter, Zeeshan Hayder, and Vladlen Koltun. “Playing for benchmarks”. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, pp. 2213–2222. [213] Martin A. Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Van de Wiele, V olodymyr Mnih, Nicolas Manfred Otto Heess, and Jost Tobias Springenberg. “Learning by Playing Solving Sparse Reward Tasks from Scratch”. In: ICML. 2018. [214] Samuel Ritter, Jane X Wang, Zeb Kurth-Nelson, Siddhant M Jayakumar, Charles Blundell, Razvan Pascanu, and Matthew Botvinick. “Been there, done that: Meta-learning with episodic recall”. In: arXiv preprint arXiv:1805.09692 (2018). [215] Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, and Pieter Abbeel. “Promp: Proximal meta-policy search”. In: arXiv preprint arXiv:1810.06784 (2018). URL: https://openreview.net/forum?id=SkxXCi0qFX. [216] Sebastian Ruder. An Overview of Multi-Task Learning in Deep Neural Networks. 2017. arXiv: 1706.05098[cs.LG]. [217] E. Rueckert, J. Mundo, A. Paraschos, J. Peters, and G. Neumann. “Extracting low-dimensional control variables for movement primitives”. In: 2015 IEEE International Conference on Robotics and Automation (ICRA). May 2015, pp. 1511–1518. DOI: 10.1109/ICRA.2015.7139390. [218] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. “Imagenet large scale visual recognition challenge”. In: International journal of computer vision 115.3 (2015), pp. 211–252. [219] Stuart J Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Pearson Education Limited, 2016. [220] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, V olodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. “Policy distillation”. In: arXiv:1511.06295 (2015). [221] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. “Progressive neural networks”. In: arXiv preprint arXiv:1606.04671 abs/1606.04671 (2016). arXiv: 1606.04671. URL: http://arxiv.org/abs/1606.04671. [222] Andrei A. Rusu, Matej Vecerík, Thomas Rothörl, Nicolas Manfred Otto Heess, Razvan Pascanu, and Raia Hadsell. “Sim-to-Real Robot Learning from Pixels with Progressive Nets”. In: CoRL. 2016. [223] Fereshteh Sadeghi and Sergey Levine. “CAD2RL: Real Single-Image Flight Without a Single Real Image”. In: Robotics: Science and Systems XIII (July 2017). DOI: 10.15607/rss.2017.xiii.034. 220 [224] Fereshteh Sadeghi, Alexander Toshev, Eric Jang, and Sergey Levine. “Sim2Real Viewpoint Invariant Visual Servoing by Recurrent Control”. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. 2018, pp. 4691–4699. DOI: 10.1109/CVPR.2018.00493. [225] Steindór Sæmundsson, Katja Hofmann, and Marc Peter Deisenroth. “Meta Reinforcement Learning with Latent Variable Gaussian Processes”. In: UAI. V ol. abs/1803.07551. 2018, pp. 642–652. arXiv: 1803.07551. URL: http://auai.org/uai2018/proceedings/papers/235.pdf. [226] Manolis Savva, Angel X Chang, Alexey Dosovitskiy, Thomas Funkhouser, and Vladlen Koltun. “MINOS: Multimodal indoor simulator for navigation in complex environments”. In: arXiv:1712.03931 (2017). [227] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. “Habitat: A platform for embodied ai research”. In: arXiv:1904.01201 (2019). [228] Alexander Sax, Bradley Emi, Amir R. Zamir, Leonidas J. Guibas, Silvio Savarese, and Jitendra Malik. “Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies”. In: Conference on Robot Learning. 2019. [229] Michael Schaarschmidt, Sven Mika, Kai Fricke, and Eiko Yoneki. “RLgraph: Modular Computation Graphs for Deep Reinforcement Learning”. In: arXiv preprint arXiv:1810.09028 (2018). [230] Tom Schaul, Diana Borsa, Joseph Modayil, and Razvan Pascanu. “Ray Interference: a Source of Plateaus in Deep Reinforcement Learning”. In: arXiv preprint arXiv:1904.11455 (2019). [231] Gerrit Schoettler, Ashvin Nair, Jianlan Luo, Shikhar Bahl, Juan Aparicio Ojea, Eugen Solowjow, and Sergey Levine. “Deep Reinforcement Learning for Industrial Insertion Tasks with Visual Inputs and Natural Reward Signals”. In: International Conference on Learning Representations. 2019. [232] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. “Trust region policy optimization”. In: International conference on machine learning. 2015, pp. 1889–1897. [233] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal Policy Optimization Algorithms”. In: CoRR abs/1707.06347 (2017). arXiv: 1707.06347. URL: http://arxiv.org/abs/1707.06347. [234] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal policy optimization algorithms”. In: arXiv preprint arXiv:1707.06347 (2017). [235] Dale E Seborg, Duncan A Mellichamp, Thomas F Edgar, and Francis J Doyle III. Process dynamics and control. John Wiley & Sons, 2010. [236] Pierre Sermanet, Kelvin Xu, and Sergey Levine. “Unsupervised Perceptual Rewards for Imitation Learning”. In: Proceedings of Robotics: Science and Systems (RSS) (2017). URL: http://arxiv.org/abs/1612.06699. 221 [237] Mohit Sharma, Jacky Liang, Jialiang Zhao, Alex LaGrassa, and Oliver Kroemer. “Learning to Compose Hierarchical Object-Centric Controllers for Robotic Manipulation”. In: arXiv preprint arXiv:2011.04627 (2020). [238] Pratyusha Sharma, Lekha Mohan, Lerrel Pinto, and Abhinav Gupta. “Multiple interactions made easy (mime): Large scale demonstrations data for imitation”. In: arXiv preprint arXiv:1810.07121 (2018). [239] Sahil Sharma and Balaraman Ravindran. “Online multi-task learning using active sampling”. In: (2017). [240] Tom Silver, Kelsey Allen, Josh Tenenbaum, and Leslie Kaelbling. “Residual policy learning”. In: arXiv preprint arXiv:1812.06298 (2018). [241] Jivko Sinapov, Sanmit Narvekar, Matteo Leonetti, and Peter Stone. “Learning inter-task transferability in the absence of target task samples”. In: Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. 2015, pp. 725–733. [242] Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and Sergey Levine. “Parrot: Data-Driven Behavioral Priors for Reinforcement Learning”. In: arXiv preprint arXiv:2011.10024 (2020). [243] Avi Singh, Albert Yu, Jonathan Yang, Jesse Zhang, Aviral Kumar, and Sergey Levine. “COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning”. In: arXiv preprint arXiv:2010.14500 (2020). [244] Satinder P Singh, Andrew G Barto, Roderic Grupen, Christopher Connolly, et al. “Robust reinforcement learning in motion planning”. In: Advances in neural information processing systems (1994), pp. 655–655. [245] William D Smart and L Pack Kaelbling. “Effective reinforcement learning for mobile robots”. In: Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292). V ol. 4. IEEE. 2002, pp. 3404–3410. [246] Peter Stone, Gregory Kuhlmann, Matthew E Taylor, and Yaxin Liu. “Keepaway soccer: From machine learning testbed to benchmark”. In: Robot Soccer World Cup. Springer. 2005, pp. 93–105. [247] Adam Stooke and Pieter Abbeel. “rlpyt: A research code base for deep reinforcement learning in pytorch”. In: arXiv preprint arXiv:1909.01500 (2019). [248] Freek Stulp. “Adaptive exploration for continual reinforcement learning”. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE. 2012, pp. 1631–1636. [249] Flood Sung, Li Zhang, Tao Xiang, Timothy M. Hospedales, and Yongxin Yang. “Learning to Learn: Meta-Critic Networks for Sample Efficient Learning”. In: ArXiv abs/1706.09529 (2017). [250] Rich Sutton. Mar. 2019. URL: http://www.incompleteideas.net/IncIdeas/BitterLesson.html. [251] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. 222 [252] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. “A Survey on Deep Transfer Learning”. In: Lecture Notes in Computer Science (2018), pp. 270–279. ISSN: 1611-3349. DOI: 10.1007/978-3-030-01424-7_27. [253] Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. “Sim-to-Real: Learning Agile Locomotion For Quadruped Robots”. In: Robotics: Science and Systems XIV (June 2018). DOI: 10.15607/rss.2018.xiv.010. [254] Daniel Tanneberg, Kai Ploeger, Elmar Rueckert, and Jan Peters. “SKID RAW: Skill Discovery from Raw Trajectories”. In: IEEE Robotics and Automation Letters (2021). [255] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. “Deepmind control suite”. In: arXiv preprint arXiv:1801.00690 (2018). arXiv: 1801.00690[cs.AI]. [256] Matthew E Taylor and Peter Stone. “Transfer learning for reinforcement learning domains: A survey”. In: Journal of Machine Learning Research 10.Jul (2009), pp. 1633–1685. [257] Matthew E Taylor, Peter Stone, and Yaxin Liu. “Transfer Learning via Inter-Task Mappings for Temporal Difference Learning.” In: Journal of Machine Learning Research 8.9 (2007). [258] Russell L. Tedrake. “Applied optimal control for dynamically stable legged locomotion”. Thesis. Massachusetts Institute of Technology, 2004. URL: https://dspace.mit.edu/handle/1721.1/28742 (visited on 01/31/2020). [259] Yee Teh, Victor Bapst, Wojciech M. Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. “Distral: Robust multitask reinforcement learning”. In: Advances in Neural Information Processing Systems 30. Ed. by I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Curran Associates, Inc., 2017, pp. 4496–4506. URL: http://papers.nips.cc/paper/7036-distral-robust-multitask-reinforcement-learning.pdf (visited on 02/05/2020). [260] Sebastian Thrun. “Lifelong Learning Algorithms”. In: Learning to Learn. USA: Kluwer Academic Publishers, 1998, pp. 181–209. ISBN: 0792380479. [261] Sebastian Thrun and Tom M Mitchell. “Lifelong robot learning”. In: Robotics and autonomous systems 15.1-2 (1995), pp. 25–46. [262] Dhruva Tirumala, Alexandre Galashov, Hyeonwoo Noh, Leonard Hasenclever, Razvan Pascanu, Jonathan Schwarz, Guillaume Desjardins, Wojciech Marian Czarnecki, Arun Ahuja, Yee Whye Teh, et al. “Behavior Priors for Efficient Reinforcement Learning”. In: arXiv preprint arXiv:2010.14274 (2020). [263] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. “Domain randomization for transferring deep neural networks from simulation to the real world”. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (Sept. 2017), pp. 23–30. DOI: 10.1109/iros.2017.8202133. 223 [264] Emanuel Todorov, Tom Erez, and Yuval Tassa. “Mujoco: A physics engine for model-based control”. In: International Conference on Intelligent Robots and Systems. IEEE. 2012, pp. 5026–5033. ISBN: 9781467317375. DOI: 10.1109/IROS.2012.6386109. [265] Marin Toromanoff, Emilie Wirbel, and Fabien Moutarde. “Is Deep Reinforcement Learning Really Superhuman on Atari?” In: arXiv preprint arXiv:1908.04683 (2019). [266] René Traoré, Hugo Caselles-Dupré, Timothée Lesort, Te Sun, Guanghang Cai, Natalia Dıaz-Rodrıguez, and David Filliat. “Discorl: Continual reinforcement learning via policy distillation”. In: arXiv preprint arXiv:1907.05855 (2019). [267] Wei-Cheng Tseng, Jin-Siang Lin, Yao-Min Feng, and Min Sun. “Toward Robust Long Range Policy Transfer”. In: arXiv preprint arXiv:2103.02957 (2021). [268] Eric Tzeng, Coline Devin, Judy Hoffman, Chelsea Finn, Xingchao Peng, Sergey Levine, Kate Saenko, and Trevor Darrell. “Towards Adapting Deep Visuomotor Representations from Simulated to Real Environments”. In: CoRR abs/1511.07111 (2015). arXiv: 1511.07111. URL: http://arxiv.org/abs/1511.07111. [269] Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, et al. “Tensor2tensor for neural machine translation”. In: arXiv preprint arXiv:1803.07416 (2018). [270] Giulia Vezzani, Michael Neunert, Markus Wulfmeier, Rae Jeong, Thomas Lampe, Noah Siegel, Roland Hafner, Abbas Abdolmaleki, Martin Riedmiller, and Francesco Nori. “" What, not how"–Solving an under-actuated insertion task from scratch”. In: arXiv preprint arXiv:2010.15492 (2020). [271] A Visser, N Dijkshoorn, M Van Der Veen, and R Jurriaans. “Closing the gap between simulation and reality in the sensor and motion models of an autonomous AR.Drone”. In: The International Micro Air Vehicles conference (2011). [272] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. “Glue: A multi-task benchmark and analysis platform for natural language understanding”. In: 7th International Conference on Learning Representations, ICLR 2019. 2019. [273] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. “Learning to reinforcement learn”. In: arXiv preprint arXiv:1611.05763 (2016). [274] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Rémi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. “Learning to reinforcement learn, 2016”. In: arXiv:1611.05763 (). [275] Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. “Benchmarking model-based reinforcement learning”. In: arXiv preprint arXiv:1907.02057 (2019). 224 [276] Michael L. Waskom. “seaborn: statistical data visualization”. In: Journal of Open Source Software 6.60 (2021), p. 3021. DOI: 10.21105/joss.03021. [277] Markus Wulfmeier, Dushyant Rao, Roland Hafner, Thomas Lampe, Abbas Abdolmaleki, Tim Hertweck, Michael Neunert, Dhruva Tirumala, Noah Siegel, Nicolas Heess, et al. “Data-efficient Hindsight Off-policy Option Learning”. In: arXiv preprint arXiv:2007.15588 (2020). [278] Bernhard Wymann, Eric Espié, Christophe Guionneau, Christos Dimitrakakis, Rémi Coulom, and Andrew Sumner. “Torcs, the open racing car simulator”. In: Software available at http://torcs. sourceforge. net 4.6 (2000). [279] Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. “Gibson Env: Real-world perception for embodied agents”. In: Computer Vision and Pattern Recognition. 2018. [280] Zhaoming Xie, Patrick Clary, Jeremy Dao, Pedro Morais, Jonathan Hurst, and Michiel van de Panne. “Iterative reinforcement learning based design of dynamic locomotion skills for cassie”. In: arXiv preprint arXiv:1903.09537 (2019). [281] Fangzhou Xiong, Zhiyong Liu, Kaizhu Huang, Xu Yang, and Hong Qiao. “State Primitive Learning to Overcome Catastrophic Forgetting in Robotics”. In: Cognitive Computation 13.2 (2021), pp. 394–402. [282] Zhongwen Xu, Hado van Hasselt, and David Silver. “Meta-Gradient Reinforcement Learning”. In: NeurIPS. 2018. [283] Brian Yang, Jesse Zhang, Vitchyr Pong, Sergey Levine, and Dinesh Jayaraman. “Replab: A reproducible low-cost arm benchmark platform for robotic learning”. In: arXiv preprint arXiv:1905.07447 (2019). [284] Erfu Yang and Dongbing Gu. Multiagent reinforcement learning for multi-robot systems: A survey. Tech. rep. Tech Report, 2004. [285] Ruihan Yang, Huazhe Xu, Yi Wu, and Xiaolong Wang. “Multi-task reinforcement learning with soft modularization”. In: arXiv preprint arXiv:2003.13661 (2020). [286] Lin Yen-Chen, Maria Bauza, and Phillip Isola. “Experience-Embedded Visual Foresight”. In: arXiv preprint arXiv:1911.05071 (2019). [287] Lin Yen-Chen, Andy Zeng, Shuran Song, Phillip Isola, and Tsung-Yi Lin. “Learning to see before learning to act: Visual pre-training for manipulation”. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2020, pp. 7286–7293. [288] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. “How transferable are features in deep neural networks?” In: Advances in neural information processing systems. 2014, pp. 3320–3328. [289] Kuan-Ting Yu, Maria Bauza, Nima Fazeli, and Alberto Rodriguez. “More than a million ways to be pushed. a high-fidelity experimental dataset of planar pushing”. In: IROS. 2016. 225 [290] Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. “One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning”. In: International Conference on Learning Representations. 2018. [291] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Karol Hausman, Sergey Levine, and Chelsea Finn. Gradient Surgery for Multi-Task Learning. 2020. URL: https://openreview.net/forum?id=HJewiCVFPB. [292] Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. “Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning”. In: arXiv preprint arXiv:1910.10897 (2019). [293] Tianhe Yu, Gleb Shevchuk, Dorsa Sadigh, and Chelsea Finn. “Unsupervised visuomotor control through distributional planning networks”. In: arXiv preprint arXiv:1902.05542 (2019). [294] Wenhao Yu, Jie Tan, C Karen Liu, and Greg Turk. “Preparing for the unknown: Learning a universal policy with online system identification”. In: Robotics: Science and Systems (RSS) (2017). [295] Andy Zeng, Shuran Song, Johnny Lee, Alberto Rodriguez, and Thomas Funkhouser. “Tossingbot: Learning to throw arbitrary objects with residual physics”. In: arXiv preprint arXiv:1903.11239 (2019). [296] Andy Zeng, Shuran Song, Stefan Welker, Johnny Lee, Alberto Rodriguez, and Thomas Funkhouser. “Learning Synergies between Pushing and Grasping with Self-supervised Deep Reinforcement Learning”. In: arXiv preprint arXiv:1803.09956 (2018). [297] Amy Zhang, Yuxin Wu, and Joelle Pineau. “Natural environment benchmarks for reinforcement learning”. In: arXiv preprint arXiv:1811.06032 (2018). [298] T. Zhang, G. Kahn, S. Levine, and P. Abbeel. “Learning deep control policies for autonomous aerial vehicles with MPC-guided policy search”. In: 2016 IEEE International Conference on Robotics and Automation (ICRA). May 2016, pp. 528–535. DOI: 10.1109/ICRA.2016.7487175. [299] Allan Zhou, Eric Jang, Daniel Kappler, Alex Herzog, Mohi Khansari, Paul Wohlhart, Yunfei Bai, Mrinal Kalakrishnan, Sergey Levine, and Chelsea Finn. “Watch, try, learn: Meta-learning from demonstrations and reward”. In: arXiv preprint arXiv:1906.03352 (2019). arXiv: 1906.03352 [cs.LG]. [300] Xuefeng Zhou, Hongmin Wu, Juan Rojas, Zhihao Xu, and Shuai Li. “Incremental Learning Robot Task Representation and Identification”. In: Nonparametric Bayesian Learning for Collaborative Robot Multimodal Introspection. Springer, 2020, pp. 29–49. [301] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. “Target-driven visual navigation in indoor scenes using deep reinforcement learning”. In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE. 2017, pp. 3357–3364. 226
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scaling robot learning with skills
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Program-guided framework for your interpreting and acquiring complex skills with learning robots
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Learning from planners to enable new robot capabilities
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Closing the reality gap via simulation-based inference and control
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Sample-efficient and robust neurosymbolic learning from demonstrations
PDF
Learning affordances through interactive perception and manipulation
PDF
Leveraging structure for learning robot control and reactive planning
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Machine learning of motor skills for robotics
PDF
Decision support systems for adaptive experimental design of autonomous, off-road ground vehicles
PDF
Learning objective functions for autonomous motion generation
PDF
Intelligent robotic manipulation of cluttered environments
Asset Metadata
Creator
Julian, Ryan Chrisopher
(author)
Core Title
Algorithms and systems for continual robot learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2021-08
Publication Date
07/23/2021
Defense Date
05/11/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
benchmarks,continual learning,deep learning,end-to-end learning,few-shot learning,fine-tuning,latent space methods,machine learning,meta-learning,multi-task learning,OAI-PMH Harvest,reinforcement learning,representation learning,reproducibility,robotic manipulation,robotics,variational inference
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sukhatme, Gaurav (
committee chair
), Culbertson, Heather (
committee member
), Gupta, Satyandra (
committee member
), Hausman, Karol (
committee member
), Lim, Joseph (
committee member
), Nikolaidis, Stefanos (
committee member
)
Creator Email
rjulian@usc.edu,ryanjulian@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15618612
Unique identifier
UC15618612
Legacy Identifier
etd-JulianRyan-9841
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Julian, Ryan Chrisopher
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
benchmarks
continual learning
deep learning
end-to-end learning
few-shot learning
fine-tuning
latent space methods
machine learning
meta-learning
multi-task learning
reinforcement learning
representation learning
reproducibility
robotic manipulation
robotics
variational inference