Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Leveraging prior experience for scalable transfer in robot learning
(USC Thesis Other)
Leveraging prior experience for scalable transfer in robot learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Leveraging Prior Experience for Scalable Transfer in Robot Learning by Karl Pertsch A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2023 Copyright 2023 Karl Pertsch Acknowledgements First and foremost, I want to thank my family for their incredible support throughout my PhD. My parents, Manuela and Patrick Pertsch, and my brother, Richard Pertsch, for their unwavering support, in good and hard times. My grandparents, Annemarie and Siegmar Staps, for being my biggest cheerleaders and by far the most regular visitors of my academic website. My uncle, Thomas Pertsch, for providing me academic shelter and a place to work when schools closed in 2020. And to my grandpa, Peter Pertsch, for being an inspiration and role model, in academia and beyond. I would also like to thank my academic advisors. My PhD advisor, Joseph Lim, my thesis co-chair, Gaurav Sukhatme, and the members of my thesis committee, Stefanos Nikolaidis, Jesse Thomason and Somil Bansal. I am grateful for your advice and guidance over the last years. Additionally, I want to thank my mentors during the various internships and research visits I did throughout my PhD: Sergey Levine and Chelsea Finn during my time at UC Berkeley, Akshara Rai and Dhruv Batra at FAIR and Karol Hausman at Google Brain. Each of these experiences has made my PhD much richer and I am grateful I got the opportunity to work with so many amazing researchers. Also a special thank you to Kostas Daniilidis at the University of Pennsylvania, who hosted me during my Fulbright year pre-PhD and advised my first first-author project. Finally, I want to thank my many friends and collaborators in CLVR lab at USC. I couldn’t have wished for a more energetic and inspiring environment during my PhD! ii TABLE OF CONTENTS Acknowledgements ii List of Tables ix List of Figures xi Abstract xxv Chapter 1: Introduction 1 1.1 A Framework for Skill-based Robot Learning . . . . . . . . . . . . . . . . . . . . 2 1.2 Scalable Robot Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Learning Skills from Diverse Datasets . . . . . . . . . . . . . . . . . . . . . . . . 6 I Using Skills for Effective Transfer in Reinforcement Learning & Imitation 8 Chapter 2: Reinforcement Learning with Learned Skills 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Learning Continuous Skill Embedding and Skill Prior . . . . . . . . . . . 13 2.3.3 Skill Prior Regularized Reinforcement Learning . . . . . . . . . . . . . . 15 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 Environments & Comparisons . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2 Maze Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.3 Robotic Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 3: Demonstration-Guided Reinforcement Learning with Learned Skills 24 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.2 Skill Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.3 Demonstration-Guided RL with Learned Skills . . . . . . . . . . . . . . . 30 iii 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.1 Experimental Setup and Comparisons . . . . . . . . . . . . . . . . . . . . 33 3.4.2 Demonstration-Guided RL with Learned Skills . . . . . . . . . . . . . . . 35 3.4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.4 Robustness to Partial Demonstrations . . . . . . . . . . . . . . . . . . . . 37 3.4.5 Data Alignment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Chapter 4: Cross-Domain Transfer via Semantic Skill Imitation 40 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4.1 Semantic Skill Representation . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4.2 Semantic Transfer Accelerated RL . . . . . . . . . . . . . . . . . . . . . . 46 4.4.3 Cross-Domain State Matching . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5.1 Cross-Domain Imitation in Simulation . . . . . . . . . . . . . . . . . . . . 49 4.5.2 Imitation from Human Demonstrations . . . . . . . . . . . . . . . . . . . 51 4.5.3 Robustness to Noisy Demonstrations and Labels . . . . . . . . . . . . . . 53 4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Chapter 5: Meta Reinforcement Learning with Learned Skills 55 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3 Problem Formulation and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 58 5.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.4.1 Skill Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.4.2 Skill-based Meta-Training . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.4.3 Target Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.5.3 Meta-training Task Distribution Analysis . . . . . . . . . . . . . . . . . . 68 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 II Scalable Robot Data Collection 70 Chapter 6: Assisted Teleoperation for Scalable Robot Data Collection 71 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3.2 Learning Assistive Policies from Multi-Modal Data . . . . . . . . . . . . . 76 6.3.3 Deciding When to Request User Input . . . . . . . . . . . . . . . . . . . . 77 iv 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.4.1 User Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.4.2 User Study Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.4.3 Scaling Data Collection to Multiple Robots . . . . . . . . . . . . . . . . . 82 6.4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 III Learning Skills from Diverse Datasets 86 Chapter 7: Model-Based Skill Learning via Goal-Conditioned Hierarchical Predictors 87 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.3 Goal-Conditioned Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.3.1 Goal-Conditioned Sequential Prediction . . . . . . . . . . . . . . . . . . . 91 7.3.2 Goal-Conditioned Prediction by Recursive Infilling . . . . . . . . . . . . . 92 7.3.3 Latent Variable Models for GCP . . . . . . . . . . . . . . . . . . . . . . . 93 7.3.4 Architectures for Goal-Conditioned Prediction . . . . . . . . . . . . . . . 94 7.4 Planning & Control with Goal-Conditioned Prediction . . . . . . . . . . . . . . . 95 7.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.5.1 Goal-Conditioned Video Prediction . . . . . . . . . . . . . . . . . . . . . 96 7.5.2 Visual Goal-Conditioned Planning and Control . . . . . . . . . . . . . . . 98 7.5.3 Temporal Abstraction Discovery . . . . . . . . . . . . . . . . . . . . . . . 101 7.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Chapter 8: Language-Conditioned Offline Reinforcement Learning with Instruction Relabeling 103 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8.3 SPRINT: Scalable Policy Pre-Training with Language Instructions . . . . . . . . . 107 8.3.1 Instruction-Conditioned Offline RL . . . . . . . . . . . . . . . . . . . . . 108 8.3.2 Language-Model-Based Instruction Aggregation . . . . . . . . . . . . . . 109 8.3.3 Cross-Trajectory Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . 110 8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 8.4.1 The ALFRED-RL Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 113 8.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.4.3 SPRINT Solves Long-Horizon Tasks Zero-Shot . . . . . . . . . . . . . . . 116 8.4.4 SPRINT Agents Finetune Effectively in Unseen Environments . . . . . . . 117 8.4.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Chapter 9: Conclusion 120 References 123 Appendices 143 v Appendix A Reinforcement Learning with Learned Skills . . . . . . . . . . . . . . . . . . . . . . . . 145 A Action-prior Regularized Soft Actor-Critic . . . . . . . . . . . . . . . . . . . . . . 145 B Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 B.1 Model Architecture and Training Objective . . . . . . . . . . . . . . . . . 148 B.2 Reinforcement Learning Setup . . . . . . . . . . . . . . . . . . . . . . . . 149 C Environments and Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 150 D State-Conditioned Skill Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 E Prior Regularization Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 F Prior Initialization Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 G Training with Sub-Optimal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 H Reuse of Learned Skill Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Appendix B Demonstration-Guided Reinforcement Learning with Learned Skills . . . . . . . . . . . 157 A Full Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 B Implementation and Experimental Details . . . . . . . . . . . . . . . . . . . . . . 157 B.1 Implementation Details: Pre-Training . . . . . . . . . . . . . . . . . . . . 157 B.2 Implementation Details: Downstream RL . . . . . . . . . . . . . . . . . . 160 B.3 Implementation Details: Comparisons . . . . . . . . . . . . . . . . . . . . 160 B.4 Environment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 C Skill Representation Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 D Demonstration-Guided RL Comparisons with Task-Agnostic Experience . . . . . . 164 E Skill-Based Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 F Kitchen Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Appendix C Cross-Domain Transfer via Semantic Skill Imitation . . . . . . . . . . . . . . . . . . . . 168 A Full Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 B Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 B.1 Skill Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 B.2 Semantic Imitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 B.3 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 B.4 Environments and Data Collection . . . . . . . . . . . . . . . . . . . . . . 173 C Imitation Learning Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 D Label Noise Robustness Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 E Detailed Ablation Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 F Additional Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Appendix D Meta Reinforcement Learning with Learned Skills . . . . . . . . . . . . . . . . . . . . . 183 A Meta-reinforcement Learning Method Ablation . . . . . . . . . . . . . . . . . . . 183 B Learning Efficiency on Target Tasks with Few Episodes of Experience . . . . . . . 184 C Investigating Offline Data vs. Target Domain Shift . . . . . . . . . . . . . . . . . 187 D Extended Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 vi E Implementation Details on Our Method . . . . . . . . . . . . . . . . . . . . . . . 189 E.1 Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 E.2 Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 E.2.1 Skill-based Meta-training . . . . . . . . . . . . . . . . . . . . . 191 E.2.2 Target Task Learning . . . . . . . . . . . . . . . . . . . . . . . 191 F Implementation Details on Baselines . . . . . . . . . . . . . . . . . . . . . . . . . 192 G Meta-training Tasks and Target Tasks. . . . . . . . . . . . . . . . . . . . . . . . . 193 G.1 Maze Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 G.2 Kitchen Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Appendix E Assisted Teleoperation for Scalable Robot Data Collection . . . . . . . . . . . . . . . . 196 A Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 B Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 B.1 Sub-goal Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 B.2 Low-level Sub-goal Reaching Policy . . . . . . . . . . . . . . . . . . . . . 198 B.3 Q-Function (ThriftyDAgger) . . . . . . . . . . . . . . . . . . . . . . . . . 199 C User Study Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Appendix F Model-Based Planning via Goal-Conditioned Hierarchical Predictors . . . . . . . . . . . 202 A Additional results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 B Evidence lower bound (ELBO) derivation . . . . . . . . . . . . . . . . . . . . . . 202 C Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 D Data processing and generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 E Planning Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 F Adaptive Binding with Dynamic Programming . . . . . . . . . . . . . . . . . . . 210 F.1 An efficient inference procedure . . . . . . . . . . . . . . . . . . . . . . . 210 F.2 Bottleneck Discovery Experimental Setup . . . . . . . . . . . . . . . . . . 211 G Training from Random Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 G.1 Runtime Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Appendix G Language-Conditioned Offline Reinforcement Learning with Instruction Relabeling . . . 214 A Large Language Model Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 B Baselines and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 B.1 Language-conditioned Behavior Cloning . . . . . . . . . . . . . . . . . . 220 B.2 Episodic Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 B.3 Actionable Models (AM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 B.4 SPRINT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 B.4.1 Cross-trajectory chaining preserves the MDP. . . . . . . . . . . 223 C Dataset and Environment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 C.1 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 C.2 Evaluation Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 D Extended Experiments, Results, and Analysis . . . . . . . . . . . . . . . . . . . . 234 vii D.1 LLM Summary Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 234 D.2 Comparing Summaries from Different LLMs . . . . . . . . . . . . . . . . 235 D.3 Step by Step Zero-shot evaluations . . . . . . . . . . . . . . . . . . . . . . 235 D.4 Extended Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 D.5 Qualitative Comparison Results . . . . . . . . . . . . . . . . . . . . . . . 238 viii List of Tables 6.1 Average number of completed side tasks and teleoperation time per demonstration during the real-robot teleoperation user study. . . . . . . . . . . . . . . . . . . . . 80 6.2 Average success rate of behavior cloning policy trained on data from single-robot vs. multi-robot teleoperation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.1 Long-term prediction performance of the goal-conditioned predictors compared to prior work on video interpolation. Additional evaluation on FVD / LPIPS (Unterthiner et al., 2018; Zhang et al., 2018a) in Appendix, Table F.1. . . . 97 7.2 GCP runtime on 16× 16 px H3.6M sequences in sec/training batch . . . . . . . . . 97 7.3 Ablation of prediction performance on pick&place . . . . . . . . . . . . . . . . . 98 7.4 Visual control performance on navigation tasks . . . . . . . . . . . . . . . . . . . 98 8.1 EVAL INSTRUCT and EVAL LENGTH returns. When combining chaining and instruction aggregation, SPRINT achieves the highest return across both evaluations. . . . . . 118 A.1 Number of blocks stacked vs fractions of random training data. . . . . . . . . . . . 154 C.1 List of all pre-trained model components, their respective functionality and their pre-training objectives. We use⌊·⌋ to indicate stopped gradients andτ T to denote demonstration trajectories relabeled with matched target domain states from Section 4.4.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 C.2 Parametrization of the noise levels for the skill label robustness experiment. . . . . 178 E.1 PATO hyperparameters. Parameters that differ between environments are marked in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 E.2 Post-execution survey (Likert scales with 7-option response format) . . . . . . . . 201 F.1 Prediction performance on perceptual metrics. . . . . . . . . . . . . . . . . . . . . 204 F.2 Hyperparameters for hierarchical planning with GCPs on 9-room and 25-room navigation tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 ix F.3 Average Trajectory Length. Planning with GCP finds shorter paths than the training distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 G.1 Evaluation Task Specifics. Note that the “number of env actions per task” corresponds to the number of environment actions the ALFRED expert planner required to complete that task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 G.2 EVAL INSTRUCT and EVAL LENGTH eval dataset per-length and overall skill completion rates. See Section 8.4 for experiment setup. . . . . . . . . . . . . . . . . . . . . . 234 G.3 Step-by-step evaluation number of completed sub-tasks (number of completed sub-tasks). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 G.4 Zero-shot Ablation Subtasks Completion Table . . . . . . . . . . . . . . . . . . . 238 x List of Figures 1.1 We propose a scalable skill-based robot learning framework that leverages diverse prior experience data to accelerate learning new tasks. We extract a large number of reusable short-horizon behaviors, i.e., skills, from prior experience. Then, we introduce RL and imitation algorithms for efficiently learning new tasks with large skill repertoires. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Our skill-based framework can teach a simulated robot new tasks using just 3 min of human video demonstrations, collected in my kitchen. We use diverse human videos (EPIC Kitchens, Damen et al. (2021)) to guide the robot’s exploration over skills. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Intelligent agents can use a large library of acquired skills when learning new tasks. Instead of exploring skills uniformly, they can leverage priors over skills as guidance, based e.g., on the current environment state. Such priors capture which skills are promising to explore, like moving a kettle when it is already grasped, and which are less likely to lead to task success, like attempting to open an already opened microwave. In this work, we propose to jointly learn an embedding space of skills and a prior over skills from unstructured data to accelerate the learning of new tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Deep latent variable model for joint learning of skill embedding and skill prior. Given a state-action trajectory from the dataset, the skill encoder maps the action sequence to a posterior distribution q(z|a a a i ) over latent skill embeddings. The action trajectory gets reconstructed by passing a sample from the posterior through the skill decoder. The skill prior maps the current environment state to a prior distribution p a a a (z|s 1 ) over skill embeddings. Colorful arrows indicate the propagation of gradients from reconstruction, regularization and prior training objectives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 For each environment we collect a diverse dataset from a wide range of training tasks (examples on top) and test skill transfer to more complex target tasks (bottom), in which the agent needs to: navigate a maze (left), stack as many blocks as possible (middle) and manipulate a kitchen setup to reach a target configuration (right). All tasks require the execution of complex, long-horizon behaviors and need to be learned from sparse rewards. . . . . . . . . . . . . . . . . . . . . . . . 18 xi 2.4 Downstream task learning curves for our method and all comparisons. Both, learned skill embedding and skill prior are essential for downstream task performance: single-action priors without temporal abstraction (Flat Prior) and learned skills without skill prior (SSP w/o Prior) fail to converge to good performance. Shaded areas represent standard deviation across three seeds. . . . . . . . . . . . . . . . . 20 2.5 Exploration behavior of our method vs. alternative transfer approaches on the downstream maze task vs. random action sampling. Through learned skill embeddings and skill priors our method can explore the environment more widely. We visualize positions of the agent during 1M steps of exploration rollouts in blue and mark episode start and goal positions in green and red respectively. . . . . . . 21 2.6 Ablation analysis of skill horizon and skill space dimensionality on block stacking task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 We leverage large, task-agnostic datasets collected across many different tasks for efficient demonstration-guided reinforcement learning by (1) acquiring a rich motor skill repertoire from such offline data and (2) understanding and imitating the demonstrations based on the skill repertoire. . . . . . . . . . . . . . . . . . . 24 3.2 Our approach, SkiLD, combines task-agnostic experience and task-specific demonstrations to efficiently learn target tasks in three steps: (1) extract skill representation from task-agnostic offline data, (2) learn task-agnostic skill prior from task-agnostic data and task-specific skill posterior from demonstrations, and (3) learn a high-level skill policy for the target task using prior knowledge from both task-agnostic offline data and task-specific demonstrations. Left: Skill embedding model with skill extractor (yellow) and closed-loop skill policy (blue). Middle: Training of skill prior (green) from task-agnostic data and skill posterior (purple) from demonstrations. Right: Training of high-level skill policy (red) on a downstream task using the pre-trained skill representation and regularization via the skill prior and posterior, mediated by the demonstration discriminator D(s). . . 27 3.3 We leverage prior experience dataD and demonstration data D demo . Our policy is guided by the task-specific skill posterior q ζ (z|s) within the support of the demonstrations (green) and by the task-agnostic skill prior p a a a (z|s) otherwise (red). The agent also receives a reward bonus for reaching states in the demonstration support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Left: Test environments, top to bottom: 2D maze navigation, robotic kitchen manipulation and robotic office cleaning. Right: Target task performance vs environment steps. By using task-agnostic experience, our approach more efficiently leverages the demonstrations than prior demonstration-guided RL approaches across all tasks. The comparison to SPiRL shows that demonstrations improve efficiency even for agents with access to a lot of prior experience. . . . . . 33 xii 3.5 Visualization of our approach on the maze navigation task (visualization states collected by rolling out the skill prior). Left: the given demonstration trajectories; Middle left: output of the demonstration discriminator D(s) (the greener, the higher the predicted probability of a state to be within demonstration support, red indicates low probability). Middle right: policy divergences to the skill posterior and Right: divergence to the skill prior (blue indicates small and red high divergence). The discriminator accurately infers the demonstration support, the policy successfully follows the skill posterior only within the demonstration support and the skill prior otherwise. . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6 Ablation studies. We test the performance of SkiLD for different sizes of the demonstration dataset|D demo | on the maze navigation task (left) and ablate the components of our objective on the kitchen manipulation task (right). . . . . . . . 36 3.7 Left: Robustness to partial demonstrations. SkiLD can leverage partial demonstrations by seamlessly integrating task-agnostic and task-specific datasets (see Section 3.4.4). Right: Analysis of data vs. task alignment. The benefit of using demonstrations in addition to prior experience diminishes if the prior experience is closely aligned with the target task (solid), but gains are high when data and task are not well-aligned (dashed). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1 We address semantic imitation, which aims to leverage demonstrations from a source domain, e.g., human video demonstrations, to accelerate the learning of the same tasks in a different target domain, e.g., controlling a robotic manipulator in a simulated kitchen environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Model overview for pre-training (left) and target task learning (right). We pre-train a semantic skill policy π l (grey) and use it to decode actions from the learned high-level policiesπ sem andπ lat (blue and yellow) during target task learning. See training details in the main text. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3 State matching between source and target domain. For every source domain state from the demonstrations, we compute the task-agnostic semantic skill distribution p TA (k|s) and find the target domain state with the most similar semantic skill distribution from the task-agnostic datasetD T . We then relabel the demonstrations with these matched states from the target domain. . . . . . . . . . . . . . . . . . . 48 4.4 We evaluate on three pairs of source (top) and target (bottom) environments. Left: maze navigation. The agent needs to follow a sequence of colored rooms (red path) but the maze layout changes substantially between source and target domains. Middle: kitchen manipulation. A robotic arm executes a sequence of skills, but the layout of the kitchens differs. Right: Same as before, but with human demonstrations from a real-world kitchen. . . . . . . . . . . . . . . . . . . . . . . 49 xiii 4.5 Left: Performance on the simulated semantic imitation tasks. STAR, matches the performance of the oracle, SkiLD, which has access to target domain demonstrations and outperforms both SPiRL, which does not use demonstrations, and SkillSeq, which follows the demonstrated semantic skills sequentially. Right: Ablations in the kitchen environment, see main text for details. . . . . . . . . . . . 50 4.6 Semantic imitation from human demonstrations. Left: Qualitative state matching results. The top row displays frames subsampled from a task demonstration in the human kitchen source domain. The bottom row visualizes the states matched to the source frames via the procedure described in Section 4.4.3. The matched states represent corresponding semantic scenes in which the agent e.g., opens the microwave, turns on the stove or opens the cabinet. Right: Quantitative results on the kitchen manipulation task from human video demonstrations. . . . . . . . . . 52 4.7 Semantic imitation with missing skills in the demonstrations. Our approach STAR still learns the full task faster than learning without demonstrations (SPiRL), while SkillSeq get stuck at the missing skill. . . . . . . . . . . . . . . . . . . . . 53 5.1 We propose a method that jointly leverages (1) a large offline dataset of prior experience collected across many tasks without reward or task annotations and (2) a set of meta-training tasks to learn how to quickly solve unseen long-horizon tasks. Our method extracts reusable skills from the offline dataset and meta-learn a policy to quickly use them for solving new tasks. . . . . . . . . . . . . . . . . . . . . . . 55 5.2 Method Overview. Our proposed skill-based meta-RL method has three phases. (1) Skill Extraction: learns reusable skills from snippets of task-agnostic offline data through a skill extractor (yellow) and low-level skill policy (blue). Also trains a prior distribution over skill embeddings (green). (2) Skill-based Meta-training: Meta-trains a high-level skill policy (red) and task encoder (purple) while using the pre-trained low-level policy. The pre-trained skill prior is used to regularize the high-level policy during meta-training and guide exploration. (3) Target Task Learning: Leverages the meta-trained hierarchical policy for quick learning of an unseen target task. After conditioning the policy by encoding a few transitions c ∗ from the target taskT ∗ , we continue fine-tuning the high-level skill policy on the target task while regularizing it with the pre-trained skill prior. . . . . . . . 60 5.3 Environments. We evaluate our proposed framework in two domains that require the learning of complex, long-horizon behaviors from sparse rewards. These environments are substantially more complex than those typically used to evaluate meta-RL algorithms. (a) Maze Navigation: The agent needs to navigate for hundreds of steps to reach unseen target goals and only receives a binary reward upon task success. (b) Kitchen Manipulation: The 7DoF agent needs to execute an unseen sequence of four subtasks, spanning hundreds of time steps, and only receives a sparse reward upon completion of each subtask. . . . . 64 xiv 5.4 Target Task Learning Efficiency. SiMPL demonstrates better sample efficiency compared to all the baselines, verifying the efficacy of meta-learning on long-horizon tasks by leveraging skills and skill prior extracted from an offline dataset. For both the two environments, we train each model on each target task with 3 different random seeds. SiMPL and PEARL-ft first collect 20 episodes of environment interactions (vertical dotted line) for conditioning the meta-trained policy before fine-tuning it on target tasks. . . . . 66 5.5 Qualitative Results. All the methods that leverage the offline dataset (i.e., SiMPL, SPiRL, and MTRL) effectively explore the maze in the first episode. Then, SiMPL converges with much fewer episodes compared to SPiRL and MTRL. In contrast, PEARL-ft is not able to make learning progress. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.6 Meta-training Task Distribution Analysis. (a) With sparser meta-training task distributions (i.e., fewer numbers of meta-training tasks), SiMPL still achieves better sample efficiency compared to SPiRL, highlighting the benefit of leveraging meta-training tasks. (b) When trained on a meta-training task distribution that aligns better with the target task distribution, SiMPL achieves improved performance. (c) When trained on a meta-training task distribution that is mis-aligned with the target tasks, SiMPL yields worse performance. For all the analyses, we train each model on each target task with 3 different random seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.1 Policy Assisted TeleOperation (PATO) enables large-scale data collection by minimizing human operator inputs and mental efforts with an assistive policy, which autonomously performs repetitive subtasks. This allows a human operator to simultaneously manage multiple robots. . . . . . . . . . . . . . . . . . . . . . . . 72 6.2 PATO is hierarchical: a high-level subgoal predictor p(s g |s,z) and a low-level subgoal-reaching policyπ LL (a|s,s g ). To decide when to follow the assistive policy, we measure uncertainty of both high-level (subgoal predictor) and low-level (subgoal-reaching policy) decisions. The task uncertainty is estimated using the subgoal predictor’s variance, and the policy uncertainty is estimated as a disagreement among an ensemble of subgoal-reaching policies. . . . . . . . . . . . 75 6.3 Our hierarchical assistive policy is trained using a pre-collected datasetD pre . From a sampled trajectory (s 1 ,a 1 ,...,a H− 1 ,s H ) of lengthH , a subgoal predictor p(s g |s 1 ,z) is trained as a conditional V AE to cover a multi-modal subgoal distribution, where s g = s H . Then, an ensemble of subgoal-reaching policies π (k) LL (a t |s t ,s g ) are trained to predict the ground truth actions. . . . . . . . . . . . . . 76 6.4 Our approach asks for human inputs when the assistive policy is uncertain about which subtask or action to take. If both the task uncertainty and policy uncertainty are lower than their thresholds, our assistive policy can reliably perform a subtask, reducing the workload of the human operator. . . . . . . . . . . . . . . . . . . . . 78 xv 6.5 User study setup. (left) A Kinova Jaco arm, front-view and in-hand cameras, and objects for kitchen-inspired tasks are placed on the workspace. (right) A human operator can watch a monitor, which shows either the camera inputs or a side task. The operator uses a gamepad to control the robot, and uses a keyboard to solve the side task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.6 Visualization of PATO on a task from the real-robot user study: place red block in bowl; place bowl on plate; put green block in oven. PATO autonomously executes familiar behaviors, but asks for user input in frames (2) and (4) to determine where to place bowl and green block (white vs. yellow arrow). In these cases, the task uncertainty surpasses the thresholdω since the subgoal predictor produces subgoals for both possible targets. Further, PATO asks for user input in frame (3) since the required transition between placing the bowl and picking up the green block was not part of its training data. Thus, the policy uncertainty estimate surpasses its thresholdγ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.7 We use up to four simulated robots to collect demonstrations. The assistive policy asks for human input using the red indicator. The green indicator represents which environment is being controlled by a user. . . . . . . . . . . . . . . . . . . . . . . 83 6.8 Average number of demonstrations collected in 4 minutes using multiple robots in simulation. (a) With PATO, users can manage multiple robots simultaneously and collect more demonstrations with four robots. (b) The ablated systems with four robots show inferior demonstration collection throughput. . . . . . . . . . . . . . . 84 7.1 When planning towards faraway goals, we propose to condition the prediction of candidate trajectories on the goal, which significantly reduces the search space of possible trajectories (left vs. middle) and enables hierarchical planning approaches that break a long-horizon task into a series of short-horizon tasks by placing subgoals (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2 Graphical models for state-space sequence generation: forward prediction (left) and the proposed goal-conditioned predictors (GCPs). Shaded circles denote observations, white circles denote unobserved latent states. Center: a sequential goal-conditioned predictor with structure similar to forward prediction. Right: a hierarchical goal-conditioned predictor that recursively applies an infilling operator to generate the full sequence. All our models leverage stochastic latent states in order to handle complex high-dimensional observations. . . . . . . . . . . . . . . 92 7.3 Architecture for two-layer hierarchical goal-conditioned predictor (GCP). Skip connections to first node’s decoder omitted for clarity. . . . . . . . . . . . . . . . 94 xvi 7.4 Samples from GCP-tree on the 25-room data. Left: hierarchical prediction process. At each layer, the infilling operator is applied between every two frames, producing a sequence with a finer and finer temporal resolution. Three layers out of eight are shown. Right: visualization of the trajectory on the map together with a plan execution (see Section 7.5.2). Bottom: two image sequences sampled given the same start and goal (subsampled to 20 frames for visualization). Our model leverages stochastic latent states that enable modeling multimodal trajectories. See additional video results on the supplementary website: orybkin.github.io/video-gcp. . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.6 Temporal abstraction discovery on pick&place. Discovered tree structure with adaptive binding: nodes from the first two layers (yellow/red) bind to semantically consistent bottlenecks across sequences, e.g. in which the robot is about to drop the object into the bin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.5 Comparison between planning methods. Trajectories (red) sampled while planning from start (blue) to goal (green). All methods predict image trajectories, which are shown as 2d states for visualization. Left: visual MPC (Ebert et al., 2018) with forward predictor, middle: non-hierarchical planning with goal-conditioned predictor (GCP), right: hierarchical planning with GCP (ours) recursively optimizes subgoals (yellow/red) in a coarse-to-fine manner and finally plans short trajectories between the subgoals. Goal-conditioning ensures that trajectories reach the long-horizon goal, while hierarchical planning decomposes the task into shorter segments which are easier to optimize. . . . . . . . . . . . . . . . . . . . . . . . . 101 8.1 We propose SPRINT, a scalable approach for policy pre-training via offline instruction-conditioned RL. Given a dataset of agent experience with a limited set of costly human-annotated language instructions, SPRINT drastically expands the pre-training task set without additional human effort via language-model-based instruction relabeling and cross-trajectory skill chaining. SPRINT policies enable efficient finetuning on unseen target tasks. . . . . . . . . . . . . . . . . . . 104 8.2 SPRINT overview. We assume access to a dataset of agent experience with language instructions for the performed skills (1). Collecting such instructions with human hindsight annotation is a flexible yet costly approach for defining pre-training tasks. Thus, SPRINT introduces two approaches for automatically growing the set of pre-training tasks without additional human effort: (2) by aggregating language instructions with an LLM and adding the relabeled trajectories back into the pre-training dataset (Section 8.3.2), (3) by performing cross-trajectory chaining of skills to enable pre-training of skills that are unseen in the offline agent experience (Section 8.3.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 8.3 A shortened example of the LLM prompt. See the full prompt in appendix, Section A.109 xvii 8.4 SPRINT chains skills from two trajectories τ 1 and τ 2 into a new composite trajectory. The two trajectories do not get concatenated, but added to the replay buffer as two separate trajectories with updated language instruction and the appropriate reward values on their terminal states. . . . . . . . . . . . . . . . . . 110 8.5 The ALFRED household task simulator (Shridhar et al., 2020) provides a rich set of long-horizon, semantically meaningful tasks and a dataset of 6.6k demonstrations with language instructions. We propose the ALFRED-RL Benchmark, which tests finetuning of RL agents on unseen tasks and scenes in ALFRED. Figure drawn from Shridhar et al. (2020), with permission. . . . . . . . . . . . . . . . . . . . . . 113 8.6 Evaluation results on the ALFRED-RL benchmark. Left: Zero shot performance on the EV AL INSTRUCT and EV AL LENGTH task sets. SPRINT (green) is able to complete substantially more subtasks than prior pre-training approaches. Middle: Breakdown of zero shot performance by task length. SPRINT shows strong zero-shot performance, especitally on challenging, long-horizon tasks. See appendix, Table G.2 for numerical results. Right: Finetuning performance in unseen floor plans of the EV AL SCENE task set. Policies pre-trained with SPRINT learn tasks in new floor plans much quicker as semantic skills learned from language instructions generalize better than skills learned via unsupervised approaches (AM). 115 8.7 Example successful task execution of our pre-trained SPRINT agent for the challenging “Warm up a piece of apple” task. Successful execution requires solving 8 subtasks in sequence and a total of 50 steps. This sequence of subtasks was never observed in the training data. SPRINT uses cross-trajectory stitching and LLM aggregation to learn unseen tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . 116 A.1 Image-based state representation for maze (left) and block stacking (right) environment (downsampled to 32× 32 px for policy). . . . . . . . . . . . . . . . 150 A.2 Comparison of policy execution traces on the kitchen environment. Following Fu et al. (2020), the agent’s task is to (1) open the microwave, (2) move the kettle backwards, (3) turn on the burner and (4) switch on the light. Red frames mark the completion of subtasks. Our skill-prior guided agent (top) is able to complete all four subtasks. In contrast, the agent using a flat single-action prior ( middle) only learns to solve two subtasks, but lacks temporal abstraction and hence fails to solve the complete long-horizon task. The skill-space policy without prior guidance (bottom) cannot efficiently explore the skill space and gets stuck in a local optimum in which it solves only a single subtask. Best viewed electronically and zoomed in. For videos, see: clvrai.com/spirl. . . . . . . . . . . . . . . . 152 xviii A.3 Results for state-conditioned skill decoder network. left: Exploration visualization as in Fig. 2.5. Even with state-conditioned skill decoder, exploration without skill prior is not able to explore a large fraction of the maze. In contrast, skills sampled from the learned skill prior lead to wide-ranging exploration when using the state-conditioned decoder. right: Downstream learning performance of our approach and skill-space policy w/o learned skill prior: w/ vs. w/o state-conditioning for skill decoder. Only guidance through the learned skill prior enables learning success. State-conditioned skill-decoder can make the downstream learning problem more challenging, leading to lower performance (”ours” vs. ”ours w/ state cond.”). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 A.4 Ablation of prior regularization during downstream RL training. Initializing the high-level policy with the learned prior but finetuning with conventional SAC is not sufficient to learn the task well. . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 A.5 Ablation of prior initialization. Initializing the downstream task policy with the prior network improves training stability and convergence speed. However, the ”w/o Init” runs demonstrate that the tasks can also be learned with prior regularization only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 A.6 Success rate on maze environment with sub-optimal training data. Our approach, using a prior learned from sub-optimal data generated with the BC policy, is able to reliably learn to reach the goal while the baseline that does not use the learned prior fails. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 A.7 Reuse of one learned skill prior for multiple downstream tasks. We train a single skill embedding and skill prior model and then use it to guide downstream RL for multiple tasks. Left: We test prior reuse on three different maze navigation tasks in the form of different goals that need to be reached. (1)-(3): Agent rollouts during training; the darker the rollout paths, the later during training they were collected. The same prior enables efficient exploration for all three tasks, but allows for convergence to task-specific policies that reach each of the goals upon convergence. 156 B.1 Qualitative results for GAIL+RL on maze navigation. Even though it makes progress towards the goal (red), it fails to ever obtain the sparse goal reaching reward. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 B.2 We compare the exploration behavior in the maze. We roll out skills sampled from SPiRL’s task-agnostic skill prior (left) and our task-specific skill posterior ( right) and find that the latter leads to more targeted exploration towards the goal (red). . 161 B.3 Office cleanup task. The robot agent needs to place three randomly sampled objects (1-7) inside randomly sampled containers (a-c). During task-agnostic data collection we apply random noise to the initial position of the objects. . . . . . . . 162 xix B.4 Comparison of our closed-loop skill representation with the open-loop representation of Pertsch et al. (2020a). Top: Skill prior rollouts for 100 k steps in the maze environment. Bottom: Subtask success rates for prior rollouts in the kitchen environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 B.5 Downstream task performance for prior demonstration-guided RL approaches with combined task-agnostic and task-specific data. All prior approaches are unable to leverage the task-agnostic data, showing a performance decrease when attempting to use it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 B.6 Imitation learning performance on maze navigation and kitchen tasks. Compared to prior imitation learning methods, SkiLD can leverage prior experience to enable the imitation of complex, long-horizon behaviors. Finetuning the pre-trained discriminator D(s) further improves performance on more challenging control tasks like in the kitchen environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 B.7 Subtask transition probabilities in the kitchen environment’s task-agnostic training dataset from Gupta et al. (2019). Each dataset trajectory consists of four consecutive subtasks, of which we display three (yellow: first, green: second, grey: third subtask). The transition probability to the fourth subtask is always near 100 %. In Section 3.4.5 we test our approach on a target task with good alignment to the task-agnostic data (Microwave - Kettle - Light Switch - Hinge Cabinet) and a target task which is mis-aligned to the data (Microwave - Light Switch - Slide Cabinet - Hinge Cabinet). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 C.1 Qualitative maze results. We visualize the trajectories of the different policies during training. The SkiLD approach leverages in-domain demonstrations to quickly learn how to reach the goal. SPiRL leverages skills from the task-agnostic dataset to widely explore the maze, but fails to reach the goal. SkillSeq makes progress towards the goal, but can get stuck in intermediate rooms, leading to a substantially lower success rate than the oracle method. Our approach, STAR, is able to match the performance of the oracle baseline and quickly learn to reach the goal while following the sequence of demonstrated rooms. . . . . . . . . . . . . . 173 C.2 Three semantically equivalent kitchen environments. Left: FrankaKitchen environment (Gupta et al., 2019), middle: rearranged FrankaKitchen environment, right: real-world kitchen. In all three environments we define the same set of seven semantic object manipulation skills like “open the microwave”, “turn on the stove” etc. The two simulated kitchen environments require different robot joint actuations to perform the same semantic skills. The real-world kitchen has a different agent embodiment (robot vs. human), layout and observation domain (low-dimensional state vs image observations). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 xx C.3 Source and target semantic maze navigation environments. A room’s color indicates its semantic ID. The red trajectory shows the traversal of semantic rooms demonstrated in the source domain and the corresponding trajectory in the target domain. The low-level planar velocity commands required to follow the demonstration in the target domain is substantially different. . . . . . . . . . . . . 174 C.4 Imitation learning on the simulated FrankaKitchen task. Our approach STAR is able to learn the target task even without access to any environment rewards, while common imitation learning approaches fail to learn the task due to the large domain gap between source demonstrations and target task environment. . . . . . . . . . . 176 C.5 Robustness of our approach, STAR, to different forms of noise in the semantic skill labels. Our approach is robust to noise in the length of annotated skills and uncertainty between different skills. While STAR is also shows some robustness to completely incorrect skill labels, frequent and confident mis-detections / mis-labelings can lead to errors during the cross-domain matching and thus impair learning performance on the target task. . . . . . . . . . . . . . . . . . . . . . . . 178 C.6 Visualization of all model components. The colors indicate the objective type used for training. Only the high-level policy is trained with online RL on the downstream task, all other components are pre-trained fully offline via supervised learning and frozen during downstream training. . . . . . . . . . . . . . . . . . . . . . . . . . . 179 C.7 Overview of some of the performed ablations. -DemoReg: removes the demonstration regularization for the high-level semantic policy and uses only the task-agnostic prior for regularization, -D-weight: removes the discriminator-based weighting between demonstration and task-agnostic regularization and uses only the former for guiding the policy during downstream training, -z: removes the latent variable z from the low-level policy and instead uses a deterministic low-level policy and no execution latent policy. . . . . . . . . . . . . . . . . . . . . . . . . . 180 D.1 Task Distributions for Task Length Ablation. We propose three meta-training task distributions of increasing difficulty to compare different meta-RL algorithms: T TRAIN-EASY uses short-horizon tasks with adjacent goal locations, making exploration easier during meta-training,T TRAIN-MEDIUM uses similar task horizon but increases the goal position variance,T TRAIN-HARD contains long-horizon tasks with high variance in goal position and thus is the hardest of the tested task distributions. . . . . . . . . . . . . . . . . . . . . 185 D.2 Meta-Training Performance for Task Length Ablation. We find that most meta-learning approaches can solve the simplest task distribution, but using prior experience in BC+PEARL and SiMPL helps for the more challenging distributions (b) and (c). We find that only our approach, which uses the prior data by extracting temporally extended skills, is able to learn the challenging long-horizon tasks efficiently. . . . . . . . . . . . . . . 185 xxi D.3 Qualitative Result of Meta-reinforcement Learning Method Ablation. Top. All the methods can learn to solve short-horizon tasksT TRAIN-EASY . Middle. On medium-horizon tasksT TRAIN-MEDIUM , PEARL struggles at exploring further, while BC+PEARL exhibits more consistent exploration yet still fails to solve some of the tasks. SiMPL can explore well and solve all the tasks. Bottom. On long-horizon tasksT TRAIN-HARD , PEARL falls into a local minimum, focusing only on one single task on the left. BC+PEARL explores slightly better and can solve a few more tasks. SiMPL can effectively learn all the tasks. . 186 D.4 Performance with few episodes of target task interaction. We find that our skill-based meta-RL approach SiMPL is able to learn complex, long-horizon tasks within few episodes of online interaction with a new task while prior meta-RL approaches and non-meta-learning baselines require many more interactions or fail to learn the task altogether.187 D.5 Image-Based Maze Navigation with Distribution Shift. (a-b): Meta-training and target task distributions. The green dots represent the goal locations of meta-training tasks and the red dots represent the goal locations of target tasks. The yellow cross represent the initial location of the agent, which is equivalent to the one used in Pertsch et al. (2020a). (c): Performance on the target task. Our approach SiMPL can leverage skills learned from offline data for efficient meta-RL on the maze navigation task and is robust to the domain shift between offline data environments and the target environment. . . . . . . . . . . . 189 D.6 Maze Meta-training and Target Task Distributions. The green dots represent the goal locations of meta-training tasks and the red dots represent the goal locations of target tasks. The yellow cross represent the initial location of the agent. . . . . . . . . . . . . . . . . 194 D.7 Maze Meta-training and Target Task Distributions for Meta-training Task Distribution Analysis. The green dots represent the goal locations of meta-training tasks and the red dots represent the goal locations of target tasks. The yellow cross represent the initial location of the agent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 E.1 Visualization of the front view (top) and wrist view (bottom) of a (serve meat, serve veggies) task trajectory in the real-world kitchen environment. . . . . . . . . . . . 197 E.2 Participant’s screen when (a) teleoperating the robot and (b) performing the side task200 F.1 Predictions on Human 3.6M. We see that the GCP models are able to faithfully capture the human trajectory. The optical flow-based method (DVF) captures the background but fails to generate complex motion needed for long-term goal- conditioned prediction. Causal InfoGan also struggles to capture the structure of these long sequences and produce implausible interpolations. Full qualitative results are on the supplementary website: sites.google.com/view/gcp-hier/home. . 203 F.2 Prior samples from GCP-tree on the Human 3.6M dataset. Each row is a different prior sample conditioned on the same information. . . . . . . . . . . . . . . . . . 204 xxii F.3 Comparison of visual planning & control approaches. Execution traces of Visual Foresight (left), GCP-tree with non-hierarchical planning (middle) and GCP-tree with hierarchical planning (right) on two 25-room navigation tasks. Visualized are start and goal observation for all approaches as well as predicted subgoals for hierarchical planning. Both GCP-based approaches can reach faraway goals reliably, but GCP with hierarchical planning finds shorter trajectories to the goal. . 205 F.4 Example trajectory distributions between fixed start (red) and goal (green) rooms on the 25-room navigation task. The example goal-reaching behavior is highly suboptimal, with both strong multimodality in the space of possible solutions as well as low-level noise in each individual trajectory. . . . . . . . . . . . . . . . . . 207 F.5 Left: random exploration data. Right: execution of our method trained on random data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 G.1 The full prompt that we use for summarization. Following the suggestions of Ahn et al. (2022) for prompt design, we explicitly number each step. The LLM completion task begins after “Summary:”. For brevity, we omit the new line characters between all numbered steps. . . . . . . . . . . . . . . . . . . . . . . . . 216 G.2 Left: distribution of the number of skills in each trajectory in the original ALFRED dataset. Middle: distribution of skills per trajectory in the “merged” dataset with merged navigation skills. Right: distribution of number of actions per skill in the “merged” dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 G.3 Data collection jupyter notebook page. Note that there is a “Skip” button so that human annotators can skip an instruction sequence if they do not feel it is semantically meaningful or easy to summarize. . . . . . . . . . . . . . . . . . . . 227 G.4 Randomly sampled, human language instruction annotations from the EVAL INSTRUCT task set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 G.5 Randomly sampled, human language instruction annotations from the EVAL SCENE task set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 G.6 Randomly sampled, human language instruction annotations from the EVAL LENGTH task set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 G.7 Comparison of OPT-350m, 1.3b, and 13b summaries on 3 randomly sampled sequences from our dataset. In general, as the model size increases, the summary becomes better. However, there are some sequences all three on which all 3 models do not generate good summaries for, such as the first sequence in these examples. . 233 G.8 LLM Summary examples. We randomly sample 12 language combinations generated by the OPT-13B text generation model (Zhang et al., 2022) and display them without cherry-picking specific examples. The LLM often creates a high level description by skipping implied steps (like humans do), e.g., “Put the remote on the couch” implies the need to pick up the remote. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 xxiii G.9 Visualizations of zero-shot policy rollouts on three tasks in the EVAL LENGTH task set.240 G.10 Visualizations of policy rollouts on two tasks in the EVAL SCENE task set, after finetuning each method. These floor plans were originally unseen to all agents until finetuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 xxiv Abstract Humans can efficiently learn a wide range of complex tasks, like cooking new dishes or using unfamiliar tools, from only a few trials. In order to build autonomous agents that can assist us in our everyday lives, we need to develop approaches that can learn a comparably wide range of long-horizon tasks efficiently. Recently, deep reinforcement learning works have demonstrated impressive manipulation capabilities in unstructured environments. Yet, a major bottleneck of existing learning algorithms is their sample efficiency: since agents are often trained from scratch, they require lots of online interactions to explore complex environments and learn long-horizon tasks. But why train from scratch? Humans use a large repertoire of skills for learning new tasks – can we enable the same in artificial agents? In this thesis we present a skill-based robot learning framework that can effectively reuse prior experience to accelerate the learning of new tasks. We first present a scalable approach for extracting a large skill repertoire from offline agent experience. We then introduce algorithms that leverage the skill repertoire to accelerate the learning of new tasks in reinforcement learning (RL), imitation and meta-RL settings. We further demonstrate how our skill-based framework can incorporate diverse human video data to teach robotic agents. Finally, we introduce assisted teleoperation approaches for collecting the large robotic experience datasets we use for skill learning and discuss methods for skill learning from diverse-quality data via model-based and language-conditioned offline reinforcement learning. The work presented here is a step towards data driven learning of diverse, long-horizon robotic tasks. We hope that it will contribute to building agents that can eventually assist humans in a wide range of tasks in their everyday lives. xxv Chapter 1 Introduction A long-standing goal of robotics research is to build intelligent agents that can assist humans in the tasks of their everyday lives. To fulfill this promise, such agents need to leave their confined lab environments and venture out into the the real world, into peoples’ homes and workplaces. Yet, programming robots to operate in such diverse, unstructured environments is a challenging task and to this day most robots are deployed in structured environments like factories or logistics centers. In recent years, the introduction of modern deep learning techniques into the field of robotics has opened a promising path towards building such intelligent assistants. Learned policies have been shown to perform a variety of robot tasks in unstructured environments (Levine et al., 2016; Kalashnikov et al., 2018; Lu et al., 2021), yet are so far confined to simple, short-horizon tasks like pick-and-place. At the same time, these algorithms require a large amount of experience to learn such behaviors. In reality, this means operating a fleet of robots for months to learn how to pick objects from a cluttered bin (Levine et al., 2016; Kalashnikov et al., 2018). While the initial results are impressive, it is clear that the current learning algorithms will not scale to the wide range of tasks we would want an intelligent assistant to learn. In contrast, humans are remarkably efficient at learning new tasks. We can learn how to master a new skill, like cooking a new dish or using a new tool, within only a few tries. Equipping artificial agents with similarly efficient learning capabilities is crucial in order to develop the generally capable intelligent assistants of the future. So what is lacking in current machine learning approaches that allows humans to be so efficient at learning new tasks? A key problem with current approaches is 1 that they aim to learn new tasks from scratch, i.e., without leveraging prior experience (Levine et al., 2016; Kalashnikov et al., 2018; Lu et al., 2021). In contrast, humans rely on a large amount of prior experience, e.g., from cooking different dishes or working with different tools, to accelerate task learning. Thus, the goal of my research is to learn a wide range of long-horizon robotic tasks efficiently by leveraging prior experience. Pick up Bread Diverse Offline Data Extract Skills Accelerate Learning of Target Task "Set the Table” 1 2 Place in Oven Put in Sink - Place food on plate - Put plate on table - Put bowl next to plate Figure 1.1: We propose a scalable skill-based robot learning framework that leverages diverse prior experience data to accelerate learning new tasks. We extract a large number of reusable short- horizon behaviors, i.e., skills, from prior experi- ence. Then, we introduce RL and imitation algo- rithms for efficiently learning new tasks with large skill repertoires. A key question we are trying to answer in this thesis is: how can we leverage prior experi- ence efficiently? Humans use prior experience by extracting a set of skills, which we use to learn new tasks faster. Skills distill a diverse range of prior experience into reusable short- horizon behaviors. Inspired by this, my research has focused on developing a scalable skill- based framework for learning long-horizon tasks efficiently using diverse prior experience in the form of offline data (see Figure 1.1). Par- ticularly, the work I will present in this the- sis has concentrated on answering three central questions: (1) How can we extract a large skill repertoire from prior experience data and use it to learn new tasks faster? (Part I) (2) How can we collect such robot experience data at scale? (Part II) (3) Finally, how can we extend skill learning to diverse datasets with varying data quality? (Part III) 1.1 A Framework for Skill-based Robot Learning We introduce a framework for skill-based robot learning, comprised of an approach for extracting a large set of skills from diverse prior experience datasets, and algorithms for leveraging these skills in reinforcement learning, imitation learning and meta-learning tasks. Our approach substantially 2 accelerates learning of new tasks using prior experience and can incorporate robot and human demonstrations. Learning Skills from Unstructured Data (Chapter 2) To learn diverse long-horizon tasks, we need a rich repertoire of pre-trained skills. Thus, the first question is: how can we learn such a repertoire of skills at scale? Prior works have manually defined a small number of skills and trained separate policies for each of them (Oh et al., 2017; Lee et al., 2019b; 2020). However, this requires a lot of human effort and does not scale to a large number of skills. Instead, we propose to learn skills from large amounts of prior experience data without need for human skill definition. Such experience data can for example be collected via human teleoperation. Given a dataset of observation-action trajectories, we propose an end-to-end trainable model that jointly learns a continuous representation of skills and a policy to execute them (Pertsch et al., 2020a). By learning skills from prior experience instead of hand-defining them and by embedding them into a continuous representation, we can scale skill learning to datasets with tens of thousands of trajectories, comprising a wide variety of skills. This repertoire builds the foundation for our skill-based reinforcement learning (RL) and imitation learning (IL) approaches. Task Learning with Large Skill Repertoires (Chapter 2, Chapter 3) Once we learned a large number of skills, the natural next question is: how can we effectively use this skill repertoire to learn new tasks? A typical way to leverage a set of pre-trained skills is by training a high-level policy that chooses which skill to perform at a given time step (Sutton et al., 1999b; Bacon et al., 2017; Kulkarni et al., 2016). Indeed, many prior works with small skill sets train such policies successfully via RL (Kulkarni et al., 2016; Nachum et al., 2018; Merel et al., 2020). However, in comparison to these prior works, the repertoire of skills we learn from prior experience is much larger, posing a challenging exploration problem: because there are so many skills to choose from, it becomes challenging for a high-level policy to exhaustively explore the space of available skills. So how can we learn effectively with large skills sets? 3 Our solution is inspired by how humans leverage skills: we naturally have a very large set of skills we can perform, yet we are efficient at exploring this skill set by leveraging a strong skill prior, an intuition that tells us which skills make sense to explore in any given situation. For example it is intuitive to put pasta into boiling water instead of throwing it onto the kitchen floor, even though we know how to execute both skills. To provide our robotic agents with a similar prior, we propose SPiRL (Skill-Prior RL, Pertsch et al. (2020a)), an approach that learns such a skill prior directly from prior experience data. SPiRL learns a distribution over the continuous skill representation, which captures the likelihood of a given skill being executed in the current state. Our approach modifies the common maximum entropy RL framework (Haarnoja et al., 2018a) to use this skill prior as guidance for the high-level policy during training. In our experiments we show that by combining the rich set of learned skills with our learned skill prior, we can learn complex chains of manipulation tasks, e.g., with a 7DOF Franka Panda in a simulated kitchen. In Chapter 3 we show that our skill-based learning approach can naturally incorporate demon- strations of the target task (Pertsch et al., 2021). Even with a handful of demonstrations we are able to learn new tasks substantially faster. In contrast to prior works in demonstration-guided RL (Rajeswaran et al., 2018; Vecerik et al., 2017; Zhu et al., 2018), our skill-based approach is able to learn from fewer demonstrations, even for long-horizon tasks. Intuitively, by leveraging prior experience via pre-trained skills, we merely need to use the demonstrations to learn how to chain familiar skills instead of learning how to perform the skills themselves. Learning Robot Tasks from Human Demonstrations (Chapter 4) In the previous two chapters, we have introduced a scalable framework for skill-based RL and imitation that can effectively leverage prior experience to learn new tasks quickly. However, robotic data, especially demonstra- tions, are notoriously expensive to collect. Thus we asked: can our skill-based learning framework be used to incorporate other sources of data that are more widely available, like human video demonstrations? Intuitively, the semantic abstraction afforded by skills allows transfer across environments and even agent embodiments: for example both, humans and robots, can perform a 4 skill like opening a microwave in their respective environment. Thus, can skills help to incorporate human data? Diverse Human Video Data Open Microwave Turn on Stove Open Cupboard 3 min of Human Video Demonstrations 1 2 3 Fast Robot Task Learning Open Microwave Open Cupboard Turn on Stove Figure 1.2: Our skill-based framework can teach a simulated robot new tasks using just 3 min of human video demonstrations, collected in my kitchen. We use diverse human videos (EPIC Kitchens, Damen et al. (2021)) to guide the robot’s exploration over skills. In Chapter 4, we leverage our existing skill- based framework to enable joint learning from human and robot data (Pertsch et al. (2022), see Figure 1.2). We use an existing large-scale human video dataset, EPIC Kitchens (Damen et al., 2021), to learn a prior over skills that informs robot exploration. EPIC Kitchens con- sists of 100 hours of human ego-centric videos of kitchen interactions, annotated with the per- formed skills. We additionally collect 3 minutes of human target task demonstrations in my own kitchen with a headmounted camera. We show that we can leverage our skill-based framework to jointly use the diverse EPIC Kitchens data and the small number of human target task demonstrations to guide robot learning of new tasks. With just 3 minutes of videos from my kitchen we speed up learning of long-horizon kitchen tasks with a simulated robotic arm by hundreds of thousands of environment interactions. This is a first step towards scalable usage of human data for learning robot policies. Meta-Learning Skill Usage (Chapter 5) A promising approach for further accelerating task learning is meta-learning, i.e., learning to learn. Meta-learning approaches explicitly train agents to quickly adapt to new tasks (Duan et al., 2016b; Finn et al., 2017a), yet when applied to control learning, meta-reinforcement learning approaches have typically been limited to simple, short- horizon tasks (Finn et al., 2017a; Rothfuss et al., 2019; Rakelly et al., 2019). In Chapter 5, we extend our skill-based learning framework to meta-learning tasks and demonstrate that by using skills extracted from large amounts of prior experience we can meta-learn policies on task distributions that are much longer horizon and diverse than prior meta-RL approaches. We also demonstrate that 5 the resulting approach can learn new tasks substantially faster than the non-meta-learning skill-RL approaches we introduced in Chapter 2, taking a further step towards sample efficient learning. 1.2 Scalable Robot Data Collection In Part I we introduced a skill-based learning framework for using large datasets of prior robot and human experience to accelerate robot learning. Yet, so far we have assumed access to a large and diverse dataset of robotic experience. However, unlike for images or text, in robotics we cannot download such data from the internet. Instead, it needs to be laboriously collected. In this second part of the thesis, we will focus on this data collection itself and discuss an approach for effectively collecting large amounts of robot data via human teleoperation. Conven- tionally, such teleoperation requires a human operator to fully control a single robot to demonstrate the tasks (Zhang et al., 2018b). As a result, robot data collection is slow and expensive. Yet, intuitively, when collecting large datasets via teleoperation, a lot of a human operator’s time is spent demonstrating the same behavior over and over, e.g., repeatedly picking up spoons to set a table. We substantially increase the effectiveness of human teleoperation by automating such repeated behav- iors. In Chapter 6, we introduce an approach that pairs a human teleoperator with a learned assistive policy during data collection. We demonstrate, that we can increase data collection throughput substantially, and even enable a single operator to teleoperate multiple robots in parallel (Dass et al., 2022). This is a crucial step towards the collection of large and diverse real-world robot datasets. 1.3 Learning Skills from Diverse Datasets In the first two parts of this thesis we have introduced approaches for collecting high-quality robot experience datasets and using them for accelerated task learning by extracting reusable skills. However, real-world datasets can contain large amounts of sub-optimal data, either due to failures in teleoperation or because they were collected at least in part with much cheaper autonomous agent 6 exploration. In Chapter 2, we showed that our skill extraction approaches suffer from such sub- optimal data. Yet, being able to learn from diverse-quality data is crucial for real-world applications. Thus, in this chapter we focus on skill-learning approaches that can extract high-performing skills from diverse-quality training datasets. In Chapter 7, we first explore using model-based techniques to learn a hierarchy of skills from sub-optimal data. Model-based reinforcement learning (Finn et al., 2016c; Ebert et al., 2018) is particularly suited for this application, since even sub-optimal data can be used to learn an accurate dynamics model. Concretely, we propose a goal-conditioned video prediction model, than learns a hierarchy of variable-length skills, purely from a simple prediction objective. We demonstrate that we can use this model to plan skill chains that successfully solve long-horizon tasks even when trained from highly sub-optimal data. In Chapter 8, we investigate another approach for skill learning from diverse data: language- conditioned offline reinforcement learning. Offline RL (Levine et al., 2020; Kumar et al., 2020; Kostrikov et al., 2022) uses temporal difference learning to extract high-performance policies even from sub-optimal data. By conditioning the learned policy on language commands, we can parameterize a wide range of skills. Yet, training a wide range of language-conditioned skills requires robot data with costly language instruction annotations. To minimize the required human effort for instruction annotation, we introduce approaches for automatic instruction re-labeling via pre-trained large language models (Brown et al., 2020; Zhang et al., 2022). We show that this leads to a wider range of learned skills and more effective downstream task learning. 7 Part I Using Skills for Effective Transfer in Reinforcement Learning & Imitation 8 Chapter 2 Reinforcement Learning with Learned Skills 2.1 Introduction Intelligent agents are able to utilize a large pool of prior experience to efficiently learn how to solve new tasks (Woodworth and Thorndike, 1901). In contrast, reinforcement learning (RL) agents typically learn each new task from scratch, without leveraging prior experience. Consequently, agents need to collect a large amount of experience while learning the target task, which is expensive, especially in the real world. On the other hand, there is an abundance of collected agent experience available in domains like autonomous driving (Caesar et al., 2020), indoor navigation (Mo et al., 2018), or robotic manipulation (Dasari et al., 2019; Cabi et al., 2019). With the widespread deployment of robots on streets or in warehouses the available amount of data will further increase in the future. However, the majority of this data is unstructured, without clear task or reward definitions, making it difficult to use for learning new tasks. In this work, our aim is to devise a scalable approach for leveraging such unstructured experience to accelerate the learning of new downstream tasks. One flexible way to utilize unstructured prior experience is by extracting skills, temporally extended actions that represent useful behaviors, which can be repurposed to solve downstream tasks. Skills can be learned from data without any task or reward information and can be transferred to new tasks and even new environment configurations. Prior work has learned skill libraries from data collected by humans (Schaal, 2006; Merel et al., 2019; 2020; Shankar et al., 2019; Lynch 9 et al., 2020) or by agents autonomously exploring the world (Hausman et al., 2018; Sharma et al., 2020b). To solve a downstream task using the learned skills, these approaches train a high-level policy whose action space is the set of extracted skills. The dimensionality of this action space scales with the number of skills. Thus, the large skill libraries extracted from rich datasets can, somewhat paradoxically, lead to worse learning efficiency on the downstream task, since the agent needs to collect large amounts of experience to perform the necessary exploration in the space of skills (Jong et al., 2008). The key idea of this work is to learn a prior over skills along with the skill library to guide exploration in skill space and enable efficient downstream learning, even with large skill spaces. Intuitively, the prior over skills is not uniform: if the agent holds the handle of a kettle, it is more promising to explore a pick-up skill than a sweeping skill (see Fig. 2.1). To implement this idea, we design a stochastic latent variable model that learns a continuous embedding space of skills and a prior distribution over these skills from unstructured agent experience. We then show how to naturally incorporate the learned skill prior into maximum-entropy RL algorithms for efficient learning of downstream tasks. To validate the effectiveness of our approach, SPiRL (Skill-Prior RL), we apply it to complex, long-horizon navigation and robot manipulation tasks. We show that through the transfer of skills we can use unstructured experience for accelerated learning of new downstream tasks and that learned skill priors are essential to efficiently utilize rich experience datasets. In summary, our contributions in this chapter are threefold: (1) we design a model for jointly learning an embedding space of skills and a prior over skills from unstructured data, (2) we extend maximum-entropy RL to incorporate learned skill priors for downstream task learning, and (3) we show that learned skill priors accelerate learning of new tasks across three simulated navigation and robot manipulation tasks. 10 Open Microwave Move Kettle Slide Cabinet Door Skill Library Microwave Opened Kettle Grasped slide move open slide move open Skill Priors 1 2 3 Efficient Downstream Task Learning Figure 2.1: Intelligent agents can use a large library of acquired skills when learning new tasks. Instead of exploring skills uniformly, they can leverage priors over skills as guidance, based e.g., on the current environment state. Such priors capture which skills are promising to explore, like moving a kettle when it is already grasped, and which are less likely to lead to task success, like attempting to open an already opened microwave. In this work, we propose to jointly learn an embedding space of skills and a prior over skills from unstructured data to accelerate the learning of new tasks. 2.2 Related Work The goal of our work is to leverage prior experience for accelerated learning of downstream tasks. Meta-learning approaches (Finn et al., 2017b; Rakelly et al., 2019) similarly aim to extract useful priors from previous experience to improve the learning efficiency for unseen tasks. However, they require a defined set of training tasks and online data collection during pre-training and therefore cannot leverage large offline datasets. In contrast, our model learns skills fully offline from unstructured data. Approaches that operate on such offline data are able to leverage large existing datasets (Dasari et al., 2019; Cabi et al., 2019) and can be applied to domains where data collection is particularly costly or safety critical (Levine et al., 2020). A number of works have recently explored the offline reinforcement learning setting (Levine et al., 2020; Fujimoto et al., 2019; Jaques et al., 2019; Kumar et al., 2019; Wu et al., 2019), in which a task needs to be learned purely from logged agent experience without any environment interactions. It has also been shown how offline RL can be used to accelerate online RL (Nair et al., 2020). However, these approaches require the experience to be annotated with rewards for the downstream task, which are challenging to provide for large, 11 real-world datasets, especially when the experience is collected across a wide range of tasks. Our approach based on skill extraction, on the other hand, does not require any reward annotation on the offline experience data and, once extracted, skills can be reused for learning a wide range of downstream tasks. More generally, the problem of inter-task transfer has been studied for a long time in the RL community (Taylor and Stone, 2009). The idea of transferring skills between tasks dates back at least to the SKILLS (Thrun and Schwartz, 1995) and PolicyBlocks (Pickett and Barto, 2002) algorithms. Learned skills can be represented as sub-policies in the form of options (Sutton et al., 1999a; Bacon et al., 2017), as subgoal setter and reacher functions (Gupta et al., 2019; Mandlekar et al., 2020a) or as discrete primitive libraries (Schaal, 2006; Lee et al., 2018b). Recently, a number of works have explored the embedding of skills into a continuous skill space via stochastic latent variable models (Hausman et al., 2018; Merel et al., 2019; Kipf et al., 2019; Merel et al., 2020; Shankar et al., 2019; Whitney et al., 2020; Lynch et al., 2020). When using powerful latent variable models, these approaches are able to represent a very large number of skills in a compact embedding space. However, the exploration of such a rich skill embedding space can be challenging, leading to inefficient downstream task learning (Jong et al., 2008). Our work introduces a learned skill prior to guide the exploration of the skill embedding space, enabling efficient learning on rich skill spaces. Learned behavior priors are commonly used to guide task learning in offline RL approaches (Fu- jimoto et al., 2019; Jaques et al., 2019; Wu et al., 2019) in order to avoid value overestimation for actions outside of the training data distribution. Recently, action priors have been used to leverage offline experience for learning downstream tasks (Siegel et al., 2020). Crucially, our approach learns priors over temporally extended actions (i.e., skills) allowing it to scale to complex, long-horizon downstream tasks. 2.3 Approach Our goal is to leverage skills extracted from large, unstructured datasets to accelerate the learning of new tasks. Scaling skill transfer to large datasets is challenging, since learning the downstream 12 task requires picking the appropriate skills from an increasingly large library of extracted skills. In this work, we propose to use learned skill priors to guide exploration in skill space and allow for efficient skill transfer from large datasets. We decompose the problem of prior-guided skill transfer into two sub-problems: (1) the extraction of skill embedding and skill prior from offline data, and (2) the prior-guided learning of downstream tasks with a hierarchical policy. 2.3.1 Problem Formulation We assume access to a datasetD of pre-recorded agent experience in the form of state-action trajectoriesτ i ={(s 0 ,a 0 ),...,(s T i ,a T i )}. This data can be collected using previously trained agents across a diverse set of tasks (Fu et al., 2020; Gulcehre et al., 2020), through agents autonomously exploring their environment (Hausman et al., 2018; Sharma et al., 2020b), via human teleopera- tion (Schaal et al., 2005; Gupta et al., 2019; Mandlekar et al., 2018; Lynch et al., 2020) or any combination of these. Crucially, we aim to leverage unstructured data that does not have annotations of tasks or sub-skills and does not contain reward information to allow for scalable data collection on real world systems. In contrast to imitation learning problems we do not assume that the training data contains complete solutions for the downstream task. Hence, we focus on transferring skills to new problems. The downstream learning problem is formulated as a Markov decision process (MDP) defined by a tuple{S,A,T ,R,ρ,γ} of states, actions, transition probability, reward, initial state distribution, and discount factor. We aim to learn a policy π θ (a|s) with parameters θ that maximizes the discounted sum of rewards J(θ)=E π ∑ T− 1 t=0 γ t r t where T is the episode horizon. 2.3.2 Learning Continuous Skill Embedding and Skill Prior We define a skill a a a i as a sequence of actions{a i t ,...,a i t+H− 1 } with fixed horizon H. Using fixed- length skills allows for scalable skill learning and has proven to be effective in prior works (Merel et al., 2019; 2020; Gupta et al., 2019; Whitney et al., 2020; Mandlekar et al., 2020a; Fang et al., 2019). Other work has proposed to learn semantic skills of flexible length (Kipf et al., 2019; Shankar 13 a 1 a 2 a 3 a H q(z|a i ) Skill Prior p a (z|s 1 ) Skill Encoder s 1 Skill Embedding ̂ a 1 ̂ a 2 ̂ a 3 ̂ a H Skill Decoder p(a i |z) Learned Prior p a (z|s 1 ) Skill Posterior q(z|a i ) Fixed Prior p(z)∼ Learned Skill Embedding a i Skill s 2 s 3 s H+1 Figure 2.2: Deep latent variable model for joint learning of skill embedding and skill prior. Given a state-action trajectory from the dataset, the skill encoder maps the action sequence to a posterior distribution q(z|a a a i ) over latent skill embeddings. The action trajectory gets reconstructed by passing a sample from the posterior through the skill decoder. The skill prior maps the current environment state to a prior distribution p a a a (z|s 1 ) over skill embeddings. Colorful arrows indicate the propagation of gradients from reconstruction, regularization and prior training objectives. et al., 2019; Pertsch et al., 2020b) and our model can be extended to include similar approaches, but we leave this for future work. To learn a low-dimensional skill embedding spaceZ , we train a stochastic latent variable model p(a a a i |z) of skills using the offline dataset (see Fig. 2.2). We randomly sample H-step trajectories from the training sequences and maximize the following evidence lower bound (ELBO): log p(a a a i )≥ E q log p(a a a i |z) | {z } reconstruction − β logq(z|a a a i )− log p(z) | {z } regularization . (2.1) Here,β is a parameter that is commonly used to tune the weight of the regularization term (Higgins et al., 2017). We optimize this objective using amortized variational inference with an inference network q(z|a a a i ) (Kingma and Welling, 2014; Rezende et al., 2014). To learn a rich skill embedding space we implement skill encoder q(z|a a a i ) and decoder p(a a a i |z) as deep neural networks that output the parameters of the Gaussian posterior and output distributions. The prior p(z) is set to be unit 14 GaussianN (0,I). Once trained, we can sample skills from our model by sampling latent variables z∼ N (0,I) and passing them through the decoder p(a a a i |z). In Section 2.3.3 we show how to use this generative model of skills for learning hierarchical RL policies. To better guide downstream learning, we learn a prior over skills along with the skill embedding model. We therefore introduce another component in our model: the skill prior p a a a (z|·). The conditioning of this skill prior can be adjusted to the environment and task at hand, but should be informative about the set of skills that are meaningful to explore in a given situation. Possible choices include the embedding of the last executed skill z t− 1 or the current environment state s t . In this work we focus on learning a state-conditioned skill prior p a a a (z|s t ). Intuitively, the current state should provide a strong prior over which skills are promising to explore and, importantly, which skills should not be explored in the current situation (see Fig. 2.1). To train the skill prior we minimize the Kullback-Leibler divergence between the predicted prior and the inferred skill posterior:E (s,a a a i )∼ D D KL q(z|a a a i ), p a a a (z|s t ) . Using the reverse KL divergence D KL (q, p) instead of D KL (p,q) ensures that the learned prior is mode-covering (Bishop, 2006), i.e., represents all observed skills in the current situation. Instead of training the skill prior after training the skill embedding model, we can jointly optimize both models and ensure stable convergence by stopping gradients from the skill prior objective into the skill encoder. We experimented with different parameterizations for the skill prior distribution, in particular multi-modal distributions, such as Gaussian mixture models and normalizing flows (Rezende and Mohamed, 2015; Dinh et al., 2017), but found simple Gaussian skill priors to work equally well in our experiments. For further implementation details, see appendix, Section B. 2.3.3 Skill Prior Regularized Reinforcement Learning To use the learned skill embedding for downstream task learning, we employ a hierarchical policy learning scheme by using the skill embedding space as action space of a high-level pol- icy. Concretely, instead of learning a policy over actions a∈A we learn a policy π θ (z|s t ) that outputs skill embeddings, which we decode into action sequences using the learned skill decoder 15 Algorithm 1 SPiRL: Skill-Prior RL 1: Inputs: H-step reward function ˜ r(s t ,z t ), discount γ, target divergence δ, learning rates λ π ,λ Q ,λ α , target update rateτ. 2: Initialize replay bufferD, high-level policyπ θ (z t |s t ), critic Q φ (s t ,z t ), target network Q ¯ φ (s t ,z t ) 3: for each iteration do 4: for every H environment steps do 5: z t ∼ π(z t |s t ) ▷ sample skill from policy 6: s t ′∼ p(s t+H |s t ,z t ) ▷ execute skill in environment 7: D← D∪{s t ,z t , ˜ r(s t ,z t ),s t ′} ▷ store transition in replay buffer 8: end for 9: for each gradient step do 10: ¯ Q= ˜ r(s t ,z t )+γ Q ¯ φ (s t ′,π θ (z t ′|s t ′))− αD KL π θ (z t ′|s t ′), p a a a (z t ′|s t ′) ▷ compute Q-target 11: θ← θ− λ π ∇ θ Q φ (s t ,π θ (z t |s t ))− αD KL (π θ (z t |s t ), p a a a (z t |s t )) ▷ update policy weights 12: φ← φ− λ Q ∇ φ 1 2 Q φ (s t ,z t )− ¯ Q 2 ▷ update critic weights 13: α← α− λ α ∇ α α· (D KL (π θ (z t |s t ), p a a a (z t |s t ))− δ) ▷ update alpha 14: ¯ φ← τφ+(1− τ) ¯ φ ▷ update target network weights 15: end for 16: end for 17: return trained policyπ θ (z t |s t ) {a i t ,...,a i t+H− 1 }∼ p(a a a i |z) 1 . We execute these actions for H steps before sampling the next skill from the high-level policy. This hierarchical structure allows for temporal abstraction, which facilitates long-horizon task learning (Sutton et al., 1999a). We can cast the problem of learning the high-level policy into a standard MDP by replacing the action spaceA with the skill spaceZ , single-step rewards with H-step rewards ˜ r=∑ H t=1 r t , and single-step transitions with H-step transitions s t+H ∼ p(s t+H |s t ,z t ). We can then use conventional model-free RL approaches to maximize the return of the high-level policyπ θ (z|s t ). This naive approach struggles when training a policy on a very rich skill spaceZ that encodes many different skills. While the nominal dimensionality of the skill space might be small, its continuous nature allows the model to embed an arbitrary number of different behaviors. Therefore, the effective dimensionality of the high-level policies’ action space scales with the number of embedded skills. When using a large offline dataset D with diverse behaviors, the number of embedded skills can grow rapidly, leading to a challenging exploration problem when training 1 We also experimented with models that directly condition the decoder on the current state, but found downstream RL to be less stable (see appendix, Section D) 16 the high-level policy. For more efficient exploration, we propose to use the learned skill prior to guide the high-level policy. We will next show how the skill prior can be naturally integrated into maximum-entropy RL algorithms. Maximum entropy RL (Ziebart, 2010; Levine, 2018) augments the training objective of the policy with a term that encourages maximization of the policy’s entropy along with the return: J(θ)=E π T ∑ t=1 γ t r(s t ,a t )+αH π(a t |s t ) (2.2) The added entropy term is equivalent to the negated KL divergence between the policy and a uniform action prior U(a t ):H (π(a t |s t ))=− E π logπ(a t |s t ) ∝− D KL (π(a t |s t ),U(a t )) up to a constant. However, in our case we aim to regularize the policy towards a non-uniform, learned skill prior to guide exploration in skill space. We can therefore replace the entropy term with the negated KL divergence from the learned prior, leading to the following objective for the high-level policy: J(θ)=E π T ∑ t=1 ˜ r(s t ,z t )− αD KL π(z t |s t ), p a a a (z t |s t ) (2.3) We can modify the state-of-the-art maximum-entropy RL algorithms, such as Soft Actor-Critic (SAC (Haarnoja et al., 2018a;b)) to optimize this objective. We summarize our SPiRL approach in Algorithm 1 with changes to SAC marked in red. For a detailed derivation of the update rules, see appendix, Section A. Analogous to Haarnoja et al. (2018b) we can devise an automatic tuning strategy for the regularization weightα by defining a target divergence parameterδ (see Algorithm 1, appendix A). 2.4 Experiments Our experiments are designed to answer the following questions: (1) Can we leverage unstruc- tured datasets to accelerate downstream task learning by transferring skills? (2) Can learned skill 17 Maze Navigation Block Stacking Kitchen Environment 1 2 3 Training Data Target Tasks 4 Figure 2.3: For each environment we collect a diverse dataset from a wide range of training tasks (examples on top) and test skill transfer to more complex target tasks (bottom), in which the agent needs to: navigate a maze (left), stack as many blocks as possible (middle) and manipulate a kitchen setup to reach a target configuration ( right). All tasks require the execution of complex, long-horizon behaviors and need to be learned from sparse rewards. priors improve exploration during downstream task learning? (3) Are learned skill priors necessary to scale skill transfer to large datasets? 2.4.1 Environments & Comparisons We evaluate SPiRL on one simulated navigation task and two simulated robotic manipulation tasks (see Fig. 2.3). For each environment, we collect a large and diverse dataset of agent experience that allows to extract a large number of skills. To test our method’s ability to transfer to unseen downstream tasks, we vary task and environment setup between training data collection and downstream task. Maze Navigation. A simulated maze navigation environment based on the D4RL maze environ- ment (Fu et al., 2020). The task is to navigate a point mass agent through a maze between fixed start and goal locations. We use a planner-based policy to collect 85000 goal-reaching trajectories in randomly generated, small maze layouts and test generalization to a goal-reaching task in a 18 randomly generated, larger maze. The state is represented as a RGB top-down view centered around the agent. For downstream learning the agent only receives a sparse reward when in close vicinity to the goal. The agent can transfer skills, such as traversing hallways or passing through narrow doors, but needs to learn to navigate a new maze layout for solving the downstream task. Block Stacking. The goal of the agent is to stack as many blocks as possible in an environment with eleven blocks. We collect 37000 training sequences with a noisy, scripted policy that randomly stacks blocks on top of each other in a smaller environment with only five blocks. The state is represented as a RGB front view centered around the agent and it receives binary rewards for picking up and stacking blocks. The agent can transfer skills like picking up, carrying and stacking blocks, but needs to perform a larger number of consecutive stacks than seen in the training data on a new environment with more blocks. Kitchen Environment. A simulated kitchen environment based on Gupta et al. (2019). We use the training data provided in the D4RL benchmark (Fu et al., 2020), which consists of 400 teleoperated sequences in which the 7-DoF robot arm manipulates different parts of the environment (e.g., open microwave, switch on stove, slide cabinet door). During downstream learning the agent needs to execute an unseen sequence of multiple subtasks. It receives a sparse, binary reward for each successfully completed manipulation. The agent can transfer a rich set of manipulation skills, but needs to recombine them in new ways to solve the downstream task. For further details on environment setup, data collection and training, see appendix, Sec- tions B and C. We compare the downstream task performance of SPiRL to several flat and hierarchical baselines that test the importance of learned skill embeddings and skill prior: • Flat Model-Free RL (SAC). Trains an agent from scratch with Soft Actor-Critic (SAC, Haarnoja et al. (2018a)). This comparison tests the benefit of leveraging prior experience. 19 SPiRL (Ours) Flat Prior SSP w/o Prior SAC BC + SAC Maze Navigation Block Stacking Kitchen Environment Figure 2.4: Downstream task learning curves for our method and all comparisons. Both, learned skill embedding and skill prior are essential for downstream task performance: single-action priors without temporal abstraction (Flat Prior) and learned skills without skill prior (SSP w/o Prior) fail to converge to good performance. Shaded areas represent standard deviation across three seeds. • Behavioral Cloning w/ finetuning (BC + SAC). Trains a supervised behavioral cloning (BC) policy from the offline data and finetunes it on the downstream task using SAC. • Flat Behavior Prior (Flat Prior). Learns a single-step action prior on the primitive action space and uses it to regularize downstream learning as described in Section 2.3.3, similar to (Siegel et al., 2020). This comparison tests the importance of temporal abstraction through learned skills. • Hierarchical Skill-Space Policy (SSP). Trains a high-level policy on the skill-embedding space of the model described in Section 2.3.2 but without skill prior, representative of (Merel et al., 2019; Kipf et al., 2019; Shankar et al., 2019). This comparison tests the importance of the learned skill prior for downstream task learning. 2.4.2 Maze Navigation We first evaluate SPiRL on the simulated maze navigation task. This task poses a hard explo- ration problem since the reward feedback is very sparse: following the D4RL benchmark (Fu et al., 2020) the agent receives a binary reward only when reaching the goal and therefore needs to explore large fractions of the maze without reward feedback. We hypothesize that learned skills and a prior 20 SPiRL (Ours) Flat Prior Skills w/o Prior Random Figure 2.5: Exploration behavior of our method vs. alternative transfer approaches on the down- stream maze task vs. random action sampling. Through learned skill embeddings and skill priors our method can explore the environment more widely. We visualize positions of the agent during 1M steps of exploration rollouts in blue and mark episode start and goal positions in green and red respectively. that guides exploration are crucial for successful learning, particularly when external feedback is sparse. In Fig. 2.4 (left) we show that only SPiRL is able to successfully learn a goal-reaching policy for the maze task; none of the baseline policies reaches the goal during training. To better understand this result, we compare the exploration behaviors of our approach and the baselines in Fig. 2.5: we collect rollouts by sampling from our skill prior and the single-step action prior and record the agent’s position in the maze. To visualize the exploration behavior of skill-space policies without learned priors (”Skills w/o Prior”) we sample skills uniformly from the skill space. Fig. 2.5 shows that only SPiRL is able to explore large parts of the maze, since targeted sampling of skills from the prior guides the agent to navigate through doorways and traverse hallways. Random exploration in skill space, in contrast, does not lead to good exploration behavior since the agent often samples skills that lead to collisions. The comparison to single-step action priors (”Flat Prior”) shows that temporal abstraction is beneficial for coherent exploration. Finally, we show reuse of a single skill prior for a variety of downstream goals in appendix, Section H. 2.4.3 Robotic Manipulation Next, we investigate the ability of SPiRL to scale to complex, robotic manipulation tasks in the block stacking problem and in the kitchen environment. For both environments we find that 21 using learned skill embeddings together with the extracted skill prior is essential to solve the task (see Fig. 2.4, middle and right; appendix Fig. A.2 for qualitative policy rollouts). In contrast, using non-hierarchical action priors (”Flat Prior”) leads to performance similar to behavioral cloning of the training dataset, but fails to solve longer-horizon tasks. This shows the benefit of temporal abstraction through skills. The approach leveraging the learned skill space without guidance from the skill prior (”SSP w/o Prior”) only rarely stacks blocks or successfully manipulates objects in the kitchen environment. Due to the large number of extracted skills from the rich training datasets, random exploration in skill space does not lead to efficient downstream learning. Instead, performance is comparable or worse than learning from scratch without skill transfer. This underlines the importance of learned skill priors for scaling skill transfer to large datasets. Similar to prior work (Gupta et al., 2019), we find that a policy initialized through behavioral cloning is not amenable to efficient finetuning on complex, long-horizon tasks. 2.4.4 Ablation Studies 0.00 0.15 0.30 0.45 0.60 Environment steps (1M) 0 1 2 3 4 Stacked Blocks Skill Horizon H = 30 H = 10 H = 3 0.00 0.15 0.30 0.45 0.60 Environment steps (1M) 0 1 2 3 4 Stacked Blocks Embedding Dimension |Z| = 30 |Z| = 10 |Z| = 2 Figure 2.6: Ablation analysis of skill horizon and skill space dimensionality on block stacking task. We analyze the influence of skill horizon H and dimensionality of the learned skill space |Z| on down- stream performance in Fig. 2.6. We see that too short skill horizons do not afford sufficient temporal abstraction. Conversely, too long horizons make the skill exploration problem harder, since a larger number of possible skills gets embedded in the skill space. Therefore, the policy converges slower. We find that the dimensionality of the learned skill embedding space needs to be large enough to represent a sufficient diversity of skills. Beyond that, |Z| does not have a major influence on the downstream performance. We attribute this to the usage of the learned skill prior: even though the 22 nominal dimensionality of the high-level policy’s action space increases, its effective dimensionality remains unchanged since the skill prior focuses exploration on the relevant parts of the skill space. We further test the importance of prior initialization and regularization, as well as training priors from sub-optimal data in appendix, Sections E - G. 2.5 Discussion In this chapter, we presented SPiRL, an approach for leveraging large, unstructured datasets to accelerate downstream learning of unseen tasks. We propose a deep latent variable model that jointly learns an embedding space of skills and a prior over these skills from offline data. We then extend maximum-entropy RL algorithms to incorporate both skill embedding and skill prior for efficient downstream learning. Finally, we evaluate SPiRL on challenging simulated navigation and robotic manipulation tasks and show that both, skill embedding and skill prior are essential for effective transfer from rich datasets. Future work can combine learned skill priors with methods for extracting semantic skills of flexible length from unstructured data (Shankar et al., 2019; Pertsch et al., 2020b). Further, skill priors are important in safety-critical applications, like autonomous driving, where random exploration is dangerous. Skill priors learned e.g. from human demonstration, can guide exploration to skills that do not endanger the learner or other agents. 23 Chapter 3 Demonstration-Guided Reinforcement Learning with Learned Skills 3.1 Introduction Policy Environment Demonstration-Guided RL Task-Specific Demonstrations Task-Agnostic Offline Data Figure 3.1: We leverage large, task-agnostic datasets collected across many different tasks for efficient demonstration-guided reinforcement learning by (1) acquiring a rich motor skill reper- toire from such offline data and (2) understanding and imitating the demonstrations based on the skill repertoire. Humans are remarkably efficient at acquir- ing new skills from demonstrations: often a sin- gle demonstration of the desired behavior and a few trials of the task are sufficient to master it (Bekkering et al., 2000; Al-Abood et al., 2001; Hodges et al., 2007). To allow for such efficient learning, we can leverage a large number of previously learned behaviors (Al-Abood et al., 2001; Hodges et al., 2007). Instead of imitating each of the demonstrated muscle movements, humans can extract the performed skills and use the rich repertoire of already acquired skills to efficiently reproduce the desired behavior. Demonstrations are also commonly used in reinforcement learning (RL) to guide exploration and improve sample efficiency (Vecerik et al., 2017; Hester et al., 2018; Rajeswaran et al., 2018; Nair et al., 2018; Zhu et al., 2018). However, such demonstration-guided RL approaches attempt 24 to learn tasks from scratch: analogous to a human trying to imitate a completely unseen behavior by following every demonstrated muscle movement, they try to imitate the primitive actions performed in the provided demonstrations. As with humans, such step-by-step imitation leads to brittle policies (Ross et al., 2011), and thus these approaches require many demonstrations and environment interactions to learn a new task. We propose to improve the efficiency of demonstration-guided RL by leveraging prior experience in the form of an offline “task-agnostic“ experience dataset, collected not on one but across many tasks (see Figure 3.1). Given such a dataset, our approach extracts reusable skills: robust short-horizon behaviors that can be recombined to learn new tasks. Like a human imitating complex behaviors via the chaining of known skills, we can use this repertoire of skills for efficient demonstration-guided RL on a new task by guiding the policy using the demonstrated skills instead of the primitive actions. Concretely, we propose Skill-based Learning with Demonstrations (SkiLD), a demonstration- guided RL algorithm that learns short-horizon skills from offline datasets and leverages them for following demonstrations of a new task. Across challenging navigation and robotic manipula- tion tasks this significantly improves the learning efficiency over prior demonstration-guided RL approaches. In summary, the contributions of our work are threefold: (1) we introduce the problem of leveraging task-agnostic offline datasets for accelerating demonstration-guided RL on unseen tasks, (2) we propose SkiLD, a skill-based algorithm for efficient demonstration-guided RL and (3) we show the effectiveness of our approach on a maze navigation and two complex robotic manipulation tasks. 3.2 Related Work Imitation learning. Learning from Demonstration, also known as imitation learning (Argall et al., 2009), is a common approach for learning complex behaviors by leveraging a set of demon- strations. Most prior approaches for imitation learning are either based on behavioral cloning 25 (BC, Pomerleau (1989)), which uses supervised learning to mimic the demonstrated actions, or inverse reinforcement learning (IRL, Abbeel and Ng (2004); Ho and Ermon (2016)), which infers a reward from the demonstrations and then trains a policy to optimize it. However, BC commonly suffers from distribution shift and struggles to learn robust policies (Ross et al., 2011), while IRL’s joint optimization of reward and policy can result in unstable training. Demonstration-guided RL. A number of prior works aim to mitigate these problems by combining reinforcement learning with imitation learning. They can be categorized into three groups: (1) approaches that use BC to initialize and regularize policies during RL training (Rajeswaran et al., 2018; Nair et al., 2018), (2) approaches that place the demonstrations in the replay buffer of an off-policy RL algorithm (Vecerik et al., 2017; Hester et al., 2018), and (3) approaches that augment the environment rewards with rewards extracted from the demonstrations (Zhu et al., 2018; Peng et al., 2018; Merel et al., 2017). While these approaches improve the efficiency of RL, they treat each task as an independent learning problem and thus require many demonstrations to learn effectively, which is especially expensive since a new set of demonstrations needs to be collected for every new task. Online RL with offline datasets. As an alternative to expensive task-specific demonstrations, multiple recent works have proposed to accelerate reinforcement learning by leveraging task- agnostic experience in the form of large datasets collected across many tasks (Pertsch et al., 2020a; Siegel et al., 2020; Nair et al., 2020; Ajay et al., 2021; Singh et al., 2021; 2020). In contrast to demonstrations, such task-agnostic datasets can be collected cheaply from a variety of sources like autonomous exploration (Hausman et al., 2018; Sharma et al., 2020b) or human tele-operation (Gupta et al., 2019; Mandlekar et al., 2018; Lynch et al., 2020), but will lead to slower learning than demonstrations since the data is not specific to the downstream task. Skill-based RL. One class of approaches for leveraging such offline datasets that is particularly suited for learning long-horizon behaviors is skill-based RL (Hausman et al., 2018; Merel et al., 2019; Kipf et al., 2019; Merel et al., 2020; Shankar et al., 2019; Whitney et al., 2020; Gupta et al., 2019; Lee et al., 2020; Lynch et al., 2020; Pertsch et al., 2020b;a). These methods extract reusable 26 s 0 s 1 s 2 s 3 s 4 a 0 a 1 a 2 a 3 q ω (z|s,a) π ϕ (a t |s t ,z) s t π θ (z|s t ) π ϕ (a t |s t ,z) Environment z a t Pre-Trained q ζ (z|s 0 ) q ω (z|s,a) Regularization Skill Extraction 1 Prior & Posterior Training 2 Downstream RL 3 s 2 s 0 s 1 s 3 s 4 p a (z|s 0 ) q ω (z|s,a) D(s) Posterior Prior Figure 3.2: Our approach, SkiLD, combines task-agnostic experience and task-specific demonstra- tions to efficiently learn target tasks in three steps: (1) extract skill representation from task-agnostic offline data, (2) learn task-agnostic skill prior from task-agnostic data and task-specific skill posterior from demonstrations, and (3) learn a high-level skill policy for the target task using prior knowledge from both task-agnostic offline data and task-specific demonstrations. Left: Skill embedding model with skill extractor (yellow) and closed-loop skill policy (blue). Middle: Training of skill prior (green) from task-agnostic data and skill posterior (purple) from demonstrations. Right: Training of high-level skill policy (red) on a downstream task using the pre-trained skill representation and regularization via the skill prior and posterior, mediated by the demonstration discriminator D(s). skills from task-agnostic datasets and learn new tasks by recombining them. Yet, such approaches perform reinforcement learning over the set of extracted skills to learn the downstream task. Although being more efficient than RL over primitive actions, they still require many environment interactions to learn long-horizon tasks. In our work we combine the best of both worlds: by using large, task-agnostic datasets and a small number of task-specific demonstrations, we accelerate the learning of long-horizon tasks while reducing the number of required demonstrations. 3.3 Approach Our goal is to use skills extracted from task-agnostic prior experience data to improve the efficiency of demonstration-guided RL on a new task. We aim to leverage a set of provided demonstrations by following the performed skills as opposed to the primitive actions. Therefore, we need a model that can (1) leverage prior data to learn a rich set of skills and (2) identify the skills performed in the demonstrations in order to follow them. Next, we formally define our problem, 27 summarize relevant prior work on RL with learned skills and then describe our demonstration-guided RL approach. 3.3.1 Preliminaries Problem Formulation We assume access to two types of datasets: a large task-agnostic offline dataset and a small task-specific demonstration dataset. The task-agnostic dataset D ={s t ,a t ,...} consists of trajectories of meaningful agent behaviors, but includes no demonstrations of the target task. We only assume that its trajectories contain short-horizon behaviors that can be reused to solve the target task. Such data can be collected without a particular task in mind using a mix of sources, e.g., via human teleoperation, autonomous exploration, or through policies trained for other tasks. Since it can be used to accelerate many downstream task that utilize similar short-term behaviors we call it task-agnostic. In contrast, the task-specific data is a much smaller set of demonstration trajectoriesD demo ={s d t ,a d t ,...} that are specific to a single target task. The downstream learning problem is formulated as a Markov decision process (MDP) defined by a tuple(S,A,T ,R,ρ,γ) of states, actions, transition probabilities, rewards, initial state distri- bution, and discount factor. We aim to learn a policyπ θ (a|s) with parametersθ that maximizes the discounted sum of rewards J(θ)=E π ∑ T− 1 t=0 J t =E π ∑ T− 1 t=0 γ t r t , where T is the episode horizon. Skill Prior RL Our goal is to extract skills from task-agnostic experience data and reuse them for demonstration-guided RL. Prior work has investigated the reuse of learned skills for accelerating RL (Pertsch et al., 2020a). In this section, we will briefly summarize their proposed approach Skill Prior RL (SPiRL) and then describe how our approach improves upon it in the demonstration-guided RL setting. SPiRL defines a skill as a sequence of H consecutive actions a a a={a t ,...,a t+H− 1 }, where the skill horizon H is a hyperparameter. It uses the task-agnostic data to jointly learn (1) a generative model of skills p(a a a|z), that decodes latent skill embeddings z into executable action sequences a a a, and (2) a state-conditioned prior distribution p(z|s) over skill embeddings. For learning a new 28 downstream task, SPiRL trains a high-level skill policy π θ (z|s) whose outputs get decoded into executable actions using the pre-trained skill decoder. Crucially, the learned skill prior is used to guide the policy during downstream RL by maximizing the following divergence-regularized RL objective: J(θ)=E π θ T− 1 ∑ t=0 r(s t ,z t )− αD KL π θ (z t |s t ), p(z t |s t ) . (3.1) Here, the KL-divergence term ensures that the policy remains close to the learned skill prior, guiding exploration during RL. By combining this guided exploration with temporal abstraction via the learned skills, SPiRL substantially improves the efficiency of RL on long-horizon tasks. 3.3.2 Skill Representation Learning We leverage SPiRL’s skill embedding model for learning our skill representation. We follow prior work on skill-based RL (Lynch et al., 2020; Ajay et al., 2021) and increase the expressiveness of the skill representation by replacing SPiRL’s low-level skill decoder p(a a a|z) with a closed-loop skill policyπ φ (a|s,z) with parametersφ that is conditioned on the current environment state. In our experiments we found this closed-loop decoder to improve performance (see Section C for an empirical comparison). Figure 3.2 (left) summarizes our skill learning model. It consists of two parts: the skill inference network q ω (z|s 0:H− 1 ,a 0:H− 2 ) with parametersω and the closed-loop skill policyπ φ (a t |s t ,z t ). Note that in contrast to SPiRL the skill inference network is state-conditioned to account for the state- conditioned low-level policy. During training we randomly sample an H-step state-action trajectory from the task-agnostic dataset and pass it to the skill inference network, which predicts the low- dimensional skill embedding z. This skill embedding is then input into the low-level policy π φ (a t |s t ,z) for every input state. The policy is trained to imitate the given action sequence, thereby learning to reproduce the behaviors encoded by the skill embedding z. 29 The latent skill representation is optimized using variational inference, which leads to the full skill learning objective: max φ,ω E q H− 2 ∏ t=0 logπ φ (a t |s t ,z) | {z } behavioral cloning − β logq ω (z|s 0:H− 1 ,a 0:H− 2 )− log p(z) | {z } embedding regularization . (3.2) We use a unit Gaussian prior p(z) and weight the embedding regularization term with a factor β (Higgins et al., 2017). 3.3.3 Demonstration-Guided RL with Learned Skills bDc <latexit sha1_base64="2UFPGslBFfCw1EFMx9I3fhXu7sw=">AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=</latexit> <latexit sha1_base64="2UFPGslBFfCw1EFMx9I3fhXu7sw=">AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=</latexit> <latexit sha1_base64="2UFPGslBFfCw1EFMx9I3fhXu7sw=">AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=</latexit> <latexit sha1_base64="2UFPGslBFfCw1EFMx9I3fhXu7sw=">AAACAnicbVBNS8NAEJ3Ur1q/op7ES7AInkoigh6LevBYwdZCE8pmu2mXbnbD7kYooXjxr3jxoIhXf4U3/42bNAdtfTDweG+GmXlhwqjSrvttVZaWV1bXquu1jc2t7R17d6+jRCoxaWPBhOyGSBFGOWlrqhnpJpKgOGTkPhxf5f79A5GKCn6nJwkJYjTkNKIYaSP17QOfRUwI6cdIjzBi2fXUl4XSt+tuwy3gLBKvJHUo0erbX/5A4DQmXGOGlOp5bqKDDElNMSPTmp8qkiA8RkPSM5SjmKggK16YOsdGGTiRkKa4dgr190SGYqUmcWg680vVvJeL/3m9VEcXQUZ5kmrC8WxRlDJHCyfPwxlQSbBmE0MQltTc6uARkghrk1rNhODNv7xIOqcNz214t2f15mUZRxUO4QhOwINzaMINtKANGB7hGV7hzXqyXqx362PWWrHKmX34A+vzBze0l/E=</latexit> bD demo c <latexit sha1_base64="MUjWve21UASCHuJ02UPoxZNYlZE=">AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=</latexit> <latexit sha1_base64="MUjWve21UASCHuJ02UPoxZNYlZE=">AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=</latexit> <latexit sha1_base64="MUjWve21UASCHuJ02UPoxZNYlZE=">AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=</latexit> <latexit sha1_base64="MUjWve21UASCHuJ02UPoxZNYlZE=">AAACDnicbVBNS8NAEN3Ur1q/qh69BEvBU0lE0GNRDx4r2A9oStlsJ+3STTbsTsQS+gu8+Fe8eFDEq2dv/hu3aQ7a+mDg8d4MM/P8WHCNjvNtFVZW19Y3ipulre2d3b3y/kFLy0QxaDIppOr4VIPgETSRo4BOrICGvoC2P76a+e17UJrL6A4nMfRCOox4wBlFI/XLVU8EQkrlhRRHjIr0etr3EB4wHUAop57K3H654tScDPYycXNSITka/fKXN5AsCSFCJqjWXdeJsZdShZwJmJa8RENM2ZgOoWtoREPQvTR7Z2pXjTKwA6lMRWhn6u+JlIZaT0LfdM6u1oveTPzP6yYYXPRSHsUJQsTmi4JE2CjtWTb2gCtgKCaGUKa4udVmI6ooQ5NgyYTgLr68TFqnNdepubdnlfplHkeRHJFjckJcck7q5IY0SJMw8kieySt5s56sF+vd+pi3Fqx85pD8gfX5A06ZnYY=</latexit> Start Goal Figure 3.3: We leverage prior experi- ence data D and demonstration data D demo . Our policy is guided by the task- specific skill posterior q ζ (z|s) within the support of the demonstrations (green) and by the task-agnostic skill prior p a a a (z|s) otherwise (red). The agent also receives a reward bonus for reaching states in the demonstration support. To leverage the learned skills for accelerating demonstration-guided RL on a new task, we use a hi- erarchical policy learning scheme (see Figure 3.2, right): a high-level policyπ θ (z|s) outputs latent skill embeddings z that get decoded into actions using the pre-trained low- level skill policy. We freeze the weights of the skill policy during downstream training for simplicity. Our goal is to leverage the task-specific demonstra- tions to guide learning of the high-level policy on the new task. In Section 3.3.1, we showed how SPiRL (Pertsch et al., 2020a) leverages a learned skill prior p(z|s) to guide exploration. However, this prior is task-agnostic, i.e., it encourages exploration of all skills that are meaningful to be explored, independent of which task the agent is trying to solve. Even though SPiRL’s objective makes learning with a large number of skills more efficient, it encourages the policy to explore many skills that are not relevant to the downstream task. 30 In this work, we propose to extend the skill prior guided approach and leverage target task demon- strations to additionally learn a task-specific skill distribution, which we call skill posterior q ζ (z|s) with parametersζ (in contrast to the skill prior it is conditioned on the target task, hence “posterior”). We train this skill posterior by using the pre-trained skill inference model q ω (z|s 0:H− 1 ,a 0:H− 2 ) to extract the embeddings for the skills performed in the demonstration sequences (see Figure 3.2, middle): min ζ E (s,a)∼ D demo D KL q ω (z|s 0:H− 1 ,a 0:H− 2 ),q ζ (z|s 0 ) , (3.3) A naive approach for leveraging the skill posterior is to simply use it to replace the skill prior in Equation 3.1, i.e., to regularize the policy to stay close to the skill posterior in every state. However, the trained skill posterior is only accurate within the demonstration support⌊D demo ⌋, because by definition it was only trained on demonstration sequences. Since |D demo |≪| D| (see Figure 3.3), the skill posterior will often provide incorrect guidance in states outside the demonstrations’ support. Instead, we propose to use a three-part objective that guides the policy to (1) follow the skill posterior within the support of the demonstrations, (2) follow the skill prior outside the demonstration support, and (3) reach states within the demonstration support. To determine whether a given state is within the support of the demonstration data we train a learned discriminator D(s) as a binary classifier using samples from the demonstration and task-agnostic datasets, respectively. In summary, our algorithm pre-trains the following components: (1) the low-level skill policy π φ (a|s,z), (2) the task-agnostic skill prior p(z|s), (3) the task-specific skill posterior q ζ (z|s) and (4) the learned discriminator D(s). Only the latter two need to be re-trained for a new target task. Once all components are pre-trained, we use the discriminator’s output to weight terms in our objective that regularize the high-level policy π θ (z|s) towards the skill prior or posterior. Additionally, we provide a reward bonus for reaching states which the discriminator classifies as 31 being within the demonstration support. This results in the following term J t for SkiLD’s full RL objective: J t = ˜ r(s t ,z t )− α q D KL (π θ (z t |s t ),q ζ (z t |s t ))· D(s t ) | {z } posterior regularization − αD KL (π θ (z t |s t ), p(z t |s t ))· (1− D(s t )) | {z } prior regularization , with ˜ r(s t ,z t )= (1− κ)· r(s t ,z t )+κ· logD(s t )− log 1− D(s t ) | {z } discriminator reward . (3.4) The weighting factorκ is a hyperparameter;α andα q are either constant or tuned automatically via dual gradient descent (Haarnoja et al., 2018b). The discriminator reward follows common formulations used in adversarial imitation learning (Finn et al., 2016a; Fu et al., 2018; Zhu et al., 2018; Kostrikov et al., 2019). 1 Our formulation combines IRL-like and BC-like objectives by using learned rewards and trying to match the demonstration’s skill distribution. For policy optimization, we use a modified version of the SPiRL algorithm (Pertsch et al., 2020a), which itself is based on Soft Actor-Critic (Haarnoja et al., 2018a). Concretely, we replace the environment reward with the discriminator-augmented reward and all prior divergence terms with our new, weighted prior-posterior-divergence terms from equation 3.4 (for the full algorithm see appendix, Section A). 3.4 Experiments In this paper, we propose to leverage a large offline experience dataset for efficient demonstration- guided RL. We aim to answer the following questions: (1) Can the use of task-agnostic prior experience improve the efficiency of demonstration-guided RL? (2) Does the reuse of pre-trained skills reduce the number of required target-specific demonstrations? (3) In what scenarios does the combination of prior experience and demonstrations lead to the largest efficiency gains? 1 We found that using the pre-trained discriminator weights led to stable training, but it is possible to perform full adversarial training by finetuning D(s) with rollouts from the downstream task training. We report results for initial experiments with discriminator finetuning in Section E and leave further investigation for future work. 32 3.4.1 Experimental Setup and Comparisons 1 2 3 4 1 2 3 SkiLD (Ours) SPiRL Skill BC+RL Replay BC+RL GAIL+RL SAC 4 Figure 3.4: Left: Test environments, top to bot- tom: 2D maze navigation, robotic kitchen ma- nipulation and robotic office cleaning. Right: Target task performance vs environment steps. By using task-agnostic experience, our approach more efficiently leverages the demonstrations than prior demonstration-guided RL approaches across all tasks. The comparison to SPiRL shows that demonstrations improve efficiency even for agents with access to a lot of prior experience. To evaluate whether our method SkiLD can efficiently use task-agnostic data, we compare it to prior demonstration-guided RL approaches on three complex, long-horizon tasks: a 2D maze navigation task, a robotic kitchen manip- ulation task and a robotic office cleaning task (see Figure 3.4, left). Maze Navigation. We adapt the maze navi- gation task from Pertsch et al. (2020a) and in- crease task complexity by adding randomness to the agent’s initial position. The agent needs to navigate through a maze for hundreds of time steps using planar velocity commands to receive a sparse binary reward upon reaching a fixed goal position. We collect 3000 task-agnostic tra- jectories using a motion planner that finds paths between randomly sampled start-goal pairs. For the target task we collect 5 demonstrations for an unseen start-goal pair. Robot Kitchen Environment. We use the environment of Gupta et al. (2019) in which a 7DOF robot arm needs to perform a sequence of four subtasks, such as opening the microwave or switching on the light, in the correct order. The agent observes a low-dimensional state representation and receives a binary reward upon completion of each consecutive subtask. We use 603 teleoperated sequences performing various subtask combinations (from Gupta et al. (2019)) as task-agnostic 33 experienceD and separate a set of 20 demonstrations for one particular sequence of subtasks, which we define as our target task (see Figure 3.4, middle). Robot Office Environment. A 5 DOF robot arm needs to clean an office environment by placing objects in their target bins or putting them in a drawer. It observes the poses of its end-effector and all objects in the scene and receives binary rewards for the completion of each subtask. We collect 2400 training trajectories by perturbing the objects initial positions and performing random subtasks using scripted policies. We also collect 50 demonstrations for the unseen target task with unseen object locations and subtask sequence. We compare our approach to multiple prior demonstration-guided RL approaches that represent the different classes of existing algorithms introduced in Section 3.2. In contrast to SkiLD, these approaches are not designed to leverage task-agnostic prior experience: BC + RL initializes a policy with behavioral cloning of the demonstrations, then continues to apply BC loss while finetuning the policy with Soft Actor-Critic (SAC, Haarnoja et al. (2018a)), representative of (Rajeswaran et al., 2018; Nair et al., 2018). GAIL + RL (Zhu et al., 2018) combines rewards from the environment and adversarial imitation learning (GAIL, Ho and Ermon (2016)), and optimizes the policy using PPO (Schulman et al., 2017). Demo Replay initializes the replay buffer of an SAC agent with the demonstrations and uses them with prioritized replay during updates, representative of (Vecerik et al., 2017). We also compare our approach to RL-only methods to show the benefit of using demonstration data: SAC (Haarnoja et al., 2018a) is a state-of-the-art model-free RL algorithm, it neither uses offline experience nor demonstrations. SPiRL (Pertsch et al., 2020a) extracts skills from task-agnostic experience and performs prior-guided RL on the target task (see Section 3.3.1) 2 . Finally, Skill BC+RL combines skills learned from task-agnostic data with target task demonstrations: it encodes the demonstrations with the pre-trained skill encoder and runs BC for the high-level skill policy, then finetunes on the target task using SAC. For further details on the environments, data collection, and implementation, see appendix Section B. 2 We train SPiRL with the closed-loop policy representation from Section 3.3.2 for fair comparison and better performance. For an empirical comparison of open and closed-loop skill representations in SPiRL, see Section C. 34 Demonstrations KL policy || posterior <latexit sha1_base64="+itTxqqSS1A+xjfJ2um5le42meY=">AAACKXicbVDLSgMxFM34tr6qLt0Ei6CbMiOCgpuiG0EXCvYBnaFk0tsazEyG5I44DP0dN/6KGwVF3fojpu0stHoh4dxz7iG5J0ykMOi6H87U9Mzs3PzCYmlpeWV1rby+0TAq1RzqXEmlWyEzIEUMdRQooZVoYFEooRneng715h1oI1R8jVkCQcT6segJztBSnXLNR7jH/Pxi4IeivzvuEiUFzwb+MfWtF8eXbQrRIGih9Mix1ylX3Ko7KvoXeAWokKIuO+UXv6t4GkGMXDJj2p6bYJAzjYJLGJT81EDC+C3rQ9vCmEVggny06YDuWKZLe0rbEyMdsT8dOYuMyaLQTkYMb8ykNiT/09op9o6CXMRJihDz8UO9VFJUdBgb7QoNHGVmAeNa2L9SfsM04zYLU7IheJMr/wWN/arnVr2rg0rtpIhjgWyRbbJLPHJIauSMXJI64eSBPJFX8uY8Os/Ou/M5Hp1yCs8m+VXO1zeA8KiZ</latexit> <latexit sha1_base64="+itTxqqSS1A+xjfJ2um5le42meY=">AAACKXicbVDLSgMxFM34tr6qLt0Ei6CbMiOCgpuiG0EXCvYBnaFk0tsazEyG5I44DP0dN/6KGwVF3fojpu0stHoh4dxz7iG5J0ykMOi6H87U9Mzs3PzCYmlpeWV1rby+0TAq1RzqXEmlWyEzIEUMdRQooZVoYFEooRneng715h1oI1R8jVkCQcT6segJztBSnXLNR7jH/Pxi4IeivzvuEiUFzwb+MfWtF8eXbQrRIGih9Mix1ylX3Ko7KvoXeAWokKIuO+UXv6t4GkGMXDJj2p6bYJAzjYJLGJT81EDC+C3rQ9vCmEVggny06YDuWKZLe0rbEyMdsT8dOYuMyaLQTkYMb8ykNiT/09op9o6CXMRJihDz8UO9VFJUdBgb7QoNHGVmAeNa2L9SfsM04zYLU7IheJMr/wWN/arnVr2rg0rtpIhjgWyRbbJLPHJIauSMXJI64eSBPJFX8uY8Os/Ou/M5Hp1yCs8m+VXO1zeA8KiZ</latexit> <latexit sha1_base64="+itTxqqSS1A+xjfJ2um5le42meY=">AAACKXicbVDLSgMxFM34tr6qLt0Ei6CbMiOCgpuiG0EXCvYBnaFk0tsazEyG5I44DP0dN/6KGwVF3fojpu0stHoh4dxz7iG5J0ykMOi6H87U9Mzs3PzCYmlpeWV1rby+0TAq1RzqXEmlWyEzIEUMdRQooZVoYFEooRneng715h1oI1R8jVkCQcT6segJztBSnXLNR7jH/Pxi4IeivzvuEiUFzwb+MfWtF8eXbQrRIGih9Mix1ylX3Ko7KvoXeAWokKIuO+UXv6t4GkGMXDJj2p6bYJAzjYJLGJT81EDC+C3rQ9vCmEVggny06YDuWKZLe0rbEyMdsT8dOYuMyaLQTkYMb8ykNiT/09op9o6CXMRJihDz8UO9VFJUdBgb7QoNHGVmAeNa2L9SfsM04zYLU7IheJMr/wWN/arnVr2rg0rtpIhjgWyRbbJLPHJIauSMXJI64eSBPJFX8uY8Os/Ou/M5Hp1yCs8m+VXO1zeA8KiZ</latexit> <latexit sha1_base64="+itTxqqSS1A+xjfJ2um5le42meY=">AAACKXicbVDLSgMxFM34tr6qLt0Ei6CbMiOCgpuiG0EXCvYBnaFk0tsazEyG5I44DP0dN/6KGwVF3fojpu0stHoh4dxz7iG5J0ykMOi6H87U9Mzs3PzCYmlpeWV1rby+0TAq1RzqXEmlWyEzIEUMdRQooZVoYFEooRneng715h1oI1R8jVkCQcT6segJztBSnXLNR7jH/Pxi4IeivzvuEiUFzwb+MfWtF8eXbQrRIGih9Mix1ylX3Ko7KvoXeAWokKIuO+UXv6t4GkGMXDJj2p6bYJAzjYJLGJT81EDC+C3rQ9vCmEVggny06YDuWKZLe0rbEyMdsT8dOYuMyaLQTkYMb8ykNiT/09op9o6CXMRJihDz8UO9VFJUdBgb7QoNHGVmAeNa2L9SfsM04zYLU7IheJMr/wWN/arnVr2rg0rtpIhjgWyRbbJLPHJIauSMXJI64eSBPJFX8uY8Os/Ou/M5Hp1yCs8m+VXO1zeA8KiZ</latexit> KL policy || prior <latexit sha1_base64="/05/nGBNtOK9LiSIb0R3N1dkB2k=">AAACJXicbVDLSgMxFM3UV62vqks3wSLUTZkRQUEXRTeCLirYB3RKyaRpG5qZDMkdsQz9GTf+ihsXFhFc+SumM7PQ1gsJ555zLsk9Xii4Btv+snJLyyura/n1wsbm1vZOcXevoWWkKKtTKaRqeUQzwQNWBw6CtULFiO8J1vRG1zO9+ciU5jJ4gHHIOj4ZBLzPKQFDdYuXLrAniG/vJq7HB+W0C6XgdDxxL7BrZiG9TJOKikuVuI+7xZJdsZPCi8DJQAllVesWp25P0shnAVBBtG47dgidmCjgVLBJwY00CwkdkQFrGxgQn+lOnGw5wUeG6eG+VOYEgBP290RMfK3HvmecPoGhntdm5H9aO4L+eSfmQRgBC2j6UD8SGCSeRYZ7XDEKYmwAoYqbv2I6JIpQMMEWTAjO/MqLoHFSceyKc39aql5lceTRATpEZeSgM1RFN6iG6oiiZ/SK3tHUerHerA/rM7XmrGxmH/0p6/sH6Femtg==</latexit> <latexit sha1_base64="/05/nGBNtOK9LiSIb0R3N1dkB2k=">AAACJXicbVDLSgMxFM3UV62vqks3wSLUTZkRQUEXRTeCLirYB3RKyaRpG5qZDMkdsQz9GTf+ihsXFhFc+SumM7PQ1gsJ555zLsk9Xii4Btv+snJLyyura/n1wsbm1vZOcXevoWWkKKtTKaRqeUQzwQNWBw6CtULFiO8J1vRG1zO9+ciU5jJ4gHHIOj4ZBLzPKQFDdYuXLrAniG/vJq7HB+W0C6XgdDxxL7BrZiG9TJOKikuVuI+7xZJdsZPCi8DJQAllVesWp25P0shnAVBBtG47dgidmCjgVLBJwY00CwkdkQFrGxgQn+lOnGw5wUeG6eG+VOYEgBP290RMfK3HvmecPoGhntdm5H9aO4L+eSfmQRgBC2j6UD8SGCSeRYZ7XDEKYmwAoYqbv2I6JIpQMMEWTAjO/MqLoHFSceyKc39aql5lceTRATpEZeSgM1RFN6iG6oiiZ/SK3tHUerHerA/rM7XmrGxmH/0p6/sH6Femtg==</latexit> <latexit sha1_base64="/05/nGBNtOK9LiSIb0R3N1dkB2k=">AAACJXicbVDLSgMxFM3UV62vqks3wSLUTZkRQUEXRTeCLirYB3RKyaRpG5qZDMkdsQz9GTf+ihsXFhFc+SumM7PQ1gsJ555zLsk9Xii4Btv+snJLyyura/n1wsbm1vZOcXevoWWkKKtTKaRqeUQzwQNWBw6CtULFiO8J1vRG1zO9+ciU5jJ4gHHIOj4ZBLzPKQFDdYuXLrAniG/vJq7HB+W0C6XgdDxxL7BrZiG9TJOKikuVuI+7xZJdsZPCi8DJQAllVesWp25P0shnAVBBtG47dgidmCjgVLBJwY00CwkdkQFrGxgQn+lOnGw5wUeG6eG+VOYEgBP290RMfK3HvmecPoGhntdm5H9aO4L+eSfmQRgBC2j6UD8SGCSeRYZ7XDEKYmwAoYqbv2I6JIpQMMEWTAjO/MqLoHFSceyKc39aql5lceTRATpEZeSgM1RFN6iG6oiiZ/SK3tHUerHerA/rM7XmrGxmH/0p6/sH6Femtg==</latexit> <latexit sha1_base64="/05/nGBNtOK9LiSIb0R3N1dkB2k=">AAACJXicbVDLSgMxFM3UV62vqks3wSLUTZkRQUEXRTeCLirYB3RKyaRpG5qZDMkdsQz9GTf+ihsXFhFc+SumM7PQ1gsJ555zLsk9Xii4Btv+snJLyyura/n1wsbm1vZOcXevoWWkKKtTKaRqeUQzwQNWBw6CtULFiO8J1vRG1zO9+ciU5jJ4gHHIOj4ZBLzPKQFDdYuXLrAniG/vJq7HB+W0C6XgdDxxL7BrZiG9TJOKikuVuI+7xZJdsZPCi8DJQAllVesWp25P0shnAVBBtG47dgidmCjgVLBJwY00CwkdkQFrGxgQn+lOnGw5wUeG6eG+VOYEgBP290RMfK3HvmecPoGhntdm5H9aO4L+eSfmQRgBC2j6UD8SGCSeRYZ7XDEKYmwAoYqbv2I6JIpQMMEWTAjO/MqLoHFSceyKc39aql5lceTRATpEZeSgM1RFN6iG6oiiZ/SK3tHUerHerA/rM7XmrGxmH/0p6/sH6Femtg==</latexit> p demo | state <latexit sha1_base64="q++LREME/a583/leXienmVEtfl4=">AAACFnicbVC7SgNBFJ31GeNr1dJmMAixMOyKoJAmaGMZwTwgu4TZyU0yZPbBzF0xLPkKG3/FxkIRW7Hzb5w8Ck08MHA451zu3BMkUmh0nG9raXlldW09t5Hf3Nre2bX39us6ThWHGo9lrJoB0yBFBDUUKKGZKGBhIKERDK7HfuMelBZxdIfDBPyQ9SLRFZyhkdr2aeIFolf0EB4w60AYj6hX9swEemU6VTUyhNE4dtK2C07JmYAuEndGCmSGatv+8joxT0OIkEumdct1EvQzplBwCaO8l2pIGB+wHrQMjVgI2s8mZ43osVE6tBsr8yKkE/X3RMZCrYdhYJIhw76e98bif14rxe6ln4koSREiPl3UTSXFmI47oh2hgKMcGsK4EuavlPeZYhxNk3lTgjt/8iKpn5Vcp+TenhcqV7M6cuSQHJEicckFqZAbUiU1wskjeSav5M16sl6sd+tjGl2yZjMH5A+szx/fxJ/N</latexit> <latexit sha1_base64="q++LREME/a583/leXienmVEtfl4=">AAACFnicbVC7SgNBFJ31GeNr1dJmMAixMOyKoJAmaGMZwTwgu4TZyU0yZPbBzF0xLPkKG3/FxkIRW7Hzb5w8Ck08MHA451zu3BMkUmh0nG9raXlldW09t5Hf3Nre2bX39us6ThWHGo9lrJoB0yBFBDUUKKGZKGBhIKERDK7HfuMelBZxdIfDBPyQ9SLRFZyhkdr2aeIFolf0EB4w60AYj6hX9swEemU6VTUyhNE4dtK2C07JmYAuEndGCmSGatv+8joxT0OIkEumdct1EvQzplBwCaO8l2pIGB+wHrQMjVgI2s8mZ43osVE6tBsr8yKkE/X3RMZCrYdhYJIhw76e98bif14rxe6ln4koSREiPl3UTSXFmI47oh2hgKMcGsK4EuavlPeZYhxNk3lTgjt/8iKpn5Vcp+TenhcqV7M6cuSQHJEicckFqZAbUiU1wskjeSav5M16sl6sd+tjGl2yZjMH5A+szx/fxJ/N</latexit> <latexit sha1_base64="q++LREME/a583/leXienmVEtfl4=">AAACFnicbVC7SgNBFJ31GeNr1dJmMAixMOyKoJAmaGMZwTwgu4TZyU0yZPbBzF0xLPkKG3/FxkIRW7Hzb5w8Ck08MHA451zu3BMkUmh0nG9raXlldW09t5Hf3Nre2bX39us6ThWHGo9lrJoB0yBFBDUUKKGZKGBhIKERDK7HfuMelBZxdIfDBPyQ9SLRFZyhkdr2aeIFolf0EB4w60AYj6hX9swEemU6VTUyhNE4dtK2C07JmYAuEndGCmSGatv+8joxT0OIkEumdct1EvQzplBwCaO8l2pIGB+wHrQMjVgI2s8mZ43osVE6tBsr8yKkE/X3RMZCrYdhYJIhw76e98bif14rxe6ln4koSREiPl3UTSXFmI47oh2hgKMcGsK4EuavlPeZYhxNk3lTgjt/8iKpn5Vcp+TenhcqV7M6cuSQHJEicckFqZAbUiU1wskjeSav5M16sl6sd+tjGl2yZjMH5A+szx/fxJ/N</latexit> <latexit sha1_base64="q++LREME/a583/leXienmVEtfl4=">AAACFnicbVC7SgNBFJ31GeNr1dJmMAixMOyKoJAmaGMZwTwgu4TZyU0yZPbBzF0xLPkKG3/FxkIRW7Hzb5w8Ck08MHA451zu3BMkUmh0nG9raXlldW09t5Hf3Nre2bX39us6ThWHGo9lrJoB0yBFBDUUKKGZKGBhIKERDK7HfuMelBZxdIfDBPyQ9SLRFZyhkdr2aeIFolf0EB4w60AYj6hX9swEemU6VTUyhNE4dtK2C07JmYAuEndGCmSGatv+8joxT0OIkEumdct1EvQzplBwCaO8l2pIGB+wHrQMjVgI2s8mZ43osVE6tBsr8yKkE/X3RMZCrYdhYJIhw76e98bif14rxe6ln4koSREiPl3UTSXFmI47oh2hgKMcGsK4EuavlPeZYhxNk3lTgjt/8iKpn5Vcp+TenhcqV7M6cuSQHJEicckFqZAbUiU1wskjeSav5M16sl6sd+tjGl2yZjMH5A+szx/fxJ/N</latexit> Figure 3.5: Visualization of our approach on the maze navigation task (visualization states collected by rolling out the skill prior). Left: the given demonstration trajectories; Middle left: output of the demonstration discriminator D(s) (the greener, the higher the predicted probability of a state to be within demonstration support, red indicates low probability). Middle right: policy divergences to the skill posterior and Right: divergence to the skill prior (blue indicates small and red high divergence). The discriminator accurately infers the demonstration support, the policy successfully follows the skill posterior only within the demonstration support and the skill prior otherwise. 3.4.2 Demonstration-Guided RL with Learned Skills Maze Navigation. Prior demonstration-guided RL approaches struggle on the task (see Figure 3.4, right) since rewards are sparse and only five demonstrations are provided. With such small coverage, behavioral cloning of the demonstrations’ primitive actions leads to brittle policies which are hard to finetune. The Replay agent improves over SAC without demonstrations and partly succeeds at the task, but learning is slow. The GAIL+RL approach is able to follow part of the demonstrated behavior, but fails to reach the final goal (see Figure B.1 for qualitative results). SPiRL and Skill BC+RL leverage task-agnostic data to learn to occasionally solve the task, but train slowly: SPiRL’s learned, task-agnostic skill prior and Skill BC+RL’s uniform skill prior during SAC finetuning encourage the exploration of many task-irrlevant skills 3 . In contrast, our approach SkiLD leverages the task-specific skill posterior to quickly explore the relevant skills, leading to significant efficiency gains (see Figure 3.5 for qualitative analysis and Figure B.2 for a comparison of SkiLD vs. SPiRL exploration). 3 Performance of SPiRL differs from Pertsch et al. (2020a) due to increased task complexity, see Section B.4. 35 Robotic Manipulation. We show the performance comparison on the robotic manipulation tasks in Figure 3.4 (right) 4 . Both tasks are more challenging since they require precise control of a high-DOF manipulator. We find that approaches for demonstration-guided RL that do not leverage task-agnostic experience struggle to learn either of the tasks since following the demonstrations step-by-step is inefficient and prone to accumulating errors. SPiRL, in contrast, is able to learn meaningful skills from the offline datasets, but struggles to explore the task-relevant skills and therefore learns slowly. Worse yet, the uniform skill prior used in Skill BC+RL’s SAC finetuning is even less suited for the target task and leads the policy to deviate from the BC initialization early on in training, preventing the agent from learning the task altogether (for pure BC performance, see appendix, Figure B.6). Our approach, however, uses the learned skill posterior to guide the chaining of the extracted skills and thereby learns to solve the tasks efficiently, showing how SkiLD effectively combines task-agnostic and task-specific data for demonstration-guided RL. 3.4.3 Ablation Studies Figure 3.6: Ablation studies. We test the perfor- mance of SkiLD for different sizes of the demon- stration dataset|D demo | on the maze navigation task (left) and ablate the components of our objec- tive on the kitchen manipulation task (right). In Figure 3.6 (left) we test the robustness of our approach to the number of demonstra- tions in the maze navigation task and com- pare to BC+RL, which we found to work best across different demonstration set sizes. Both approaches benefit from more demonstrations, but our approach is able to learn with much fewer demonstrations by using prior experience. While BC+RL learns each low-level action from the demonstrations, SkiLD merely learns to recom- bine skills it has already mastered using the offline data, thus requiring less dense supervision and fewer demonstrations. We also ablate the components of our RL objective on the kitchen task in Figure 3.6 (right). Removing the discriminator reward bonus (”no-GAIL”) slows convergence 4 For qualitative robot manipulation videos, seehttps://sites.google.com/view/skill-demo-rl. 36 since the agent lacks a dense reward signal. Naively replacing the skill prior in the SPiRL objective of Equation 3.1 with the learned skill posterior (”post-only”) fails since the agent follows the skill posterior outside its support. Removing the skill posterior and optimizing a discriminator bonus augmented reward using SPiRL (”no-post”) fails because the agent cannot efficiently explore the rich skill space. Finally, we show the efficacy of our approach in the pure imitation setting, without environment rewards, in appendix, Section E. 3.4.4 Robustness to Partial Demonstrations SkiLD (Ours) SPiRL SkiLD - Full SPiRL SkiLD - Partial Figure 3.7: Left: Robustness to partial demon- strations. SkiLD can leverage partial demonstra- tions by seamlessly integrating task-agnostic and task-specific datasets (see Section 3.4.4). Right: Analysis of data vs. task alignment. The bene- fit of using demonstrations in addition to prior experience diminishes if the prior experience is closely aligned with the target task (solid), but gains are high when data and task are not well- aligned (dashed). Most prior approaches that aim to follow demonstrations of a target task, assume that these demonstrations show the complete exe- cution of the task. However, we can often en- counter situations in which the demonstrations only show incomplete solutions, e.g. because the agent’s and demonstration’s initial states do not align or because we only have access to demonstrations for a subtask within a long- horizon task. Thus, SkiLD is designed to han- dle such partial demonstrations: through the discriminator weighting it relies on demonstrations only within their support and falls back to following the task-agnostic skill prior otherwise. Thus it provides a robust framework that seam- lessly integrates task-specific and task-agnostic data sources. We test this experimentally in the kitchen environment: we train SkiLD with partial demonstrations in which we remove one of the subskills. The results in Figure 3.7 (left) show that “SkiLD-Partial” is able to leverage the partial demonstrations to improve efficiency over SPiRL, which does not leverage demonstrations. Expectedly, using the full demonstrations in the SkiLD framework (“SkiLD-Full”) leads to even higher learning efficiency. 37 3.4.5 Data Alignment Analysis We aim to analyze in what scenarios the use of demonstrations in addition to task-agnostic experience is most beneficial. In particular, we evaluate how the alignment between the distribution of observed behaviors in the task-agnostic dataset and the target behaviors influences learning efficiency. We choose two different target tasks in the kitchen environment, one with good and one with bad alignment between the behavior distributions (see Section F), and compare our method, which uses demonstrations, to SPiRL, which only relies on the task-agnostic data. In the well-aligned case (Figure 3.7, right, solid lines), we find that both approaches learn the task efficiently. Since the skill prior encourages effective exploration on the downstream task, the benefit of the additional demonstrations leveraged in our method is marginal. In contrast, if task-agnostic data and downstream task are not well-aligned (Figure 3.7, right, dashed), SPiRL struggles to learn the task since it cannot maximize task reward and minimize divergence from the mis-aligned skill prior at the same time. Our approach learns more reliably by encouraging the policy to reach demonstration-like states and then follow the skill posterior, which by design is well-aligned with the target task. In summary, our analysis finds that approaches which leverage both task-agnostic data and demonstrations, improve over methods that use either of the data sources alone across all tested tasks. We find that combining the data sources is particularly beneficial in two cases: • Diverse Task-Agnostic Data. Demonstrations can focus exploration on task-relevant skills if the task-agnostic skill prior explores a too large set of skills (see Section 3.4.2). • Mis-Aligned Task-Agnostic Data. Demonstrations can compensate mis-alignment between task-agnostic data and target task by guiding exploration with the skill posterior instead of the mis-aligned prior. 38 3.5 Discussion In this chapter, we proposed SkiLD, an approach for demonstration-guided RL that is able to leverage task-agnostic experience datasets and task-specific demonstrations for accelerated learning of unseen tasks. In three challenging environments SkiLD learns new tasks more efficiently than both, prior demonstration-guided RL approaches that are unable to leverage task-agnostic data, as well as skill-based RL methods that cannot effectively incorporate demonstrations. Future work should combine task-agnostic data and demonstrations for efficient learning in the real world and investigate domain-agnostic measures for data-task alignment to quantify the usefulness of prior experience for target tasks. 39 Chapter 4 Cross-Domain Transfer via Semantic Skill Imitation 4.1 Introduction Source Domain Demonstration Target Domain 1 1 2 3 Semantic Imitation Open Microwave Turn on Stove Open Cabinet Target Domain 2 Semantic Imitation 1 2 3 Open Microwave Turn on Stove Open Cabinet 1 Open Microwave 2 Turn on Stove 3 Open Cabinet Figure 4.1: We address semantic imitation, which aims to leverage demonstrations from a source do- main, e.g., human video demonstrations, to accel- erate the learning of the same tasks in a different target domain, e.g., controlling a robotic manipu- lator in a simulated kitchen environment. Consider a person imitating an expert in two scenarios: a beginner learning to play tennis, and a chef following a recipe for a new dish. In the former case, when mastering the basic skills of tennis, humans tend to imitate the precise arm movements demonstrated by the expert. In con- trast, when operating in a familiar domain, such as a chef learning to cook a new dish, imitation happens on a higher scale. Instead of imitating individual movements, they follow high-level, semantically meaningful skills like “stir the mix- ture” or “turn on the oven”. Such semantic skills generalize across environment layouts, and al- low humans to follow demonstrations across substantially different environments. Most works that leverage demonstrations in robotics imitate low-level actions. Demonstrations are typically provided by manually moving the robot (Sharma et al., 2018) or via teleoperation 40 (Mandlekar et al., 2018). A critical challenge of this approach is scaling: demonstrations need to be collected in every new environment. On the other hand, imitation of high-level (semantic) skills has the promise of generalization: demonstrations can be collected in one kitchen and applied to any number of kitchens, eliminating the need to re-demonstrate in every new environment. Learning via imitation of high-level skills can lead to scalable and generalizable robot learning. In this work, we present Semantic Transfer Accelerated RL (STAR), which accelerates RL using cross-domain demonstrations by leveraging semantic skills, instead of low-level actions. We consider a setting with significantly different source and target environments. Figure 4.1 shows an example: a robot arm learns to do a kitchen manipulation task by following a visual human demonstration from a different (real-world) kitchen. An approach that follows the precise arm movements of the human will fail due to embodiment and environment differences. Yet, by following the demonstrated semantic skills like “open the microwave” and “turn on the stove”, our approach can leverage demonstrations despite the domain differences. Like the chef in the above example, we use prior experience for enabling this semantic transfer. We assume access to datasets of prior experience collected across many tasks, in both the source and target domains. From this data, we learn semantic skills like “open the microwave” or “turn on the stove”. Next, we collect demonstrations of the task in the source domain and find “semantically similar” states in the target domain. Using this mapping, we learn a policy to follow the demonstrated semantic skills in semantically similar states in the target domain. We present results on two semantic imitation problems in simulation and on real-to-sim transfer from human videos. In simulation, we test STAR in: (1) a maze navigation task across mazes of different layouts and (2) a sequence of kitchen tasks between two variations of the FrankaKitchen environment (Gupta et al., 2019). In both tasks our approach matches the learning efficiency of methods with in-domain demonstrations, despite only using cross-domain demonstrations. Additionally, we show that a human demonstration video recorded within 3 minutes in a real-world kitchen can accelerate the learning of long-horizon manipulation tasks in the FrankaKitchen by hundreds of thousands of robot environment interactions. 41 In summary, our contributions are twofold: (1) we introduce STAR, an approach for cross- domain transfer via learned semantic skills, (2) we show that STAR can leverage demonstrations across substantially differing domains to accelerate the learning of long-horizon tasks. 4.2 Related Work Learning from demonstrations Learning from Demonstrations (LfD, Argall et al. (2009)) is a popular method for learning robot behaviors using demonstrations of the target task, often collected by human operators. Common approaches include behavioral cloning (BC, Pomerleau (1989)) and adversarial imitation approaches (Ho and Ermon, 2016). A number of works have proposed approaches for combining these imitation objectives with reinforcement learning (Rajeswaran et al., 2018; Nair et al., 2018; Zhu et al., 2018; Peng et al., 2018). However, all of these approaches require demonstrations in the target domain, limiting their applicability to new domains. In contrast, our approach imitates the demonstrations’ semantic skills and thus enables transfer across domains. Skill-based Imitation Using temporal abstraction via skills has a long tradition in hierarchical RL (Sutton et al., 1999b; Bacon et al., 2017; Nachum et al., 2018). Skills have also been used for the imitation of long-horizon tasks. Pertsch et al. (2021); Hakhamaneshi et al. (2022) learn skills from task-agnostic offline experience (Pertsch et al., 2020a; Ajay et al., 2021) and imitate demonstrated skills instead of primitive actions. But, since the learned skills do not capture semantic information, they require demonstrations in the target domain. Xu et al. (2018); Huang et al. (2019) divide long-horizon tasks into subroutines, but struggle if the two domains requires a different sequence of subroutines, e.g., if skill pre-conditions are not met in the target environment. Our approach is robust to such mismatches without requiring demonstrations in the target domain. Cross-Domain Imitation Peng et al. (2020) assume a pre-specified mapping between source and target domain. (Smith et al., 2019; Das et al., 2020) leverage offline experience to learn mappings while (Duan et al., 2017; Sharma et al., 2019; Yu et al., 2018b) rely on paired demonstrations. A 42 popular goal is to leverage human videos for robot learning since they are easy to collect at scale. (Sermanet et al., 2018; Chen et al., 2021) learn reward functions from human demonstrations and Schmeckpeper et al. (2020) add human experience to an RL agent’s replay buffer, but they only consider short-horizon tasks and rely on environments being similar. Yu et al. (2018a) meta-learn cross-domain subroutines, but cannot handle different subroutines between source and target. Our approach imitates long-horizon tasks across domains, without a pre-defined mapping and is robust to different semantic subroutines. 4.3 Problem Formulation We define a source environment S and a target environment T. In the source environment, we have N demonstrationsτ S 1:N withτ S i ={s S 0 ,a S 0 ,s S 1 ,a S 1 ,...} sequences of states s S and actions a S . Our goal is to leverage these demonstrations to accelerate training of a policyπ(s T ) in the target environment, acting on target states s T and predicting actions a T . π(s T ) maximizes the discounted target task reward J T =E π ∑ L− 1 l=0 γ l R(s T l ,a T l ) for an episode of length L. We account for different state-action spaces(s S ,a S ) vs. (s T ,a T ) between source and target, but drop the superscript in the following sections, assuming that the context makes it clear whether we are addressing source or target states. In Section 4.4.3 we describe how we bridge this environment gap. Without loss of generality we assume that the source and target environments are substantially different; sequences of low-level actions that solve a task in the source environment do not lead to high reward in the target environment. In the following we will also use the term domain to refer to two environments with this property. Yet, we assume that the demonstrations show a set of semantic skills, which when followed in the target environment can lead to task success. Here the term semantic skill refers to a high-level notion of skill, like “open the microwave” or “turn on the oven”, which is independent of the environment-specific low-level actions required to perform it. We further assume that both source and target environment allow for the execution of the same set of semantic skills. Semantic imitation requires an agent to understand the semantic skills performed in the demon- strations. We use task-agnostic datasetsD S andD T in the source and target domains to extract 43 such semantic skills. EachD i consists of state-action trajectories collected across a diverse range of prior tasks, e.g., from previously trained policies or teleoperation, as is commonly assumed in prior work (Pertsch et al., 2020a; Ajay et al., 2021; Pertsch et al., 2021; Hakhamaneshi et al., 2022). We also assume discrete semantic skill annotations k t ∈K , denoting the skill being executed at time step t. These can be collected manually, but we demonstrate how to use pre-trained action recognition models as a more scalable alternative (Sec. 4.5.2). 4.4 Approach Algorithm 2 STAR (Semantic Transfer Accelerated RL) Pre-Train low-level policyπ l (a|s,k,z) ▷ cf. Sec. 4.4.1 Match source demos to target states ▷ cf. Sec. 4.4.3 Pre-train p demo (k|s), p TA (k|s), p TA (z|s,k),D(s) ▷ cf. Tab. C.1 for each target train iteration do Collect online experience(s,k,z,R,s ′ ) Update high-level policies with eq. 4.3 ▷ cf. Alg. 5 end for return trained high-level policiesπ sem (k|s),π lat (z|s,k) Our approach STAR imitates demonstrations’ semantic skills, in- stead of low-level actions, to enable cross-domain, semantic imitation. We use a two-layer hierarchical policy with a high-level that outputs the se- mantic skill and a low-level that ex- ecutes the skill. We first describe our semantic skill representation, fol- lowed by the low-level and high-level policy learning. Algorithm 2 summarizes our approach. 4.4.1 Semantic Skill Representation A skill is characterized by both its semantics, i.e., whether to open the microwave or turn on the stove, as well as the details of its low-level execution, e.g., at what angle to approach the microwave or where to grasp its door handle. Thus, we represent skills via a low-level policyπ l (a|s,k,z) which is conditioned on the current environment state s, the semantic skill ID k and a latent variable z which captures the execution details. For example, when “turning on the stove”, a are the joint velocities, s is the robot and environment state, k is the semantic skill ID of this skill, and z captures 44 s 0 s 1 s 2 s 3 s 4 a 0 a 1 a 2 a 3 q(z|s,a,k) π l (a|s,k,z) s 2 s 0 s 1 s 3 s 4 k s a π(a|s) = In-Support Regularization = Out-of-Support Regularization = Pre-Trained & Frozen p demo (k|s) <latexit sha1_base64="y3NCKlQn/uVLDFB4qRqfCdKRMdM=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQYhN2RdAyaGMZwTwgWcPs5CYZMvtg5q4Yltj4KzYWitj6F3b+jZNkC008MHA451zu3OPHUmh0nG8rt7S8srqWXy9sbG5t79i7e3UdJYpDjUcyUk2faZAihBoKlNCMFbDAl9Dwh1cTv3EPSosovMVRDF7A+qHoCc7QSB37IL5rIzxg2oUgGpeGtG3SSPVJxy46ZWcKukjcjBRJhmrH/mp3I54EECKXTOuW68TopUyh4BLGhXaiIWZ8yPrQMjRkAWgvnV4wpsdG6dJepMwLkU7V3xMpC7QeBb5JBgwHet6biP95rQR7F14qwjhBCPlsUS+RFCM6qYN2hQKOcmQI40qYv1I+YIpxNKUVTAnu/MmLpH5adp2ye3NWrFxmdeTJITkiJeKSc1Ih16RKaoSTR/JMXsmb9WS9WO/Wxyyas7KZffIH1ucPNAWWsw==</latexit> <latexit sha1_base64="y3NCKlQn/uVLDFB4qRqfCdKRMdM=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQYhN2RdAyaGMZwTwgWcPs5CYZMvtg5q4Yltj4KzYWitj6F3b+jZNkC008MHA451zu3OPHUmh0nG8rt7S8srqWXy9sbG5t79i7e3UdJYpDjUcyUk2faZAihBoKlNCMFbDAl9Dwh1cTv3EPSosovMVRDF7A+qHoCc7QSB37IL5rIzxg2oUgGpeGtG3SSPVJxy46ZWcKukjcjBRJhmrH/mp3I54EECKXTOuW68TopUyh4BLGhXaiIWZ8yPrQMjRkAWgvnV4wpsdG6dJepMwLkU7V3xMpC7QeBb5JBgwHet6biP95rQR7F14qwjhBCPlsUS+RFCM6qYN2hQKOcmQI40qYv1I+YIpxNKUVTAnu/MmLpH5adp2ye3NWrFxmdeTJITkiJeKSc1Ih16RKaoSTR/JMXsmb9WS9WO/Wxyyas7KZffIH1ucPNAWWsw==</latexit> <latexit sha1_base64="y3NCKlQn/uVLDFB4qRqfCdKRMdM=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQYhN2RdAyaGMZwTwgWcPs5CYZMvtg5q4Yltj4KzYWitj6F3b+jZNkC008MHA451zu3OPHUmh0nG8rt7S8srqWXy9sbG5t79i7e3UdJYpDjUcyUk2faZAihBoKlNCMFbDAl9Dwh1cTv3EPSosovMVRDF7A+qHoCc7QSB37IL5rIzxg2oUgGpeGtG3SSPVJxy46ZWcKukjcjBRJhmrH/mp3I54EECKXTOuW68TopUyh4BLGhXaiIWZ8yPrQMjRkAWgvnV4wpsdG6dJepMwLkU7V3xMpC7QeBb5JBgwHet6biP95rQR7F14qwjhBCPlsUS+RFCM6qYN2hQKOcmQI40qYv1I+YIpxNKUVTAnu/MmLpH5adp2ye3NWrFxmdeTJITkiJeKSc1Ih16RKaoSTR/JMXsmb9WS9WO/Wxyyas7KZffIH1ucPNAWWsw==</latexit> <latexit sha1_base64="y3NCKlQn/uVLDFB4qRqfCdKRMdM=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQYhN2RdAyaGMZwTwgWcPs5CYZMvtg5q4Yltj4KzYWitj6F3b+jZNkC008MHA451zu3OPHUmh0nG8rt7S8srqWXy9sbG5t79i7e3UdJYpDjUcyUk2faZAihBoKlNCMFbDAl9Dwh1cTv3EPSosovMVRDF7A+qHoCc7QSB37IL5rIzxg2oUgGpeGtG3SSPVJxy46ZWcKukjcjBRJhmrH/mp3I54EECKXTOuW68TopUyh4BLGhXaiIWZ8yPrQMjRkAWgvnV4wpsdG6dJepMwLkU7V3xMpC7QeBb5JBgwHet6biP95rQR7F14qwjhBCPlsUS+RFCM6qYN2hQKOcmQI40qYv1I+YIpxNKUVTAnu/MmLpH5adp2ye3NWrFxmdeTJITkiJeKSc1Ih16RKaoSTR/JMXsmb9WS9WO/Wxyyas7KZffIH1ucPNAWWsw==</latexit> p TA (k|s) <latexit sha1_base64="vpQAmA6WDz/Qp9Ril18cg5Y6Mfw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMiKDLqhuXFfqCdiyZNNOGZjJDckcsYxf+ihsXirj1N9z5N6btLLT1QOBwzrncm+PHgmtwnG8rt7S8srqWXy9sbG5t79i7ew0dJYqyOo1EpFo+0UxwyerAQbBWrBgJfcGa/vB64jfvmdI8kjUYxcwLSV/ygFMCRuraB/FdB9gDpLXLcWmIOyYLWJ907aJTdqbAi8TNSBFlqHbtr04voknIJFBBtG67TgxeShRwKti40Ek0iwkdkj5rGypJyLSXTu8f42Oj9HAQKfMk4Kn6eyIlodaj0DfJkMBAz3sT8T+vnUBw4aVcxgkwSWeLgkRgiPCkDNzjilEQI0MIVdzciumAKELBVFYwJbjzX14kjdOy65Td27Ni5SqrI48O0REqIRedowq6QVVURxQ9omf0it6sJ+vFerc+ZtGclc3soz+wPn8AN6iVjw==</latexit> <latexit sha1_base64="vpQAmA6WDz/Qp9Ril18cg5Y6Mfw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMiKDLqhuXFfqCdiyZNNOGZjJDckcsYxf+ihsXirj1N9z5N6btLLT1QOBwzrncm+PHgmtwnG8rt7S8srqWXy9sbG5t79i7ew0dJYqyOo1EpFo+0UxwyerAQbBWrBgJfcGa/vB64jfvmdI8kjUYxcwLSV/ygFMCRuraB/FdB9gDpLXLcWmIOyYLWJ907aJTdqbAi8TNSBFlqHbtr04voknIJFBBtG67TgxeShRwKti40Ek0iwkdkj5rGypJyLSXTu8f42Oj9HAQKfMk4Kn6eyIlodaj0DfJkMBAz3sT8T+vnUBw4aVcxgkwSWeLgkRgiPCkDNzjilEQI0MIVdzciumAKELBVFYwJbjzX14kjdOy65Td27Ni5SqrI48O0REqIRedowq6QVVURxQ9omf0it6sJ+vFerc+ZtGclc3soz+wPn8AN6iVjw==</latexit> <latexit sha1_base64="vpQAmA6WDz/Qp9Ril18cg5Y6Mfw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMiKDLqhuXFfqCdiyZNNOGZjJDckcsYxf+ihsXirj1N9z5N6btLLT1QOBwzrncm+PHgmtwnG8rt7S8srqWXy9sbG5t79i7ew0dJYqyOo1EpFo+0UxwyerAQbBWrBgJfcGa/vB64jfvmdI8kjUYxcwLSV/ygFMCRuraB/FdB9gDpLXLcWmIOyYLWJ907aJTdqbAi8TNSBFlqHbtr04voknIJFBBtG67TgxeShRwKti40Ek0iwkdkj5rGypJyLSXTu8f42Oj9HAQKfMk4Kn6eyIlodaj0DfJkMBAz3sT8T+vnUBw4aVcxgkwSWeLgkRgiPCkDNzjilEQI0MIVdzciumAKELBVFYwJbjzX14kjdOy65Td27Ni5SqrI48O0REqIRedowq6QVVURxQ9omf0it6sJ+vFerc+ZtGclc3soz+wPn8AN6iVjw==</latexit> <latexit sha1_base64="vpQAmA6WDz/Qp9Ril18cg5Y6Mfw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMiKDLqhuXFfqCdiyZNNOGZjJDckcsYxf+ihsXirj1N9z5N6btLLT1QOBwzrncm+PHgmtwnG8rt7S8srqWXy9sbG5t79i7ew0dJYqyOo1EpFo+0UxwyerAQbBWrBgJfcGa/vB64jfvmdI8kjUYxcwLSV/ygFMCRuraB/FdB9gDpLXLcWmIOyYLWJ907aJTdqbAi8TNSBFlqHbtr04voknIJFBBtG67TgxeShRwKti40Ek0iwkdkj5rGypJyLSXTu8f42Oj9HAQKfMk4Kn6eyIlodaj0DfJkMBAz3sT8T+vnUBw4aVcxgkwSWeLgkRgiPCkDNzjilEQI0MIVdzciumAKELBVFYwJbjzX14kjdOy65Td27Ni5SqrI48O0REqIRedowq6QVVURxQ9omf0it6sJ+vFerc+ZtGclc3soz+wPn8AN6iVjw==</latexit> ⇡ sem (k|s) <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> ⇡ lat (z|s,k) <latexit sha1_base64="c/Ra8OoU+P8GWDNB5OTzhF6v8xI=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKIoIui25cVrAPaGKZTCft0MkkzNwINWTjr7hxoYhbP8Odf+P0sdDWAwOHc87lzj1BIrgGx/m2CkvLK6trxfXSxubW9o69u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQyvx37rgSnNY3kHo4T5EelLHnJKwEhd+8BL+H0mCOSVR+yZJGB9iocnXbvsVJ0J8CJxZ6SMZqh37S+vF9M0YhKoIFp3XCcBPyMKOBUsL3mpZgmhQ9JnHUMliZj2s8kBOT42Sg+HsTJPAp6ovycyEmk9igKTjAgM9Lw3Fv/zOimEl37GZZICk3S6KEwFhhiP28A9rhgFMTKEUMXNXzEdEEUomM5KpgR3/uRF0jyruk7VvT0v165mdRTRITpCFeSiC1RDN6iOGoiiHD2jV/RmPVkv1rv1MY0WrNnMPvoD6/MHhhWVrw==</latexit> <latexit sha1_base64="c/Ra8OoU+P8GWDNB5OTzhF6v8xI=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKIoIui25cVrAPaGKZTCft0MkkzNwINWTjr7hxoYhbP8Odf+P0sdDWAwOHc87lzj1BIrgGx/m2CkvLK6trxfXSxubW9o69u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQyvx37rgSnNY3kHo4T5EelLHnJKwEhd+8BL+H0mCOSVR+yZJGB9iocnXbvsVJ0J8CJxZ6SMZqh37S+vF9M0YhKoIFp3XCcBPyMKOBUsL3mpZgmhQ9JnHUMliZj2s8kBOT42Sg+HsTJPAp6ovycyEmk9igKTjAgM9Lw3Fv/zOimEl37GZZICk3S6KEwFhhiP28A9rhgFMTKEUMXNXzEdEEUomM5KpgR3/uRF0jyruk7VvT0v165mdRTRITpCFeSiC1RDN6iOGoiiHD2jV/RmPVkv1rv1MY0WrNnMPvoD6/MHhhWVrw==</latexit> <latexit sha1_base64="c/Ra8OoU+P8GWDNB5OTzhF6v8xI=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKIoIui25cVrAPaGKZTCft0MkkzNwINWTjr7hxoYhbP8Odf+P0sdDWAwOHc87lzj1BIrgGx/m2CkvLK6trxfXSxubW9o69u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQyvx37rgSnNY3kHo4T5EelLHnJKwEhd+8BL+H0mCOSVR+yZJGB9iocnXbvsVJ0J8CJxZ6SMZqh37S+vF9M0YhKoIFp3XCcBPyMKOBUsL3mpZgmhQ9JnHUMliZj2s8kBOT42Sg+HsTJPAp6ovycyEmk9igKTjAgM9Lw3Fv/zOimEl37GZZICk3S6KEwFhhiP28A9rhgFMTKEUMXNXzEdEEUomM5KpgR3/uRF0jyruk7VvT0v165mdRTRITpCFeSiC1RDN6iOGoiiHD2jV/RmPVkv1rv1MY0WrNnMPvoD6/MHhhWVrw==</latexit> <latexit sha1_base64="c/Ra8OoU+P8GWDNB5OTzhF6v8xI=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKIoIui25cVrAPaGKZTCft0MkkzNwINWTjr7hxoYhbP8Odf+P0sdDWAwOHc87lzj1BIrgGx/m2CkvLK6trxfXSxubW9o69u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQyvx37rgSnNY3kHo4T5EelLHnJKwEhd+8BL+H0mCOSVR+yZJGB9iocnXbvsVJ0J8CJxZ6SMZqh37S+vF9M0YhKoIFp3XCcBPyMKOBUsL3mpZgmhQ9JnHUMliZj2s8kBOT42Sg+HsTJPAp6ovycyEmk9igKTjAgM9Lw3Fv/zOimEl37GZZICk3S6KEwFhhiP28A9rhgFMTKEUMXNXzEdEEUomM5KpgR3/uRF0jyruk7VvT0v165mdRTRITpCFeSiC1RDN6iOGoiiHD2jV/RmPVkv1rv1MY0WrNnMPvoD6/MHhhWVrw==</latexit> ⇡ l (a|s,k,z) <latexit sha1_base64="9oyZxgtvpK0SDTEixVXDmE96DWI=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARKpQyI4Iui25cVrAPaMeSSTNtaCYzJHeEOtSNv+LGhSJu/Qt3/o1pOwttPRA4nHMuN/f4seAaHOfbyi0tr6yu5dcLG5tb2zv27l5DR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8mvjNe6Y0j+QtjGLmhaQvecApASN17YNOzO9SMS4R3DE5wLqMh2X8cNK1i07FmQIvEjcjRZSh1rW/Or2IJiGTQAXRuu06MXgpUcCpYONCJ9EsJnRI+qxtqCQh0146vWCMj43Sw0GkzJOAp+rviZSEWo9C3yRDAgM9703E/7x2AsGFl3IZJ8AknS0KEoEhwpM6cI8rRkGMDCFUcfNXTAdEEQqmtIIpwZ0/eZE0TiuuU3FvzorVy6yOPDpER6iEXHSOquga1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AFx+5WR</latexit> <latexit sha1_base64="9oyZxgtvpK0SDTEixVXDmE96DWI=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARKpQyI4Iui25cVrAPaMeSSTNtaCYzJHeEOtSNv+LGhSJu/Qt3/o1pOwttPRA4nHMuN/f4seAaHOfbyi0tr6yu5dcLG5tb2zv27l5DR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8mvjNe6Y0j+QtjGLmhaQvecApASN17YNOzO9SMS4R3DE5wLqMh2X8cNK1i07FmQIvEjcjRZSh1rW/Or2IJiGTQAXRuu06MXgpUcCpYONCJ9EsJnRI+qxtqCQh0146vWCMj43Sw0GkzJOAp+rviZSEWo9C3yRDAgM9703E/7x2AsGFl3IZJ8AknS0KEoEhwpM6cI8rRkGMDCFUcfNXTAdEEQqmtIIpwZ0/eZE0TiuuU3FvzorVy6yOPDpER6iEXHSOquga1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AFx+5WR</latexit> <latexit sha1_base64="9oyZxgtvpK0SDTEixVXDmE96DWI=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARKpQyI4Iui25cVrAPaMeSSTNtaCYzJHeEOtSNv+LGhSJu/Qt3/o1pOwttPRA4nHMuN/f4seAaHOfbyi0tr6yu5dcLG5tb2zv27l5DR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8mvjNe6Y0j+QtjGLmhaQvecApASN17YNOzO9SMS4R3DE5wLqMh2X8cNK1i07FmQIvEjcjRZSh1rW/Or2IJiGTQAXRuu06MXgpUcCpYONCJ9EsJnRI+qxtqCQh0146vWCMj43Sw0GkzJOAp+rviZSEWo9C3yRDAgM9703E/7x2AsGFl3IZJ8AknS0KEoEhwpM6cI8rRkGMDCFUcfNXTAdEEQqmtIIpwZ0/eZE0TiuuU3FvzorVy6yOPDpER6iEXHSOquga1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AFx+5WR</latexit> <latexit sha1_base64="9oyZxgtvpK0SDTEixVXDmE96DWI=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARKpQyI4Iui25cVrAPaMeSSTNtaCYzJHeEOtSNv+LGhSJu/Qt3/o1pOwttPRA4nHMuN/f4seAaHOfbyi0tr6yu5dcLG5tb2zv27l5DR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8mvjNe6Y0j+QtjGLmhaQvecApASN17YNOzO9SMS4R3DE5wLqMh2X8cNK1i07FmQIvEjcjRZSh1rW/Or2IJiGTQAXRuu06MXgpUcCpYONCJ9EsJnRI+qxtqCQh0146vWCMj43Sw0GkzJOAp+rviZSEWo9C3yRDAgM9703E/7x2AsGFl3IZJ8AknS0KEoEhwpM6cI8rRkGMDCFUcfNXTAdEEQqmtIIpwZ0/eZE0TiuuU3FvzorVy6yOPDpER6iEXHSOquga1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AFx+5WR</latexit> p TA (z|s,k) <latexit sha1_base64="c22fNTx68cHPMust/FGx9JuE3es=">AAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQQcKuCFpGbSwj5AXZNcxOZpMhsw9m7opx2cLGX7GxUMTWj7Dzb5w8Ck08MHA451zu3OPFgiuwrG8jt7S8srqWXy9sbG5t75i7e00VJZKyBo1EJNseUUzwkDWAg2DtWDISeIK1vOHV2G/dMal4FNZhFDM3IP2Q+5wS0FLXLMa3qQPsHtL6RZaVH7Cj04DVMR4edc2SVbEmwIvEnpESmqHWNb+cXkSTgIVABVGqY1sxuCmRwKlgWcFJFIsJHZI+62gakoApN50ckeFDrfSwH0n9QsAT9fdESgKlRoGnkwGBgZr3xuJ/XicB/9xNeRgnwEI6XeQnAkOEx43gHpeMghhpQqjk+q+YDogkFHRvBV2CPX/yImmeVGyrYt+clqqXszryqIgOUBnZ6AxV0TWqoQai6BE9o1f0ZjwZL8a78TGN5ozZzD76A+PzB8bql38=</latexit> <latexit sha1_base64="c22fNTx68cHPMust/FGx9JuE3es=">AAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQQcKuCFpGbSwj5AXZNcxOZpMhsw9m7opx2cLGX7GxUMTWj7Dzb5w8Ck08MHA451zu3OPFgiuwrG8jt7S8srqWXy9sbG5t75i7e00VJZKyBo1EJNseUUzwkDWAg2DtWDISeIK1vOHV2G/dMal4FNZhFDM3IP2Q+5wS0FLXLMa3qQPsHtL6RZaVH7Cj04DVMR4edc2SVbEmwIvEnpESmqHWNb+cXkSTgIVABVGqY1sxuCmRwKlgWcFJFIsJHZI+62gakoApN50ckeFDrfSwH0n9QsAT9fdESgKlRoGnkwGBgZr3xuJ/XicB/9xNeRgnwEI6XeQnAkOEx43gHpeMghhpQqjk+q+YDogkFHRvBV2CPX/yImmeVGyrYt+clqqXszryqIgOUBnZ6AxV0TWqoQai6BE9o1f0ZjwZL8a78TGN5ozZzD76A+PzB8bql38=</latexit> <latexit sha1_base64="c22fNTx68cHPMust/FGx9JuE3es=">AAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQQcKuCFpGbSwj5AXZNcxOZpMhsw9m7opx2cLGX7GxUMTWj7Dzb5w8Ck08MHA451zu3OPFgiuwrG8jt7S8srqWXy9sbG5t75i7e00VJZKyBo1EJNseUUzwkDWAg2DtWDISeIK1vOHV2G/dMal4FNZhFDM3IP2Q+5wS0FLXLMa3qQPsHtL6RZaVH7Cj04DVMR4edc2SVbEmwIvEnpESmqHWNb+cXkSTgIVABVGqY1sxuCmRwKlgWcFJFIsJHZI+62gakoApN50ckeFDrfSwH0n9QsAT9fdESgKlRoGnkwGBgZr3xuJ/XicB/9xNeRgnwEI6XeQnAkOEx43gHpeMghhpQqjk+q+YDogkFHRvBV2CPX/yImmeVGyrYt+clqqXszryqIgOUBnZ6AxV0TWqoQai6BE9o1f0ZjwZL8a78TGN5ozZzD76A+PzB8bql38=</latexit> <latexit sha1_base64="c22fNTx68cHPMust/FGx9JuE3es=">AAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQQcKuCFpGbSwj5AXZNcxOZpMhsw9m7opx2cLGX7GxUMTWj7Dzb5w8Ck08MHA451zu3OPFgiuwrG8jt7S8srqWXy9sbG5t75i7e00VJZKyBo1EJNseUUzwkDWAg2DtWDISeIK1vOHV2G/dMal4FNZhFDM3IP2Q+5wS0FLXLMa3qQPsHtL6RZaVH7Cj04DVMR4edc2SVbEmwIvEnpESmqHWNb+cXkSTgIVABVGqY1sxuCmRwKlgWcFJFIsJHZI+62gakoApN50ckeFDrfSwH0n9QsAT9fdESgKlRoGnkwGBgZr3xuJ/XicB/9xNeRgnwEI6XeQnAkOEx43gHpeMghhpQqjk+q+YDogkFHRvBV2CPX/yImmeVGyrYt+clqqXszryqIgOUBnZ6AxV0TWqoQai6BE9o1f0ZjwZL8a78TGN5ozZzD76A+PzB8bql38=</latexit> Figure 4.2: Model overview for pre-training (left) and target task learning (right). We pre-train a semantic skill policyπ l (grey) and use it to decode actions from the learned high-level policies π sem andπ lat (blue and yellow) during target task learning. See training details in the main text. the robot hand orientation as it interacts with the stove. A single tuple(k,z) represents a sequence of H steps, since such temporal abstraction facilitates long-horizon imitation (Pertsch et al., 2021). We train our model as a conditional variational autoencoder (V AE) (Sohn et al., 2015) over a sequence of actions given a state and semantic skill ID. Thus, the latent variable z represents all information required to reconstruct a 0:H− 1 that is not contained in the skill ID, i.e., information about how to execute the semantic skill. Figure 4.2, left depicts the training setup forπ l . We randomly sample an H-step state-action subsequence(s 0:H ,a 0:H− 1 ) fromD T . An inference network q(z|s,a,k) encodes the sequence into a latent representation z conditioned on the semantic skill ID k at the first time step. k and z are passed toπ l , which reconstructs the sampled actions. Our training objective is a standard conditional V AE objective that combines a reconstruction and a prior regularization term: L π l =E q H− 1 ∏ t=0 logπ l (a t |s t ,k,z) | {z } reconstruction − βD KL q(z|s 0:H ,a 0:H− 1 ,k), p(z) | {z } prior regularization . (4.1) Here D KL denotes the Kullback-Leibler divergence. We use a simple uniform Gaussian prior p(z) and a weighting factorβ for the regularization objective (Higgins et al., 2017). The semantic skill ID k is pre-defined, discrete and labelled, while the latent z is learned and continuous. In this way, our formulation captures discrete aspects of manipulation skills (open a microwave vs. turn 45 on a stove) while being able to continuously modulate each semantic skill (e.g., different ways of approaching the microwave). 4.4.2 Semantic Transfer Accelerated RL After pre-training the low-level policy π l (a|s,k,z), we learn the high-level policy using the source domain demonstrations. Concretely, we train a policyπ h (k,z|s) that predicts tuples(k,z) which get executed viaπ l . Note that unlike prior work (Pertsch et al., 2021), our high-level policy outputs both, the semantic skill k and the low-level execution latent z. It is thus able to choose which semantic skill to execute and tailor its execution to the target domain. Cross-domain demonstrations solely guide the semantic skill choice, since the low-level execution might vary between source and target domains. Thus, we factorize π h into a semantic sub-policy π sem (k|s) and a latent, non-semantic sub-policyπ lat (z|s,k): π(a|s)=π l (a|s,k,z) | {z } skill policy · π lat (z|s,k)π sem (k|s) | {z } high-level policyπ h (k,z|s) . (4.2) Intuitively, this can be thought of as first deciding what skill to execute (e.g., open the microwave), followed by how to execute it. We pre-train multiple models via supervised learning for trainingπ h : (1) two semantic skill priors p demo (k|s) and p TA (k|s), trained to infer the semantic skill annotations from demonstrations and task-agnostic datasetD T respectively, (2) a task-agnostic prior p TA (z|s,k) over the latent skill variable z, trained to match the output of the inference network onD T and (3) a discriminator D(s), trained to classify whether a state is part of the demonstration trajectories. We summarize all pre-trained components and their supervised training objectives in Appendix, Table C.1. We provide an overview of our semantic imitation architecture and the used regularization terms in Figure 4.2, right. We build on the idea of weighted policy regularization with a learned demonstration support estimator from Pertsch et al. (2021) (see Chapter 3). We regularize the high-level semantic policy π sem (blue) towards the demonstration skill distribution p demo (k|s) 46 when D(s) classifies the current state as part of the demonstrations ( green). For states which D(s) classifies as outside the demonstration support, we regularize π sem towards the task-agnostic prior p TA (k|s) (red). We always regularize the non-semantic sub-policyπ lat (z|s,k) (yellow) towards the task-agnostic prior p TA (z|s,k), since execution-specific information cannot be transferred across domains. The overall optimization objective forπ h is: E π h ˜ r(s,a)− α q D KL π sem (k|s), p demo (k|s) · D(s) | {z } demonstration regularization − α p D KL π sem (k|s), p TA (k|s) · (1− D(s)) | {z } task-agnostic semantic prior regularization , − α l D KL π lat (z|s,k), p TA (z|s,k) | {z } task-agnostic execution prior regularization . (4.3) α q ,α p andα l are either fixed or automatically tuned via dual gradient descent. We augment the target task reward using the discriminator D(s) to encourage the policy to reach states within the demonstration support: ˜ r(s,a)=(1− κ)· R(s,a)+κ· logD(s)− log 1− D(s) . In the setting with no target environment rewards (pure imitation learning), we rely solely on this discriminator reward for policy training (Section C). For a summary of the full procedure, see Algorithm 5. The final challenge is that the discriminator D(s) and the prior p demo (k|s) are trained on states from the source domain, but need to be applied to the target domain. Since the domains differ substantially, we cannot expect the pre-trained networks to generalize. Instead, we need to explicitly bridge the state domain gap, as described next. 4.4.3 Cross-Domain State Matching Our goal is to find semantically similar states between the source and the target environment. These are states with similar distributions over likely semantic skills. E.g. if the agent’s hand is reaching for the handle of a closed microwave, the probability for the skill “open microwave” is high, while the probability for other skills, e.g. “turn on stove” is low. Crucially, this is true independent of the domain and independent of whether e.g. a human or robot is executing the action. Thus, we can use the skill prior distributions to find semantically similar states. 47 Source Domain State Target Domain States Open Microwave Turn On Stove Open Cabinet Figure 4.3: State matching between source and target domain. For every source domain state from the demonstrations, we compute the task- agnostic semantic skill distribution p TA (k|s) and find the target domain state with the most similar semantic skill distribution from the task-agnostic datasetD T . We then relabel the demonstrations with these matched states from the target domain. Following this intuition, we find correspond- ing states based on the similarity between the task-agnostic semantic skill prior distributions p TA (k|s). We illustrate an example in Fig- ure 4.3: for a given source demonstration state s S with high likelihood of opening the mi- crowave, we find a target domain state s T that has high likelihood of opening the microwave, by minimizing the symmetric KL divergence be- tween the task-agnostic skill distributions (we omit(·) TA for brevity): min s T ∈D T D KL p T (k|s T ), p S (k|s S ) (4.4) + D KL p S (k|s S ), p T (k|s T ) In practice, states can be matched incorrectly when the task agnostic dataset chooses one skill with much higher probability than others. In such states, the divergence in equation 4.4 is dominated by one skill, and others are ignored, causing matching errors. Using a state’s temporal context can result in more robust correspondences by reducing the influence of high likelihood skills in any single state. We compute an aggregated skill distributionφ(k|s) using a temporal window around the current state: φ(k|s t )= 1 Z(s) T ∑ i=t γ i + p(k|s i )+ t− 1 ∑ j=1 γ t− j − p(k|s t− j ) (4.5) Here,γ + ,γ − ∈[0,1] determine the forward and backward horizon of the aggregate skill distribution. Z(s) ensures that the aggregate probability distribution sums to one. Instead of p TA in equation 4.4, we useφ(k|s). By matching all source-domain demonstrations states to states in the target domain via φ(k|s), we create a proxy dataset of target state demonstrations, which we use to pre-train 48 the models p demo (k|s) and D(s). Once trained, we use them for training the high-level policy via equation 4.3. 4.5 Experiments Source Environment Target Environment Maze Navigation Simulated Kitchen Real-World Kitchen 1 2 3 4 1 1 1 2 2 2 3 3 3 4 4 4 Figure 4.4: We evaluate on three pairs of source (top) and target (bottom) environments. Left: maze navigation. The agent needs to follow a sequence of colored rooms (red path) but the maze layout changes substantially between source and target domains. Middle: kitchen manipulation. A robotic arm executes a sequence of skills, but the layout of the kitchens differs. Right: Same as before, but with human demonstrations from a real-world kitchen. Our experiments are designed to answer the following questions: (1) Can we leverage demonstrations across domains to accelerate learning via semantic imitation? (2) Can we use semantic imitation to teach a robot a new task from real-world videos of humans performing the task? (3) Is our approach robust to missing skills in the demonstrations? We test se- mantic imitation across two simulated maze and kitchen environments, as well as from real-world videos of humans to a simulated robot. Our results show that our approach can accelerate learning from cross-domain demonstrations, even with real-to-sim gap. 4.5.1 Cross-Domain Imitation in Simulation We first test our approach STAR in two simulated settings: a maze navigation and a robot kitchen manipulation task (see Figure 4.4, left & middle). In the maze navigation task, both domains have corresponding rooms, indicated by their color in Figure 4.4. The agent needs to follow a sequence of semantic skills like “go to red room”, “go to green room” etc. In the kitchen manipulation task, a Franka arm tackles long-horizon manipulation tasks in a simulated kitchen (Gupta et al., 2019). We define 7 semantic skills, like “open the microwave” or “turn on the stove” in the source 49 SkiLD (Oracle) SPiRL STAR (Ours) SkillSeq BC+RL Figure 4.5: Left: Performance on the simulated semantic imitation tasks. STAR, matches the per- formance of the oracle, SkiLD, which has access to target domain demonstrations and outperforms both SPiRL, which does not use demonstrations, and SkillSeq, which follows the demonstrated semantic skills sequentially. Right: Ablations in the kitchen environment, see main text for details. and target environments. In both environments we collect demonstrations in the source domain, and task-agnostic datasets in both the source and target domains using motion planners and human teleoperation respectively. For further details on action and observation spaces, rewards and data collection, see Sec B.4. We compare our approach to multile prior skill-based RL approaches with and without demon- stration guidance: SPiRL (Pertsch et al., 2020a) learns skills fromD T and then trains a high-level policy over skills; BC+RL (Rajeswaran et al., 2018; Nair et al., 2018) pre-trains with behavioral cloning and finetunes with SAC (Haarnoja et al., 2018a); SkillSeq, similar to Xu et al. (2018), sequentially executes the ground truth sequence of semantic skills as demonstrated; SkiLD (Pertsch et al., 2021) is an oracle with access to demonstrations in the target domain and follows them using learned skills. For more details on the implementation of our approach and all comparisons, see appendix, Sections B.1 - B.3. Figure 4.5, left, compares the performance of all approaches in both tasks. BC+RL is unable to leverage the cross-domain demonstrations and makes no progress on the task. SPiRL is able to learn the kitchen manipulation task, but requires many more environment interactions to reach the same performance as our approach. SkillSeq succeeds in approximately 20% of the maze episodes and solves on average 3 out of 4 subtasks in the kitchen manipulation environment after fine-tuning. The mixed success is due to inaccuracies in execution of the skill policies since SkillSeq 50 follows the ground truth sequence of high-level skills. Our approach, STAR, can use cross-domain demonstrations to match the learning efficiency of SkiLD (oracle) that has access to target domain demonstrations. This shows that our approach is effective at extracting useful information from cross-domain demonstrations. During downstream task training of the high-level semantic and execution policies our approach can fix both, errors in the high-level skill plan and the low-level skill execution. The ability to jointly adapt high-level and low-level policies and e.g. react to failures in the low-level policy rather than following a fixed high-level plan is crucial for good performance on long-horizon tasks. We find that this trend holds even in the “pure” imitation learning (IL) setting without environment rewards, where we solely rely on the learned discriminator reward to guide learning (see appendix, Section C for detailed results). Thus, STAR can be used both, as a demonstration-guided RL algorithm and for cross-domain imitation learning. Qualitative results can be viewed athttps://tinyurl.com/star-rl and in Figure C.1. To study the different components of our approach, we run ablations in the FrankaKitchen environment (Fig. 4.5, right). Removing the discriminator-based weighting for the demonstration regularization (-D-weight) (Eq. 4) or removing the demonstration regularization altogether (- DemoReg), leads to poor performance. In contrast, removing the discriminator-based dense reward (-D-reward) or temporal aggregation during matching (-TempAgg) affects learning speed but has the same asymptotic performance. Finally, a model without the latent variable z (-z) cannot model the diversity of skill executions in the data; the resulting skills are too imprecise to learn long-horizon tasks. We show qualitative examples of the effect of varying matching window sizes [γ − ,γ + ] on the project website: https://tinyurl.com/star-rl. 4.5.2 Imitation from Human Demonstrations In this section we ask: can our approach be used to leverage human video demonstrations for teaching new tasks to robots? Imitating human demonstrations presents a larger challenge since it requires bridging domain differences that span observation spaces (from images in the real-world to low-dimensional states in simulation), agent morphologies (from a bimanual human 51 Source Environment Target Environment Figure 4.6: Semantic imitation from human demonstrations. Left: Qualitative state matching results. The top row displays frames subsampled from a task demonstration in the human kitchen source domain. The bottom row visualizes the states matched to the source frames via the procedure described in Section 4.4.3. The matched states represent corresponding semantic scenes in which the agent e.g., opens the microwave, turns on the stove or opens the cabinet. Right: Quantitative results on the kitchen manipulation task from human video demonstrations. to a 7DOF robot arm), and environments (from the real-world to a simulated robotic environment). To investigate this question, we collect 20 human video demonstrations in a real-world kitchen, which demonstrate a task the robotic agent needs to learn in the target simulated domain. Instead of collecting a large, task-agnostic dataset in the human source domain and manually annotating semantic skill labels, we demonstrate a more scalable alternative: we use an action recognition model, pre-trained on the EPIC Kitchens dataset (Damen et al., 2021), zero-shot to predict semantic skill distributions on the human demonstration videos. We define a mapping from the 97 verb and 300 noun classes in EPIC Kitchens to the skills present in the target domain and then use our approach as described in Section 4.4.2, using the EPIC skill distributions as the task-agnostic skill prior p TA (k|s). For data collection details, see Section B.4. We visualize qualitative matching results between the domains in Figure 4.6, left. We suc- cessfully match frames to the corresponding semantic states in the target domain. In Figure 4.6, right, we show that this leads to successful semantic imitation of the human demonstrations. Our approach STAR with EPIC Kitchens auto-generated skill distributions is able to reach the same asymptotic performance as the oracle approach that has access to target domain demonstrations, with only slightly reduced learning speed. It also outperforms the SkillSeq and SPiRL baselines (for qualitative results seehttps://tinyurl.com/star-rl). 52 To recap: for this experiment we did not collect a large, task-agnostic human dataset and we did not manually annotate any human videos. Collecting a few human demonstrations in an unseen kitchen was sufficient to substantially accelerate learning of the target task on the robot in simulation. This demonstrates one avenue for scaling robot learning by (1) learning from easy-to-collect human video demonstrations and (2) using pre-trained skill prediction models to bridge the domain gap. 4.5.3 Robustness to Noisy Demonstrations and Labels STAR (all Tasks) SPiRL STAR (w/o Task 1) STAR (w/o Task 2) STAR (w/o Task 3) SkillSeq (w/o Task 1) SkillSeq (w/o Task 2) SkillSeq (w/o Task 3) Figure 4.7: Semantic imitation with missing skills in the demonstrations. Our approach STAR still learns the full task faster than learning without demonstrations (SPiRL), while SkillSeq get stuck at the missing skill. In realistic scenarios agents often need to cope with noisy demonstra- tion data, e.g., with partial demon- strations or faulty labels. Thus, we test STAR’s ability to handle such noise. First, we test imitation from partial demonstrations with missing subskills. These commonly occur when there are large differences be- tween source and target domain, e.g., the demonstration domain might already have a pot on the stove, and starts with “turn on the stove”, but in the target domain we need to first place the pot on the stove. We test this in the simulated kitchen tasks by dropping individual subskills from the demonstrations (“w/o Task i’ in Figure 4.7). Figure 4.7 shows that the SkillSeq approach struggles with such noise: it gets stuck whenever the corresponding skill is missing in the demonstration. In contrast, STAR can leverage demonstrations that are lacking complete subskills and still learn faster than the no-demonstration baseline SPiRL. When a skill is missing, the STAR agent finds itself off the demonstration support. Then the objective in equation 4.3 regularizes the policy towards the task-agnostic skill prior, encouraging the agent to explore until it finds its way (back) to the demonstration support. This allows our method to bridge “holes” in the demonstrations. We also test STAR’s robustness to noisy semantic 53 skill labels, in Section D. We find that STAR is robust to errors in the annotated skill lengths and to uncertain skill detections. Only frequent, high-confidence mis-detections of skills can lead to erroneous matches and decreased performance. Both experiments show that STAR’s guidance with semantic demonstrations is robust to noise in the training and demonstration data. 4.6 Discussion In this chapter, we presented STAR, an approach for imitation based on semantic skills that can use cross-domain demonstrations for accelerating RL. STAR is effective on multiple semantic imitation problems, including using real-world human demonstration videos for learning a robotic kitchen manipulation task. Our results present a promising way to use large-scale human video datasets like EPIC Kitchens (Damen et al., 2021) for behavior learning in robotics. However, our approach assumes a pre-defined set of semantic skills and semantic skill labels on the training data. We demonstrated how such assumptions can be reduced via the use of pre-trained skill prediction models. Yet, obtaining such semantic information from cheaper-to-collect natural language descriptions of the training trajectories without a pre-defined skill set is an exciting direction for future work. Additionally, strengthening the robustness to skill mis-labelings, e.g., via a more robust state matching mechanism, can further improve performance on noisy, real-world datasets. 54 Chapter 5 Meta Reinforcement Learning with Learned Skills 5.1 Introduction In recent years, deep reinforcement learning methods have achieved impressive results in robot learning (Gu et al., 2017; Andrychowicz et al., 2020; Kalashnikov et al., 2021). Yet, existing approaches are sample inefficient, thus rendering the learning of complex behaviors through trial and error learning infeasible, especially on real robot systems. In contrast, humans are capable of effectively learning a variety of complex skills in only a few trials. This can be greatly attributed to our ability to learn how to learn new tasks quickly by efficiently utilizing previously acquired skills. Figure 5.1: We propose a method that jointly lever- ages (1) a large offline dataset of prior experience collected across many tasks without reward or task annotations and (2) a set of meta-training tasks to learn how to quickly solve unseen long- horizon tasks. Our method extracts reusable skills from the offline dataset and meta-learn a policy to quickly use them for solving new tasks. Can machines likewise learn to how to learn by efficiently utilizing learned skills like hu- mans? Meta-reinforcement learning (meta-RL) holds the promise of allowing RL agents to acquire novel tasks with improved efficiency by learning to learn from a distribution of tasks (Finn et al., 2017b; Rakelly et al., 2019). In spite of recent advances in the field, most existing meta-RL algorithms are restricted to short-horizon, dense-reward tasks. To facilitate efficient learning on long-horizon, sparse-reward tasks, recent works aim to leverage experience 55 from prior tasks in the form of offline datasets without additional reward and task annotations (Lynch et al., 2020; Pertsch et al., 2020a; Chebotar et al., 2021b). While these methods can solve complex tasks with substantially improved sample efficiency over methods learning from scratch, millions of interactions with environments are still required to acquire long-horizon skills. In this chapter, we take a step towards combining the capabilities of both learning how to quickly learn new tasks while also leveraging prior experience in the form of unannotated offline data (see Figure 5.1). We devise a method that enables meta-learning on complex, long-horizon tasks and can solve unseen target tasks with orders of magnitude fewer environment interactions than prior works. We propose to leverage the offline experience by extracting reusable skills – short-term behaviors that can be composed to solve unseen long-horizon tasks. We employ a hierarchical meta-learning scheme in which we meta-train a high-level policy to learn how to quickly reuse the extracted skills. To efficiently explore the learned skill space during meta-training, the high-level policy is guided by a skill prior which is also acquired from the offline experience data. We evaluate our method and prior approaches in deep RL, skill-based RL, meta-RL, and multi-task RL on two challenging continuous control environments: maze navigation and kitchen manipulation, which require long-horizon control and only provide sparse rewards. Experimental results show that our method can efficiently solve unseen tasks by exploiting meta-learning tasks and offline datasets, while prior approaches require substantially more samples or fail completely. In summary, the main contributions of this chapter are threefold: • To the best of our knowledge, this is the first work to combine meta-reinforcement learning algorithms with task-agnostic offline datasets that do not contain reward or task annotations. • We propose a method that combines meta-learning with offline data by extracting learned skills and a skill prior and meta-learns a hierarchical skill policy regularized by the prior. • We empirically show that our method is significantly more efficient at learning long-horizon sparse-reward tasks compared to prior methods in deep RL, skill-based RL, meta-RL, and multi-task RL. 56 5.2 Related Work Meta-Reinforcement Learning. Meta-RL approaches (Duan et al., 2016a; Wang et al., 2017; Finn et al., 2017b; Yu et al., 2018b; Rothfuss et al., 2019; Gupta et al., 2018; Nagabandi et al., 2019a; Clavera et al., 2019; 2018; Rakelly et al., 2019; Vuorio et al., 2019; Yang et al., 2019; Zintgraf et al., 2019; Humplik et al., 2019; Zintgraf et al., 2020; Liu et al., 2021) hold the promise of allowing learning agents to quickly adapt to novel tasks by learning to learn from a distribution of tasks. Despite the recent advances in the field, most existing meta-RL algorithms are limited to short-horizon and dense-reward tasks. In contrast, we aim to develop a method that can meta-learn to solve long-horizon tasks with sparse rewards by leveraging offline datasets. Offline datasets. Recently, many works have investigated the usage of offline datasets for agent training. In particular, the field of offline reinforcement learning (Levine et al., 2020; Siegel et al., 2020; Kumar et al., 2020; Yu et al., 2021) aims to devise methods that can perform RL fully offline from pre-collected data, without the need for environment interactions. However, these methods require target task reward annotations on the offline data for every new tasks that should be learned. These reward annotations can be challenging to obtain, especially if the offline data is collected from a diverse set of prior tasks. In contrast, our method is able to leverage offline datasets without any reward annotations. Offline Meta-RL. Another recent line of research aims to meta-learn from static, pre-collected datasets including reward annotations (Mitchell et al., 2021; Pong et al., 2022; Dorfman et al., 2021). After meta-training with the offline datasets, these works aim to quickly adapt to a new task with only a small amount of data from that new task. In contrast to the aforementioned offline RL methods these works aim to adapt to unseen tasks and assume access to only limited data from the new tasks. However, in addition to reward annotations, these approaches often require that the offline training data is split into separate datasets for each training tasks, further limiting the scalability. Skill-based Learning. An alternative approach for leveraging offline data that does not require reward or task annotations is through the extraction of skills – reusable short-horizon behaviors. 57 Methods for skill-based learning recombine these skills for learning unseen target tasks and converge substantially faster than methods that learn from scratch (Lee et al., 2018b; Hausman et al., 2018; Sharma et al., 2020b). When trained from diverse datasets these approaches can extract a wide repertoire of skills and learn complex, long-horizon tasks (Merel et al., 2020; Lynch et al., 2020; Pertsch et al., 2020a; Ajay et al., 2021; Chebotar et al., 2021b; Pertsch et al., 2021). Yet, although they are more efficient than training from scratch, they still require a large number of environment interactions to learn a new task. Our method instead combines skills extracted from offline data with meta-learning, leading to significantly improved sample efficiency. 5.3 Problem Formulation and Preliminaries Our approach builds on prior work for meta-learning and learning from offline datasets and aims to combine the best of both worlds. In the following we will formalize our problem setup and briefly summarize relevant prior work. Problem Formulation Following prior work on learning from large offline datasets (Lynch et al., 2020; Pertsch et al., 2020a; 2021), we assume access to a dataset of state-action trajectories D={s t ,a t ,...} which is collected either across a wide variety of tasks or as “play data” with no particular task in mind. We thus refer to this dataset as task-agnostic. With a large number of data collection tasks, the dataset covers a wide variety of behaviors and can be used to accelerate learning on diverse tasks. Such data can be collected at scale, e.g., through autonomous exploration (Hausman et al., 2018; Sharma et al., 2020b; Dasari et al., 2019), human teleoperation (Schaal et al., 2005; Gupta et al., 2019; Mandlekar et al., 2018; Lynch et al., 2020), or from previously trained agents (Fu et al., 2020; Gulcehre et al., 2020). We additionally assume access to a set of meta-training tasks T={T 1 ,...,T N }, where each task is represented as a Markov decision process (MDP) defined by a tuple{S,A,P,r,ρ,γ} of states, actions, transition probability, reward, initial state distribution, and discount factor. 58 Our goal is to leverage both, the offline dataset D and the meta-training tasks T, to accelerate the training of a policyπ(a|s) on a target taskT ∗ which is also represented as an MDP. Crucially, we do not assume thatT ∗ is a part of the set of training tasks T, nor that D contains demonstrations for solvingT ∗ . Thus, we aim to design an algorithm that can leverage offline data and meta-training tasks for learning how to quickly compose known skills for solving an unseen target task. Next, we will describe existing approaches that either leverage offline data or meta-training tasks to accelerate target task learning. Then, we describe how our approach takes advantage of the best of both worlds. Off-Policy Meta-RL Rakelly et al. (2019) introduced an off-policy meta-RL algorithm called probabilistic embeddings for actor-critic RL (PEARL) that leverages a set of training tasks T to enable quick learning of new tasks. Specifically, PEARL leverages the meta-training tasks for learning a task encoder q(e|c). This encoder takes in a small set of state-action-reward transitions c and produces a task embedding e. This embedding is used to condition the actorπ(a|s,z) and critic Q(s,a,e). In PEARL, actor, critic and task encoder are trained by jointly maximizing the obtained reward and the policy’s entropyH (Haarnoja et al., 2018a): max π E T∼ p T ,e∼ q(·| c T ) ∑ t E (s t ,a t )∼ ρ π|e r T (s t ,a t )+αH π(a|s t ,e) . (5.1) Additionally, the task embedding output of the task encoder is regularized towards a constant prior distribution p(e). 5.4 Approach We propose Skill-based Meta-Policy Learning (SiMPL), an algorithm for jointly leveraging offline data as well as a set of meta-training tasks to accelerate the learning of unseen target tasks. Our method has three phases: (1) skill extraction: we extract reusable skills and a skill prior from the offline data (Section 5.4.1), (2) skill-based meta-training: we utilize the meta-training tasks to 59 Figure 5.2: Method Overview. Our proposed skill-based meta-RL method has three phases. (1) Skill Extraction: learns reusable skills from snippets of task-agnostic offline data through a skill extractor ( yellow) and low-level skill policy (blue). Also trains a prior distribution over skill embeddings (green). (2) Skill- based Meta-training: Meta-trains a high-level skill policy (red) and task encoder (purple) while using the pre-trained low-level policy. The pre-trained skill prior is used to regularize the high-level policy during meta-training and guide exploration. (3) Target Task Learning: Leverages the meta-trained hierarchical policy for quick learning of an unseen target task. After conditioning the policy by encoding a few transitions c ∗ from the target taskT ∗ , we continue fine-tuning the high-level skill policy on the target task while regularizing it with the pre-trained skill prior. learn how to leverage the extracted skills and skill prior to efficiently solve new tasks (Section 5.4.2), (3) target task learning: we fine-tune the meta-trained policy to rapidly adapt to solve an unseen target task (Section 5.4.3). An illustration of the proposed method is shown in Figure 5.2. 5.4.1 Skill Extraction To acquire a set of reusable skills from the offline dataset D, we leverage the skill extraction approach proposed in Chapter 2 (Pertsch et al., 2020a). Concretely, we jointly train (1) a skill encoder q(z|s 0:K ,a 0:K− 1 ) that embeds an K-steps trajectory randomly cropped from the sequences in D into a low-dimensional skill embedding z, and (2) a low-level skill policy π(a t |s t ,z) that is trained with behavioral cloning to reproduce the action sequence a 0:K− 1 given the skill embedding. To learn a smooth skill representation, we regularize the output of the skill encoder with a unit Gaussian prior distribution, and weight this regularization by a coefficient β (Higgins et al., 2017): max q,π E z∼ q K− 1 ∏ t=0 logπ(a t |s t ,z) | {z } behavioral cloning − βD KL q(z|s 0:K ,a 0:K− 1 ),N (0,I) | {z } embedding regularization . (5.2) 60 Additionally, we follow Pertsch et al. (2020a) and learn a skill prior p(z|s) that captures the distribution of skills likely to be executed in a given state under the training data distribution. The prior is trained to match the output of the skill encoder: min p D KL ⌊q(z|s 0:K ,a 0:K− 1 )⌋, p(z|s 0 ) . Here⌊·⌋ indicates that gradient flow is stopped into the skill encoder for training the skill prior. 5.4.2 Skill-based Meta-Training We aim to learn a policy that can quickly learn to leverage the extracted skills to solve new tasks. We leverage off-policy meta-RL (see Section 5.3) to learn such a policy using our set of meta-training tasks T. Similar to PEARL (Rakelly et al., 2019), we train a task-encoder that takes in a set of sampled transitions and produces a task embedding e. Crucially, we leverage our learned skills by training a task-embedding-conditioned policy over skills instead of primitive actions: π(z|s,e), thus equipping the policy with a set of useful pre-trained behaviors and reducing the meta-training task to learning how to combine these behaviors instead of learning them from scratch. We find that this usage of offline data through learned skills is crucial for enabling meta-training on complex, long-horizon tasks (see Section 5.5). Prior work has shown that the efficiency of RL on learned skill spaces can be substantially improved by guiding the policy with a learned skill prior (Pertsch et al., 2020a; Ajay et al., 2021). Thus, instead of regularizing with a maximum entropy objective as done in prior work on off- policy meta-RL (Rakelly et al., 2019), we propose to regularize the meta-training policy with our pre-trained skill prior, leading to the following meta-training objective: max π E T∼ p T ,e∼ q(·| c T ) ∑ t E (s t ,z t )∼ ρ π|e r T (s t ,z t )− αD KL π(z|s t ,e), p(z|s t ) . (5.3) whereα determines the strength of the prior regularization. We automatically tuneα via dual gradient descent by choosing a target divergenceδ between policy and prior (Pertsch et al., 2020a). To compute the task embedding e, we used multiple different sizes of c. We found that we can increase training stability by adjusting the strength of the prior regularization to the size of the 61 conditioning set. Intuitively, when the high-level policy is conditioned on only a few transitions, i.e. when the set c is small, it has only little information about the task at hand and should thus be regularized stronger towards the task-agnostic skill prior. Conversely, when c is large, the policy likely has more information about the target task and thus should be allowed to deviate from the skill prior more to solve the task, i.e. have a weaker regularization strength. To implement this intuition, we employ a simple approach: we define two target divergences δ 1 andδ 2 and associated auto-tuned coefficients α 1 andα 2 withδ 1 <δ 2 . We regularize the policy using the larger coefficient α 1 with small conditioning transition set and otherwise we regularize using the smaller coefficient α 2 . We found this technique simple yet sufficient in our experiments and leave the investigation of more sophisticated regularization approaches for future work. 5.4.3 Target Task Learning When a target task is given, we aim to leverage the meta-trained policy for quickly learning how to solve it. Intuitively, the policy should first explore different skill options to learn about the task at hand and then rapidly narrow its output distribution to those skills that solve the task. We implement this intuition by first collecting a small set of conditioning transitions c ∗ from the target task by exploring with the meta-trained policy. Since we have no information about the target task at this stage, we explore the environment by conditioning our pre-trained policy with task embeddings sampled from the task prior p(e). Then, we encode this set of transitions into a target task embedding e ∗ ∼ q(e|c ∗ ). By conditioning our meta-trained high-level policy on this encoding, we can rapidly narrow its skill distribution to skills that solve the given target task: π(z|s,e ∗ ). Empirically, we find that this policy is often already able to achieve high success rates on the target task. Note that only very few interactions with the environment for collecting c ∗ are required for learning a complex, long-horizon and unseen target task with sparse reward. This is substantially more efficient than prior approaches such as SPiRL that require orders of magnitude more target task interactions for achieving comparable performance. 62 To further improve the performance on the target task, we fine-tune the conditioned policy with target task rewards while guiding its exploration with the pre-trained skill prior 1 : max π E e ∗ ∼ q(·| c ∗ ) ∑ t E (s t ,z t )∼ ρ π|e ∗ r T ∗ (s t ,z t )− αD KL π(z|s t ,e ∗ ), p(z|s t ) . (5.4) In practice, we propose several techniques for stabilizing meta-training and fine-tuning: (1) adap- tively regularizing the policy based on the size of the conditioning trajectory set as described in Section 5.4.2, (2) parameterizing the policy as a residual policy that outputs differences to the pre-trained skill prior instead of the approach from Pertsch et al. (2020a) that directly fine-tunes the skill prior, and (3) initializing the Q-function andα parameter during fine-tuning with meta- trained parameters instead of randomly initialized networks. We discuss these techniques in detail in Section E. 5.5 Experiments Our experiments aim to answer the following questions: (1) Can our proposed method learn to efficiently solve long-horizon, sparse reward tasks? (2) Is it crucial to utilize offline datasets to achieve this? (3) How can we best leverage the training tasks for efficient learning of target tasks? (4) How does the training task distribution affect the target task learning? 5.5.1 Experimental Setup We evaluate our approach in two challenging continuous control environments: maze navigation and kitchen manipulation environment, as illustrated in Figure 5.3. While meta-RL algorithms are typically evaluated on tasks that span only a few dozen time steps and provide dense rewards (Finn et al., 2017b; Rothfuss et al., 2019; Rakelly et al., 2019; Zintgraf et al., 2020), our tasks require to 1 Other regularization distributions are possible during fine-tuning, e.g. the high-level policy conditioned on task prior samples p(z|s,e∼ p(e)) or the target task embedding conditioned policy p(z|s,e ∗ ) before finetuning . Yet, we found the regularization with the pre-trained task-agnostic skill prior to work best in our experiments. 63 Meta-training Tasks Target Tasks Meta-training Tasks Target Tasks Agent Meta-training Tasks Target Tasks top burner light switch slide cabinet hinge cabinet slide cabinet bottom burner bottom burner kettle bottom burner light switch top burner microwave kettle slide cabinet hinge cabinet light switch 1 2 3 4 (a) Maze Navigation (b) Kitchen Manipulation Figure 5.3: Environments. We evaluate our proposed framework in two domains that require the learning of complex, long-horizon behaviors from sparse rewards. These environments are substantially more complex than those typically used to evaluate meta-RL algorithms. (a) Maze Navigation: The agent needs to navigate for hundreds of steps to reach unseen target goals and only receives a binary reward upon task success. (b) Kitchen Manipulation: The 7DoF agent needs to execute an unseen sequence of four subtasks, spanning hundreds of time steps, and only receives a sparse reward upon completion of each subtask. learn long-horizon behaviors over hundreds of time steps from sparse reward feedback and thus pose a new challenge to meta-learning algorithms. Maze Navigation This 2D maze navigation domain based on the maze navigation problem in Fu et al. (2020) requires long-horizon control with hundreds of steps for a successful episode and only provides sparse reward feedback upon reaching the goal. The observation space of the agent consists of its 2D position and velocity and it acts via planar, continuous velocity commands. Following Fu et al. (2020) we collect a task-agnostic offline dataset by randomly sampling start-goal locations in the maze and using a planner to generate a trajectory that reaches from start to goal. Note that the trajectories are not annotated with any reward or task labels (i.e., which start-goal location is used for producing each trajectory). To generate a set of meta-training and target tasks, we fix the agent’s initial position in the center of the maze and sample 40 random goal locations for meta-training and another set of 10 goals for target tasks. All meta-training and target tasks use the same sparse reward formulation. More details can be found in Section G.1. Kitchen Manipulation The FrankaKitchen environment of Gupta et al. (2019) requires the agent to control a 7-DoF robot arm via continuous joint velocity commands and complete a sequence of manipulation tasks like opening the microwave or turning on the stove. Successful episodes span 64 300-500 steps and the agent is only provided a sparse reward signal upon successful completion of a subtask. We leverage a dataset of 600 human-teleoperated manipulation sequences of Gupta et al. (2019) for offline pre-training. In each trajectory, the robot executes a sequence of four subtasks. We then define a set of 23 meta-training tasks and 10 target tasks that in turn require the consecutive execution of four subtasks (see Figure 5.3 for examples). Note that each task consists of a unique combination of subtasks. More details can be found in Section G.2. Comparisons We compare SiMPL to prior approaches in RL, skill-based RL, meta-RL, and multi-task RL. • SAC (Haarnoja et al., 2018a) is a state of the art deep RL algorithm. It learns to solve a target task from scratch without leveraging the offline dataset nor the meta-training tasks. • SPiRL (Pertsch et al., 2020a) is our method from Chapter 2. It acquires skills and a skill prior from the offline dataset but does not utilize the meta-training tasks. This investigates the benefits our method can obtain from leveraging the meta-training tasks. • PEARL (Rakelly et al., 2019) is a state of the art off-policy meta-RL algorithm that learns a policy which can quickly adapt to unseen test tasks. It learns from the meta-training tasks but does not use the offline dataset. This examines the benefits of using learned skills in meta-RL. • PEARL-ft demonstrates the performance of a PEARL (Rakelly et al., 2019) model further fine-tuned on a target task using SAC (Haarnoja et al., 2018a). • Multi-task RL (MTRL) is a multi-task RL baseline which learns from the meta-training tasks by distilling individual policies specialized in each task into a shared policy, similar to Distral (Teh et al., 2017). Each individual policy is trained using SPiRL by leveraging skills extracted from the offline dataset. Therefore, it utilizes both the meta-training tasks and offline dataset similar to our method. This provides a direct comparison of multi-task learning (MTRL) from the training tasks vs. meta-learning using them (ours). More implementation details on the baselines can be found in Section F. 65 SiMPL (Ours) SPiRL MTRL PEARL-ft SAC PEARL Figure 5.4: Target Task Learning Efficiency. SiMPL demonstrates better sample efficiency compared to all the baselines, verifying the efficacy of meta-learning on long-horizon tasks by leveraging skills and skill prior extracted from an offline dataset. For both the two environments, we train each model on each target task with 3 different random seeds. SiMPL and PEARL-ft first collect 20 episodes of environment interactions (vertical dotted line) for conditioning the meta-trained policy before fine-tuning it on target tasks. 5.5.2 Results We present the quantitative results in Figure 5.4 and the qualitative results on the maze navigation domain in Figure 5.5. In Figure 5.4, SiMPL demonstrates much better sample efficiency for learning the unseen target tasks compared to all the baselines. Without leveraging the offline dataset and meta-training tasks, SAC is not able to make learning progress on most of the target tasks. While PEARL is first trained on the meta-training tasks, it still achieves poor performance on the target tasks and fine-tuning it (PEARL-ft) does not yield significant improvement. We believe this is because both environments provide only sparse rewards yet require long-horizon and complex behaviors, which is known to be difficult for meta-RL methods (Mitchell et al., 2021). On the other hand, by first extracting skills and acquiring a skill prior from the offline dataset, SPiRL’s performance consistently improves with more samples from the target tasks. Yet, it requires significantly more environment interactions than our method to solve the target tasks since the policy is optimized using vanilla RL, which is not designed to learn to quickly learn new tasks. While the multi-task RL (MTRL) baseline first learns a multi-task policy from the meta-training 66 Meta-training Tasks Episode 0 Episodes 20 Episodes 80 SPiRL Ours Meta-Training Task Target Task Target Task Agent Trajectory Episode 0 Episode 20 Episode 100 SiMPL (Ours) SPiRL Episode 0 Episode 20 Episode 100 PEARL-ft MTRL Figure 5.5: Qualitative Results. All the methods that leverage the offline dataset (i.e., SiMPL, SPiRL, and MTRL) effectively explore the maze in the first episode. Then, SiMPL converges with much fewer episodes compared to SPiRL and MTRL. In contrast, PEARL-ft is not able to make learning progress. tasks, its sample efficiency is similar to SPiRL on target task learning, which highlights the strength of our proposed method – meta-learning from the meta-training tasks for fast target task learning. Compared to the baselines, our method learns the target tasks much quicker. Within only a few episodes the policy converges to solve more than 80% of the target tasks in the maze environment and two out of four subtasks in the kitchen manipulation environment. The prior-regularized fine-tuning then continues to improve performance. The rapidly increasing performance and the overall faster convergence show the benefits of leveraging meta-training tasks in addition to learning from offline data: by first learning to learn how to quickly solve tasks using the extracted skills and the skill prior, our policy can efficiently solve the target tasks. The qualitative results presented in Figure 5.5 show that all the methods that leverage the offline dataset (i.e., SiMPL, SPiRL, and MTRL) effectively explore the maze in the first episode. Then, SiMPL converges with much fewer episodes compared to SPiRL and MTRL, underlining the effectiveness of meta-training. In contrast, PEARL-ft is not able to make learning progress, justifying the necessity of employing offline datasets for acquiring long-horizon, complex behaviors. 67 (a) Sparser Task Distribution (b)T TRAIN-TOP →T TARGET-TOP (c)T TRAIN-TOP →T TARGET-BOTTOM Figure 5.6: Meta-training Task Distribution Analysis. (a) With sparser meta-training task distributions (i.e., fewer numbers of meta-training tasks), SiMPL still achieves better sample efficiency compared to SPiRL, highlighting the benefit of leveraging meta-training tasks. (b) When trained on a meta-training task distribution that aligns better with the target task distribution, SiMPL achieves improved performance. (c) When trained on a meta-training task distribution that is mis-aligned with the target tasks, SiMPL yields worse performance. For all the analyses, we train each model on each target task with 3 different random seeds. 5.5.3 Meta-training Task Distribution Analysis In this section, we aim to investigate the effect of the meta-training task distribution on our skill-based meta-training and target task learning phases. Specifically, we examine the effect of (1) the number of tasks in the meta-training task distribution and (2) the alignment between a meta-training task distribution and target task distribution. We conduct experiments and analyses in the maze navigation domain. More details on task distributions can be found in Section G.1. Number of meta-training tasks To investigate how the number of meta-training tasks affects the performance of our method, we train our method with fewer numbers meta-training tasks (i.e., 10 and 20) and evaluate it with the same set of target tasks. The quantitative results presented in Figure 5.6a suggest that even with sparser meta-training task distributions (i.e., fewer numbers of meta-training tasks), SiMPL is still more sample efficient than the best baseline, SPiRL. Meta-train / test task alignment We aim to examine if a model trained on a meta-training task distribution that aligns better/worse with the target tasks would yield improved/deteriorated performance. To this end, we create biased meta-training / test task distributions: we create a 68 meta-train set by sampling goal locations from only the top 25% portion of the maze (T TRAIN-TOP ). To rule out the effect of the density of the task distribution, we sample 10 (i.e., 40× 25%) meta- training tasks. Then, we create two target task distributions that have good and bad alignment with this meta-training distribution respectively by sampling 10 target tasks from the top 25% portion of the maze (T TARGET-TOP ) and 10 target tasks from the bottom 25% portion of the maze (T TARGET-BOTTOM ). Figure 5.6b and Figure 5.6c present the target task learning efficiency for models trained with good task alignment (meta-train onT TRAIN-TOP , learn target tasks fromT TARGET-TOP ) and bad task alignment (meta-train onT TRAIN-TOP , learn target tasks fromT TARGET-BOTTOM ), respectively. The results demonstrate that SiMPL can achieve improved performance when trained on a better aligned meta-training task distribution. On the other hand, not surprisingly, SiMPL and MTRL perform slightly worse compared to SPiRL when trained with misaligned meta-training tasks (see Figure 5.6c). This is expected given that SPiRL does not learn from the misaligned meta- training tasks. In summary, from Figure 5.6, we can conclude that meta-learning from either a diverse task distribution or a better informed task distribution can yield improved performance. 5.6 Discussion We proposed a skill-based meta-RL method, dubbed SiMPL, that can meta-learn on long-horizon tasks by leveraging prior experience in the form of large offline datasets without additional reward and task annotations. Specifically, our method first learns to extracts reusable skills and a skill prior from the offline data. Then, we proposed to meta-trains a high-level policy that leverages these skills for efficient learning of unseen target tasks. To effectively utilize learned skills, the high-level policy is regularized by the acquired prior. The experimental results on challenging continuous control long-horizon navigation and manipulation tasks with sparse rewards demonstrate that our method outperforms the prior approaches in deep RL, skill-based RL, meta-RL, and multi-task RL. In the future, we aim to demonstrate the scalability of our method to high-DoF continuous control problems on real robotic systems to show the benefits of our improved sample efficiency. 69 Part II Scalable Robot Data Collection 70 Chapter 6 Assisted Teleoperation for Scalable Robot Data Collection 6.1 Introduction Recently, many works have shown impressive robot learning results from diverse, human- collected demonstration datasets (Mandlekar et al., 2018; Cabi et al., 2019; Lu et al., 2021; Ebert et al., 2022), including our own works from Part I of this thesis. They underline the importance of scalable robot data collection. Yet, the current standard approach for demonstration collection, human teleoperation, is tedious and costly: tasks need to be demonstrated repeatedly and each operator can control only a single robot at a time. Research in teleoperation has focused on exploring different interfaces, such as VR controllers (Zhang et al., 2018b) and smart phones (Mandlekar et al., 2018), but does not address the aforementioned bottlenecks to scaling data collection. Thus, current teleoperation systems do not have the scalability required by modern robot learning pipelines. Our goal is to improve the scalability of robotic data collection by providing assistance to the human operator during teleoperation. We take inspiration from other fields of machine learning, such as semantic segmentation, where costly labeling processes have been substantially accelerated by providing human annotators with learned assistance systems, e.g., in the form of rough segmentation estimates, that drastically reduce the labeling burden (Castrejon et al., 2017; Acuna et al., 2018). Similarly, we propose to train assistive policies, that can automate control of repeatedly demonstrated behaviors and ask for user input only when facing a novel situation or when unsure which behavior to execute. Thereby, we aim to reduce the mental load of the human operator and enable scalable 71 Fig 1. Teaser Human Teleoperation Assisted Teleoperation Data Data Human operator π Assistive policy Human operator … Provide input only when asked Automate repetitive motions Provide every action Figure 6.1: Policy Assisted TeleOperation (PATO) enables large-scale data collection by mini- mizing human operator inputs and mental efforts with an assistive policy, which autonomously performs repetitive subtasks. This allows a human operator to simultaneously manage multiple robots. teleoperation by allowing a single operator to perform data collection with multiple robots in parallel. In order to build an assistive system for robotic data collection, we need to solve two key challenges: (1) we need to learn assistive policies from diverse human-collected data, which is known to be challenging (Mandlekar et al., 2021), and (2) we need to learn when to ask for operator input while keeping such interventions at a minimum. To address these challenges, we propose to use a hierarchical stochastic policy that can learn effectively from diverse human data. Further, we use the policy’s stochastic predictions to estimate its uncertainty about how to act in the current scene and which task to pursue. Then, we use this estimate to elicit operator input only if the assistive policy is uncertain about how to proceed. The main contribution of this paper is a novel Policy Assisted TeleOperation (PATO) system, which enables scalable robotic data collection using a hierarchical assistive policy. We evaluate the 72 effectiveness of our approach in a user study in which operators collect datasets of diverse kitchen- inspired manipulation tasks with a real robot. We find that our proposed assisted teleoperation approach reduces operators’ mental load and improves their demonstration throughput. We further demonstrate that our approach allows a single operator to control data collection with multiple robots simultaneously in a simulated manipulation environment – a first step towards more scalable robotic data collection. 6.2 Related Work Robot Teleoperation. Demonstrations have played a key role in robot learning for many decades (Pomerleau, 1989; Billard et al., 2008; Argall et al., 2009), thus many approaches have been explored for collecting such demonstrations. While initially kinesthetic teaching was com- mon (Amor et al., 2009) in which a human operator directly moves the robot, more recently teleoperation has become the norm (Zhang et al., 2018b; Mandlekar et al., 2018; Gupta et al., 2019; Ebert et al., 2022; Lee et al., 2021a), since separating the human operator and the robot allows for more comfortable human control inputs and is crucial for training policies with image-based inputs. Research into teleoperation systems has focused on exploring different interfaces like VR headsets (Zhang et al., 2018b; Ebert et al., 2022), joysticks (Cabi et al., 2019) and smartphones (Man- dlekar et al., 2018). Yet, none of these works explores active assistance of the human operator during teleoperation. Others have investigated controlling high-DoF manipulators via low-DoF interfaces through learned embedding spaces (Losey et al., 2020; Jeon et al., 2020) to allow people with disabilities to control robotic arms. In contrast, our approach trains assistive policies that automate part of the teleoperation process with the goal of enabling more scalable data collection. Shared Autonomy. The idea of sharing efforts between humans and robots when solving tasks has a rich history in the human-robot-interaction (HRI) community (Javdani et al., 2015; Selvaggio et al., 2021; Berthet-Rayne et al., 2016; Pichler et al., 2017; Gao and Chien, 2017; Nikolaidis et al., 2017; Johns et al., 2016; Argall, 2018; Fontaine and Nikolaidis, 2021; 2022). Approaches for such shared autonomy typically rely on a pre-defined set of goals and aim to infer the intent of the human 73 operator to optimally assist them. Crucially, in the context of data collection, we cannot assume that all goals are known a priori, since a core goal of data collection is to collect previously unseen behaviors. Thus, instead of inferring the operator’s intent over a fixed goal set, we leverage the model’s estimate over its own uncertainty to determine when to assist and when to rely on operator input. Interactive Human Robot Learning. In the field of robot learning, many approaches have explored leveraging human input in the learning loop and focused on different ways to decide when to leverage such input. Based on the DAgger algorithm (Ross et al., 2011), works have investigated having the human themselves decide when to intervene (Kelly et al., 2019), using ensemble-based support estimates (Menda et al., 2019), using discrepancies between model output and human inputs (Zhang and Cho, 2017; Hoque et al., 2021b) or risk estimates based on predicted future returns (Hoque et al., 2021a). Yet, all these approaches focus on training a policy for a single task, not on collecting a diverse dataset. Thus, they are not designed to learn from multi-modal datasets or estimate uncertainty about the desired task. We show in our user study that these are crucial for enabling scalable robot data collection. Assisted Robot Data Collection. Clever et al. (Clever et al., 2021) aims to assist in robot demonstration collection via a learned policy. They visualize the projected trajectory of the assistive policy to enable the human operator to intervene if necessary. However, they focus on collection of single-task, short-horizon demonstrations and require the operator to constantly monitor the robot to decide when to intervene. In contrast, our system can collect diverse, multi-task datasets and learn when to ask the user for input, enabling more scalable data collection, e.g., with multiple robots in parallel. 6.3 Approach In this paper, we aim to improve the scalability of robotic data collection by learning an assistive policy to automate part of teleoperation (in Section 6.3.2) and ask for user inputs only when necessary (in Section 6.3.3). 74 Fig 2. Hierarchical & ensemble policy model Subgoal predictor p(s g |s,z) Subgoal-reaching policies π (k) LL (a|s,s g ) s s g π (1) LL π (2) LL π (3) LL π (K) LL … s (1) g s (2) g s (3) g s (4) g s (N) g s (...) g Policy uncertainty Task uncertainty Figure 6.2: PATO is hierarchical: a high-level subgoal predictor p(s g |s,z) and a low-level subgoal- reaching policyπ LL (a|s,s g ). To decide when to follow the assistive policy, we measure uncertainty of both high-level (subgoal predictor) and low-level (subgoal-reaching policy) decisions. The task uncertainty is estimated using the subgoal predictor’s variance, and the policy uncertainty is estimated as a disagreement among an ensemble of subgoal-reaching policies. 6.3.1 Problem Formulation An assistive policy π(a|s) produces actions a, e.g., robot end-effector displacements, given states s, e.g., raw RGB images. To enable scalable data collection of a datasetD, the policy should control the robot and minimize required human inputs, which allows the human operator to divert attention away from the robot over contiguous intervals, e.g., to attend to other robotic agents collecting data in parallel. To train the assistive policy π we assume access to a pre-collected datasetD pre of diverse agent experience, e.g., from scripted policies, previously collected data on different tasks or human play (Lynch et al., 2020). Crucially, we explicitly require our approach to handle scenarios in which the newly collected datasetD contains behaviors that are not present inD pre . Thus, it is not possible to fully automate data collection given the pre-training dataset. Instead, the system needs to request human input for unseen behaviors while providing assistance for known behaviors. 75 6.3.2 Learning Assistive Policies from Multi-Modal Data Fig 7. Detailed model training figure a 1 a 2 a 3 a H−1 π (1) LL (a t |s t ,s g ) s 3 s 1 s 2 s H p(s g |s 1 ,z) Subgoal Predictor Ensemble of Subgoal-reaching Policies … s g = s H q(z|s 1:H ) Subgoal Encoder Figure 6.3: Our hierarchical assistive policy is trained us- ing a pre-collected datasetD pre . From a sampled trajec- tory(s 1 ,a 1 ,...,a H− 1 ,s H ) of lengthH , a subgoal predic- tor p(s g |s 1 ,z) is trained as a conditional V AE to cover a multi-modal subgoal distribution, where s g = s H . Then, an ensemble of subgoal-reaching policiesπ (k) LL (a t |s t ,s g ) are trained to predict the ground truth actions. Learning the assistive policy π(a|s) from the diverse, multi-task dataD pre is challenging (Mandlekar et al., 2021), since it is often highly multi-modal and requires imitation of long demonstrations. In our approach, we build on prior work in imitation of long-horizon, multi-task human data (Mandlekar et al., 2020a). We propose to use a hierarchical policy with a subgoal predictor p(s g |s,z) and a low-level subgoal reaching policyπ LL (a t |s t ,s g ) (see Figure 6.2). We condition the subgoal predictor on a stochastic latent variable z to allow prediction of the full distribution of possible subgoals. For example, in a multi-modal dataset in which the robot moves veggies into the oven in half of the trajectories and places them on a plate in the other half, z captures whether to predict a subgoal with veggies in the oven or on the plate. We train the subgoal predictor as a conditional variational auto-encoder over subgoals (Sohn et al., 2015): given a randomly sampled starting state s t from the pre-training dataset and a subgoal state s t+H = s g H steps later in the trajectory, we use a learned inference network q(z|s t ,s g ) to encode s t and s g into a latent variable z. We then use the subgoal predictor p(s g |s t ,z) to decode back to the original subgoal state. During training we apply a subgoal reconstruction loss, as well as a regularization loss on the latent variable z. Finally, the subgoal reaching policy is trained via simple behavioral cloning. We summarize the components of our training model in Figure 6.3. 76 Our final training objective is: max θ,φ,µ E (s,a,s g )∼ D pre z∼ q(·| s,s g ) p θ (s g |s,z) | {z } subgoal reconstruction +π LL,φ (a|s,s g ) | {z } behavioral cloning − βD KL q(z|s,s g ), p(z) | {z } latent regularization (6.1) Here a represents the sequence of actions from current state until s g . We useθ,φ,µ to denote the parameters of the subgoal predictor, goal reaching policy, and inference network, respectively. β is a regularization weighting factor, D KL denotes the Kullback-Leibler divergence, and we use a unit Gaussian prior p(z) over the latent variable. To execute our assistive policyπ(a|s), we first sample a latent variable z from the unit Gaussian prior, then pass z and s through the subgoal predictor p(s g |s,z) to generate a subgoal and then use the goal-reaching policy to predict executable actionsπ LL (a|s,s g ). 6.3.3 Deciding When to Request User Input A core requirement of our approach is that it can actively ask for operator input while minimizing the number of such requests. This is crucial, since it allows the operator to divert their attention to other tasks, e.g., controlling other robots, while the assistive policy is controlling the robot. Thus, a key question is: when should the policy ask for human inputs? Intuitively, the policy should request help when it is uncertain about what action to take next. This can occur in two scenarios: (1) the policy faces a situation that is not present in the training data, so it does not know which action to take, or (2) the policy faces a seen situation, but the training trajectories contain multiple possible continuations and the policy is not sure which one to pick. The latter scenario commonly occurs during the collection of diverse datasets, since trajectories for different tasks often intersect. For example, in a kitchen environment, multiple tasks could start by tossing veggies in a pan, but then diverge into putting the pan on the stove or in the oven. During teleoperation, human inputs are required for such situations to decide which task to continue with. This is in contrast to single-task demonstration collection, e.g., during DAgger training, where such uncertainty over tasks usually does not occur. 77 Fig 4. Control flow π (k) LL (a|s,s g ) p(s g |s,z) s Policy uncertainty > γ Task uncertainty > ω Or Human operator YES NO π Assistive policy Or Operator overtakes Figure 6.4: Our approach asks for human inputs when the assistive policy is uncertain about which subtask or action to take. If both the task uncertainty and policy uncertainty are lower than their thresholds, our assistive policy can reliably perform a subtask, reducing the workload of the human operator. Our hierarchical model allows us to separately estimate both classes of uncertainty. To estimate whether a given state is unseen, we follow prior work on out-of-distribution detection (Lakshmi- narayanan et al., 2017; Menda et al., 2019) and train an ensemble of K low-level reaching policies, all on the same dataD pre but with different initializations and batch ordering. By measuring the disagreement D(a (1) ,...,a (K) ) between the actions predicted by these ensemble policies, i.e., the mean of the variance of actions in each dimension, we can recognize unseen states as those states with high disagreement (Lakshminarayanan et al., 2017). If we determine that a state is seen in the training data, we can estimate the policy’s certainty about the task by sampling from the distribution of subgoals produced by the subgoal predictor and computing the inter-subgoal variance Var(s (1) g ,...,s (N) g ). In cases with multiple possible task continuations, this variance will be high since the distribution of possible subgoals will widen. We leverage both uncertainty estimates to decide on whether the assistive policy should continue controlling the robot or whether it should stop and ask for human input. We found a simple thresholding scheme sufficient, with threshold parameters γ,ω for the ensemble disagreement and subgoal variance, respectively. Future work can investigate more advanced mixing schemes, e.g., with auto-tuned thresholds (Hoque et al., 2021a) or hysteresis between switching from robot to human and back (Hoque et al., 2021b). We also include a human override H t that allows the human to actively take control at any point during the teleoperation, e.g., in order to demonstrate a new task 78 from a seen state, in which case the assistive policy would not ask for human input automatically. In summary, we continue executing the assistive policy if: Assist= ¬(D(a (1) ,...,a (K) )>γ) | {z } OOD state ∧¬(Var(s (1) g ,...,s (N) g )>ω) | {z } uncertain task ∧¬ H t |{z} human override (6.2) 6.4 Experiments We aim to answer the following questions: (1) Does PATO reduce the mental load of human operators? (2) Does it allow the operators to divert their attention to other tasks during teleoperation? (3) Can PATO scale robotic data collection by allowing a single operator to teleoperate multiple robotic systems in parallel? To answer these questions, we perform two user studies in which (1) users teleoperate a real robot arm to perform diverse manipulation tasks, and (2) users teleoperate multiple simulated robotic arms simultaneously. 6.4.1 User Study Design To evaluate the effectiveness of PATO, we conduct a user study (N = 16) in which users teleoperate a Kinova Jaco 2 robot arm to collect diverse robot manipulation demonstrations for kitchen-inspired long-horizon tasks, e.g., “place ingredients in bowl” and “place bowl in oven” (see Figure 6.5). Users teleoperate the robot’s end-effector via joystick and buttons on a standard gamepad controller. The users are also asked to solve simple side tasks during teleoperation to measure their ability to divert attention and conduct other tasks. Specifically, they are shown a string of randomly selected characters, one of which is different from the others (e.g., 000100), and asked to specify the index of the odd character. To ensure that the users can only attend to teleoperation or the side task at a time, they perform teleoperation purely via a live-stream video of the robot setup, without being able to see the physical robot. Crucially, they can only see either the video stream or the side task and can switch between the two views with a button press. 79 To train PATO, we collect a pre-training dataset of 120 demonstrations. Crucially, during the user study the operators need to collect unseen long-horizon tasks. We provide more details about the environments and user study setup in Appendix A and C respectively. We compare PATO to (1) teleoperation without assistance, the current standard approach to collecting robot demonstration data and (2) ThriftyDAgger (Hoque et al., 2021a), the closest prior work to ours for interactive human-robot data collection. ThriftyDAgger is designed to minimize human inputs during single-task demonstration collection by requesting human input only in critical states where a learned value function estimates low probability for reaching the goal. We initially implemented ThriftyDAgger with a flat policy as in the original work, but found it could not learn to model the multi-modal trajectories in our training dataset, leading to poor performance. Thus, we compare to an improved version of ThriftyDAgger that uses the same hierarchical policy as ours, which we found better suited to learn from the multi-modal data. Our goal is to measure the user’s mental workload during teleoperation as well as their ability to divert their attention to other tasks, which is crucial for scaling teleoperation by allowing a single user to teleoperate multiple robots (see Section 6.4.3). After each teleoperation session, we administer the NASA TLX survey (Hart, 1986; Hart and Staveland, 1988), which measures the user’s perceived workload. We aggregate the responses to obtain a single score for each user. Participants also answer a survey adopted from previous work (Nikolaidis et al., 2017), which assesses their perception of the robot’s intelligence, their satisfaction and trust in the system (Table E.2). 6.4.2 User Study Analysis Table 6.1: Average number of completed side tasks and teleoperation time per demonstration during the real-robot teleoperation user study. Approach Avg. Num. of completed side tasks Avg. teleop time (seconds) Unassisted 0.25 (± 0.66) 109.5 (± 31.4) ThriftyDAgger 13.06 (± 9.63) 105.9 (± 29.5) PATO (ours) 15.88 (± 7.11) 85.0 (± 18.2) 80 Fig 4. Real robot setup Keyboard for side task Gamepad for robot control Camera Kitchen task In-hand camera Jaco robot arm 2 camera views Side task Figure 6.5: User study setup. (left) A Kinova Jaco arm, front-view and in-hand cameras, and objects for kitchen-inspired tasks are placed on the workspace. (right) A human operator can watch a monitor, which shows either the camera inputs or a side task. The operator uses a gamepad to control the robot, and uses a keyboard to solve the side task. Comparison of Assisted vs Unassisted Methods We compare the standard unassisted teleoper- ation baseline to the two methods with assistance in Table 6.1: unassisted teleoperation requires user’s full attention, thus they are unable to solve any side task during teleoperation, while assisted approaches allow the user to solve additional side tasks without sacrificing teleoperation speed. Comparison of PATO with ThriftyDAgger Now we compare PATO with the assisted baseline ThriftyDAgger. First, we statistically evaluate the user responses to the surveys to elicit the participants’ trust, satisfaction and perception of the robot’s intelligence, as well as their mental workload when teleoperating with the two approaches. During the study, participants agreed more strongly that they trusted the robot to perform the correct action at the correct time for PATO (Wilcoxon signed-rank test, p= 0.001). Further, they found the robot to be significantly more intelligent with the proposed method (repeated-measures ANOV A, F(1,15)= 5.14, p= 0.039, Cronbach’sα = 0.95) and were significantly more satisfied with their collaboration with the robot (F(1,15)= 5.05, p= 0.040,α = 0.91). Finally, during the NASA TLX survey, participants showed a lower mental workload using PATO compared to the baseline (F(1,15)= 5.52, p= 0.033). 81 Task Uncertainty Policy Uncertainty ω γ 1 2 3 4 5 Figure 6.6: Visualization of PATO on a task from the real-robot user study: place red block in bowl; place bowl on plate; put green block in oven. PATO autonomously executes familiar behaviors, but asks for user input in frames (2) and (4) to determine where to place bowl and green block (white vs. yellow arrow). In these cases, the task uncertainty surpasses the thresholdω since the subgoal predictor produces subgoals for both possible targets. Further, PATO asks for user input in frame (3) since the required transition between placing the bowl and picking up the green block was not part of its training data. Thus, the policy uncertainty estimate surpasses its thresholdγ. A key factor in the subjective differences between the two approaches is their ability to elicit user feedback at appropriate times: when the robot is at a decision point between two possible task continuations (see Figure 6.6, frames (2) and (4)). ThriftyDAgger’s risk-based objective is not sensitive to such decision points and thus it rarely asks for user feedback. It instead executes one of the possible subtasks at random. In our study we found that this lead to erroneous skill executions in 48 % of cases. Such errors require tedious correction by the user, deteriorating their trust in the system and their teleoperation efficiency. In contrast, PATO leverages its estimate of task uncertainty (see Figure 6.6) to correctly elicit user feedback in 82 % of cases, leading to higher perceived levels of trust and reduced mental load. 6.4.3 Scaling Data Collection to Multiple Robots In the previous section we showed that PATO allows users to divert their attention to other tasks during data collection. An important application of this is multi-robot teleoperation, in which a single operator performs data collection with multiple robots in parallel and periodically attends to 82 different robots. To test this, we conduct a teleoperation study (N = 10) with multiple simulated robots in the realistic physics simulator (Lee et al., 2021a) (see Figure 6.7). A human user is asked to collect demonstrations for a block stacking task with multiple robots in parallel via the same gamepad interface used in the real robot study. The user can switch control between different robots with a button press. We measure the total number of collected demonstrations across the robot fleet in a fixed time frame of T = 4 minutes. 7. Multi-robot setup Robot 1 Robot 2 Robot 3 Robot 4 User input required Controlled by user Figure 6.7: We use up to four simulated robots to collect demonstrations. The assistive policy asks for human input using the red indicator. The green indicator represents which environment is being controlled by a user. We compare the number of col- lected demonstrations with increasing numbers of simultaneously teleoper- ated robots in Figure 6.8a. Our ap- proach enables strong scaling of data collection throughput. As expected, the scaling is not linearly proportional, i.e., four robots do not lead to four times more demonstrations collected. This is because simultaneous teleop- eration of a larger fleet requires more context switches between the robots, reducing the effective teleoperation time. Yet, PATO enables effective par- allel teleoperation, leading to higher demonstration throughput with larger robot fleets. In contrast, with standard unassisted teleopera- tion, a single operator can only control a single robot such that the demonstration throughput would not increase even if a larger fleet of robots was available. To verify that our approach collects high-quality demonstrations, we train a behavior cloning policy on the data from the multi-robot teleoperation experiment, and compare to a BC policy trained on single-robot teleoperation data. 83 (a) Scaling experiment (b) Ablation study Figure 6.8: Average number of demonstrations collected in 4 minutes using multiple robots in simulation. (a) With PATO, users can manage multiple robots simultaneously and collect more demonstrations with four robots. (b) The ablated systems with four robots show inferior demonstration collection throughput. Table 6.2: Average success rate of behavior cloning policy trained on data from single-robot vs. multi-robot teleoperation. No. of robots No. of training trajectories Success Rate (%) 1-robot 11 34.6 (± 8.3) 4-robot 28 56.8 (± 13.0) The results in Table 6.2 show that demon- strations collected with our method allow for effective behavior cloning even when teleoper- ating four robots in parallel. Crucially, the larger demonstration throughput with four robots re- sults in a policy with higher performance using the same amount of human teleoperator effort. 6.4.4 Ablation Study PATO has two key components: the hierarchical policy (Section 6.3.2) and the uncertainty based requesting of user input (Section 6.3.3). To evaluate their importance, we perform ablation studies in the 4-robot teleoperation setup from Section 6.4.3. We compare our full method against two ablations: (1) PATO w/o hierarchy, which trains an ensemble of flat stochastic policiesπ(a|s), and (2) PATO w/o uncertainty, which removes the uncertainty-based requesting of user input. “PATO 84 w/o hierarchy” uses the ensemble disagreement and action distribution variance to determine policy and task uncertainty, respectively. We report the number of collected demonstrations per method within T = 4 min averaged across N= 10 users in Figure 6.8b. Both ablations perform worse than PATO. Specifically, we find that the flat policy is neither able to accurately model the multi-modal training data nor to accurately estimate the task uncertainty. As a result, the assistive policy does not ask for user input in critical decision states and exhibits frequent control failures, leading to reduced teleoperation throughput. Similarly, the ablation without uncertainty estimation does not elicit user input in critical states and thus requires the user to correct it whenever it tries to complete an incorrect task. As a result, the demonstrations collected with the ablated methods are less optimal, requiring an average 398 and 375 steps until task completion for “PATO w/o hierarchy” and “PATO w/o uncertainty”, respectively, compared to 322 steps for our method. 6.5 Discussion A large-scale robot demonstration dataset is key to enabling the next breakthrough in robot learning. As a step towards large-scale robot data, we propose an efficient and scalable system for robotic data collection, which automates part of human teleoperation using a learned assistive policy and actively asks for human input in critical states. This allows a human operator to handle multiple robots simultaneously and significantly improve data collection throughput. The user study supports that our assisted teleoperation system requires infrequent inputs from users and users feel less mental burden when collecting robot demonstrations. We further show significantly improved data collection throughput of our system in the multi-robot control experiments. For simplicity, we assume access to pre-collected dataD pre to train the assistive policy. However, our assistive policy can, in theory, be learned from scratch and continuously improved as more data is collected. We leave this as a future work. Moreover, deploying the multi-robot data collection in the real world is a plausible future work. 85 Part III Learning Skills from Diverse Datasets 86 Chapter 7 Model-Based Skill Learning via Goal-Conditioned Hierarchical Predictors 7.1 Introduction Non-Goal- Conditioned Predicted First Predicted Last Start Goal Goal-Conditioned Sequential Goal-Conditioned Hierarchical Figure 7.1: When planning towards faraway goals, we pro- pose to condition the prediction of candidate trajectories on the goal, which significantly reduces the search space of possible trajectories (left vs. middle) and enables hierarchi- cal planning approaches that break a long-horizon task into a series of short-horizon tasks by placing subgoals (right). Intelligent agents aiming to solve long-horizon tasks reason about the future, make predictions, and plan ac- cordingly. Several recent approaches (Ebert et al., 2018; Zhang et al., 2019; Ha and Schmidhuber, 2018; Xie et al., 2019; Nagabandi et al., 2019b; Hafner et al., 2019a) employ powerful pre- dictive models (Finn et al., 2016b; Buesing et al., 2018; Hafner et al., 2019b; Lee et al., 2018a) to enable agents to predict and plan in complex environments directly from visual sensory observations, without needing to engineer a state estimator. To plan a sequence of actions, these approaches usually use the predictive model to generate candidate roll-outs starting from the current state and then search for the sequence that best reaches the goal using a cost function (see Fig. 7.1, left). Such predictive models can even be trained from low-performance data 87 and achieve high performance on downstream tasks through the planning process. However, such approaches do not scale to complex long-horizon tasks (Ebert et al., 2018). Imagine the task of planning a route from your home to the airport. The above approaches would attempt to model all possible routes starting at home and then search for those that ended up at the airport. For long-horizon problems, the number of possible trajectories grows very large, making extensive search infeasible. In contrast, we propose a planning agent that only considers trajectories that start at home and end at the airport, i.e., makes predictions with the goal in mind. This approach both reduces prediction complexity as a simpler trajectory distribution needs to be modeled, and significantly reduces the search space for finding the best route, as depicted in Fig. 7.1 (center). Indeed, we can produce a feasible plan simply as a single forward pass of the generative model, and can further refine it to find the optimal plan through iterative optimization. However, modeling this distribution becomes challenging for long time horizons even with goal-conditioned predictors. A naive method inspired by sequential predictive approaches would predict future trajectories at a fixed frequency, one step at a time — the equivalent of starting to plan the route to the airport by predicting the very first footsteps. This can lead to large accumulating errors. Moreover, the optimization problem of finding the best trajectory remains challenging. The sequential planning approaches are unable to focus on large important decisions as most samples are spent optimizing local variation in the trajectory. To alleviate both shortcomings, we propose to predict an a tree-structured way, starting with a coarse trajectory and recursively filling in finer and finer details. This is achieved by recursive application of a single module that is trained to answer: given two states, what is a state that occurs between them? This hierarchical prediction model is effective at long-term prediction and further enables us to design an efficient long-horizon planning approach by employing a coarse-to-fine trajectory optimization scheme. Hierarchical prediction models naturally lend themselves to modeling the hierarchical structure present in many long-horizon tasks by breaking them into their constituent steps. However such procedural steps do not all occur on a regularly spaced schedule or last for equal lengths of time. 88 Therefore, we further propose a version of our model based on a novel probabilistic formulation of dynamic time warping (Sakoe, 1971) that allows the model to select which frames to generate at each level in the tree, enabling flexible placement of intermediate predictions. In summary, the contributions of this chapter are as follows. First, we propose a framework for goal-conditioned prediction and planning that is able to scale to visual observations by using a latent state model. Second, we extend this framework to hierarchical prediction and planning, which improves both efficiency and performance through the coarse-to-fine strategy and effective parallelization. We further extend this method to modeling the temporal variation in subtask structure. Evaluated on a complex visual navigation task, our method scales better than alternative approaches, allowing effective control on tasks longer than possible with prior visual planning methods. 7.2 Related Work Video interpolation. We propose a latent variable goal-conditioned prediction model that is able to handle high-dimensional image observations. This resembles prior work on video-interpolation where given a start and a goal image, images are filled in between. So far such work has focused on short-term interpolation, often using models based on optical flow (Liu et al., 2017; Jiang et al., 2018; Niklaus et al., 2017a;b). Recent work has proposed neural network models that predict images directly, but this work still evaluates on short-horizon prediction (Wang et al., 2019). The models introduced in our work by contrast scale to video sequences of up to 500 time steps modelling complex distributions that exhibit multi-modality. Visual planning and control. Most existing visual planning methods (Finn and Levine, 2017; Paxton et al., 2019; Xie et al., 2019; Ebert et al., 2018; Hafner et al., 2019b) use model predictive control, computing plans forward in time by sampling state or action sequences. This quickly becomes computationally intractable for longer horizon problems, as the search complexity grows exponentially with the number of time steps (Ebert et al., 2018). Instead, we propose a method for goal-conditioned hierarchical planning, which is able to effectively scale to long horizons as it both 89 reduces the search space and performs more efficient hierarchical optimization. Ichter and Pavone (2018) also perform goal-conditioned planning by constraining the search space to trajectories that end at the goal, however, the method is only validated on low-dimensional states. In this paper, we leverage latent state-space goal-conditioned predictors that scale to visual inputs and further improve the planning by using a hierarchical divide-and-conquer scheme. Other types of goal-conditioned control include inverse models and goal-conditioned imitative models (Pathak et al., 2018; Yu et al., 2018b; Torabi et al., 2018; Smith et al., 2020). However, these methods rely on imitation learning and are limited to settings where high-quality demonstrations are available. In contrast, our goal-conditioned planning and control method is able to optimize the trajectory it executes, and does not require optimal training data. Hierarchical planning. While hierarchical planning has been extensively explored in symbolic AI (Sacerdoti, 1974; Knoblock, 1990; Kaelbling and Lozano-P´ erez, 2010), these approaches are unable to cope with raw (e.g., image-based) observations, limiting their ability to solve diverse real-world tasks. Instead, we propose an approach that learns to perform hierarchical planning directly in terms of sensory observations, purely from data. Since our method does not require human-designed specification of tasks and environment, it is applicable in general settings where trajectory data can be collected. Recently, a number of different hierarchical planning approaches have been introduced (Jayaraman et al., 2019; Pertsch et al., 2020b; Fang et al., 2019; Nair and Finn, 2020; Kim et al., 2019; Nasiriany et al., 2019) that only work well with one or two layers of abstraction due to the architectural design or computational bottlenecks. Some of the few hierarchical planning approaches that have been shown to work with more than two layers of abstraction use tree-structured models (Chen et al., 2020; Jurgenson et al., 2020; Parascandolo et al., 2020). However these models have not been shown to scale to high-dimensional spaces such as images. While also using a tree-structured model similar to our method, Chen et al. (2020) make the assumption that the map in the physical workspace is known. To the best of our knowledge, our proposed hierarchical planning algorithm is the first to use a variable number of abstraction layers while scaling to high-dimensional inputs such as images. 90 7.3 Goal-Conditioned Prediction In this section, we formalize the goal-condition prediction problem, and propose several models for goal-conditioned prediction, including both auto-regressive models and tree-structured models. To define the goal-conditioned prediction problem, consider a sequence of observations [o 1 ,o 2 ,...o T ] of length T . Standard forward prediction approaches (Fig 7.2, left) observe the first k observations and synthesize the rest of the sequence. That is, they model p(o k+1 ,o k+2 ,...o T− 1 |o 1 ,o 2 ,...o k ). Instead, we would like our goal-conditioned predictors to produce intermediate observations given the first and last elements in the sequence (Fig 7.2, center and right). In other words, they must model p(o 2 ,o 3 ,...o T− 1 |o 1 ,o T ). We propose several designs for goal-conditioned predictors that operate in learned compact state spaces for scalability and accuracy. 7.3.1 Goal-Conditioned Sequential Prediction We first present a naive design for goal-conditioned prediction based on forward auto-regressive models. Autoregressive models operating directly on observations scale poorly in terms of computa- tional efficiency and predictive performance (Denton and Fergus, 2018; Buesing et al., 2018; Hafner et al., 2019b). We design a latent state-space model (GCP-sequential, shown in Fig 7.2, center) that predicts in a latent space represented by a random variable s t and then decodes the observations with a decoder p(o t |s t ). The latent state s t is used to allow handling partially observable settings. The likelihood of this model factorizes as follows: p(o 2 ,o 3 ,...o T− 1 |o 1 ,o T )= Z p(o 2 |s 2 )p(s 2 |o 1 ,o T ) T− 1 ∏ t=3 p(o t |s t )p(s t |s t− 1 ,o 1 ,o T )ds 2:T− 1 . (7.1) We show in Sec 7.3.4 that this model is simple to implement, and can build directly on previously proposed auto-regressive sequence prediction models. However, its computational complexity scales with the sequence length, as every state must be produced in sequence. As we show empirically, this approach also struggles with modeling longer sequences due to compounding errors, and is 91 Figure 7.2: Graphical models for state-space sequence generation: forward prediction (left) and the proposed goal-conditioned predictors (GCPs). Shaded circles denote observations, white circles denote unobserved latent states. Center: a sequential goal-conditioned predictor with structure similar to forward prediction. Right: a hierarchical goal-conditioned predictor that recursively applies an infilling operator to generate the full sequence. All our models leverage stochastic latent states in order to handle complex high-dimensional observations. prone to ignoring the goal information on these longer sequences as very long-term dependencies have to be modeled when predicting the second observation from the first observation and the goal. 7.3.2 Goal-Conditioned Prediction by Recursive Infilling In order to scale goal-conditioned prediction to longer time horizons we now design a tree- structured GCP model that is both more efficient and more effective than the naive sequential predictor. Suppose that we have an intermediate state prediction operator p(s t |pa(t)) that produces an intermediate latent state s t halfway in time between its two parent states pa(t). Then, consider the following alternative process for goal-conditioned prediction depicted in Fig 7.2 (right): at the beginning, the observed first and last observation are encoded into the latent state space as s 1 and s T , and the prediction operator p(s t |pa(t)) generates s T/2 . The same operator may now be applied to two new sets of parents(s 1 ,s T/2 ) and(s T/2 ,s T ). As this process continues recursively, the intermediate prediction operator fills in more and more temporal detail until the full sequence is synthesized. 92 We call this model GCP-tree, since it has a tree-like 1 shape where each predicted state is dependent on its left and right parents, starting with the start and the goal. GCP-tree factorizes the goal-conditioned sequence generation problem as: p(o 2 ,o 3 ,...o T− 1 |o 1 ,o T )= Z p(s 1 |o 1 )p(s T |o T ) T− 1 ∏ t=2 p(o t |s t )p(s t |pa(t))ds 1:T . (7.2) Adaptive binding. We have thus far described the intermediate prediction operator as always generating the state that occurs halfway in time between its two parents. While this is a simple and effective scheme, it may not correspond to the natural hierarchical structure in the sequence. For example, in the navigation example from the introduction, we might prefer the first split to correspond to visiting the bank, which partitions the prediction problem into two largely independent halves. We then design a version of GCP-tree that allows the intermediate state predictor to select which of the several states between the parents to predict, each time it is applied. In other words, the predicted state might bind to one of many observations in the sequence. In this more versatile model, we represent the time steps of the tree nodes with discrete latent variable w that selects which nodes bind to which observations: p(o t |s 1:N ,w t )= p(o t |s w t ). We can then express the prediction problem as: p(o 2:T− 1 |o 1 ,o T ) = Z p(s 1 |o 1 )p(s N |o T ) ∏ n p(s n |pa(n)) T− 1 ∏ t=2 p(o t |s 1:N ,w t )p(w t )ds 1:N dw 2:T− 1 . Appendix F shows an efficient inference procedure for w based on a novel probabilistic version of dynamic time warping (Sakoe, 1971). 7.3.3 Latent Variable Models for GCP We have so far described the latent state s t as being a monolithic random variable. However, an appropriate design of s t is crucial for good performance: a purely deterministic s t might not be able to model the variation in the data, while a purely stochastic s t might lead to optimization 1 The generation process closely mimics a graph-theoretic tree, but every node has two parents instead of one. 93 challenges. Following prior work (Denton and Fergus, 2018; Hafner et al., 2019b), we therefore divide s t into h t and z t , i.e. s t =(h t ,z t ), where h t is the deterministic memory state of a recurrent neural network, and z t is a stochastic per-time step latent variable. To optimize the resulting model, we leverage amortized variational inference (Kingma and Welling, 2014; Rezende et al., 2014) with an approximate posterior q(˜ z|o 1:T ), where ˜ z= z 2:T− 1 . The deterministic state h t does not require inference since it can simply be computed from the observed data o 1 ,o T . The training objective is the following evidence lower bound on the log-likelihood of the sequence: ln p(o 2:T− 1 |o 1,T )≥ E q(˜ z) [ln p(o 2:T− 1 |o 1,T , ˜ z)]− KL(q(˜ z)|| p(˜ z|o 1,T )). (7.3) 7.3.4 Architectures for Goal-Conditioned Prediction VarLSTM D z ∼ Concat MLP μ σ TreeLSTM VariationalLSTM VarLSTM VarLSTM I 0 I T ̂ I 1 ̂ I 11 ̂ I 12 E E D D Figure 7.3: Architecture for two-layer hierarchical goal-conditioned predictor (GCP). Skip connections to first node’s decoder omitted for clarity. We describe how GCP models can be instantiated with deep neural networks to predict sequences of high- dimensional observations o 1:T , such as videos. The prior p(z t |pa(t)) is a diagonal Gaussian whose parameters are predicted with a multi-layer perceptron (MLP). The de- terministic state predictor p(h t |z t ,pa(t)) is implemented as an LSTM (Hochreiter and Schmidhuber, 1997). We condition the recurrent predictor on the start and goal observations encoded through a convolutional encoder e t = E(o t ). The decoding distribution p(o t |s t ) is predicted by a convolutional decoder with input features ˆ e t and skip- connections from the encoder (Villegas et al., 2017; Den- ton and Fergus, 2018). In line with recent work (Rybkin et al., 2020), we found that learning a calibrated decoder is important for good performance, and we use the discrete logistics mixture as the decoding distribution (Salimans et al., 2017). The parameters of the diagonal Gaussian posterior distribution for each node, q(z t |o t ,pa(t)), are predicted given the corresponding observation and 94 parent nodes with another MLP. For a more detailed description of the architectural parameters we refer to Appendix C. 7.4 Planning & Control with Goal-Conditioned Prediction In the previous section, we described an approach to goal-conditioned sequence prediction or GCP. The GCP model can be directly applied to control problems since, given a goal, it can produce realistic trajectories for reaching that goal. However, in many cases our objective is to reach the goal in a specific way . For instance, we might want to spend the least amount of time or energy required to reach the goal. In those cases, explicit planning is required to obtain a trajectory from the model that optimizes a user-provided cost functionC(o t ,...,o t ′). In GCPs, planning is performed over the latent variables z that determine which trajectory between start and goal is predicted: min z C(g(o t ,o T ,z)), where g is the GCP model. We propose to use the cross-entropy method (CEM, (Rubinstein and Kroese, 2004)) for optimization, which has proven effective in prior work on visual MPC (Ebert et al., 2018; Nagabandi et al., 2019b; Nasiriany et al., 2019; Pertsch et al., 2020b). Once a trajectory is planned, we infer the actions necessary to execute it using a learned inverse model (see Appendix, Section E). Algorithm 3 Goal-Conditioned Hierarchical Planning 1: Inputs: Hierarchical goal-conditioned predictor g, cur- rent & goal observation o t ,o T , cost function ˆ C 2: Initialize plan: P=[o t ,o T ] 3: for d = 1...D do ▷ iterate depth of hierarchy 4: for n = 0...|P|− 1 do 5: z∼ N (0,I) ▷ sample M subgoal latents 6: o sg = g(P[n],P[n+ 1],z) ▷ predict subgoals 7: o d,n = argmin o∈o sg ˆ C(P[n],o)+ ˆ C(o,P[n+ 1]) 8: INSERT(P,o d,n ) ▷ insert best subgoal in plan 9: end for 10: end for 11: return P Goal-conditioned hierarchical planning. Instead of optimizing the full trajectory at once, the hierarchical structure of the GCP-tree model al- lows us to design a more efficient, hi- erarchical planning scheme in which the trajectories between start and goal are optimized in a coarse-to-fine man- ner. The procedure is detailed in Al- gorithm 3. We initialize the plan to consist of only start and goal observation. Then our approach 95 recursively adds new subgoals to the plan, leading to a more and more detailed trajectory. Con- cretely, we proceed by optimizing the latent variables of the GCP-tree model g(o t ,o T ,z) layer by layer: in every step we sample M candidate latents per subgoal in the current layer and pick the corresponding subgoal that minimizes the total cost with respect to both its parents. The best subgoal gets inserted into the plan between its parents and the procedure recurses. Cost function. Evaluating the true cost functionC(o t ,...,o t ′) would require unrolling the full prediction tree. For more efficient, hierarchical planning, we instead want to evaluate the expected cost ˆ C(o t ,o t ′)=E (o t ,...,o t ′)∼ D C(o t ,...,o t ′) of a trajectory between two observations under the training data distributionD. This allows us to estimate the cost of a trajectory passing through a given subgoal as the sum of the pairwise cost estimates to both its parents, without predicting all its children. We train a neural network estimator for the expected cost via supervised learning by randomly sampling two observations from a training trajectory and evaluating the true cost on the connecting trajectory segmentC(o t ,...,o t ′) to obtain the target value. 7.5 Experimental Evaluation The aim of our experiments is to study the following questions: (1) Are the proposed GCPs able to effectively predict goal-directed trajectories in the image space and scale to long time horizons? (2) Is the proposed goal-conditioned hierarchical planning method able to solve long-horizon visual control tasks? (3) Does the version of GCP with adaptive binding find high-level events in the trajectories? 7.5.1 Goal-Conditioned Video Prediction Most commonly used video datasets in the literature depict relatively short motions, making them poorly suited for studying long-horizon prediction capability. We therefore evaluate on one standard dataset, and two synthetic datasets that we designed specifically for evaluating long-horizon prediction. The pick&place dataset contains videos of a simulated Sawyer robot arm placing objects 96 Table 7.1: Long-term prediction performance of the goal-conditioned predictors compared to prior work on video interpolation. Additional evaluation on FVD / LPIPS (Unterthiner et al., 2018; Zhang et al., 2018a) in Appendix, Table F.1. DATASET PICK&PLACE HUMAN 3.6M 9 ROOMS NAV 25 ROOMS NAV METHOD PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM GCP-TREE 34.34 0.965 28.34 0.928 13.83 0.288 12.88 0.279 GCP-SEQUENTIAL 34.45 0.965 27.57 0.924 12.91 0.213 11.61 0.209 DVF (LIU ET AL., 2017) 26.15 0.858 26.74 0.922 11.678 0.22 11.34 0.172 CIGAN (KURUTACH ET AL., 2018) 21.16 0.613 16.89 0.453 11.96 0.222 9.91 0.150 into a bin. Training trajectories contain up to 80 frames at 64× 64 px and are collected using a simple rule-based policy. The Navigation data consists of videos of an agent navigating a simulated environment with multiple rooms: we evaluate versions with 9-room and 25-room layouts, both of which use 32× 32 px agent-centric top-down image observations, with up to 100 and 200 frame sequences, respectively. We collect example trajectories that reach goals in a randomized, suboptimal manner, providing a very diverse set of trajectories (details are in App. D). We further evaluate on the real-world Human 3.6M video dataset (Ionescu et al., 2013), predicting 64× 64 px frames at full frequency of 50Hz up to 10 seconds in the future to show the scalability of our method. This is in contrast to prior work which evaluated on subsampled sequences shorter than 100 frames (see (Denton and Birodkar, 2017; Denton and Fergus, 2018; Wichers et al., 2018)). Architecture and hyperparameters are detailed in Appendix C. Table 7.2: GCP runtime on 16× 16 px H3.6M sequences in sec/training batch SEQ LENGTH 100 500 1000 GCP-SEQ 1.49 8.44 17.6 GCP-TREE 0.55 1.66 2.77 SPEED-UP × 2.7 × 5.1 × 6.4 In Tab. 7.1, we compare the GCP models to a state-of- the-art deep video interpolation method, DVF (Liu et al., 2017), 2 as well as a method for goal-conditioned gener- ation of visual plans by interpolation in a learned latent space, CIGAN (Kurutach et al., 2018). Following the standard procedure for evaluation of stochastic prediction models, we report top-of-100 peak signal-to-noise ratio (PSNR) and structural similarity metric (SSIM). We observe that the interpolation methods fail to 2 While DVF has an official trained model, we re-train DVF on each dataset for better performance. 97 learn meaningful long-term dynamics, and instead blend between start and goal image or predict physically implausible changes in the scene. In contrast, GCP-sequential and GCP-tree, equipped with powerful latent variable models, learn to predict rich scene dynamics between distant start and goal frames (see qualitative results in Fig. 7.4 and for all methods on the project website. Table 7.3: Ablation of prediction perfor- mance on pick&place METHOD PSNR SSIM TREE 34.34 0.965 TREE W/O SKIPS 32.64 0.955 TREE W/O LSTM 31.44 0.947 On the longer Human 3.6M and 25-room datasets, the GCP-tree model significantly outperforms the GCP- sequential model. Qualitatively, we observe that the se- quential model struggles to take into account the goal in- formation on the longer sequences, as this requires model- ing long-term dependencies, while the hierarchical model is able to naturally incorporate the goal information in the recursive infilling process. Additionally, the hierarchical structure of GCP-tree enables substantially faster runtimes (see Table 7.2). We present an ablation study for GCP-tree in Tab. 7.3, showing that both the skip connections and the recurrence in the predictive module contribute to good performance. 7.5.2 Visual Goal-Conditioned Planning and Control Table 7.4: Visual control performance on navigation tasks METHOD 9-ROOM NAV 25-ROOM NAV SUCCESS COST SUCCESS COST GC BC (NAIR ET AL., 2017) 45% 139.75 7% 402.48 VF (EBERT ET AL., 2018) 84% 128.00 26% 362.82 OURS 93% 34.34 82% 158.06 GCP-FLAT 94% 36.00 79% 181.02 GCP-SEQUENTIAL 91% 50.02 14% 391.99 Next, we evaluate our hier- archical goal-conditioned plan- ning approach (see Section 7.4) on long-horizon visual control tasks. We test our method on a challenging image-based naviga- tion task in the 9 and 25-rooms environments described in Sec- tion 7.5.1. We use the data from Section 7.5.1 to train the predictive model. We note that our method does not require optimal 98 demonstrations, but only data that is sufficiently diverse. Such dataset might be collected e.g. via crowd-sourced teleoperation (Zhu et al., 2020), or with a suitable exploration policy (Eysenbach et al., 2019; Sekar et al., 2020). For evaluation with purely random data, see supp. section G. Given the current image observation the agent is tasked to reach the goal, defined by a goal image, on the shortest possible path. We average performance over 100 task instances for evaluation. These tasks involve crossing up to three and up to 10 rooms respectively, requiring planning over horizons of several hundred time steps, much longer than in previous visual planning methods (Ebert et al., 2017; 2018). 1 2 2 S g 1 2 2 3 3 3 3 S start g goal goal S start g GCP-Tree Layer-wise Prediction Trajectory Overview GCP-Tree Prediction Samples Figure 7.4: Samples from GCP-tree on the 25-room data. Left: hierarchical prediction process. At each layer, the infilling operator is applied between every two frames, producing a sequence with a finer and finer temporal resolution. Three layers out of eight are shown. Right: visualization of the trajectory on the map together with a plan execution (see Section 7.5.2). Bottom: two image sequences sampled given the same start and goal (subsampled to 20 frames for visualization). Our model leverages stochastic latent states that enable modeling multimodal trajectories. See additional video results on the supplementary website: orybkin.github.io/video-gcp. We compare hierarchical planning with GCP to visual foresight (VF, Ebert et al. (2018)), which optimizes rollouts from an action-conditioned forward prediction model via CEM (Rubinstein and Kroese, 2004). We adopt the improvements to the sampling and CEM procedure introduced in Nagabandi et al. (2019b). We also compare to goal-conditioned behavioral cloning (GC BC, (Nair 99 et al., 2017)) as a “planning-free” approach for learning goal-reaching from example goal-reaching behavior. In Table 7.4, we report the average success rate of reaching the goal room, as well as the average cost, which corresponds to the trajectory length. 3 VF performs well on the easy task set, which requires planning horizons similar to prior work on VF, but struggles on the longer tasks as the search space becomes large. The BC method is not able to model the complexity of the training data and fails to solve these environments. In contrast, our approach performs well even on the long-horizon task set. Figure 7.6: Temporal abstraction discovery on pick&place. Discovered tree structure with adaptive binding: nodes from the first two layers (yellow/red) bind to semantically consis- tent bottlenecks across sequences, e.g. in which the robot is about to drop the object into the bin. We compare different planning ap- proaches in Fig. 7.5. We find that samples from the forward prediction model in VF have low probability of reaching long-horizon goals. Using GCPs with a non-hierarchical plan- ning scheme similar to (Ebert et al., 2018; Nagabandi et al., 2019b) (GCP- Flat) requires optimization over a large set of possible trajectories be- tween start and goal and can strug- gle to find a plan with low cost. In contrast, our hierarchical planning ap- proach finds plans with low cost by breaking the long-horizon task into shorter subtasks through multiple recursions of subgoal planning. Using GCP-sequential instead of GCP-tree for sampling performs well on short tasks, but struggles to scale to longer tasks (see Table 7.4), highlighting the importance of the hierarchical prediction model. 3 Since reporting length for failed cases would skew the results towards methods that produce short, unsuccessful trajectories, we report a constant large length for failed trajectories. 100 S g 2 1 2 g S S g g S 1 2 2 g g S S GCP-Flat Visual Foresight GCP-Hierarchical Figure 7.5: Comparison between planning methods. Trajectories (red) sampled while planning from start (blue) to goal (green). All methods predict image trajectories, which are shown as 2d states for visualization. Left: visual MPC (Ebert et al., 2018) with forward predictor, middle: non-hierarchical planning with goal-conditioned predictor (GCP), right: hierarchical planning with GCP (ours) recursively optimizes subgoals (yellow/red) in a coarse-to-fine manner and finally plans short trajectories between the subgoals. Goal-conditioning ensures that trajectories reach the long-horizon goal, while hierarchical planning decomposes the task into shorter segments which are easier to optimize. 7.5.3 Temporal Abstraction Discovery We qualitatively evaluate the ability of GCP-tree with adaptive binding (see Section 7.3.2) to learn the temporal structure in the robotic pick&place dataset. We increase the reconstruction loss of the nodes in the first two layers of the tree 50 times, forcing these nodes to bind to the frames for which the prediction is the most confident, the bottlenecks (see experimental details in Appendix F). In Fig. 7.6, we see that this structural prior causes the model to bind the top nodes to frames that represent semantic bottlenecks, e.g. when the robot is about to drop the object in the bin. We found that all three top layer nodes specialize on binding to distinctive bottlenecks, leading to 101 diverse predicted tree structures. We did not observe that adaptive binding improves the quality of predictions on our datasets, though the ability to discover meaningful bottlenecks may itself be useful (Barto and Mahadevan, 2003; Kipf et al., 2019; Goyal et al., 2019). 7.6 Discussion We presented two models for goal-conditioned prediction: a standard sequential architecture and a hierarchical tree-structured variant, where the latter either splits the sequence into equal parts at each level of the tree, or into variable-length chunks via an adaptive binding mechanism. We further propose an efficient hierarchical planning approach based on the tree-structure model. All variants of our method outperform prior video interpolation methods, and the hierarchical variants substantially outperform the sequential model and prior visual MPC approaches on a long-horizon image-based navigation task. Additionally, the adaptive binding model can discover bottleneck subgoals. 102 Chapter 8 Language-Conditioned Offline Reinforcement Learning with Instruction Relabeling 8.1 Introduction When humans learn a new task, e.g., how to cook a new dish, we rely on a large repertoire of previously learned skills, like “chopping vegetables” or “boiling pasta”, that make learning more efficient. Similarly, much work in reinforcement learning (RL) aims to equip agents with a set of useful skills (Sutton et al., 1999b; Eysenbach et al., 2019; Sharma et al., 2020b; Pertsch et al., 2020a) for improving learning efficiency. Recent works have employed language instructions as a way for humans to manually define a wide range of such skills for policy training, typically via hindsight annotation of large, pre-collected agent experience datasets (Mees et al., 2022; Lynch and Sermanet, 2021; Lynch et al., 2022; Brohan et al., 2022). While the resulting policies show impressive capabilities, generalization to new tasks requires a large set of pre-training tasks, prompting prior works to spend substantial human effort for annotating agent trajectory datasets with hundreds of thousands of human instruction labels (Lynch et al., 2022). In this work we ask: how can we reduce human effort when collecting a diverse set of instruction labels for pre-training? Our core idea is to use extensive automated relabeling and recombination of language instruc- tions to grow an initial set of instruction labels without additional human effort. Our work is inspired by prior works in RL that use random goal state relabeling to automatically define a wide range 103 of training tasks (Andrychowicz et al., 2017; Chebotar et al., 2021a). However, these randomly generated tasks fail to capture semantic meaning. We extend this task relabeling idea to the space of semantically meaningful tasks by generating new language instruction labels. Pre-Training Data LLM Relabeling 1. Get mug from shelf 2. Put mug in coffee machine 3. Press brew button GPT Brew Coffee SPRINT Policy Pre-Training Target Task Finetuning + Open Microwave Remove Lid Pick up Mug Automatic Pre-Training Task Generation Agent Trajectories Language Instructions Combined Pre-Training Data Cross-Trajectory Chaining τ A τ B “Clean Mug” “Place in Coffee Machine” “Clean Mug and place in coffee machine” + Figure 8.1: We propose SPRINT, a scalable approach for policy pre-training via offline instruction-conditioned RL. Given a dataset of agent experience with a limited set of costly human-annotated language instructions, SPRINT drastically expands the pre-training task set without additional human effort via language- model-based instruction relabeling and cross- trajectory skill chaining. SPRINT policies en- able efficient finetuning on unseen target tasks. We introduce SPRINT (Scalable Pre- training via Relabeling Language INsTructions), a scalable pre-training approach that equips poli- cies with a set of semantically meaningful skills (see Figure 8.1 for an illustration). SPRINT uses two core ideas to automatically grow the set of pre-training skills: (1) skill aggregation via pre-trained large language models (LLMs) and (2) cross-trajectory skill chaining via language- conditioned offline RL. We assume access to an offline dataset of state-action trajectories with a base set of language instruction labels for the performed skills, such as “place mug in cof- fee machine” or “press brew button”, collected from human annotators, e.g., via Mechanical Turk (Mees et al., 2022; Lynch and Sermanet, 2021; Lynch et al., 2022). Since human annota- tion is expensive, SPRINT’s core contribution is to automatically grow the instruction set with- out additional human input. We first show how to use a pre-trained large language model to re- label the language instructions, thereby creating a large number of new tasks. For example, the tasks “place mug in coffee machine” and “press brew button” can be combined into a new task: “make coffee.” Secondly, we introduce an offline RL approach for chaining behaviors across multiple 104 trajectories from the training data; starting with a skill like “pick up bread” from one trajectory and ending with “place bread on table” from another. This allows the policy to learn semantic skills that have never been seen in the training data. SPRINT trains a policy on the combined set of task instructions, thereby equipping the agent with a policy that can execute a wide range of semantically meaningful skills. Our experiments demonstrate that this allows for substantially more sample-efficient learning and better zero-shot execution of novel long-horizon downstream tasks than prior pre-training approaches. Our contributions are threefold: (1) we propose SPRINT, which leverages natural language instructions for scalable policy pre-training via instruction-conditioned offline RL, (2) we auto- matically expand the set of pre-training tasks via LLM-based skill relabeling and cross-trajectory chaining, (3) we introduce ALFRED-RL, an RL benchmark for the popular ALFRED household task simulator (Shridhar et al., 2020), to test agents on a rich set of long-horizon, semantically meaningful tasks. We demonstrate that SPRINT learns such tasks more efficiently than prior pre-training approaches. 8.2 Related Work Language in RL. There is a long-standing interest in leveraging natural language during behavior learning, e.g., to structure agent’s internal representations (Andreas et al., 2017b) or guide long-horizon task learning via recipe-like plans (Branavan et al., 2009; Andreas et al., 2017a). The recent progress in training large, general-purpose language models has enabled approaches that directly generate such plans solely from a high-level task description either to directly execute in the environment (Huang et al., 2022a; Ahn et al., 2022; Huang et al., 2022b) or to train policies in a semi-supervised setting (Sharma et al., 2022). Others have shown how language combined with vision inputs can be used to learn state representations (Nair et al., 2022b) or reward functions (Fan et al., 2022). In contrast, we use language as a tool to directly pre-train policies. Prior work on such language-conditioned policy learning has shown that language can be used to define a wide range of tasks for pre-training and that the resulting policies have impressive generalization 105 capabilities (Lynch and Sermanet, 2021; Brohan et al., 2022; Lynch et al., 2022). Yet, they require datasets with hundreds of thousands of human instruction annotations, which are expensive to collect. In contrast, our work introduces an approach for automatic generation of rich training instructions, thereby substantially reducing the amount of human labeling required for successful pre-training. Similar to ours, the work of Colas et al. (2020) explored using automated language generation to define tasks for policy training, but they required a hand-defined grammar for task instruction generation and focused on improving exploration in online RL. In contrast, we leverage large language models for flexible task generation and use them for scalable, offline, policy pre-training. Pre-training Policies for RL. Developing policy pre-training approaches for faster downstream learning has been investigated for many years (Ijspeert et al., 2002; Theodorou et al., 2010; Hester et al., 2018). Recent advances in offline reinforcement learning (Levine et al., 2020) enabled approaches that can pre-train agents offline and effectively finetune them on online tasks (Peng et al., 2019; Singh et al., 2020; Nair et al., 2020; Kostrikov et al., 2022). However, these approaches require target-task reward annotations on the pre-training data and the resulting policies are only pre-trained to solve the target task. In contrast, meta-RL approaches pre-train on a range of tasks and thus allow fast adaptation to unseen downstream tasks (Duan et al., 2016a; Finn et al., 2017b; Rakelly et al., 2019; Nam et al., 2022), yet require tedious manual definition of pre-training tasks by experts, making them less scalable. To avoid manual task design, other works have explored unsupervised pre-training approaches based on behavior diversification (Achiam et al., 2018; Eysenbach et al., 2019; Sharma et al., 2020a) or extraction of behavior priors from offline agent experience (Pertsch et al., 2020a; Ajay et al., 2021; Singh et al., 2021). Closest to ours is the approach of Chebotar et al. (2021a), which performs unsupervised pre-training by learning to reach randomly sampled states from offline agent experience. Yet, unsupervised pre-training approaches learn skills without semantic meaning which, as we demonstrate in Section 8.4, lead to worse transfer to downstream tasks. In contrast, we introduce a scalable pre-training approach based on natural language instructions amplified with automatic instruction generation that equips agents with semantically meaningful skills and allows for efficient transfer to unseen tasks. 106 τ 1 τ 2 D τ 1 D L “Place mug in coffee machine” “Press start button” τ 2 “Place mug in shelf” “Clean mug in sink” Hindsight Language Annotation 1 z A “Place mug in coffee machine” “Press start button” z B LLM Relabeling ̂ z “Make coffee” Language Model Skill Aggregation 2 τ 1 “Place mug in coffee machine” “Press start button” τ 2 “Place mug in shelf” “Clean mug in sink” ̂ z 1 “Clean mug in sink. Place mug in coffee machine.” Cross-Trajectory Skill Chaining 3 ̂ z 2 Figure 8.2: SPRINT overview. We assume access to a dataset of agent experience with language instructions for the performed skills (1). Collecting such instructions with human hindsight annota- tion is a flexible yet costly approach for defining pre-training tasks. Thus, SPRINT introduces two approaches for automatically growing the set of pre-training tasks without additional human effort: (2) by aggregating language instructions with an LLM and adding the relabeled trajectories back into the pre-training dataset (Section 8.3.2), (3) by performing cross-trajectory chaining of skills to enable pre-training of skills that are unseen in the offline agent experience (Section 8.3.3). 8.3 SPRINT: Scalable Policy Pre-Training with Language Instructions In this paper, we propose SPRINT (Scalable Pre-training via Relabeling Language INsTructions), an approach for pre-training a policy to equip it with a wide range of semantic skills in order to enable efficient finetuning on unseen tasks. SPRINT performs instruction-conditioned offline RL on a set of human-provided instructions and has two core technical components for automatically expanding this instruction set, which we will describe in this section: (1) language-model-based skill aggregation and (2) cross-trajectory skill chaining. But first, we will detail the data we pre-train on and give an intuitive overview of SPRINT. Pre-training Data. Following prior work on agent pre-training, we assume access to a large offline dataset D of agent experience (Gupta et al., 2019; Lynch et al., 2020; Pertsch et al., 2020a; Chebotar et al., 2021a; Ebert et al., 2022; Pertsch et al., 2021). Such data can be collected at scale, e.g., from prior RL runs, via teleoperation, through autonomous agent exploration, or any combination thereof. We further assume that the data contains natural language descriptions of the skills performed by the agent, e.g., “put a mug in the coffee machine” or “push the brew 107 button.” Such descriptions can be collected by annotating sequences from the offline dataset D in hindsight, e.g., via crowd-sourcing through platforms like Amazon Mechanical Turk (Lynch and Sermanet, 2021; Shridhar et al., 2020). Given a randomly sampled sequenceτ of states and actions from the datasetD, annotators can label sub-trajectories τ 1 =[s 0 ,a 0 ,s 1 ,...],τ 2 = ... with free- form language descriptions z 1 ,z 2 ,... of the tasks performed in the respective sub-trajectories, for example: τ ¯ z =[τ z 1 ,τ z 2 ]=[[(s z 1 0 ,a z 1 0 ,z 1 ),...,(s z 1 T ,a z 1 T ,z 1 )] | {z } “Put a mug in the coffee machine” [(s z 2 0 ,a z 2 0 ,z 2 ),...] | {z } “Push the brew button” ]. This results in a dataset D L ={τ ¯ z 1 ,τ ¯ z 2 ,...} of diverse trajectories with language descriptions (see Figure 8.2, left). Yet, collecting such human language annotations at scale is costly. Thus, we next introduce an approach for automatically growing an initial set of language instructions without additional human effort, a crucial step for effective scaling of semantic policy pre-training. Approach Overview. Our approach leverages skill descriptions inD L as instructions for the policy during pre-training: each description defines a task and the policy is rewarded for successfully executing the instruction. Intuitively, the richer the set of language instructions during pre-training, the more semantic skills the policy will learn and the more downstream tasks it can finetune on efficiently. Thus, SPRINT introduces two approaches for increasing the scale and diversity of the pre-training instructions without additional human inputs. Firstly, SPRINT leverages pre-trained language models to aggregate the human labels into new instructions (Figure 8.2, middle). Secondly, SPRINT introduces an approach for cross-trajectory skill-chaining that combines behaviors across multiple training trajectories to generate completely unseen instruction chains (Figure 8.2, right). Both approaches expand the set of pre-training tasks, leading to more effective pre-training. Once the policy is trained, we can transfer it to a new task for RL finetuning. 8.3.1 Instruction-Conditioned Offline RL To pre-train our policyπ with the natural language instruction datasetD L , we take inspiration from goal-conditioned RL (Kaelbling, 1993; Schaul et al., 2015; Chebotar et al., 2021a): instead of conditioning the policy on goal states and rewarding it for reaching these states, we condition our 108 policy π(a|s,z) on language instructions z fromD L and provide a sparse reward R(s,a,z) to the agent for reaching the end-state s T of the respective sub-trajectory. Formally, we define the reward as: R(s,a,z)= 1, for s= s z T 0, otherwise. (8.1) We train our policyπ(a|s,z) to maximize this reward with offline RL (Levine et al., 2020) using an instruction-conditioned critic Q(s,a,z). Specifically, we use Implicit Q-Learning (Kostrikov et al., 2022) since we found it to be easy to tune. 8.3.2 Language-Model-Based Instruction Aggregation LLM Prompt Example Summarize the following steps. 1: Pick up the tomato slice. 2: Heat it up in the microwave. Summary: Microwave a tomato slice. 1: [SKILL 1] 2: [SKILL 2] ... Summary: Figure 8.3: A shortened example of the LLM prompt. See the full prompt in appendix, Sec- tion A. Large language models (LLMs), trained on massive corpora of internet text data, have been shown to be effective at performing a variety of tasks – from question answering to program syn- thesis – when prompted with relevant text (De- vlin et al., 2019; Brown et al., 2020; Wang and Komatsuzaki, 2021; Rae et al., 2021; Hoffmann et al., 2022; Zhang et al., 2022; Chowdhery et al., 2022). Here we use LLMs to aggregate, i.e., paraphrase, the existing language annota- tions inD L (see Figure 8.2, middle, for an illus- tration). Given a trajectory that contains multi- ple sub-trajectories, we can aggregate adjacent sub-trajectories into a longer trajectory and relabel its natural language annotation with a sum- mary of the individual instructions generated by the LLM, thereby generating a new higher-level 109 pre-training task that encompasses instructions from multiple sub-trajectories. 1 We use a simple summarization prompt to instruct the language model (see Figure 8.3 for an example, Section A for the full prompt). Specifically, we aggregate with OPT-13B (Zhang et al., 2022), an open-source 13 billion parameter LLM (see Sections D.1 and D.2 for qualitative examples and comparisons to other LLMs). Like in Section 8.3.1, the reward for this new aggregated sub-trajectory is 1 at the last transition and 0 otherwise. For example, we prompt the LLM to summarize the two skills (z 1 : “Put a mug in the coffee machine,” z 2 : “Push the brew button”), resulting in a new semantic annotation ˆ z 1:2 describing both skills (e.g., “Make coffee”). We then add the new trajectory back to our dataset D L . Using this technique, we generate new language annotations for all N 2 tuples of consecutive sub-trajectories in our dataset. In practice, this allows us to increase the number of pre-training task instructions by 2.5x without additional human effort. 8.3.3 Cross-Trajectory Chaining τ A τ B s T A s 0 B s T B s 0 A Q(s,a,z B ) 1 “Clean Mug” “Place in Shelf” “Pick up Mug” “Place in Coffee Machine” “Clean Mug and place in coffee machine” Figure 8.4: SPRINT chains skills from two trajectories τ 1 and τ 2 into a new composite trajectory. The two trajecto- ries do not get concatenated, but added to the replay buffer as two separate tra- jectories with updated language instruc- tion and the appropriate reward values on their terminal states. Agents trained with offline RL can combine behav- iors from multiple trajectories via value propagation, i.e., perform “stitching” (Levine et al., 2020). For example, if trajectory (A) shows cleaning the mug in the sink and then placing it on the shelf, while trajectory (B) starts with plac- ing the mug in the coffee machine, offline RL algorithms are able to learn to clean the mug in the sink and then place it in the coffee machine (see Figure 8.2, right). In our case of instruction-conditioned offline RL, enabling such stitching behavior requires special care. Due to the different language instruction conditionings for the critic Q(s,a,z A ) and Q(s,a,z B ), values do not naturally propagate from trajectory (B) back to trajectory 1 Other relabeling operations, such as splitting an instruction into lower-level instructions, can also be performed by the LLM. However, such operations require grounding the LLM to the agent’s observations to determine sub-trajectory split points. We leave investigating this to future work. 110 (A). Instead, we must actively add “chaining examples” to our training dataset (Chebotar et al., 2021a). For this, we randomly sample a sub-trajectoryτ z A =[s 0:t ,a 0:t ,z A ] from the training dataset D L , e.g., part of trajectory (A), and replace its language instruction with the instruction z B from another trajectoryτ z B , e.g., “place mug in coffee machine.” Note that we do not need to sample the full trajectory (A). Instead, s t can be any state from trajectory (A) from which we try to execute skill (B). Crucially, we cannot use the same reward function as before: since the last state s t of the sampled sub-trajectory, e.g., a mug in the sink, does not solve the instruction z B , “put mug in coffee machine,” we cannot set its reward to 1. Which reward should we use instead? Let’s recap that when using temporal-difference (TD) learning (Sutton and Barto, 2018), Q functions for the sparse reward definition from Equation 8.1 intuitively represent a value that is proportional to the probability of reaching the goal at time T (Eysenbach et al., 2022; Chebotar et al., 2021a): Q π (s t ,a t ,z)=E[ ∑ t ′ =t γ t ′ R(s t ′,a t ′,z)] =E γ T− t 1[s T = g z ] ∝ P π (s T = g z |s t ,a t ). (8.2) Here,γ∈(0,1) denotes the discount factor. Similarly, we want Q(s,a,z B ) to represent the probability of reaching the goal g z B , the mug being in the coffee machine, from state s. Thus, we need to set the reward of the last state s t in the sampled sub-trajectory to the probability of reaching the goal g z B from s t , i.e., Q(s t ,a t ,z B ): R(s,a,z B )= Q(s,a,z B ), for s= s t 0, otherwise. (8.3) Finally, we can apply the skill aggregation approach from Section 8.3.2 in the cross-trajectory case to generate even more training instructions. Instead of sampling two sub-trajectories from the same training trajectory as in Section 8.3.2, we can now sample τ z A and τ z B from different 111 trajectories and chain them together (see Figure 8.4). The aggregate instruction ˆ z implies that the agent first finishes skill (A) and then finishes skill (B), e.g., “ clean the coffee mug and place it in the coffee machine.” Thus, following the logic above, we need to set the reward for the final state in trajectory (B), s T B , to 1 since it solves the task ˆ z, and the reward in the final state of trajectory (A), s T A , to the probability of solving skill (B) from there: R(s,a, ˆ z)= 1, for s= s T B Q(s,a,z B ), for s= s T A 0, otherwise. (8.4) Note that unlike in Section 8.3.2, we treat both relabeledτ z A ,τ z B as separate trajectories as we do not know the states and actions required to transition from the last state of (A) to states in (B). 2 To generate the aggregate instruction ˆ z we can use the LLM summarization from Section 8.3.2. Yet, in practice we found the resulting summaries are often confusing, since randomly paired instructions from different trajectories can rarely be summarized meaningfully. Instead, we saw better performance by simply concatenating the natural language instructions from τ A and τ B , thereby allowing the agent to focus on individual instructions even if the combined instruction is not meaningful. Finally, since Q changes during training, we compute the rewards in Eqs. 8.3 and 8.4 online while training and relabel trajectories at equal proportions for both rewards. See appendix, Alg. 6 for SPRINT pseudocode. 8.4 Experiments In our experiments, we investigate how well an agent pre-trained with SPRINT performs on unseen, semantically meaningful tasks. Specifically, we answer the following questions: (1) Does 2 See appendix, Section B.4.1 for a more detailed discussion on how chaining preserves the original state transition distribution. 112 SPRINT enable more efficient finetuning on unseen target tasks than previous pre-training ap- proaches? (2) Can SPRINT agents execute unseen language instructions zero-shot? (3) Does semantic pre-training lead to better generalization to unseen environments than unsupervised pre-training? 8.4.1 The ALFRED-RL Benchmark 1 2 3 4 5 6 " w a l k t o t h e co f f e e m a ke r o n t h e r i g h t " " w a sh t h e m u g i n t h e si n k" " p u t t h e cl e a n m u g i n t h e co f f e e m a ke r " " p i ck u p t h e m u g a n d g o b a ck t o t h e co f f e e m a ke r " "p i ck u p t h e d i r t y m u g f r o m t h e co f f e e m a ke r " " t u r n a n d w a l k t o t h e si n k" vi su a l n a vi g a t i o n vi su a l n a vi g a t i o n m e m o r y o b j e ct i n t e r a ct i o n st a t e ch a n g e s vi su a l n a vi g a t i o n o b j e ct i n t e r a ct i o n o b j e ct i n t e r a ct i o n G o a l : " R i n se o f f a m u g a n d p l a ce i t i n t h e co f f e e m a ke r " t 0 = t 10 = t 21 = t 50 = t 27 = t 36 = Figure 8.5: The ALFRED household task simu- lator (Shridhar et al., 2020) provides a rich set of long-horizon, semantically meaningful tasks and a dataset of 6.6k demonstrations with language instructions. We propose the ALFRED-RL Bench- mark, which tests finetuning of RL agents on un- seen tasks and scenes in ALFRED. Figure drawn from Shridhar et al. (2020), with permission. Our goal is to compare different pre-training approaches on a diverse set of semantically meaningful, long-horizon tasks. Yet, existing multi-task RL environments typically evalu- ate only on short-horizon or semantically not meaningful tasks (Yu et al., 2019; Mees et al., 2022). Thus, we introduce a new RL benchmark based on the ALFRED household task simula- tor (Shridhar et al., 2020). ALFRED’s 100+ rich indoor scenes with many interactable ob- jects allow to evaluate an agent’s capabilities for solving long-horizon household tasks. While the original ALFRED benchmark focuses on imitation learning approaches, we extend the environment to support RL-based training of agents. We define an OpenAI gym interface with 300× 300 egocentric RGB observations and an action space consisting of 12 discrete action choices (e.g. turn left, look up, pick up object), along with 82 interactable object types (Pashevich et al., 2021). We use ALFRED’s task success detector to define a sparse reward function that provides the agent with a return of 1 for every solved subtask of the target long-horizon tasks. For offline pre-training we use a dataset from the original ALFRED benchmark with 6.6k episodes of offline agent experience and 73k crowd-sourced natural language 113 instruction annotations (e.g., “pick up the dirty mug”, “wash it in the sink”, . . . , see Figure 8.5). For more environment and dataset details, see Section C. Evaluation Tasks. The original ALFRED imitation benchmark defines a set of evaluation tasks, however, we cannot compute sub-task rewards with ALFRED’s success detector on these tasks since no demonstrations are provided, making RL infeasible. Thus, we create three new sets of evaluation tasks that test generalization to unseen instructions (EVAL INSTRUCT ), task horizons (EVAL LENGTH ) and household scenes (EVAL SCENE ). EVAL INSTRUCT consists of 100 unseen evaluation tasks with varied horizon within the range of task horizons observed during training. We create it by sampling sequences of 1 to 6 withheld instructions from the ALFRED dataset and asking human annotators to provide a high-level language instruction that describes the sequence. For example, the 3-instruction sequence “pick up a spoon from the counter”, “put the spoon in the sink, rinse it and pick up the spoon”, “pull the cabinet drawer beside the stove and put inside the spoon and close the drawer”, was summarized as “put the clean spoon on the cabinet drawer” (for more examples, see appendix, Section C.2). We ensure that EVAL INSTRUCT has an equal distribution of length 1, 2, 3, 4, and 5+ tasks (20 of each category). EVAL LENGTH is a set of 20 evaluation commands that test the agent’s long-horizon chaining capabilities: we withhold trajectories and instruction summaries of sequences with 7 to 8 instructions from the training dataset, making sure that the respective skill chains are not present in any single training trajectory. Thus, the agent needs to learn to chain behaviors across multiple trajectories to solve these tasks. EVAL SCENE tests environment generalization capabilities with a set of 10 tasks in household floor plans not seen in any trajectory in the training dataset, consisting in equal parts of 1 to 5-instruction sequences. More task set construction details can be found in appendix, Section C.2. 114 Zero-Shot Overall Zero-Shot Split by Length Finetuning on EVAL SCENE EVAL INSTRUCT EVAL LENGTH Figure 8.6: Evaluation results on the ALFRED-RL benchmark. Left: Zero shot performance on the EV AL INSTRUCT and EV AL LENGTH task sets. SPRINT (green) is able to complete substantially more subtasks than prior pre-training approaches. Middle: Breakdown of zero shot performance by task length. SPRINT shows strong zero-shot performance, especitally on challenging, long-horizon tasks. See appendix, Table G.2 for numerical results. Right: Finetuning performance in unseen floor plans of the EV AL SCENE task set. Policies pre-trained with SPRINT learn tasks in new floor plans much quicker as semantic skills learned from language instructions generalize better than skills learned via unsupervised approaches (AM). 8.4.2 Experimental Setup Comparisons. We compare SPRINT against common policy pre-training approaches: behav- ioral cloning and offline goal-conditioned RL. Specifically, we compare to the following prior works: • Language-conditioned BC (L-BC) (Jang et al., 2021; Lynch and Sermanet, 2021): Behavior cloning (BC) conditioned on the individual ALFRED language instructions. • Episodic Transformers (ET) (Pashevich et al., 2021): A transformer BC architecture that conditions on full sequences of language instructions—SOTA on certain ALFRED tasks. • Actionable Models (AM) (Chebotar et al., 2021a): Goal-conditioned offline RL with ran- domly sampled goal observations from the ALFRED training set. We implement all methods with the same architecture and hyperparameters where possible, and pre-train them for the same number of steps. For more implementation details, see appendix Section B. All results reported are means and standard deviations over 3 seeds. 115 Task: “Warm up a piece of apple” “Pick up apple” “Place apple on table” “Pick up knife” “Slice apple” “Place knife on table” “Pick up apple slice” “Heat apple slice in the microwave and take it back out” “Place apple slice on the table” Figure 8.7: Example successful task execution of our pre-trained SPRINT agent for the challenging “Warm up a piece of apple” task. Successful execution requires solving 8 subtasks in sequence and a total of 50 steps. This sequence of subtasks was never observed in the training data. SPRINT uses cross-trajectory stitching and LLM aggregation to learn unseen tasks. 8.4.3 SPRINT Solves Long-Horizon Tasks Zero-Shot We first test the effectiveness of SPRINT’s pre-training by analyzing zero-shot performance across 100 unseen tasks in the EVAL INSTRUCT evaluation set. We report results in Figure 8.6 (left). Our approach, SPRINT, achieves 3x higher zero-shot task performance than prior pre-training approaches AM and L-BC. Even though ET is designed for ALFRED evaluations and benefits from a transformer architecture, SPRINT still outperforms it overall by 1.5x. To better understand the differences between the methods, we report the breakdown of returns by length of the evaluation task in Figure 8.6 (middle). We find that L-BC achieves good performance on short-horizon tasks. However, on long-horizon tasks, SPRINT achieves much higher returns, since it can leverage the language-model to automatically generate longer-horizon pre-training tasks. In contrast, standard L-BC approaches train only on the human-provided, shorter-horizon annotations and thus cannot zero-shot perform long-horizon tasks. This trend holds even when evaluating with detailed, step- by-step instructions (see appendix Section D.3). Similar to our approach, AM trains to reach long-horizon goals during pre-training but the results in Figure 8.6 (left) show that its pre-training with goal-state conditioning is less effective than our language-conditioned pre-training. 116 To evaluate the pre-training approaches’ ability to generalize beyond the task horizons seen during training, we evaluate zero-shot performance on the EVAL LENGTH task set in Figure 8.6 (mid- dle). Unsurprisingly, both L-BC and AM largely fail to generate returns on these long-horizon tasks. ET does slightly better as it is trained to attend to long-horizon instruction sequences, but it still performs poorly on these tasks as it cannot generalize to the higher-level human-written task descriptions. In contrast, SPRINT can solve unseen long-horizon tasks with up to 8 subtasks (see Figure 8.7 for an example trajectory, appendix Section D.5 for more qualitative comparisons). This shows that, with the help of cross-trajectory chaining, SPRINT is able to go beyond the skill combinations in the training trajectories and execute unseen tasks. 8.4.4 SPRINT Agents Finetune Effectively in Unseen Environments To test downstream task performance, we finetune the pre-trained agents on the most challenging EVAL SCENE task set in unseen household floor plans. To implement finetuning for SPRINT and AM, we condition the policy on a language instruction or goal image from the target task respectively and then run IQL with online data collection. For L-BC and ET, we first pre-train a language-conditioned Q-function with IQL on the pre-training dataset and then finetune both the policy and critic with online IQL. We report finetuning results in Figure 8.6 (right), with qualitative examples in appendix, Sec- tion D.5. SPRINT is able to achieve 1.5x higher downstream task return than the best prior work. Specifically, we find L-BC converges quickly to a low return as it is not pre-trained for longer horizon tasks. ET is also unable to learn new tasks effectively; we found it challenging to optimize its transformer architecture with online RL despite extensive tuning. Meanwhile, AM struggles to transfer the learned skills to the new environment, possibly since the goal states from the new environment are unseen. In contrast, our method’s pre-training with language conditioning allows for effective transfer even to unseen environments since the semantics of the tasks transfer well: the language description “place cup in coffee machine” transfers to many environments while the goal image for the same task might look very different. Thus, pre-training with semantic language 117 instructions can enable better transfer for learning tasks in new environments than pre-training to reach goal states. 8.4.5 Ablation Studies We verify the effectiveness of the components of our approach, with the following ablations: • SPRINT w/o chain: removes cross-trajectory chaining (Section 8.3.3), instead trains only on within-trajectory human-provided and LLM-aggregated tasks • SPRINT w/o LLM-agg: additionally removes the LLM aggregation (Section 8.3.2), thus trains offline RL agent only on the human-provided task annotations. Table 8.1: EVAL INSTRUCT and EVAL LENGTH returns. When combining chaining and instruction aggregation, SPRINT achieves the highest return across both evaluations. EVAL INSTRUCT EVAL LENGTH SPRINT (ours) 1.27± 0.13 2.59± 0.66 SPRINT w/o chain 0.91± 0.03 2.04± 0.04 SPRINT w/o LLM-agg 0.38± 0.05 0.10± 0.04 We report zero-shot evaluation results in Table 8.1. The results show the importance of the different components of our approach. With- out cross-trajectory chaining, perfor- mance drops by∼ 20-30%. Interest- ingly, we find that performance also degrades on the EVAL INSTRUCT task set, indicating that cross-trajectory chaining can improve performance even on task horizons seen in the training set. We observe an even larger performance loss when removing the language model aggregation of pre-training tasks: the resulting agent is solely trained on the shorter horizon human task annotations, and thus struggles to solve long-horizon evaluation tasks, similar to the L-BC approach in Section 8.4.3. For a per-task length breakdown of the ablation performances, see appendix, Section D.4. This section also presents additional ablation experiments that show that instruction aggregation via simple concatenation instead of LLM summarization reduces performance, further supporting the importance of the usage of LLMs in our approach. 118 8.5 Discussion In this chapter, we presented SPRINT, an approach for scalable agent pre-training with offline instruction-conditioned RL. We demonstrated how given a base set of natural language instructions on offline agent experience, we can use LLM relabeling and cross-trajectory skill chaining to automatically expand the set of pre-training tasks. In our experimental evaluations on the ALFRED- RL benchmark we demonstrated that SPRINT pre-training allows for substantially more efficient finetuning on downstream tasks in unseen environments than prior works. SPRINT represents a step towards scalable approaches for pre-training agents with a set of semantically meaningful skills. Currently, SPRINT can only leverage language-annotated data. While we can collect such annotations via crowd-sourcing, there will always be a lot of experience data without language annotations. Investigating approaches that can combine language-annotated and unannotated data for effective pre-training is an interesting future direction. 119 Chapter 9 Conclusion In this thesis, we introduced a skill-based framework for robot learning. We first proposed scalable approaches for extracting a large repertoire of skills from prior agent experience data. We then demonstrated how we can leverage these skills for accelerating downstream task learning in reinforcement learning, imitation learning and meta-learning setups. We also showed how skills can be used to leverage human datasets as prior experience for robot task learning. We then introduced an approach for scalable robotic data collection. We showed that by assisting human teleoperators with a learned policy, we can substantially increase demonstration throughput and even allow a single operator to teleoperate multiple robots simultaneously. Finally, we discussed alternative approaches for skill learning that are robust to diverse quality data and can learn high-performing policies even from sub-optimal experience. We introduced skill- learning algorithms based on model-based reinforcement learning and language-conditioned offline RL and showed that they can learn skill repertoires that lead to effective learning of long-horizon downstream tasks. The skill-based robot learning framework we introduced is a step towards building intelligent agents that can assist humans in their everyday lives. Yet, the field of data driven robot learning is still in its infantry. The scale and diversity of data we can use for robot learning today is orders of magnitude smaller than in other fields of machine learning and the range of tasks we can teach robots is still very limited. This leaves ample opportunity for future research into how to obtain 120 diverse training datasets, how to curate them and how to best use them for scalable task learning. We will next discuss these future research directions in more detail. Using Scalable Data Sources for Robot Learning In this thesis, we focused on learning from offline robot experience data, typically collected via human teleoperation. We introduced approaches for scaling such teleoperation via learned assistance (Chapter 6), yet the practical challenges of collecting large and diverse robotic datasets (Brohan et al., 2022) make it unlikely we will collect data at comparable scale to vision or natural language datasets, which can be scraped from the internet, any time soon. Thus, an exciting direction for future work is how to leverage diverse non-robotic datasets, e.g., of human videos or natural language, for robot learning. We have shown promising results for learning from human video demonstrations (Chapter 4), but many open problems remain, e.g., can we learn low-level manipulation behavior from human data and can we leverage other sources of information like text? Curating Datasets for Behavior Learning Targeted curation of diverse datasets has been shown to be crucial for effective learning in domains that have ample data access, e.g., natural language processing (Xie et al., 2023). With robotic datasets increasing in size and diversity, the question of dataset curation becomes increasingly important: how can we curate the best datasets for behavior learning? What data characteristics are best suited for policy training? To which extent should we use demonstrations vs. autonomously collected data? These questions are particularly important in robot learning, since they can also inform what data should be collected in the first place, in order to most efficiently use data collection resources. New Approaches for Transfer from Diverse Datasets In this thesis, we have investigated differ- ent approaches for transfer from prior experience: extracting skills via behavioral cloning (Chap- ter 2), using model-based approaches (Chapter 7) and offline RL (Chapter 8). Other works have shown successful transfer via representation learning (Nam et al., 2022; Nair et al., 2022b). Each of these approaches has their own strengths and weaknesses, e.g., offline RL can handle sub-optimal 121 training data while representation learning can easily incorporate non-robotic data sources like human video. An interesting direction for future work is to develop algorithms for transfer from prior experience that combine the strengths of these approaches. 122 References Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In International Conference on Machine Learning, 2004. Joshua Achiam, Harrison Edwards, Dario Amodei, and Pieter Abbeel. Variational option discovery algorithms. arXiv, 2018. David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Efficient interactive annotation of segmentation datasets with polygon-rnn++. In CVPR, 2018. Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy Zeng. Do as i can and not as i say: Grounding language in robotic affordances. In CoRL, 2022. Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Ofir Nachum. Opal: Offline primitive discovery for accelerating offline reinforcement learning. ICLR, 2021. Saleh A Al-Abood, Keith Davids, and Simon J Bennett. Specificity of task constraints and effects of visual demonstrations and verbal instructions in directing learners’ search during skill acquisition. Journal of motor behavior, 33(3):295–305, 2001. Heni Ben Amor, Erik Berger, David V ogt, and Bernhard Jung. Kinesthetic bootstrapping: Teaching motor skills to humanoid robots through physical interaction. In Annual conference on artificial intelligence , 2009. Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. In International Conference on Machine Learning, 2017a. Jacob Andreas, Dan Klein, and Sergey Levine. Learning with latent language. In NAACL, 2017b. Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In NeurIPS, 2017. 123 OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 2020. Brenna D Argall. Autonomy in rehabilitation robotics: An intersection. Annual Review of Control, Robotics, and Autonomous Systems, 1:441, 2018. Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009. Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Associa- tion for the Advancement of Artificial Intelligence , 2017. D. Bahdanau, K. Cho, and Y . Bengio. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations, 2015. Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(1-2):41–77, January 2003. ISSN 0924-6703. doi: 10.1023/A:1022140919877. URLhttps://doi.org/10.1023/A:1022140919877. Harold Bekkering, Andreas Wohlschlager, and Merideth Gattis. Imitation of gestures in children is goal-directed. The Quarterly Journal of Experimental Psychology: Section A, 53 (1):153–164, 2000. Pierre Berthet-Rayne, Maura Power, Hawkeye King, and Guang-Zhong Yang. Hubot: A three state human-robot collaborative framework for bimanual surgical tasks based on learned models. In ICRA. IEEE, 2016. Aude Billard, Sylvain Calinon, Ruediger Dillmann, and Stefan Schaal. Survey: Robot programming by demonstration. Handbook of robotics, 59, 2008. Christopher M Bishop. Pattern recognition and machine learning. springer, 2006. Satchuthananthavale RK Branavan, Harr Chen, Luke Zettlemoyer, and Regina Barzilay. Reinforcement learning for mapping instructions to actions. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2009. Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. John Bronskill, Daniela Massiceti, Massimiliano Patacchiola, Katja Hofmann, Sebastian Nowozin, and Richard Turner. Memory efficient meta-learning with large images. In Neural Information Processing Systems, 2021. 124 Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020. Lars Buesing, Theophane Weber, S´ ebastien Racani` ere, S. M. Ali Eslami, Danilo Jimenez Rezende, David P. Reichert, Fabio Viola, Frederic Besse, Karol Gregor, Demis Hassabis, and Daan Wierstra. Learning and querying fast generative models for reinforcement learning. arXiv:1802.03006, 2018. Serkan Cabi, Sergio Gomez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, Oleg Sushkov, David Barker, Jonathan Scholz, Misha Denil, Nando de Freitas, and Ziyu Wang. Scaling data-driven robotics with reward sketching and batch reinforcement learning. RSS, 2019. Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. CVPR, 2020. Lluis Castrejon, Kaustav Kundu, Raquel Urtasun, and Sanja Fidler. Annotating object instances with a polygon-rnn. In CVPR, 2017. Yevgen Chebotar, Karol Hausman, Yao Lu, Ted Xiao, Dmitry Kalashnikov, Jacob Varley, Alex Irpan, Benjamin Eysenbach, Ryan C Julian, Chelsea Finn, and Sergey Levine. Actionable models: Unsupervised offline reinforcement learning of robotic skills. In ICML, 2021a. Yevgen Chebotar, Karol Hausman, Yao Lu, Ted Xiao, Dmitry Kalashnikov, Jake Varley, Alex Irpan, Benjamin Eysenbach, Ryan Julian, Chelsea Finn, et al. Actionable models: Unsupervised offline reinforcement learning of robotic skills. ICML, 2021b. Annie S Chen, Suraj Nair, and Chelsea Finn. Learning generalizable robotic reward functions from” in-the-wild” human videos. RSS, 2021. Binghong Chen, Bo Dai, and Le Song. Learning to plan via neural exploration-exploitation trees. ICLR, 2020. Maxime Chevalier-Boisvert. gym-miniworld environment for openai gym. https:// github.com/maximecb/gym-miniworld, 2018. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchin- son, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, 125 Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways, 2022. Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-based reinforcement learning via meta-policy optimization. In Conference on Robot Learning, 2018. Ignasi Clavera, Anusha Nagabandi, Simin Liu, Ronald S. Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. In International Conference on Learning Representations, 2019. Henry M Clever, Ankur Handa, Hammad Mazhar, Kevin Parker, Omer Shapira, Qian Wan, Yashraj Narang, Iretiayo Akinola, Maya Cakmak, and Dieter Fox. Assistive tele- op: Leveraging transformers to collect robotic task demonstrations. 4th Robot Learning Workshop: Self-Supervised and Lifelong Learning at Neurips, 2021. C´ edric Colas, Tristan Karch, Nicolas Lair, Jean-Michel Dussoux, Cl´ ement Moulin-Frier, Peter Dominey, and Pierre-Yves Oudeyer. Language as a cognitive tool to imagine goals in curiosity driven exploration. NeurIPS, 2020. Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2021. Marco Cuturi and Mathieu Blondel. Soft-dtw: a differentiable loss function for time-series. In ICML, 2017. Dima Damen, Hazel Doughty, Giovanni Maria Farinella, , Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic- kitchens-100. International Journal of Computer Vision (IJCV), 2021. Neha Das, Sarah Bechtle, Todor Davchev, Dinesh Jayaraman, Akshara Rai, and Franziska Meier. Model-based inverse reinforcement learning from visual demonstrations. CoRL, 2020. Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeck- peper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. In Conference on Robot Learning, 2019. Shivin Dass, Karl Pertsch, Hejia Zhang, Youngwoon Lee, Joseph J Lim, and Stefanos Nikolaidis. Pato: Policy assisted teleoperation for scalable robot data collection. arXiv preprint arXiv:2212.04708, 2022. 126 Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. E. Denton and R. Fergus. Stochastic video generation with a learned prior. In International Conference on Machine Learning, 2018. Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. In Neural Information Processing Systems, 2017. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ICLR, 2017. Ron Dorfman, Idan Shenfeld, and Aviv Tamar. Offline meta learning of exploration. In Neural Information Processing Systems, 2021. Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. RL 2 : Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016a. Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016b. Yan Duan, Marcin Andrychowicz, Bradly C Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. NeurIPS, 2017. Nikita Dvornik, Cordelia Schmid, and Julien Mairal. Selecting relevant features from a multi- domain representation for few-shot classification. In European Conference on Computer Vision, 2020. Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Self-supervised visual planning with temporal skip connections. In Conference on Robot Learning, 2017. Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv:1812.00568, 2018. Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. In Robotics: Science and Systems, 2022. Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. ICLR, 2019. Benjamin Eysenbach, Tianjun Zhang, Ruslan Salakhutdinov, and Sergey Levine. Contrastive learning as goal-conditioned reinforcement learning, 2022. 127 Haoqi Fan, Yanghao Li, Bo Xiong, Wan-Yen Lo, and Christoph Feichtenhofer. Pyslowfast, 2020. Linxi Fan, Yuke Zhu, Jiren Zhu, Zihua Liu, Orien Zeng, Anchit Gupta, Joan Creus-Costa, Silvio Savarese, and Li Fei-Fei. Surreal: Open-source reinforcement learning framework and robot manipulation benchmark. In CoRL, 2018. Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. arXiv preprint arXiv: Arxiv- 2206.08853, 2022. Kuan Fang, Yuke Zhu, Animesh Garg, Silvio Savarese, and Li Fei-Fei. Dynamics learning with cascaded variational inference for multi-step manipulation. CoRL, 2019. Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In IEEE International Conference on Robotics and Automation, 2017. Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852, 2016a. Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In Neural Information Processing Systems, 2016b. Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep spatial autoencoders for visuomotor learning. In IEEE International Conference on Robotics and Automation, pages 512–519. IEEE, 2016c. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135, 2017a. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. ICML, 2017b. Matthew Fontaine and Stefanos Nikolaidis. A quality diversity approach to automatically generating human-robot interaction scenarios in shared autonomy. RSS, 2021. Matthew C Fontaine and Stefanos Nikolaidis. Evaluating human–robot interaction algorithms in shared autonomy via quality diversity scenario generation. ACM Transactions on Human- Robot Interaction (THRI), 11(3):1–30, 2022. Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. In International Conference on Learning Representations, 2018. URLhttps://openreview.net/forum?id=rkHywl-A-. Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020. 128 Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062, 2019. Yang Gao and Steve Chien. Review on space robotics: Toward top-level science through space exploration. Science Robotics, 2(7), 2017. Anirudh Goyal, Riashat Islam, Daniel Strouse, Zafarali Ahmed, Hugo Larochelle, Matthew Botvinick, Yoshua Bengio, and Sergey Levine. Infobot: Transfer and exploration via the information bottleneck. In ICLR, 2019. Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learn- ing for robotic manipulation with asynchronous off-policy updates. In IEEE International Conference on Robotics and Automation, 2017. Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Tom Le Paine, Sergio G´ omez Colmenarejo, Konrad Zolna, Rishabh Agarwal, Josh Merel, Daniel Mankowitz, Cosmin Paduraru, et al. Rl unplugged: Benchmarks for offline reinforcement learning. arXiv preprint arXiv:2006.13888, 2020. Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel, and Sergey Levine. Meta- reinforcement learning of structured exploration strategies. In Neural Information Processing Systems, 2018. Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman. Re- lay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. Conference on Robot Learning, 2019. David Ha and J¨ urgen Schmidhuber. Recurrent world models facilitate policy evolution. In Neural Information Processing Systems. 2018. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1856–1865, 2018a. Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018b. Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to con- trol: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2019a. Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. International Conference on Machine Learning, 2019b. Kourosh Hakhamaneshi, Ruihan Zhao, Albert Zhan, Pieter Abbeel, and Michael Laskin. Hierarchical few-shot imitation with skill transition models. ICLR, 2022. 129 Sandra G Hart. Nasa task load index (tlx). 1986. Sandra G Hart and Lowell E Staveland. Development of nasa-tlx (task load index): Results of empirical and theoretical research. In Advances in psychology, volume 52, pages 139–183. Elsevier, 1988. Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016. Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Hor- gan, John Quan, Andrew Sendonaris, Ian Osband, et al. Deep q-learning from demonstrations. In Association for the Advancement of Artificial Intelligence , 2018. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-V AE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017. Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016. Sepp Hochreiter and J¨ urgen Schmidhuber. Long short-term memory. Neural computation, 9 (8):1735–1780, 1997. Nicola J Hodges, A Mark Williams, Spencer J Hayes, and Gavin Breslin. What is modelled during observational learning? Journal of sports sciences, 25(5):531–545, 2007. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models, 2022. Ryan Hoque, Ashwin Balakrishna, Ellen Novoseller, Albert Wilcox, Daniel S Brown, and Ken Goldberg. Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning. CoRL, 2021a. Ryan Hoque, Ashwin Balakrishna, Carl Putterman, Michael Luo, Daniel S Brown, Daniel Seita, Brijen Thananjeyan, Ellen Novoseller, and Ken Goldberg. Lazydagger: Reducing con- text switching in interactive imitation learning. In 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE), 2021b. De-An Huang, Suraj Nair, Danfei Xu, Yuke Zhu, Animesh Garg, Li Fei-Fei, Silvio Savarese, and Juan Carlos Niebles. Neural task graphs: Generalizing to unseen tasks from a single video demonstration. In CVPR, 2019. 130 Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. ICML, 2022a. Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner mono- logue: Embodied reasoning through planning with language models. In arXiv preprint arXiv:2207.05608, 2022b. Jan Humplik, Alexandre Galashov, Leonard Hasenclever, Pedro A Ortega, Yee Whye Teh, and Nicolas Heess. Meta reinforcement learning as task inference. arXiv preprint arXiv:1905.06424, 2019. Brian Ichter and Marco Pavone. Robot motion planning in learned latent spaces. RAL, 2018. Auke Jan Ijspeert, Jun Nakanishi, and Stefan Schaal. Movement imitation with nonlinear dynamical systems in humanoid robots. In ICRA. IEEE, 2002. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015. Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013. Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002, 2021. Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019. Shervin Javdani, Siddhartha S Srinivasa, and J Andrew Bagnell. Shared autonomy via hindsight optimization. In RSS, 2015. Dinesh Jayaraman, Frederik Ebert, Alexey Efros, and Sergey Levine. Time-agnostic predic- tion: Predicting predictable video frames. International Conference on Learning Representa- tions, 2019. Hong Jun Jeon, Dylan P Losey, and Dorsa Sadigh. Shared autonomy with learned latent actions. RSS, 2020. Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In CVPR, 2018. Mishel Johns, Brian Mok, David Sirkin, Nikhil Gowda, Catherine Smith, Walter Talamonti, and Wendy Ju. Exploring shared control in automated driving. In HRI. IEEE, 2016. 131 Nicholas K Jong, Todd Hester, and Peter Stone. The utility of temporal abstraction in reinforcement learning. In International Conference on Autonomous Agents and Multi-Agent Systems, pages 299–306. Citeseer, 2008. Tom Jurgenson, Or Avner, Edward Groshev, and Aviv Tamar. Sub-goal trees–a framework for goal-based reinforcement learning. ICML, 2020. Leslie Pack Kaelbling. Learning to achieve goals. In IN PROC. OF IJCAI-93, pages 1094–1098. Morgan Kaufmann, 1993. Leslie Pack Kaelbling and Tom´ as Lozano-P´ erez. Hierarchical planning in the now. In Workshops at the Twenty-Fourth AAAI Conference on Artificial Intelligence , 2010. Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pages 651–673, 2018. Dmitry Kalashnikov, Jacob Varley, Yevgen Chebotar, Benjamin Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, and Karol Hausman. Mt-opt: Continuous multi-task robotic reinforcement learning at scale. CoRL, 2021. Lydia E Kavraki, Petr Svestka, J-C Latombe, and Mark H Overmars. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE transactions on Robotics and Automation, 12(4):566–580, 1996. Michael Kelly, Chelsea Sidrane, Katherine Driggs-Campbell, and Mykel J Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. In ICRA, 2019. Taesup Kim, Sungjin Ahn, and Yoshua Bengio. Variational temporal abstraction. In NeurIPS. 2019. Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014. Thomas Kipf, Yujia Li, Hanjun Dai, Vinicius Zambaldi, Edward Grefenstette, Pushmeet Kohli, and Peter Battaglia. Compositional imitation learning: Explaining and executing one task at a time. ICML, 2019. Craig A Knoblock. Learning abstraction hierarchies for problem solving. In AAAI, 1990. Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. In European Conference on Computer Vision, 2020. 132 Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan Tompson. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. In International Conference on Learning Representations, 2019. Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022. URLhttps: //openreview.net/forum?id=68n2s9ZJWF8. Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 3675–3683, 2016. Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Neural Information Processing Systems, pages 11784–11794, 2019. Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In Neural Information Processing Systems, 2020. Thanard Kurutach, Aviv Tamar, Ge Yang, Stuart J Russell, and Pieter Abbeel. Learning plannable representations with causal infogan. In NeurIPS, 2018. Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NeurIPS, 2017. A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. arXiv:1804.01523, abs/1804.01523, 2018a. Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, 2019a. Youngwoon Lee, Shao-Hua Sun, Sriram Somasundaram, Edward S Hu, and Joseph J Lim. Composing complex skills by learning transition policies. In ICLR, 2018b. Youngwoon Lee, Shao-Hua Sun, Sriram Somasundaram, Edward S. Hu, and Joseph J. Lim. Composing complex skills by learning transition policies. In International Confer- ence on Learning Representations, 2019b. URL https://openreview.net/forum?id= rygrBhC5tQ. Youngwoon Lee, Jingyun Yang, and Joseph J. Lim. Learning to coordinate manipulation skills via skill behavior diversification. In International Conference on Learning Representa- tions, 2020. Youngwoon Lee, Edward S Hu, and Joseph J Lim. IKEA furniture assembly environment for long-horizon complex manipulation tasks. In IEEE International Conference on Robotics and Automation, 2021a. URLhttps://clvrai.com/furniture. 133 Youngwoon Lee, Andrew Szot, Shao-Hua Sun, and Joseph J. Lim. Generalizable imitation learning from observation via inferring goal proximity. In Neural Information Processing Systems, 2021b. Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016. Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020. Evan Z Liu, Aditi Raghunathan, Percy Liang, and Chelsea Finn. Decoupling exploration and exploitation for meta-reinforcement learning without sacrifices. In International Conference on Machine Learning, 2021. Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019. Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations, 2020. Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. In ICCV, 2017. Dylan P Losey, Krishnan Srinivasan, Ajay Mandlekar, Animesh Garg, and Dorsa Sadigh. Controlling assistive robots with learned latent actions. In ICRA, 2020. Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control. In ICLR, 2019. Yao Lu, Karol Hausman, Yevgen Chebotar, Mengyuan Yan, Eric Jang, Alexander Herzog, Ted Xiao, Alex Irpan, Mohi Khansari, Dmitry Kalashnikov, and Sergey Levine. Aw-opt: Learning robotic skills with imitation and reinforcement at scale. In CoRL, 2021. Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention- based neural machine translation. In EMNLP, 2015. Corey Lynch and Pierre Sermanet. Language conditioned imitation learning over unstructured data. RSS, 2021. Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine, and Pierre Sermanet. Learning latent plans from play. In Conference on Robot Learning, pages 1113–1132. PMLR, 2020. 134 Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. arXiv preprint arXiv:2210.06407, 2022. Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, Silvio Savarese, and Li Fei-Fei. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pages 879–893, 2018. Ajay Mandlekar, Fabio Ramos, Byron Boots, Li Fei-Fei, Animesh Garg, and Dieter Fox. Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. ICRA, 2020a. Ajay Mandlekar, Danfei Xu, Roberto Mart´ ın-Mart´ ın, Silvio Savarese, and Li Fei-Fei. Gti: Learning to generalize across long-horizon tasks from human demonstrations. In Robotics: Science and Systems, 2020b. Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart´ ın-Mart´ ın. What matters in learning from offline human demonstrations for robot manipulation. CoRL, 2021. Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 2022. Kunal Menda, Katherine Driggs-Campbell, and Mykel J Kochenderfer. Ensembledagger: A bayesian approach to safe imitation learning. In IROS, 2019. Josh Merel, Yuval Tassa, Dhruva TB, Sriram Srinivasan, Jay Lemmon, Ziyu Wang, Greg Wayne, and Nicolas Heess. Learning human behaviors from motion capture by adversarial imitation. arXiv preprint arXiv:1707.02201, 2017. Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control. In International Conference on Learning Representations, 2019. Josh Merel, Saran Tunyasuvunakool, Arun Ahuja, Yuval Tassa, Leonard Hasenclever, Vu Pham, Tom Erez, Greg Wayne, and Nicolas Heess. Catch & carry: Reusable neu- ral controllers for vision-guided whole-body tasks. ACM Transactions on Graphics (TOG), 2020. Eric Mitchell, Rafael Rafailov, Xue Bin Peng, Sergey Levine, and Chelsea Finn. Offline meta-reinforcement learning with advantage weighting. In International Conference on Machine Learning, 2021. Kaichun Mo, Haoxiang Li, Zhe Lin, and Joon-Young Lee. The AdobeIndoorNav Dataset: To- wards deep reinforcement learning based real-world indoor robot visual navigation. preprint arXiv:1802.08824, 2018. 135 Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchi- cal reinforcement learning. In Advances in Neural Information Processing Systems, pages 3303–3313, 2018. Anusha Nagabandi, Chelsea Finn, and Sergey Levine. Deep online learning via meta- learning: Continual adaptation for model-based rl. In International Conference on Learning Representations, 2019a. Anusha Nagabandi, Kurt Konoglie, Sergey Levine, and Vikash Kumar. Deep dynamics models for learning dexterous manipulation. CoRL, 2019b. Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Combining self-supervised learning and imitation for vision-based rope manipulation. In IEEE International Conference on Robotics and Automation, pages 2146– 2153, 2017. Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming exploration in reinforcement learning with demonstrations. In IEEE Interna- tional Conference on Robotics and Automation, pages 6292–6299, 2018. Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020. Suraj Nair and Chelsea Finn. Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation. ICLR, 2020. Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. In Conference on Robot Learning, 2022a. Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation, 2022b. Taewook Nam, Shao-Hua Sun, Karl Pertsch, Sung Ju Hwang, and Joseph J. Lim. Skill-based meta-reinforcement learning. In ICLR, 2022. Soroush Nasiriany, Vitchyr Pong, Steven Lin, and Sergey Levine. Planning with goal- conditioned policies. In NeurIPS, 2019. Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive convolution. In CVPR, 2017a. Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive separable convolution. In ICCV, 2017b. Stefanos Nikolaidis, Yu Xiang Zhu, David Hsu, and Siddhartha Srinivasa. Human-robot mutual adaptation in shared autonomy. In HRI. IEEE, 2017. Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zero-shot task generaliza- tion with multi-task deep reinforcement learning. In ICML, 2017. 136 Giambattista Parascandolo, Lars Buesing, Josh Merel, Leonard Hasenclever, John Aslanides, Jessica B Hamrick, Nicolas Heess, Alexander Neitz, and Theophane Weber. Divide-and- conquer monte carlo tree search for goal-directed planning. arXiv preprint arXiv:2004.11410, 2020. Alexander Pashevich, Cordelia Schmid, and Chen Sun. Episodic Transformer for Vision- and-Language Navigation. In ICCV, 2021. Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-shot visual imitation. In ICLR, 2018. Chris Paxton, Yotam Barnoy, Kapil D. Katyal, Raman Arora, and Gregory D. Hager. Visual robot task planning. ICRA, 2019. Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Trans- actions on Graphics (TOG), 37(4):1–14, 2018. Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning, 2019. Xue Bin Peng, Erwin Coumans, Tingnan Zhang, Tsang-Wei Lee, Jie Tan, and Sergey Levine. Learning agile robotic locomotion skills by imitating animals. RSS, 2020. Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual Reasoning with a General Conditioning Layer. In Association for the Advancement of Artificial Intelligence , 2018. Karl Pertsch, Youngwoon Lee, and Joseph J. Lim. Accelerating reinforcement learning with learned skill priors. In Conference on Robot Learning, 2020a. Karl Pertsch, Oleh Rybkin, Jingyun Yang, Shenghao Zhou, Konstantinos Derpanis, Kostas Daniilidis, Joseph Lim, and Andrew Jaegle. Keyframing the future: Keyframe discovery for visual prediction and planning. In Learning for Dynamics and Control, pages 969–979. PMLR, 2020b. Karl Pertsch, Youngwoon Lee, Yue Wu, and Joseph J. Lim. Demonstration-guided reinforce- ment learning with learned skills. In Conference on Robot Learning, 2021. Karl Pertsch, Ruta Desai, Vikash Kumar, Franziska Meier, Joseph J. Lim, Dhruv Batra, and Akshara Rai. Cross-domain transfer via semantic skill imitation. CoRL, 2022. Andreas Pichler, Sharath Chandra Akkaladevi, Markus Ikeda, Michael Hofmann, Matthias Plasch, Christian W¨ ogerer, and Gerald Fritz. Towards shared autonomy for robotic tasks in manufacturing. Procedia Manufacturing, 11:72–82, 2017. Marc Pickett and Andrew G Barto. Policyblocks: An algorithm for creating useful macro- actions in reinforcement learning. In International Conference on Machine Learning, vol- ume 19, pages 506–513, 2002. 137 Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems, pages 305–313, 1989. Vitchyr H Pong, Ashvin Nair, Laura Smith, Catherine Huang, and Sergey Levine. Offline meta-reinforcement learning with online self-supervision. ICML, 2022. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsim- poukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. Scaling language models: Methods, analysis & insights from training gopher, 2021. Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. In Robotics: Science and Systems, 2018. Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International Conference on Machine Learning, 2019. Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In EMNLP, 2019. Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, 2015. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014. St´ ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics, pages 627–635, 2011. Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, and Pieter Abbeel. ProMP: Proximal meta-policy search. In International Conference on Learning Representations, 2019. 138 Reuven Y . Rubinstein and Dirk P. Kroese. The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning. Springer- Verlag New York, 2004. Oleh Rybkin, Kostas Daniilidis, and Sergey Levine. Simple and effective vae training with calibrated decoders. arXiv preprint arXiv: 2006.13202, 2020. Earl D Sacerdoti. Planning in a hierarchy of abstraction spaces. Artificial intelligence , 5(2): 115–135, 1974. Hiroaki Sakoe. Dynamic-programming approach to continuous speech recognition. In 1971 Proc. the International Congress of Acoustics, Budapest, 1971. Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. ICLR, 2017. Stefan Schaal. Dynamic Movement Primitives - A Framework for Motor Control in Humans and Humanoid Robotics. Springer Tokyo, 2006. Stefan Schaal, Jan Peters, Jun Nakanishi, and Auke Ijspeert. Learning movement primitives. In Paolo Dario and Raja Chatila, editors, Robotics Research. Springer Berlin Heidelberg, 2005. Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In ICML, 2015. Karl Schmeckpeper, Oleh Rybkin, Kostas Daniilidis, Sergey Levine, and Chelsea Finn. Reinforcement learning with videos: Combining offline observations with interaction. CoRL, 2020. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. In International Conference on Machine Learning, pages 8583–8592. PMLR, 2020. Mario Selvaggio, Marco Cognetti, Stefanos Nikolaidis, Serena Ivaldi, and Bruno Sicil- iano. Autonomy in physical human-robot interaction: A brief survey. IEEE Robotics and Automation Letters, 2021. Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey Levine. Time-contrastive networks: Self-supervised learning from video. In IEEE International Conference on Robotics and Automation, pages 1134–1141, 2018. Tanmay Shankar, Shubham Tulsiani, Lerrel Pinto, and Abhinav Gupta. Discovering motor programs by recomposing demonstrations. In ICLR, 2019. 139 Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics- aware unsupervised discovery of skills. ICLR, 2020a. Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics- aware unsupervised discovery of skills. ICLR, 2020b. Pratyusha Sharma, Lekha Mohan, Lerrel Pinto, and Abhinav Gupta. Multiple interactions made easy (mime): Large scale demonstrations data for imitation. In Conference on robot learning, pages 906–915. PMLR, 2018. Pratyusha Sharma, Deepak Pathak, and Abhinav Gupta. Third-person visual imitation learning via decoupled hierarchical controller. NeurIPS, 2019. Pratyusha Sharma, Antonio Torralba, and Jacob Andreas. Skill induction and planning with latent language. In ACL, 2022. Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR, 2020. Noah Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. Keep doing what worked: Behavior modelling priors for offline reinforcement learning. In International Conference on Learning Representations, 2020. Avi Singh, Albert Yu, Jonathan Yang, Jesse Zhang, Aviral Kumar, and Sergey Levine. Cog: Connecting new skills to past experience with offline reinforcement learning. In Conference on Robot Learning, 2020. Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and Sergey Levine. Parrot: Data-driven behavioral priors for reinforcement learning. ICLR, 2021. Laura Smith, Nikita Dhawan, Marvin Zhang, Pieter Abbeel, and Sergey Levine. Avid: Learning multi-stage tasks via pixel-level translation of human videos. RSS, 2019. Laura Smith, Nikita Dhawan, Marvin Zhang, Pieter Abbeel, and Sergey Levine. Avid: Learning multi-stage tasks via pixel-level translation of human videos. RSS, 2020. Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, 2015. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence , 112 (1-2):181–211, 1999a. 140 Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence , 112: 181–211, 1999b. Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representa- tions from tree-structured long short-term memory networks. ACL, 2015. Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(7), 2009. Yee Teh, Victor Bapst, Wojciech M. Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Neural Information Processing Systems, 2017. Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. Reinforcement learning of motor skills in high dimensions: A path integral approach. In ICRA, 2010. Sebastian Thrun and Anton Schwartz. Finding structure in reinforcement learning. In Advances in Neural Information Processing Systems, 1995. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012. Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. arXiv preprint arXiv:1805.01954, 2018. Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Utku Evci, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, et al. Meta-dataset: A dataset of datasets for learning to learn from few examples. In International Conference on Learning Representations, 2020. Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michal- ski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018. Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Association for the Advancement of Artificial Intelligence , 2016. Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Roth¨ orl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017. Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and content for natural video sequence prediction. In International Conference on Learning Representations, 2017. Risto Vuorio, Shao-Hua Sun, Hexiang Hu, and Joseph J Lim. Multimodal model-agnostic meta-learning via task-aware modulation. In Neural Information Processing Systems, 2019. 141 Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Lan- guage Model. https://github.com/kingoflolz/mesh-transformer-jax, 2021. Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. In Annual Meeting of the Cognitive Science Society (CogSci), 2017. Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. Neural Information Processing Systems, 2018. Tsun-Hsuan Wang, Yen-Chi Cheng, Chieh Hubert Lin, Hwann-Tzong Chen, and Min Sun. Point-to-point video generation. In ICCV, 2019. William Whitney, Rajat Agarwal, Kyunghyun Cho, and Abhinav Gupta. Dynamics-aware embeddings. ICLR, 2020. Nevan Wichers, Ruben Villegas, Dumitru Erhan, and Honglak Lee. Hierarchical long-term video prediction without supervision. ICML, 2018. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning. Machine learning, 8(3-4):229–256, 1992. Robert S Woodworth and EL Thorndike. The influence of improvement in one mental function upon the efficiency of other functions.(i). Psychological review, 8(3):247, 1901. Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019. Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018. Annie Xie, Frederik Ebert, Sergey Levine, and Chelsea Finn. Improvisation through physical understanding: Using novel objects as tools with visual foresight. RSS, 2019. Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169, 2023. Danfei Xu, Suraj Nair, Yuke Zhu, Julian Gao, Animesh Garg, Li Fei-Fei, and Silvio Savarese. Neural task programming: Learning to generalize across hierarchical tasks. In ICRA, 2018. Yuxiang Yang, Ken Caluwaerts, Atil Iscen, Jie Tan, and Chelsea Finn. Norml: No-reward meta learning. In International Conference on Autonomous Agents and Multiagent Systems, 2019. Tianhe Yu, Pieter Abbeel, Sergey Levine, and Chelsea Finn. One-shot hierarchical imitation learning of compound visuomotor tasks. RSS, 2018a. Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot imitation from observing humans via domain-adaptive meta- learning. RSS, 2018b. 142 Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2019. Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline model-based policy optimization. NeurIPS, 2021. Jiakai Zhang and Kyunghyun Cho. Query-efficient imitation learning for end-to-end au- tonomous driving. AAAI, 2017. Marvin Zhang, Sharad Vikram, Laura Smith, Pieter Abbeel, Matthew J Johnson, and Sergey Levine. Solar: deep structured representations for model-based reinforcement learning. ICLR, 2019. Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018a. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022. Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In ICRA, 2018b. Yuke Zhu, Ziyu Wang, Josh Merel, Andrei Rusu, Tom Erez, Serkan Cabi, Saran Tunyasuvu- nakool, J´ anos Kram´ ar, Raia Hadsell, Nando de Freitas, and Nicolas Heess. Reinforcement and imitation learning for diverse visuomotor skills. In Robotics: Science and Systems, 2018. Yuke Zhu, Josiah Wong, Ajay Mandlekar, and Roberto Mart´ ın-Mart´ ın. robosuite: A modular simulation framework and benchmark for robot learning. In arXiv preprint arXiv:2009.12293, 2020. Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010. Luisa Zintgraf, Kyriacos Shiarli, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fast context adaptation via meta-learning. In International Conference on Machine Learning, 2019. Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, and Shimon Whiteson. Varibad: A very good method for bayes-adaptive deep rl via meta-learning. In International Conference on Learning Representations, 2020. 143 Appendices 144 Appendix A Reinforcement Learning with Learned Skills A Action-prior Regularized Soft Actor-Critic The original derivation of the SAC algorithm assumes a uniform prior over actions. We extend the formulation to the case with a non-uniform action prior p(a|·), where the dot indicates that the prior can be non-conditional or conditioned on e.g., the current state or the previous action. Our derivation closely follows Haarnoja et al. (2018a) and Levine (2018) with the key difference that we replace the entropy maximization in the reward function with a term that penalizes divergence from the action prior. We derive the formulation for single-step action priors below, and the extension to skill priors is straightforward by replacing actions a t with skill embeddings z t . We adopt the probabilistic graphical model (PGM) described in Levine (2018), which includes optimality variablesO 1:T , whose distribution is defined as p(O t |s t ,a t ) = exp r(s t ,a t ) where r(s t ,a t ) is the reward. We treatO 1:T = 1 as evidence in our PGM and obtain the following conditional trajectory distribution: p(τ|O 1:T )= p(s 1 ) T ∏ t=1 p(O t |s t ,a t )p(s t+1 |s t ,a t )p(a t |·) = p(s 1 ) T ∏ t=1 p(s t+1 |s t ,a t )p(a t |·) · exp T ∑ t=1 r(s t ,a t ) Crucially, in contrast to Levine (2018) we did not omit the action prior p(a t |·) since we assume it to be generally not uniform. 145 Our goal is to derive an objective for learning a policy that induces such a trajectory distribution. Following Levine (2018) we will cast this problem within the framework of structured variational inference and derive an expression for an evidence lower bound (ELBO). We define a variational distribution q(a t |s t ) that represents our policy. It induces a trajectory distribution q(τ)= p(s 1 )∏ T t=1 p(s t+1 |s t ,a t )q(a t |s t ). We can derive the ELBO as: log p(O 1:T )≥− D KL q(τ)|| p(τ|O 1:T ) ≥ E τ∼ q(τ) " log p(s 1 )+ T ∑ t=1 log p(s t+1 |s t ,a t )log p(a t |·) + T ∑ t=1 r(s t ,a t ) − log p(s 1 )− T ∑ t=1 log p(s t+1 |s t ,a t )logq(a t |s t ) # ≥ E τ∼ q(τ) " T ∑ t=1 r(s t ,a t )+ log p(a t |·)− logq(a t |s t ) # ≥ E τ∼ q(τ) " T ∑ t=1 r(s t ,a t )− D KL q(a t |s t )|| p(a t |·) # Note that in the case of a uniform action prior the KL divergence is equivalent to the negative entropy− H (q(a t |s t )). Substituting the KL divergence with the entropy recovers the ELBO derived in Levine (2018). To maximize this ELBO with respect to the policy q(a t |s t ), Levine (2018) propose to use an inference procedure based on a message passing algorithm. Following this derivation for the ”mes- sages” V(s t ) and Q(s t ,a t ) (Levine (2018), Section 4.2), but substituting policy entropy− logq(a t |s t ) with prior divergence D KL q(a t |s t )|| p(a t |·) , the modified Bellman backup operator can be derived as: T π Q(s t ,a t )= r(s t ,a t )+γE s t+1 ∼ p V(s t+1 ) where V(s t )=E a t ∼ π Q(s t ,a t )− D KL π(a t |s t )|| p(a t |·) 146 To show convergence of this operator to the optimal Q function we follow the proof of Haarnoja et al. (2018a) in appendix B1 and introduce a divergence-augmented reward: r π (s t ,a t )= r(s t ,a t )− E s t+1 ∼ p D KL π(a t+1 |s t+1 )|| p(a t+1 |·) . Then we can recover the original Bellman update: Q(s t ,a t )← r π (s t ,a t )+γE s t+1 ∼ p,a t+1 ∼ π Q(s t+1 ,a t+1 ) , for which the known convergence guarantees hold (Sutton and Barto, 2018). The modifications to the messages Q(s t ,a t ) and V(s t ) directly lead to the following modified policy improvement operator: argmin θ E s t ∼ D,a t ∼ π D KL π(a t |s t )|| p(a t |·) − Q(s t ,a t ) Finally, the practical implementations of SAC introduce a temperature parameterα that trades off between the reward and the entropy term in the original formulation and the reward and divergence term in our formulation. Haarnoja et al. (2018b) propose an algorithm to automatically adjustα by formulating policy learning as a constrained optimization problem. In our formulation we derive a similar update mechanism forα. We start by formulating the following constrained optimization problem: max x 1:T E p π T ∑ t=1 r(s t ,a t ) s.t. D KL π(a t |s t )|| p(a t |·) ≤ δ ∀t Hereδ is a target divergence between policy and action prior similar to the target entropy ¯ H is the original SAC formulation. We can formulate the dual problem by introducing the temperatureα: min α>0 max π T ∑ t=1 r(s t ,a t )+α δ− D KL π(a t |s t )|| p(a t |·) 147 This leads to the modified update objective for α: argmin α>0 E a t ∼ π αδ− αD KL π(a t |s t )|| p(a t |·) We combine the modified objectives for Q-value function, policy and temperatureα in the skill-prior regularized SAC algorithm, summarized in Algorithm 1. B Implementation Details B.1 Model Architecture and Training Objective We instantiate the skill embedding model described in Section 2.3.2 with deep neural networks. The skill encoder is implemented as a one-layer LSTM with 128 hidden units. After processing the full input action sequence, it outputs the parameters(µ z ,σ z ) of the Gaussian posterior distribution in the 10-dimensional skill embedding spaceZ . The skill decoder mirrors the encoder’s architecture and is unrolled for H steps to produce the H reconstructed actions. The sampled skill embedding z is passed as input in every step. The skill prior is implemented as a 6-layer fully-connected network with 128 hidden units per layer. It parametrizes the Gaussian skill prior distributionN (µ p ,σ p ). For image-based state inputs in maze and block stacking environment, we first pass the state through a convolutional encoder network with three layers, a kernel size of three and(8,16,32) channels respectively. The resulting feature map is flattened to form the input to the skill prior network. We use leaky-ReLU activations and batch normalization throughout our architecture. We optimize our model using the RAdam optimizer with parameters with β 1 = 0.9 and β 2 = 0.999, batch size 16 and learning rate 1× 10 − 3 . Training on a single NVIDIA Titan X GPU takes 148 approximately 8 hours. Assuming a unit-variance Gaussian output distribution our full training objective is: L = H ∑ i=1 (a i − ˆ a i ) 2 | {z } reconstruction − β D KL N (µ z ,σ z )||N (0,I) | {z } regularization +D KL N (⌊µ z ⌋,⌊σ z ⌋)||N (µ p ,σ p ) | {z } prior training . (B.1) Here⌊·⌋ indicates that gradients flowing through these variables are stopped. For Gaussian distribu- tions the KL divergence can be analytically computed. For non-Gaussian prior parametrizations (e.g. with Gaussian mixture model or normalizing flow priors) we found that sampling-based estimates also suffice to achieve reliable, albeit slightly slower convergence. We tune the weighting parameterβ separately for each environment and useβ = 1× 10 − 2 for maze and block stacking andβ = 5× 10 − 4 for the kitchen environment. B.2 Reinforcement Learning Setup The architecture of policy and critic mirror the one of the skill prior network. The policy outputs the parameters of a Gaussian action distribution while the critic outputs a single Q-value estimate. Empirically, we found it important to initialize the weights of the policy with the pre-trained skill prior weights in addition to regularizing towards the prior (see Section F). We use the hyperparameters of the standard SAC implementation (Haarnoja et al., 2018a) with batch size 256, replay buffer capacity of 1× 10 6 and discount factor γ = 0.99. We collect 5000 warmup rollout steps to initialize the replay buffer before training. We use Adam optimizer with β 1 = 0.9, β 2 = 0.999 and learning rate 3× 10 − 4 for updating policy, critic and temperature α. Analogous to SAC, we train two separate critic networks and compute the Q-value as the minimum over both estimates to stabilize training. The corresponding target networks get updated at a rate of τ = 5× 10 − 3 . The policy’s actions are limited in the range[− 2,2] by a tanh ”squashing function” (see Haarnoja et al. (2018a), appendix C). We tune the target divergenceδ separately for each environment and useδ = 1 for the maze navigation task andδ = 5 for both robot manipulation tasks. 149 C Environments and Data Collection Maze Navigation Block Stacking Figure A.1: Image-based state representation for maze (left) and block stacking (right) environ- ment (downsampled to 32× 32 px for policy). Maze Navigation. The maze navigation envi- ronment is based on the maze environment in the D4RL benchmark (Fu et al., 2020). Instead of using a single, fixed layout, we generate ran- dom layouts for training data collection by plac- ing walls with doorways in randomly sampled positions. For each collected training sequence we sample a new maze layout and randomly sample start and goal position for the agent. Following Fu et al. (2020), we collect goal-reaching examples through a combination of high-level planner with access to a map of the maze and a low-level controller that follows the plan. For the downstream task we randomly sample a maze that is four times larger than the training data layouts. We keep maze layout, as well as start and goal location for the agent fixed throughout downstream learning. The policy outputs (x,y)-velocities for the agent. The state is represented as a local top-down view around the agent (see Fig. A.1). To represent the agent’s velocity, we stack two consecutive 32× 32px observations as input to the policy. The agent receives a per-timestep binary reward when the distance to the goal is below a threshold. Block Stacking. The block stacking environment is simulated using the Mujoco physics engine. For data collection, we initialize the five blocks in the environment to be randomly stacked on top of each other or placed at random locations in between. We use a hand-coded data collection policy to generate trajectories with up to three consecutive stacking manipulations. The location of blocks and the movement of the agent are limited to a 2D plane and a barrier prevents the agent from leaving the table. To increase the support of the collected trajectories we add noise to the hard-coded policy by placing pseudo-random subgoals in between and within stacking sequences. 150 The downstream task of the agent is to stack as many blocks as possible in a larger version of the environment with 11 blocks. The environment state is represented through a front view of the agent (see Fig. A.1). The policies’ input is a stack of two 32× 32px images and it outputs (x,z)-displacements for the robot as well as a continuous action in range[0...1] that represents the opening degree of the gripper. The agent receives per-timestep binary rewards for lifting a block from the ground and moving it on top of another block. It further receives a reward proportional to the height of the highest stacked tower. Kitchen environment. We use the kitchen environment from the D4RL benchmark (Fu et al., 2020) which was originally published by Gupta et al. (2019). For training we use the data provided in D4RL (dataset version ”mixed”). It consists of trajectories collected via human tele-operation that each perform four consecutive manipulations of objects in the environment. There are seven manipulatable objects in the environment. The downstream task of the agent consists of performing an unseen sequence of four manipulations - while the individual manipulations have been observed in the training data, the agent needs to learn to recombine these skills in a new way to solve the task. The state is a 30-dimensional vector representing the agent’s joint velocities as well as poses of the manipulatable objects. The agent outputs 7-dimensional joint velocities for robot control as well as a 2-dimensional continuous gripper opening/closing action. It receives a one-time reward whenever fulfilling one of the subtasks. D State-Conditioned Skill Decoder Following prior works (Merel et al., 2019; Kipf et al., 2019), we experimented with conditioning the skill decoder on the current environment state s 1 . Specifically, the current environment state is passed through a neural network that predicts the initial hidden state of the decoding LSTM. We found that conditioning the skill decoder on the state did not improve downstream learning and can even lead to worse performance. In particular, it does not improve the exploration behavior of the 151 Ours Flat Prior SSP w/o Prior time Figure A.2: Comparison of policy execution traces on the kitchen environment. Following Fu et al. (2020), the agent’s task is to (1) open the microwave, (2) move the kettle backwards, (3) turn on the burner and (4) switch on the light. Red frames mark the completion of subtasks. Our skill-prior guided agent (top) is able to complete all four subtasks. In contrast, the agent using a flat single-action prior ( middle) only learns to solve two subtasks, but lacks temporal abstraction and hence fails to solve the complete long-horizon task. The skill-space policy without prior guidance (bottom) cannot efficiently explore the skill space and gets stuck in a local optimum in which it solves only a single subtask. Best viewed electronically and zoomed in. For videos, see: clvrai.com/spirl. Skills + Prior (Ours) Skills w/o Prior + conditioned decode Figure A.3: Results for state-conditioned skill decoder network. left: Exploration visualization as in Fig. 2.5. Even with state-conditioned skill decoder, exploration without skill prior is not able to explore a large fraction of the maze. In contrast, skills sampled from the learned skill prior lead to wide-ranging exploration when using the state-conditioned decoder. right: Downstream learning performance of our approach and skill-space policy w/o learned skill prior: w/ vs. w/o state- conditioning for skill decoder. Only guidance through the learned skill prior enables learning success. State-conditioned skill-decoder can make the downstream learning problem more challenging, leading to lower performance (”ours” vs. ”ours w/ state cond.”). 152 0.00 0.15 0.30 0.45 0.60 Environment steps (1M) 0 1 2 3 4 Stacked Blocks Ours SSP PriorInit + SAC Figure A.4: Ablation of prior regularization during down- stream RL training. Initializing the high-level policy with the learned prior but finetuning with conventional SAC is not suffi- cient to learn the task well. Figure A.5: Ablation of prior initialization. Initializing the down- stream task policy with the prior network improves training sta- bility and convergence speed. However, the ”w/o Init” runs demonstrate that the tasks can also be learned with prior regular- ization only. skill-space policy without learned prior (see Fig. A.3, left); a learned skill prior is still necessary for efficient exploration on the downstream task. Additionally, we found that conditioning the skill decoder on the state can reduce downstream learning performance (see Fig. A.3, right). We hypothesize that state-conditioning can make the learning problem for the high-level policy more challenging: due to the state-conditioning the same high-level action z can result in different decoded action sequences depending on the current state, making the high-level policies’ action space dynamics more complex. E Prior Regularization Ablation We ablate the influence of the prior regularization during downstream learning as described in Section 2.3.3. Specifically, we compare to an approach that initializes the high-level policy with the learned skill prior, but then uses conventional SAC (Haarnoja et al., 2018a) (with uniform skill prior) to finetune on the downstream task. Fig. A.4 shows that the prior regularization during downstream learning is essential for good performance: conventional maximum-entropy SAC (”SSP PriorInit + SAC”) quickly leads the prior-initialized policy to deviate from the skill prior by encouraging it to maximize the entropy of the distribution over skills, slowing the learning substantially. 153 F Prior Initialization Ablation For the RL experiments in Section 2.4 we initialize the weights of the high-level policy with the learned skill prior network. Here, we compare the performance of this approach to an ablation that does not perform the initialization. We find that prior initialization improves convergence speed and stability of training (see Fig. A.5). We identify two major challenges that make training policies ”from scratch”, without initializa- tion, challenging: (1) sampling-based divergence estimates between the randomly initialized policy and the prior distribution can be inaccurate in early phases of training, (2) learning gets stuck in local optima where the divergence between policy and prior is minimized on a small subset of the state space, but the policy does not explore beyond this region. Since both our learned prior and the high-level policy are parametrized with Gaussian output distributions, we can analytically compute the KL divergence to get a more stable estimate and alleviate problem (1). To address problem (2) when training from scratch, we encourage exploration by sampling a fractionω of the rollout steps during experience collection directly from the prior instead of the policy. For the ”w/o Init” experiments in Fig. A.5 we setω = 1.0 for the first 500 k steps (i.e. always sample from the prior) and then anneal it to 0 (i.e. always use the policy for rollout collection) over the next 500 k steps. Note that policies trained with prior initialization do not require these additions and still converge faster. G Training with Sub-Optimal Data Table A.1: Number of blocks stacked vs frac- tions of random training data. % Random Data 0% 50% 75% # Blocks Stacked 3.5 2.0 1.0 We investigate the influence of the training data on the performance of our approach. In particular, we test whether it is possible to learn effective skill priors from heavily sub-optimal data. For the experiments in Section 2.4 we trained the skill prior from high-quality experience collected using expert-like policies, albeit on tasks that differ 154 from the downstream task (see Section C). However, in many practical cases the quality of training data can be mixed. In this Section we investigate two scenarios for training from sub-optimal data. Mixed Expert and Random Data. We assume that a large fraction of the data is collected by inexperienced users leading to very low quality trajectories, while another part of the data is collected by experts and has high quality. We emulate this in the block stacking environment by combining rollouts collected by executing random actions with parts of the high-quality rollouts we used for the experiments in Section 2.4.3. The results are shown in table A.1. Our approach achieves good performance even when half of the training data consists of very low quality rollouts, and can learn meaningful skills for stacking blocks when 75 % of the data is of low quality. The best baseline was only able to stack 1.5 blocks on average, even though it was trained from only high-quality data (see Fig. 2.4, middle). 0.00 0.45 0.90 1.35 1.80 Environment steps (1M) 0.0 0.2 0.4 0.6 0.8 1.0 Success Rate Ours (BC-Data) SSP w/o Prior Figure A.6: Success rate on maze envi- ronment with sub-optimal training data. Our approach, using a prior learned from sub-optimal data generated with the BC policy, is able to reliably learn to reach the goal while the baseline that does not use the learned prior fails. Only Non-Expert Data. We assume access to a dataset of only mediocre quality demonstrations, without any expert-level trajectories. We generate this dataset in the maze environment by training a behavior cloning (BC) policy on expert-level trajectories and using it to collect a new dataset. Due to the limited capacity of the BC policy, this dataset is of substantially lower quality than the expert dataset, e.g. the agent collides with walls on average 2.5 times per trajectory while it never collides in expert trajectories. While we find that a skill-prior-regularized agent trained on the mediocre data explores the maze less widely than one trained on the expert data, it still works substantially better than the baseline that does not use the skill prior, achieving 100 % success rate of reaching a faraway goal after ¡1M environment steps, while the baseline does not reach the goal even after 3M environment steps. 155 1 2 3 1 2 3 Figure A.7: Reuse of one learned skill prior for multiple downstream tasks. We train a single skill embedding and skill prior model and then use it to guide downstream RL for multiple tasks. Left: We test prior reuse on three different maze navigation tasks in the form of different goals that need to be reached. (1)-(3): Agent rollouts during training; the darker the rollout paths, the later during training they were collected. The same prior enables efficient exploration for all three tasks, but allows for convergence to task-specific policies that reach each of the goals upon convergence. Both scenarios show that our approach can learn effective skill embeddings and skill priors even from substantially sub-optimal data. H Reuse of Learned Skill Priors Our approach has two separate stages: (1) learning of skill embedding and skill prior from offline data and (2) prior-regularized downstream RL. Since the learning of the skill prior is independent of the downstream task, we can reuse the same skill prior for guiding learning on multiple downstream tasks. To test this, we learn a single skill prior on the maze environment depicted in Fig. 2.3 (left) and use it to train multiple downstream task agents that reach different goals. In Fig. A.7 we show a visualization of the training rollouts in a top-down view, similar to the visualization in Fig. 2.5; darker trajectories are more recent. We can see that the same prior is able to guide downstream agents to efficiently learn to reach diverse goals. All agents achieve ∼ 100% success rate upon convergence. Intuitively, the prior captures the knowledge that it is more meaningful to e.g. cross doorways instead of bumping into walls, which helps exploration in the maze independent of the goal position. 156 Appendix B Demonstration-Guided Reinforcement Learning with Learned Skills A Full Algorithm We detail our full SkiLD algorithm for demonstration-guided RL with learned skills in Algo- rithm 4. It is based on the SPiRL algorithm for RL with learned skills (Pertsch et al., 2020a) which in turn builds on Soft-Actor Critic (Haarnoja et al., 2018a), an off-policy model-free RL algorithm. We mark changes of our algorithm with respect to SPiRL and SAC in red in Algorithm 4. The hyperparametersα andα q can either be constant, or they can be automatically tuned using dual gradient descent (Haarnoja et al., 2018b; Pertsch et al., 2020a). In the latter case, we need to define a set of target divergencesδ,δ q . The parametersα andα q are then optimized to ensure that the expected divergence between policy and skill prior and posterior distributions is equal to the chosen target divergence (see Algorithm 4). B Implementation and Experimental Details B.1 Implementation Details: Pre-Training We introduce our objective for learning the skill inference network q ω (z|s,a) and low-level skill policy π φ (a t |s t ,z) in Section 3.3.2. In practice, we instantiate all model components with 157 Algorithm 4 SkiLD (Skill-based Learning with Demonstrations) 1: Inputs: H-step reward function ˜ r(s t ,z t ), reward weightγ, discountη, target divergencesδ,δ q , learning ratesλ π ,λ Q ,λ α , target update rateτ. 2: Initialize replay bufferD, high-level policyπ θ (z t |s t ), critic Q φ (s t ,z t ), target network Q ¯ φ (s t ,z t ) 3: for each iteration do 4: for every H environment steps do 5: z t ∼ π(z t |s t ) ▷ sample skill from policy 6: s t ′∼ p(s t+H |s t ,z t ) ▷ execute skill in environment 7: D← D∪{s t ,z t , ˜ r(s t ,z t ),s t ′} ▷ store transition in replay buffer 8: end for 9: for each gradient step do 10: r Σ =(1− γ)· ˜ r(s t ,z t )+γ· logD(s t )− log 1− D(s t ) ▷ compute combined reward 11: ¯ Q= r Σ +η Q ¯ φ (s t ′,π θ (z t ′|s t ′))− α q D KL π θ (z t ′|s t ′),q ζ (z t ′|s t ′) · D(s t ′) 12: +αD KL π θ (z t ′|s t ′), p(z t ′|s t ′) · 1− D(s t ′) ▷ compute Q-target 13: θ← θ− λ π ∇ θ Q φ (s t ,π θ (z t |s t ))− α q D KL π θ (z t |s t ),q ζ (z t |s t ) · D(s t ) 14: +αD KL π θ (z t |s t ), p(z t |s t ) · 1− D(s t ) ▷ update policy weights 15: φ← φ− λ Q ∇ φ 1 2 Q φ (s t ,z t )− ¯ Q 2 ▷ update critic weights 16: α← α− λ α ∇ α α· (D KL (π θ (z t |s t ), p(z t |s t ))− δ) ▷ update alpha 17: α q ← α q − λ α ∇ α q α q · (D KL (π θ (z t |s t ),q ζ (z t |s t ))− δ q ) ▷ update alpha-q 18: ¯ φ← τφ+(1− τ) ¯ φ ▷ update target network weights 19: end for 20: end for 21: return trained policyπ θ (z t |s t ) deep neural networks Q ω ,Π φ respectively, and optimize the full model using back-propagation. We also jointly train our skill prior network P. We follow the common assumption of Gaussian, unit-variance output distributions for low-level policy actions, leading to the following network loss: L = H− 2 ∏ t=0 a t − Π φ (s t ,z) 2 +βD KL Q ω (s 0:H− 1 ,a 0:H− 2 )||N (0,I) | {z } skill representation training + D KL ⌊Q ω (s 0:H− 1 ,a 0:H− 2 )⌋|| P(s 0 ) | {z } skill prior training . 158 Here⌊·⌋ indicates that we stop gradient flow from the prior training objective into the skill inference network for improved training stability. After training the skill inference network with above objective, we train the skill posterior network Q ζ by minimizing KL divergence to the skill inference network’s output on trajectories sampled from the demonstration data. We minimize the following objective: L post = D KL ⌊Q ω (s 0:H− 1 ,a 0:H− 2 )⌋|| Q ζ (s 0 ) We use a 1-layer LSTM with 128 hidden units for the inference network and 3-layer MLPs with 128 hidden units in each layer for the low-level policy. We encode skills of horizon 10 into 10-dimensional skill representations z. Skill prior and posterior networks are implemented as 5-layer MLPs with 128 hidden units per layer. They both parametrize mean and standard deviation of Gaussian output distributions. All networks use batch normalization after every layer and leaky ReLU activation functions. We tune the regularization weightβ to be 1× 10 − 3 for the maze and 5× 10 − 4 for kitchen and office environment. For the demonstration discriminator D(s) we use a 3-layer MLP with only 32 hidden units per layer to avoid overfitting. It uses a sigmoid activation function on the final layer and leaky ReLU activations otherwise. We train the discriminator with binary cross-entropy loss on samples from task-agnostic and demonstration datasets: L D =− 1 N · N/2 ∑ i=1 logD(s d i ) | {z } demonstrations + N/2 ∑ j=1 log 1− D(s j ) | {z } task-agnostic data We optimize all networks using the RAdam optimizer (Liu et al., 2020) with parametersβ 1 = 0.9 and β 2 = 0.999, batch size 128 and learning rate 1× 10 − 3 . On a single NVIDIA Titan X GPU we can train the skill representation and skill prior in approximately 5 hours, the skill posterior in approximately 3 hours and the discriminator in approximately 3 hours. 159 B.2 Implementation Details: Downstream RL The architecture of the policy mirrors the one of the skill prior and posterior networks. The critic is a simple 2-layer MLP with 256 hidden units per layer. The policy outputs the parameters of a Gaussian action distribution while the critic outputs a single Q-value estimate. We initialize the policy with the weights of the skill posterior network. We use the hyperparameters of the standard SAC implementation (Haarnoja et al., 2018a) with batch size 256, replay buffer capacity of 1× 10 6 and discount factor γ = 0.99. We collect 5000 warmup rollout steps to initialize the replay buffer before training. We use the Adam opti- mizer (Kingma and Ba, 2015) withβ 1 = 0.9,β 2 = 0.999 and learning rate 3× 10 − 4 for updating policy, critic and temperaturesα andα q . Analogous to SAC, we train two separate critic networks and compute the Q-value as the minimum over both estimates to stabilize training. The correspond- ing target networks get updated at a rate ofτ = 5× 10 − 3 . The policy’s actions are limited in the range[− 2,2] by a tanh ”squashing function” (see Haarnoja et al. (2018a), appendix C). We use automatic tuning ofα andα q in the maze navigation task and set the target divergences to 1 and 10 respectively. In the kitchen and office environments we obtained best results by using constant values of α =α q = 1× 10 − 1 . In all experiments we set κ = 0.9. For all RL results we average the results of three independently seeded runs and display mean and standard deviation. B.3 Implementation Details: Comparisons BC+RL. This comparison is representative of demonstration-guided RL approaches that use BC objectives to initialize and regularize the policy during RL (Rajeswaran et al., 2018; Nair et al., 2018). We pre-train a BC policy on the demonstration dataset and use it to initialize the RL policy. We use SAC to train the policy on the target task. Similar to Nair et al. (2018) we augment the policy update with a regularization term that minimizes the L2 loss between the predicted mean of the policy’s output distribution and the output of the BC pre-trained policy 1 . 1 We also tried sampling action targets directly from the demonstration replay buffer, but found using a BC policy as target more effective on the tested tasks. 160 Figure B.1: Qualitative re- sults for GAIL+RL on maze navigation. Even though it makes progress towards the goal (red), it fails to ever ob- tain the sparse goal reaching reward. Figure B.2: We compare the exploration behavior in the maze. We roll out skills sampled from SPiRL’s task-agnostic skill prior (left) and our task-specific skill posterior ( right) and find that the latter leads to more targeted exploration towards the goal (red). Demo Replay. This comparison is representative of approaches that initialize the replay buffer of an off-policy RL agent with demonstration transitions (Vecerik et al., 2017; Hester et al., 2018). In practice we use SAC and initialize a second replay buffer with the demonstration transitions. Since the demonstrations do not come with reward, we heuristically set the reward of each demonstration trajectory to be a high value (100 for the maze, 4 for the robotic environments) on the final transition and zero everywhere else. During each SAC update, we sample half of the training mini-batch from the normal SAC replay buffer and half from the demonstration replay buffer. All other aspects of SAC remain unchanged. B.4 Environment Details Maze Navigation. We adapt the maze navigation task from Pertsch et al. (2020a) which extends the maze navigation tasks from the D4RL benchmark (Fu et al., 2020). The starting position is sampled uniformly from a start region and the agent receives a one-time sparse reward of 100 when reaching the fixed goal position, which also ends the episode. The 4D observation space contains 2D position and velocity of the agent. The agent is controlled via 2D velocity commands. 161 1 2 3 4 5 6 7 a b c Figure B.3: Office cleanup task. The robot agent needs to place three ran- domly sampled objects (1-7) inside ran- domly sampled containers (a-c). During task-agnostic data collection we apply random noise to the initial position of the objects. Robot Kitchen Environment. We use the kitchen envi- ronment from Gupta et al. (2019). For solving the target task, the agent needs to execute a fixed sequence of four subtasks by controlling an Emika Franka Panda 7-DOF robot via joint velocity and continuous gripper actuation commands. The 30-dimensional state space contains the robot’s joint angles as well as object-specific features that characterize the position of each of the manipulatable ob- jects. We use 20 state-action sequences from the dataset of Gupta et al. (2019) as demonstrations. To increase demonstration support, we collect a demonstration dataset by initializing the environment at states along the demon- strations and rolling out a random policy for 10 steps. Robot Office Environment. We create a novel office cleanup task in which a 5-DOF WidowX robot needs to place a number of objects into designated containers, requiring the execution of a sequence of pick, place and drawer open and close subtasks (see Figure B.3). The agent controls position and orientation of the end-effector and a continuous gripper actuation, resulting in a 7-dimensional action space. For simulating the environment we build on the Roboverse framework (Singh et al., 2020). During collection of the task-agnostic data we randomly sample a subset of three of the seven objects as well as a random order of target containers and use scripted policies to execute the task. We only save successful executions. For the target task we fix object positions and require the agent to place three objects in fixed target containers. The 97-dimensional state space contains the agent’s end-effector position and orientation as well as position and orientation of all objects and containers. Differences to Pertsch et al. (2020a). While both maze navigation and kitchen environment are based on the tasks in Pertsch et al. (2020a), we made multiple changes to increase task complexity, 162 resulting in the lower absolute performance of the SPiRL baseline in Figure 3.4. For the maze navigation task we added randomness to the starting position and terminate the episode upon reaching the goal position, reducing the max. reward obtainable for successfully solving the task. We also switched to a low-dimensional state representation for simplicity. For the kitchen environment, the task originally used in Gupta et al. (2019) as well as Pertsch et al. (2020a) was well aligned with the training data distribution and there were no demonstrations available for this task. In our evaluation we use a different downstream task (see section F) which is less well-aligned with the training data and therefore harder to learn. This also allows us to use sequences from the dataset of Gupta et al. (2019) as demonstrations for this task. C Skill Representation Comparison Ours Pertsch et al., 2020 Figure B.4: Comparison of our closed-loop skill representation with the open-loop representation of Pertsch et al. (2020a). Top: Skill prior rollouts for 100 k steps in the maze environment. Bot- tom: Subtask success rates for prior rollouts in the kitchen environment. In Section 3.3.2 we described our skill rep- resentation based on a closed-loop low-level policy as a more powerful alternative to the open-loop action decoder-based representation of Pertsch et al. (2020a). To compare the perfor- mance of the two representations we perform rollouts with the learned skill prior: we sample a skill from the prior and rollout the low-level pol- icy for H steps. We repeat this until the episode terminates and visualize the results for multiple episodes in maze and kitchen environment in Figure B.4. In Figure B.4 (top) we see that both repre- sentations lead to effective exploration in the maze environment. Since the 2D maze naviga- tion task does not require control in high-dimensional action spaces, both skill representations are sufficient to accurately reproduce behaviors observed in the task-agnostic training data. 163 Figure B.5: Downstream task performance for prior demonstration-guided RL approaches with combined task-agnostic and task-specific data. All prior approaches are unable to leverage the task-agnostic data, showing a performance decrease when attempting to use it. In contrast, the results on the kitchen environment (Figure B.4, bottom) show that the closed-loop skill representation is able to more accurately control the high-DOF robotic manipulator and reliably solve multiple subtasks per rollout episode. 2 We hypothesize that the closed-loop skill policy is able to learn more robust skills from the task-agnostic training data, particularly in high-dimensional control problems. D Demonstration-Guided RL Comparisons with Task-Agnostic Experience In Section 3.4.2 we compared our approach to prior demonstration-guided RL approaches which are not designed to leverage task-agnostic datasets. We applied these prior works in the setting they were designed for: using only task-specific demonstrations of the target task. Here, we conduct experiments in which we run these prior works using the combined task-agnostic and task-specific datasets to give them access to the same data that our approach used. From the results in Figure B.5 we can see that none of the prior works is able to effectively leverage the additional task-agnostic data. In many cases the performance of the approaches is worse than when only using task-specific data (see Figure 3.4). Since prior approaches are not designed to leverage task-agnostic data, applying them in the combined-data setting can hurt learning on the 2 Seehttps://sites.google.com/view/skill-demo-rl for skill prior rollout videos with both skill represen- tations in the kitchen environment. 164 SkiLD (Demo-RL) SkiLD (Imitation) w/ D-finetuning BC GAIL SkiLD (Imitation) Figure B.6: Imitation learning performance on maze navigation and kitchen tasks. Compared to prior imitation learning methods, SkiLD can leverage prior experience to enable the imitation of complex, long-horizon behaviors. Finetuning the pre-trained discriminator D(s) further improves performance on more challenging control tasks like in the kitchen environment. target task. In contrast, our approach can effectively leverage the task-agnostic data for accelerating demonstration-guided RL. E Skill-Based Imitation Learning We ablate the influence of the environment reward feedback on the performance of our approach by setting the reward weight κ = 1.0, thus relying solely on the learned discriminator reward. Our goal is to test whether our approach SkiLD is able to leverage task-agnostic experience to improve the performance of pure imitation learning, i.e., learning to follow demonstrations without environment reward feedback. We compare SkiLD to common approaches for imitation learning: behavioral cloning (BC, Pomerleau (1989)) and generative adversarial imitation learning (GAIL, Ho and Ermon (2016)). We also experiment with a version of our skill-based imitation learning approach that performs online 165 finetuning of the pre-trained discriminator D(s) using data collected during training of the imitation policy. We summarize the results of the imitation learning experiments in Figure B.6. Learning purely by imitating the demonstrations, without additional reward feedback, is generally slower than demonstration-guided RL on tasks that require more challenging control, like in the kitchen environment, where the pre-trained discriminator does not capture the desired trajectory distribution accurately. Yet, we find that our approach is able to leverage task-agnostic data to effectively imitate complex, long-horizon behaviors while prior imitation learning approaches struggle. Further, online finetuning of the learned discriminator improves imitation learning performance when the pre-trained discriminator is not accurate enough. In the maze navigation task the pre-trained discriminator represents the distribution of solution trajectories well, so pure imitation performance is comparable to demonstration-guided RL. We find that finetuning the discriminator on the maze “sharpens” the decision boundary of the discriminator, i.e., increases its confidence in correctly estimating the demonstration support. Yet, this does not lead to faster overall convergence since the pre-trained discriminator is already sufficiently accurate. F Kitchen Data Analysis For the kitchen manipulation experiments we use the dataset provided by Gupta et al. (2019) as task-agnostic pre-training data. It consists of 603 teleoperated sequences, each of which shows the completion of four consecutive subtasks. In total there are seven possible subtasks: opening the microwave, moving the kettle, turning on top and bottom burner, flipping the light switch and opening a slide and a hinge cabinet. In Figure B.7 we analyze the transition probabilities between subtasks in the task-agnostic dataset. We can see that these transition probabilities are not uniformly distributed, but instead certain transitions are more likely than others, e.g., it is much more likely to sample a training trajectory in which the agent first opens the microwave than one in which it starts by turning on the bottom burner. 166 Bottom Burner Microwave Kettle Top Burner Light Switch Slide Cabinet Bottom Burner Top Burner Light Switch Kettle Top Burner Light Switch Slide Cabinet Hinge Cabinet Light Switch Slide Cabinet Bottom Burner Top Burner Light Switch Slide Cabinet Top Burner Light Switch Slide Cabinet Light Switch Slide Cabinet Bottom Burner Top Burner Light Switch 9.45% 60.86% 29.68% 100.0% 46.59% 41.96% 5.72% 5.72% 100.0% 100.0% 42.1% 0.58% 16.95% 40.35% 47.36% 52.63% 31.16% 27.27% 25.32% 16.23% 100.0% 100.0% 78.21% 11.17% 10.61% 45.0% 35.0% 20.0% Figure B.7: Subtask transition probabilities in the kitchen environment’s task-agnostic training dataset from Gupta et al. (2019). Each dataset trajectory consists of four consecutive subtasks, of which we display three (yellow: first, green: second, grey: third subtask). The transition probability to the fourth subtask is always near 100 %. In Section 3.4.5 we test our approach on a target task with good alignment to the task-agnostic data (Microwave - Kettle - Light Switch - Hinge Cabinet) and a target task which is mis-aligned to the data (Microwave - Light Switch - Slide Cabinet - Hinge Cabinet). In Section 3.4.5 we test the effect this bias in transition probabilities has on the learning of target tasks. Concretely, we investigate two cases: good alignment between task-agnostic data and target task and mis-alignment between the two. In the former case we choose the target task Kettle - Bottom Burner - Top Burner - Slide Cabinet, since the required subtask transitions are likely under the training data distribution. For the mis-aligned case we choose Microwave - Light Switch - Slide Cabinet - Hinge Cabinet as target task, since particularly the transition from opening the microwave to flipping the light switch is very unlikely to be observed in the training data. 167 Appendix C Cross-Domain Transfer via Semantic Skill Imitation A Full Algorithm We present a detailed description of the downstream RL algorithm for our STAR approach in Algorithm 5. It builds on soft actor-critic (Haarnoja et al., 2018a;b). In contrast to the original SAC we operate in a hybrid action space with mixed discrete and continuous actions: π sem (k|s) outputs discrete semantic skill IDs andπ lat (z|s,k) outputs continuous latent variables. For all input hyperparameters we use the default values from Pertsch et al. (2020a; 2021) and only adapt the regularization weightsα q ,α p andα l for each task. They can either be set to a fixed value or automatically tuned via dual gradient descent in lines 24-26 by setting target parameters δ q ,δ p andδ l (Haarnoja et al., 2018b). B Implementation Details B.1 Skill Learning We summarize the pre-training objectives of all model components in Table C.1. We instantiate all components with deep neural networks. We use a single-layer LSTM with 128 hidden units for the inference network and 3-layer MLPs with 128 hidden units for the low-level policy. The skill-representation z is a 10-dimensional continuous latent variable. All skill priors are implemented as 5-layer MLPs with 128 hidden units. The semantic skill priors output logits of a categorical 168 distribution over k, the non-semantic prior outputs mean and log-variance of a diagonal Gaussian distribution over z. We use batch normalization after every layer and leaky ReLU activations. We auto-tune the regularization weight β for training the low-level skill policy using dual gradient descent and set the target to 2× 10 − 2 for the maze and 1× 10 − 2 for all kitchen experiments. When training on image-based human data we add a 6-layer CNN-encoder to the semantic skill prior p TA (k|s) trained on the source domain datasetD S . The encoder reduces image resolution by half and doubles the number of channels in each layer, starting with a resolution of 64x64 and 8 channels in the first layer. We use batch normalization and leaky ReLU activations for this encoder too. The demonstration discriminator D(s) is implemented as a 2-layer MLP with 32 hidden units and no batch normalization to avoid overfitting. We use a sigmoid activation in it’s final layer to constrain its output in range(0,1). For cross-domain state matching we use a symmetric temporal window with γ − ,γ + = 0.99. Only in the experiments with missing skills (see Section 4.5.3) we setγ − ,γ + = 0. All networks are optimized using the RAdam optimizer (Liu et al., 2020) with parametersβ 1 = 0.9 andβ 2 = 0.999, batch size 128 and learning rate 1× 10 − 3 . The computational complexity of our approach is comparable to that of prior skill-based RL approaches like Pertsch et al. (2020a). On a single NVIDIA V100 GPU we can train the low-level policy and all skill priors in approximately 10 hours and the demonstration discriminator in approximately 3 hours. B.2 Semantic Imitation The high-level policiesπ sem (k|s) andπ lat (z|s,k) are implemented as 5-layer MLPs with batch normalization and ReLU activations. The former outputs the logits of a categorical distribution over k, the latter the mean and log-variance of a diagonal Gaussian distribution over z. We initialize the semantic high-level policy with the pre-trained demonstration skill prior p demo (k|s) and the non-semantic high-level policy with the pre-trained task-agnostic latent skill prior p TA (z|s,k). 169 Table C.1: List of all pre-trained model components, their respective functionality and their pre- training objectives. We use ⌊·⌋ to indicate stopped gradients and τ T to denote demonstration trajectories relabeled with matched target domain states from Section 4.4.3. MODEL SYMBOL DESCRIPTION TRAINING OBJECTIVE Skill Policy π l (a|s,k,z) Executes a given skill, defined by semantic skill ID and low-level execution latent. Equation (1) Demonstration Semantic Skill Distri- bution p demo (k|s) Captures semantic skill distribu- tion of demonstration sequences. E s,a,k∼ τ T − ∑ K i=1 k i · log p demo (k i |s) Task- Agnostic Semantic Skill Prior p TA (k|s) Captures semantic skill distribu- tion of task-agnostic prior experi- ence. E s,a,k∼ D T − ∑ K i=1 k i · log p TA (k i |s) Task- Agnostic Low-level Execution Prior p TA (z|s,k) Captures distribution over low- level execution latents from task- agnostic prior experience. E s,a,k∼ D T D KL ⌊q(z|s,a,k)⌋, p TA (z|s,k) Demonstration Support Dis- criminator D(s) Determines whether a state is within the support of the demon- strations. − 1 2 · E s∼ τ T logD(s) | {z } demonstrations +E s∼ D T log 1− D(s) | {z } task-agnostic data We implement the critic as a 5-layer MLP with 256 hidden units per layer that outputs a|K|- dimensional vector of Q-values. The scalar Q-value is then computed as the expectation under the output distribution ofπ sem . We use batch size 256, replay buffer capacity of 1× 10 6 and discount factorγ= 0.99. We warm- start training by initializing the replay buffer with 2000 steps. We use the Adam optimizer Kingma and Ba (2015) withβ 1 = 0.9,β 2 = 0.999 and learning rate 3× 10 − 4 for updating policy and critic. Analogous to SAC, we train two separate critic networks and compute the Q-value as the minimum over both estimates to stabilize training. The target networks get updated at a rate ofτ = 5× 10 − 3 . The latent high-level policy’s actions are limited in the range[− 2,2] by a tanh ”squashing function” 170 (see Haarnoja et al. (2018a), appendix C). We set allα parameters to fixed values of 10 in the maze navigation task and 5× 10 − 2 in all kitchen tasks. For reward computation we set the factorκ = 0.9, i.e., we blend environment and discriminator- based rewards. In practice, we find that we can improve convergence speed by using a shaped discriminator reward that increases towards the end of the demonstration. This is comparable to goal-proximity based rewards used in in-domain imitation, e.g., Lee et al. (2021b). To compute the shaped reward, we pre-train a progress predictor P(s) along with the discriminator D(s). P(s) estimates the time step of a state within a demonstration relative to the total length of the demonstration, thus its outputs are bound in the range[0,1]. We implement the progress predictor as a simple 3-layer MLP with a sigmoid output activation. During RL training we can then compute the shaped reward as: r(s,a)=κ· R(s,a)+(1− κ)· P(s)· R D if P(s)≥ 0.5 R D otherwise with R D = logD(s t )− log 1− D(s t ) (B.1) For all RL results we average the results of three independently seeded runs and display mean and standard deviation across seeds. The computation time for these experiments varies by environment and is mainly determined by the simulation time of the used environments. Across all environments we can typically finish downstream task training within 12h on a single NVIDIA V100 GPU. B.3 Comparisons SPiRL. We follow the approach of Pertsch et al. (2020a) which first trains a latent skill repre- sentation from task-agnostic data and then uses a pre-trained task-agnostic prior to regularize the policy during downstream learning. To allow for fair comparison, we adapt the SPiRL approach to work with our semantic skill model. In this way both SPiRL and STAR use the same set of learned semantic skills. During downstream task learning we regularize both high-level policies 171 π sem (k|s) andπ lat (z|s,k) using the corresponding task-agnostic skill priors p TA (k|s) and p TA (z|s,k), analogous to the task-agnostic skill prior regularization in the original SPiRL work. SkiLD. We similarly adapt SkiLD (Pertsch et al., 2021) to work with our learned semantic skill model. In contrast to the SPiRL comparison, we now regularize both high-level policies with skill distributions trained on the target domain demonstrations whenever D(s) classifies a state as being part of the demonstration support (see Chapter 3). SkillSeq. We pre-train a skill-ID conditioned policy on the task-agnostic target domain datasetD T using behavioral cloning. We split this policy into a 3-layer MLP encoder and a 3-layer MLP policy head that produces the output action. The policy has an additional 3-layer MLP output head that is trained to estimate whether the current skill terminates in the input state. We use the semantic skill labels k in the task-agnostic dataset to determine states in which a skill ends and train the termination predictor as a binary classifier. During downstream learning, we use a programmatic high-level policy that has access to the true sequence of semantic skills required to solve the downstream task and conditions the low-level policy on these skill IDs one-by-one. The skill ID is switched to the next skill when the pre-trained termination predictor infers the current state as a terminal state for the current skill. For fair comparison we use online RL for finetuning the skill-conditioned policy via soft actor-critic (SAC, Haarnoja et al. (2018a)). BC+RL. We train a policy directly on the source domain demonstrations via behavioral cloning. We then use this pre-trained policy to initialize the policy during target task training in the target domain. We fine-tune this initialization using SAC with the rewards provided by the target environ- ment. Similar to Rajeswaran et al. (2018); Nair et al. (2018) we regularize the policy towards the pre-trained BC policy during downstream learning. 172 SkiLD (oracle) SPiRL SkillSeq STAR (ours) Figure C.1: Qualitative maze results. We visualize the trajectories of the different policies during training. The SkiLD approach leverages in-domain demonstrations to quickly learn how to reach the goal. SPiRL leverages skills from the task-agnostic dataset to widely explore the maze, but fails to reach the goal. SkillSeq makes progress towards the goal, but can get stuck in intermediate rooms, leading to a substantially lower success rate than the oracle method. Our approach, STAR, is able to match the performance of the oracle baseline and quickly learn to reach the goal while following the sequence of demonstrated rooms. B.4 Environments and Data Collection Maze navigation. We generate two maze layouts with the same number of rooms. We indicate a room’s semantic ID via its color in Figure C.3. We ensure the same “room connectivity” between both layouts, i.e., corresponding semantic rooms have the same openings between each other. For example the yellow room connects to the blue room but not to the green room in both layouts. This ensures that we can follow the same sequence of semantic rooms in both environments. While we ensure that the semantic layout of the mazes is equivalent, their physical layout is substantially different: the mazes are rotated by 180 degrees, for example the red room is in the bottom right corner for the first maze but in the top left corner for the second. Additionally, the layout of individual rooms and the positions of obstacle walls change between the domains. As a result, simple imitation of the low-level planar velocity actions from one domain will not lead to successfully following the same sequence of semantic rooms in the other domain. We define a total of 48 semantic skills: one for each room-to-room traversal, e.g., “go from the red room to the green room”, and one for reaching a goal within each room, e.g., “reach a goal in the green room”. Thus, the semantic description of a demonstrated trajectory could for example be: “Go from the red room to the green 173 FrankaKitchen [Gupta et al.’19] 1 2 3 4 1 2 3 4 Rearranged FrankaKitchen Real-World Kitchen 1 2 3 4 Figure C.2: Three semantically equivalent kitchen environments. Left: FrankaKitchen environ- ment (Gupta et al., 2019), middle: rearranged FrankaKitchen environment, right: real-world kitchen. In all three environments we define the same set of seven semantic object manipulation skills like “open the microwave”, “turn on the stove” etc. The two simulated kitchen environments re- quire different robot joint actuations to perform the same semantic skills. The real-world kitchen has a different agent embodiment (robot vs. human), layout and observation domain (low-dimensional state vs image observations). room, then from the green to the beige room,..., then from the blue to the orange room and then reach a goal in the orange room.” Source Maze Target Maze Figure C.3: Source and target seman- tic maze navigation environments. A room’s color indicates its semantic ID. The red trajectory shows the traversal of semantic rooms demonstrated in the source domain and the corresponding trajectory in the target domain. The low-level planar velocity commands re- quired to follow the demonstration in the target domain is substantially differ- ent. Simulated Kitchen. We use the FrankaKitchen envi- ronment of Gupta et al. (2019) (see Figure C.2, left) and define a set of seven semantic manipulation skills: open- ing the microwave, opening the slide and hinge cabinet, turning on bottom and top stove and flipping the light switch. We also create a rearranged version of the kitchen environment (Figure C.2, middle) with different layout and visual appearance but the same set of semantic interac- tion options. In both environments we use the state-action definition of Gupta et al. (2019): (1) a 60-dimensional state representation consisting of the agent’s joint state as well as object states like opening angles or object pose, 174 (2) a 9-dimensional action space consisting of 7 robot joint velocities and two gripper finger positions. For the FrankaKitchen environment we can use the data provided by Gupta et al. (2019): 600 human teleoperated sequences each solving four different semantic tasks in sequence. In the newly created rearranged kitchen environment we collect a comparable dataset by controlling the robot via trajectory optimization on a dense reward function. We use the CEM implementation of Lowrey et al. (2019). For both datasets we label the semantic skills by programmatically detecting the end state of an object interaction using the low-dimensional state representation. Real-World Kitchen. Data collection is performed by fixating a GoPro camera to the head of a human data collector which then performs a sequence of semantic skills. The camera is angled to widely capture the area in front of the human. During data collection and within each trajectory we vary the viewpoint, skill execution speed and hand used for skill execution. We collect 20 human demonstration sequences for the task sequence: open microwave, move kettle, turn on stove, open cabinet. We then automatically generate semantic skill predictions via zero-shot inference with a pre-trained action recognition model. Specifically, we use the publicly available SlowFast model trained on the EPIC Kitchens 100 dataset (Damen et al., 2021; Fan et al., 2020). The model takes in a window of 32 consecutive video images at a 256× 256 px resolution and outputs a distribution over 97 verb and 300 object classes. Since our simulated FrankaKitchen target environment does not support the same set of skills, we define a mapping from the output of the EPIC Kitchens model to the applicable skills in the Franka Kitchen environment, e.g., we will map outputs for the verb “open” and the noun “microwave” to the “open microwave” skill in FrankaKitchen. Note that some skill distinctions in the FrankaKitchen environment are not supported by EPIC Kitchens, like “turn on top burner” vs “turn on bottom burner”. In such cases we map the outputs of the EPIC Kitchens model to a single skill in the target environment. With this skill mapping we finetune the EPIC Kitchens model for outputting the relevant classes. Note that this model finetuning is performed with the original EPIC Kitchens data, i.e., no additional, domain specific data is used in this step and no additional annotations need to be collected. This finetuning is performed such that the resulting 175 model directly outputs a distribution over the relevant skills. Alternatively, the relevant skills could be extracted from the output of the original model and the distribution could be renormalized. To generate the skill predictions for the human video demonstrations, we move a sliding window of 32 frames over the demonstrations and generate a prediction in each step using the EPIC Kitchens model. We pad the resulting skill distribution sequence with the first and last predicted skill distribution to obtain the same number of skill predictions as there are frames in the demonstration video. Then we use the sequence of skill distributions to perform cross-domain matching and semantic imitation as detailed in Section 4.4.2, without any changes to the algorithm. C Imitation Learning Results STAR w/ Env Reward STAR w/o Env Reward BC GAIL Figure C.4: Imitation learning on the simulated FrankaKitchen task. Our ap- proach STAR is able to learn the target task even without access to any environ- ment rewards, while common imitation learning approaches fail to learn the task due to the large domain gap between source demonstrations and target task environment. We evaluate our approach in the “pure” imitation learning setting in the kitchen environment. Here, we assume no access to environment rewards. Instead, we rely solely on the discriminator-based reward learned from the cross-domain demonstrations to guide learning (see Section 4.4.2). We present evaluations in the FrankaK- itchen environment in Figure C.4. Our approach STAR is able to learn the target task from demonstrations without any environment rewards, although learning is somewhat slower than in the demonstration-guided RL setting with environment reward access. In contrast, standard imita- tion learning approaches are unable to learn the task since they struggle with the large domain gap between source domain demonstrations and target domain execution. These results show that our approach STAR is applicable both, in the demonstration-guided RL setting with environment rewards, and in the imitation learning setting without environment rewards. 176 D Label Noise Robustness Analysis An important aspect for the scalability of an approach is its ability to cope with noise in the training data. While prior work on skill-based RL has investigated the robustness of such approaches to suboptimal behavior in the training data (Pertsch et al., 2020a), we will focus on an aspect of the training data that is specifically important for our cross-domain imitation approach: the semantic skill labels. In this section, we investigate the robustness of our approach to different forms of noise on the semantic skill labels. Such noise can either be introduced through inaccuracies in the manual labeling process or via an automated form of skill labeling, as performed with the EPIC kitchens models in Section 4.5.2. To cleanly investigate the robustness to different forms of skill label noise, we start from a noise-free set of labels, which we can easily obtain programmatically in the simulated FrankaKitchen environment. We then artificially perturb the labels to introduce artifacts that mimic realistic labeling errors. This allows us to (1) investigate different forms of noise independently and (2) vary the magnitude of the introduced noise in a controlled way. Specifically, we introduce noise along three axis: • skill length noise: artificially perturbs the length of a labeled skill within a range [1− l n ...1+ l n ] of the true length of the skill, mimicking a labeler’s uncertainty on when exactly a skill ends • skill uncertainty noise: perturbs the distribution over detected skills around N n transition between skills by adding probability weight to erroneous skills produced via a random walk, mimicking the uncertainty e.g., produced by a pre-trained action recognition model • skill misdetection noise: adds N m incorrectly detected skill segments at randomly sampled points throughout the sequence of randomly sampled lengths, mimicking mis-labelings which can (rarely) occur in human data or (more frequently) in auto-labeled data We show evaluations of our approach with different levels of noise along all three axis in Figure C.5. We perform these evaluations in the simulated FrankaKitchen environment and average 177 No Noise Noise (strong) Noise (weak) Noise (middle) Figure C.5: Robustness of our approach, STAR, to different forms of noise in the semantic skill labels. Our approach is robust to noise in the length of annotated skills and uncertainty between different skills. While STAR is also shows some robustness to completely incorrect skill labels, frequent and confident mis-detections / mis-labelings can lead to errors during the cross-domain matching and thus impair learning performance on the target task. Table C.2: Parametrization of the noise levels for the skill label robustness experiment. SKILL LENGTH NOISE SKILL UNCERTAINTY NOISE SKILL MISDETECTION NOISE Varied Parameter l n (Percentual length noise window) N n (Number of uncertain segments) N m (number of misdetected segments) Weak Noise 10 % 1 1 Middle Noise 20 % 2 2 Strong Noise 30 % 3 3 performance across 10 seeds to reduce the noise-induced variance in the results. The parameters of the different tested noise levels are detailed in Table C.2. The results in Figure C.5 show that STAR is robust to a wider range of noise levels in the anno- tated skill length and uncertainty between the skills: the performance does not significantly change even with increased noise levels. However, we find that confident mis-predictions / mis-labelings of skills can have a negative impact on the performance. Particularly if mis-predictions happen frequently (“Noise (strong)”), states between the source and target domain can be mismatched, leading to worse target task performance. But we find that even in the case of mis-detections STAR is able to handle a moderate amount of such noise robustly, which is important for STAR’s scalability to large and noisy real-world datasets. 178 s a π(a|s) p demo (k|s) <latexit sha1_base64="y3NCKlQn/uVLDFB4qRqfCdKRMdM=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQYhN2RdAyaGMZwTwgWcPs5CYZMvtg5q4Yltj4KzYWitj6F3b+jZNkC008MHA451zu3OPHUmh0nG8rt7S8srqWXy9sbG5t79i7e3UdJYpDjUcyUk2faZAihBoKlNCMFbDAl9Dwh1cTv3EPSosovMVRDF7A+qHoCc7QSB37IL5rIzxg2oUgGpeGtG3SSPVJxy46ZWcKukjcjBRJhmrH/mp3I54EECKXTOuW68TopUyh4BLGhXaiIWZ8yPrQMjRkAWgvnV4wpsdG6dJepMwLkU7V3xMpC7QeBb5JBgwHet6biP95rQR7F14qwjhBCPlsUS+RFCM6qYN2hQKOcmQI40qYv1I+YIpxNKUVTAnu/MmLpH5adp2ye3NWrFxmdeTJITkiJeKSc1Ih16RKaoSTR/JMXsmb9WS9WO/Wxyyas7KZffIH1ucPNAWWsw==</latexit> <latexit sha1_base64="y3NCKlQn/uVLDFB4qRqfCdKRMdM=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQYhN2RdAyaGMZwTwgWcPs5CYZMvtg5q4Yltj4KzYWitj6F3b+jZNkC008MHA451zu3OPHUmh0nG8rt7S8srqWXy9sbG5t79i7e3UdJYpDjUcyUk2faZAihBoKlNCMFbDAl9Dwh1cTv3EPSosovMVRDF7A+qHoCc7QSB37IL5rIzxg2oUgGpeGtG3SSPVJxy46ZWcKukjcjBRJhmrH/mp3I54EECKXTOuW68TopUyh4BLGhXaiIWZ8yPrQMjRkAWgvnV4wpsdG6dJepMwLkU7V3xMpC7QeBb5JBgwHet6biP95rQR7F14qwjhBCPlsUS+RFCM6qYN2hQKOcmQI40qYv1I+YIpxNKUVTAnu/MmLpH5adp2ye3NWrFxmdeTJITkiJeKSc1Ih16RKaoSTR/JMXsmb9WS9WO/Wxyyas7KZffIH1ucPNAWWsw==</latexit> <latexit sha1_base64="y3NCKlQn/uVLDFB4qRqfCdKRMdM=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQYhN2RdAyaGMZwTwgWcPs5CYZMvtg5q4Yltj4KzYWitj6F3b+jZNkC008MHA451zu3OPHUmh0nG8rt7S8srqWXy9sbG5t79i7e3UdJYpDjUcyUk2faZAihBoKlNCMFbDAl9Dwh1cTv3EPSosovMVRDF7A+qHoCc7QSB37IL5rIzxg2oUgGpeGtG3SSPVJxy46ZWcKukjcjBRJhmrH/mp3I54EECKXTOuW68TopUyh4BLGhXaiIWZ8yPrQMjRkAWgvnV4wpsdG6dJepMwLkU7V3xMpC7QeBb5JBgwHet6biP95rQR7F14qwjhBCPlsUS+RFCM6qYN2hQKOcmQI40qYv1I+YIpxNKUVTAnu/MmLpH5adp2ye3NWrFxmdeTJITkiJeKSc1Ih16RKaoSTR/JMXsmb9WS9WO/Wxyyas7KZffIH1ucPNAWWsw==</latexit> <latexit sha1_base64="y3NCKlQn/uVLDFB4qRqfCdKRMdM=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQYhN2RdAyaGMZwTwgWcPs5CYZMvtg5q4Yltj4KzYWitj6F3b+jZNkC008MHA451zu3OPHUmh0nG8rt7S8srqWXy9sbG5t79i7e3UdJYpDjUcyUk2faZAihBoKlNCMFbDAl9Dwh1cTv3EPSosovMVRDF7A+qHoCc7QSB37IL5rIzxg2oUgGpeGtG3SSPVJxy46ZWcKukjcjBRJhmrH/mp3I54EECKXTOuW68TopUyh4BLGhXaiIWZ8yPrQMjRkAWgvnV4wpsdG6dJepMwLkU7V3xMpC7QeBb5JBgwHet6biP95rQR7F14qwjhBCPlsUS+RFCM6qYN2hQKOcmQI40qYv1I+YIpxNKUVTAnu/MmLpH5adp2ye3NWrFxmdeTJITkiJeKSc1Ih16RKaoSTR/JMXsmb9WS9WO/Wxyyas7KZffIH1ucPNAWWsw==</latexit> p TA (k|s) <latexit sha1_base64="vpQAmA6WDz/Qp9Ril18cg5Y6Mfw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMiKDLqhuXFfqCdiyZNNOGZjJDckcsYxf+ihsXirj1N9z5N6btLLT1QOBwzrncm+PHgmtwnG8rt7S8srqWXy9sbG5t79i7ew0dJYqyOo1EpFo+0UxwyerAQbBWrBgJfcGa/vB64jfvmdI8kjUYxcwLSV/ygFMCRuraB/FdB9gDpLXLcWmIOyYLWJ907aJTdqbAi8TNSBFlqHbtr04voknIJFBBtG67TgxeShRwKti40Ek0iwkdkj5rGypJyLSXTu8f42Oj9HAQKfMk4Kn6eyIlodaj0DfJkMBAz3sT8T+vnUBw4aVcxgkwSWeLgkRgiPCkDNzjilEQI0MIVdzciumAKELBVFYwJbjzX14kjdOy65Td27Ni5SqrI48O0REqIRedowq6QVVURxQ9omf0it6sJ+vFerc+ZtGclc3soz+wPn8AN6iVjw==</latexit> <latexit sha1_base64="vpQAmA6WDz/Qp9Ril18cg5Y6Mfw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMiKDLqhuXFfqCdiyZNNOGZjJDckcsYxf+ihsXirj1N9z5N6btLLT1QOBwzrncm+PHgmtwnG8rt7S8srqWXy9sbG5t79i7ew0dJYqyOo1EpFo+0UxwyerAQbBWrBgJfcGa/vB64jfvmdI8kjUYxcwLSV/ygFMCRuraB/FdB9gDpLXLcWmIOyYLWJ907aJTdqbAi8TNSBFlqHbtr04voknIJFBBtG67TgxeShRwKti40Ek0iwkdkj5rGypJyLSXTu8f42Oj9HAQKfMk4Kn6eyIlodaj0DfJkMBAz3sT8T+vnUBw4aVcxgkwSWeLgkRgiPCkDNzjilEQI0MIVdzciumAKELBVFYwJbjzX14kjdOy65Td27Ni5SqrI48O0REqIRedowq6QVVURxQ9omf0it6sJ+vFerc+ZtGclc3soz+wPn8AN6iVjw==</latexit> <latexit sha1_base64="vpQAmA6WDz/Qp9Ril18cg5Y6Mfw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMiKDLqhuXFfqCdiyZNNOGZjJDckcsYxf+ihsXirj1N9z5N6btLLT1QOBwzrncm+PHgmtwnG8rt7S8srqWXy9sbG5t79i7ew0dJYqyOo1EpFo+0UxwyerAQbBWrBgJfcGa/vB64jfvmdI8kjUYxcwLSV/ygFMCRuraB/FdB9gDpLXLcWmIOyYLWJ907aJTdqbAi8TNSBFlqHbtr04voknIJFBBtG67TgxeShRwKti40Ek0iwkdkj5rGypJyLSXTu8f42Oj9HAQKfMk4Kn6eyIlodaj0DfJkMBAz3sT8T+vnUBw4aVcxgkwSWeLgkRgiPCkDNzjilEQI0MIVdzciumAKELBVFYwJbjzX14kjdOy65Td27Ni5SqrI48O0REqIRedowq6QVVURxQ9omf0it6sJ+vFerc+ZtGclc3soz+wPn8AN6iVjw==</latexit> <latexit sha1_base64="vpQAmA6WDz/Qp9Ril18cg5Y6Mfw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMiKDLqhuXFfqCdiyZNNOGZjJDckcsYxf+ihsXirj1N9z5N6btLLT1QOBwzrncm+PHgmtwnG8rt7S8srqWXy9sbG5t79i7ew0dJYqyOo1EpFo+0UxwyerAQbBWrBgJfcGa/vB64jfvmdI8kjUYxcwLSV/ygFMCRuraB/FdB9gDpLXLcWmIOyYLWJ907aJTdqbAi8TNSBFlqHbtr04voknIJFBBtG67TgxeShRwKti40Ek0iwkdkj5rGypJyLSXTu8f42Oj9HAQKfMk4Kn6eyIlodaj0DfJkMBAz3sT8T+vnUBw4aVcxgkwSWeLgkRgiPCkDNzjilEQI0MIVdzciumAKELBVFYwJbjzX14kjdOy65Td27Ni5SqrI48O0REqIRedowq6QVVURxQ9omf0it6sJ+vFerc+ZtGclc3soz+wPn8AN6iVjw==</latexit> ⇡ sem (k|s) <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> ⇡ lat (z|s,k) <latexit sha1_base64="c/Ra8OoU+P8GWDNB5OTzhF6v8xI=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKIoIui25cVrAPaGKZTCft0MkkzNwINWTjr7hxoYhbP8Odf+P0sdDWAwOHc87lzj1BIrgGx/m2CkvLK6trxfXSxubW9o69u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQyvx37rgSnNY3kHo4T5EelLHnJKwEhd+8BL+H0mCOSVR+yZJGB9iocnXbvsVJ0J8CJxZ6SMZqh37S+vF9M0YhKoIFp3XCcBPyMKOBUsL3mpZgmhQ9JnHUMliZj2s8kBOT42Sg+HsTJPAp6ovycyEmk9igKTjAgM9Lw3Fv/zOimEl37GZZICk3S6KEwFhhiP28A9rhgFMTKEUMXNXzEdEEUomM5KpgR3/uRF0jyruk7VvT0v165mdRTRITpCFeSiC1RDN6iOGoiiHD2jV/RmPVkv1rv1MY0WrNnMPvoD6/MHhhWVrw==</latexit> <latexit sha1_base64="c/Ra8OoU+P8GWDNB5OTzhF6v8xI=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKIoIui25cVrAPaGKZTCft0MkkzNwINWTjr7hxoYhbP8Odf+P0sdDWAwOHc87lzj1BIrgGx/m2CkvLK6trxfXSxubW9o69u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQyvx37rgSnNY3kHo4T5EelLHnJKwEhd+8BL+H0mCOSVR+yZJGB9iocnXbvsVJ0J8CJxZ6SMZqh37S+vF9M0YhKoIFp3XCcBPyMKOBUsL3mpZgmhQ9JnHUMliZj2s8kBOT42Sg+HsTJPAp6ovycyEmk9igKTjAgM9Lw3Fv/zOimEl37GZZICk3S6KEwFhhiP28A9rhgFMTKEUMXNXzEdEEUomM5KpgR3/uRF0jyruk7VvT0v165mdRTRITpCFeSiC1RDN6iOGoiiHD2jV/RmPVkv1rv1MY0WrNnMPvoD6/MHhhWVrw==</latexit> <latexit sha1_base64="c/Ra8OoU+P8GWDNB5OTzhF6v8xI=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKIoIui25cVrAPaGKZTCft0MkkzNwINWTjr7hxoYhbP8Odf+P0sdDWAwOHc87lzj1BIrgGx/m2CkvLK6trxfXSxubW9o69u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQyvx37rgSnNY3kHo4T5EelLHnJKwEhd+8BL+H0mCOSVR+yZJGB9iocnXbvsVJ0J8CJxZ6SMZqh37S+vF9M0YhKoIFp3XCcBPyMKOBUsL3mpZgmhQ9JnHUMliZj2s8kBOT42Sg+HsTJPAp6ovycyEmk9igKTjAgM9Lw3Fv/zOimEl37GZZICk3S6KEwFhhiP28A9rhgFMTKEUMXNXzEdEEUomM5KpgR3/uRF0jyruk7VvT0v165mdRTRITpCFeSiC1RDN6iOGoiiHD2jV/RmPVkv1rv1MY0WrNnMPvoD6/MHhhWVrw==</latexit> <latexit sha1_base64="c/Ra8OoU+P8GWDNB5OTzhF6v8xI=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKIoIui25cVrAPaGKZTCft0MkkzNwINWTjr7hxoYhbP8Odf+P0sdDWAwOHc87lzj1BIrgGx/m2CkvLK6trxfXSxubW9o69u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQyvx37rgSnNY3kHo4T5EelLHnJKwEhd+8BL+H0mCOSVR+yZJGB9iocnXbvsVJ0J8CJxZ6SMZqh37S+vF9M0YhKoIFp3XCcBPyMKOBUsL3mpZgmhQ9JnHUMliZj2s8kBOT42Sg+HsTJPAp6ovycyEmk9igKTjAgM9Lw3Fv/zOimEl37GZZICk3S6KEwFhhiP28A9rhgFMTKEUMXNXzEdEEUomM5KpgR3/uRF0jyruk7VvT0v165mdRTRITpCFeSiC1RDN6iOGoiiHD2jV/RmPVkv1rv1MY0WrNnMPvoD6/MHhhWVrw==</latexit> ⇡ l (a|s,k,z) <latexit sha1_base64="9oyZxgtvpK0SDTEixVXDmE96DWI=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARKpQyI4Iui25cVrAPaMeSSTNtaCYzJHeEOtSNv+LGhSJu/Qt3/o1pOwttPRA4nHMuN/f4seAaHOfbyi0tr6yu5dcLG5tb2zv27l5DR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8mvjNe6Y0j+QtjGLmhaQvecApASN17YNOzO9SMS4R3DE5wLqMh2X8cNK1i07FmQIvEjcjRZSh1rW/Or2IJiGTQAXRuu06MXgpUcCpYONCJ9EsJnRI+qxtqCQh0146vWCMj43Sw0GkzJOAp+rviZSEWo9C3yRDAgM9703E/7x2AsGFl3IZJ8AknS0KEoEhwpM6cI8rRkGMDCFUcfNXTAdEEQqmtIIpwZ0/eZE0TiuuU3FvzorVy6yOPDpER6iEXHSOquga1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AFx+5WR</latexit> <latexit sha1_base64="9oyZxgtvpK0SDTEixVXDmE96DWI=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARKpQyI4Iui25cVrAPaMeSSTNtaCYzJHeEOtSNv+LGhSJu/Qt3/o1pOwttPRA4nHMuN/f4seAaHOfbyi0tr6yu5dcLG5tb2zv27l5DR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8mvjNe6Y0j+QtjGLmhaQvecApASN17YNOzO9SMS4R3DE5wLqMh2X8cNK1i07FmQIvEjcjRZSh1rW/Or2IJiGTQAXRuu06MXgpUcCpYONCJ9EsJnRI+qxtqCQh0146vWCMj43Sw0GkzJOAp+rviZSEWo9C3yRDAgM9703E/7x2AsGFl3IZJ8AknS0KEoEhwpM6cI8rRkGMDCFUcfNXTAdEEQqmtIIpwZ0/eZE0TiuuU3FvzorVy6yOPDpER6iEXHSOquga1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AFx+5WR</latexit> <latexit sha1_base64="9oyZxgtvpK0SDTEixVXDmE96DWI=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARKpQyI4Iui25cVrAPaMeSSTNtaCYzJHeEOtSNv+LGhSJu/Qt3/o1pOwttPRA4nHMuN/f4seAaHOfbyi0tr6yu5dcLG5tb2zv27l5DR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8mvjNe6Y0j+QtjGLmhaQvecApASN17YNOzO9SMS4R3DE5wLqMh2X8cNK1i07FmQIvEjcjRZSh1rW/Or2IJiGTQAXRuu06MXgpUcCpYONCJ9EsJnRI+qxtqCQh0146vWCMj43Sw0GkzJOAp+rviZSEWo9C3yRDAgM9703E/7x2AsGFl3IZJ8AknS0KEoEhwpM6cI8rRkGMDCFUcfNXTAdEEQqmtIIpwZ0/eZE0TiuuU3FvzorVy6yOPDpER6iEXHSOquga1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AFx+5WR</latexit> <latexit sha1_base64="9oyZxgtvpK0SDTEixVXDmE96DWI=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARKpQyI4Iui25cVrAPaMeSSTNtaCYzJHeEOtSNv+LGhSJu/Qt3/o1pOwttPRA4nHMuN/f4seAaHOfbyi0tr6yu5dcLG5tb2zv27l5DR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8mvjNe6Y0j+QtjGLmhaQvecApASN17YNOzO9SMS4R3DE5wLqMh2X8cNK1i07FmQIvEjcjRZSh1rW/Or2IJiGTQAXRuu06MXgpUcCpYONCJ9EsJnRI+qxtqCQh0146vWCMj43Sw0GkzJOAp+rviZSEWo9C3yRDAgM9703E/7x2AsGFl3IZJ8AknS0KEoEhwpM6cI8rRkGMDCFUcfNXTAdEEQqmtIIpwZ0/eZE0TiuuU3FvzorVy6yOPDpER6iEXHSOquga1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AFx+5WR</latexit> p TA (z|s,k) <latexit sha1_base64="c22fNTx68cHPMust/FGx9JuE3es=">AAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQQcKuCFpGbSwj5AXZNcxOZpMhsw9m7opx2cLGX7GxUMTWj7Dzb5w8Ck08MHA451zu3OPFgiuwrG8jt7S8srqWXy9sbG5t75i7e00VJZKyBo1EJNseUUzwkDWAg2DtWDISeIK1vOHV2G/dMal4FNZhFDM3IP2Q+5wS0FLXLMa3qQPsHtL6RZaVH7Cj04DVMR4edc2SVbEmwIvEnpESmqHWNb+cXkSTgIVABVGqY1sxuCmRwKlgWcFJFIsJHZI+62gakoApN50ckeFDrfSwH0n9QsAT9fdESgKlRoGnkwGBgZr3xuJ/XicB/9xNeRgnwEI6XeQnAkOEx43gHpeMghhpQqjk+q+YDogkFHRvBV2CPX/yImmeVGyrYt+clqqXszryqIgOUBnZ6AxV0TWqoQai6BE9o1f0ZjwZL8a78TGN5ozZzD76A+PzB8bql38=</latexit> <latexit sha1_base64="c22fNTx68cHPMust/FGx9JuE3es=">AAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQQcKuCFpGbSwj5AXZNcxOZpMhsw9m7opx2cLGX7GxUMTWj7Dzb5w8Ck08MHA451zu3OPFgiuwrG8jt7S8srqWXy9sbG5t75i7e00VJZKyBo1EJNseUUzwkDWAg2DtWDISeIK1vOHV2G/dMal4FNZhFDM3IP2Q+5wS0FLXLMa3qQPsHtL6RZaVH7Cj04DVMR4edc2SVbEmwIvEnpESmqHWNb+cXkSTgIVABVGqY1sxuCmRwKlgWcFJFIsJHZI+62gakoApN50ckeFDrfSwH0n9QsAT9fdESgKlRoGnkwGBgZr3xuJ/XicB/9xNeRgnwEI6XeQnAkOEx43gHpeMghhpQqjk+q+YDogkFHRvBV2CPX/yImmeVGyrYt+clqqXszryqIgOUBnZ6AxV0TWqoQai6BE9o1f0ZjwZL8a78TGN5ozZzD76A+PzB8bql38=</latexit> <latexit sha1_base64="c22fNTx68cHPMust/FGx9JuE3es=">AAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQQcKuCFpGbSwj5AXZNcxOZpMhsw9m7opx2cLGX7GxUMTWj7Dzb5w8Ck08MHA451zu3OPFgiuwrG8jt7S8srqWXy9sbG5t75i7e00VJZKyBo1EJNseUUzwkDWAg2DtWDISeIK1vOHV2G/dMal4FNZhFDM3IP2Q+5wS0FLXLMa3qQPsHtL6RZaVH7Cj04DVMR4edc2SVbEmwIvEnpESmqHWNb+cXkSTgIVABVGqY1sxuCmRwKlgWcFJFIsJHZI+62gakoApN50ckeFDrfSwH0n9QsAT9fdESgKlRoGnkwGBgZr3xuJ/XicB/9xNeRgnwEI6XeQnAkOEx43gHpeMghhpQqjk+q+YDogkFHRvBV2CPX/yImmeVGyrYt+clqqXszryqIgOUBnZ6AxV0TWqoQai6BE9o1f0ZjwZL8a78TGN5ozZzD76A+PzB8bql38=</latexit> <latexit sha1_base64="c22fNTx68cHPMust/FGx9JuE3es=">AAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQQcKuCFpGbSwj5AXZNcxOZpMhsw9m7opx2cLGX7GxUMTWj7Dzb5w8Ck08MHA451zu3OPFgiuwrG8jt7S8srqWXy9sbG5t75i7e00VJZKyBo1EJNseUUzwkDWAg2DtWDISeIK1vOHV2G/dMal4FNZhFDM3IP2Q+5wS0FLXLMa3qQPsHtL6RZaVH7Cj04DVMR4edc2SVbEmwIvEnpESmqHWNb+cXkSTgIVABVGqY1sxuCmRwKlgWcFJFIsJHZI+62gakoApN50ckeFDrfSwH0n9QsAT9fdESgKlRoGnkwGBgZr3xuJ/XicB/9xNeRgnwEI6XeQnAkOEx43gHpeMghhpQqjk+q+YDogkFHRvBV2CPX/yImmeVGyrYt+clqqXszryqIgOUBnZ6AxV0TWqoQai6BE9o1f0ZjwZL8a78TGN5ozZzD76A+PzB8bql38=</latexit> = Pre-Trained w/ Supervised Learning & Frozen = Trained Online with RL Support Discriminator Demonstration Regularization Task-Agnostic Regularization Execution Regularization Semantic Policy Low-level Policy Execution Latent Policy D(s) <latexit sha1_base64="hD01c9KasdDmQVoRS6pc5pM73B4=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBahXkoigh6LevBYwX5AG8pmu2mX7m7C7kYooX/BiwdFvPqHvPlv3KQ5aOuDgcd7M8zMC2LOtHHdb6e0tr6xuVXeruzs7u0fVA+POjpKFKFtEvFI9QKsKWeStg0znPZiRbEIOO0G09vM7z5RpVkkH80spr7AY8lCRrDJpLu6Ph9Wa27DzYFWiVeQGhRoDatfg1FEEkGlIRxr3ffc2PgpVoYRTueVQaJpjMkUj2nfUokF1X6a3zpHZ1YZoTBStqRBufp7IsVC65kIbKfAZqKXvUz8z+snJrz2UybjxFBJFovChCMToexxNGKKEsNnlmCimL0VkQlWmBgbT8WG4C2/vEo6Fw3PbXgPl7XmTRFHGU7gFOrgwRU04R5a0AYCE3iGV3hzhPPivDsfi9aSU8wcwx84nz8w2I2q</latexit> <latexit sha1_base64="hD01c9KasdDmQVoRS6pc5pM73B4=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBahXkoigh6LevBYwX5AG8pmu2mX7m7C7kYooX/BiwdFvPqHvPlv3KQ5aOuDgcd7M8zMC2LOtHHdb6e0tr6xuVXeruzs7u0fVA+POjpKFKFtEvFI9QKsKWeStg0znPZiRbEIOO0G09vM7z5RpVkkH80spr7AY8lCRrDJpLu6Ph9Wa27DzYFWiVeQGhRoDatfg1FEEkGlIRxr3ffc2PgpVoYRTueVQaJpjMkUj2nfUokF1X6a3zpHZ1YZoTBStqRBufp7IsVC65kIbKfAZqKXvUz8z+snJrz2UybjxFBJFovChCMToexxNGKKEsNnlmCimL0VkQlWmBgbT8WG4C2/vEo6Fw3PbXgPl7XmTRFHGU7gFOrgwRU04R5a0AYCE3iGV3hzhPPivDsfi9aSU8wcwx84nz8w2I2q</latexit> <latexit sha1_base64="hD01c9KasdDmQVoRS6pc5pM73B4=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBahXkoigh6LevBYwX5AG8pmu2mX7m7C7kYooX/BiwdFvPqHvPlv3KQ5aOuDgcd7M8zMC2LOtHHdb6e0tr6xuVXeruzs7u0fVA+POjpKFKFtEvFI9QKsKWeStg0znPZiRbEIOO0G09vM7z5RpVkkH80spr7AY8lCRrDJpLu6Ph9Wa27DzYFWiVeQGhRoDatfg1FEEkGlIRxr3ffc2PgpVoYRTueVQaJpjMkUj2nfUokF1X6a3zpHZ1YZoTBStqRBufp7IsVC65kIbKfAZqKXvUz8z+snJrz2UybjxFBJFovChCMToexxNGKKEsNnlmCimL0VkQlWmBgbT8WG4C2/vEo6Fw3PbXgPl7XmTRFHGU7gFOrgwRU04R5a0AYCE3iGV3hzhPPivDsfi9aSU8wcwx84nz8w2I2q</latexit> <latexit sha1_base64="hD01c9KasdDmQVoRS6pc5pM73B4=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBahXkoigh6LevBYwX5AG8pmu2mX7m7C7kYooX/BiwdFvPqHvPlv3KQ5aOuDgcd7M8zMC2LOtHHdb6e0tr6xuVXeruzs7u0fVA+POjpKFKFtEvFI9QKsKWeStg0znPZiRbEIOO0G09vM7z5RpVkkH80spr7AY8lCRrDJpLu6Ph9Wa27DzYFWiVeQGhRoDatfg1FEEkGlIRxr3ffc2PgpVoYRTueVQaJpjMkUj2nfUokF1X6a3zpHZ1YZoTBStqRBufp7IsVC65kIbKfAZqKXvUz8z+snJrz2UybjxFBJFovChCMToexxNGKKEsNnlmCimL0VkQlWmBgbT8WG4C2/vEo6Fw3PbXgPl7XmTRFHGU7gFOrgwRU04R5a0AYCE3iGV3hzhPPivDsfi9aSU8wcwx84nz8w2I2q</latexit> Figure C.6: Visualization of all model components. The colors indicate the objective type used for training. Only the high-level policy is trained with online RL on the downstream task, all other components are pre-trained fully offline via supervised learning and frozen during downstream training. E Detailed Ablation Description We provide an overview of the components of our approach in Figure C.6. The figure highlights that most components are trained offline with simple supervised objectives and then frozen during downstream task learning, making their training straightforward and reproducible. Only the high- level semantic and execution policy are trained via online RL on the downstream task. We also provide a more detailed description of the performed ablation studies from Figure 4.5, right, below. These ablation studies demonstrate the importance of the different components of our model. Finally, we visualize the resulting models for multiple of our ablation studies in Figure C.7. STAR - D-reward. Ablates the discriminator-based dense reward (see Section 4.4.2). Instead trains the high-level policy only based on the environment-provided reward on the downstream task. STAR - TempAgg. Ablates the temporal aggregation during cross-domain matching (see Sec- tion 4.4.3). Instead uses single state semantic skill distributions to find matching states. 179 s a π(a|s) p TA (k|s) <latexit sha1_base64="vpQAmA6WDz/Qp9Ril18cg5Y6Mfw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMiKDLqhuXFfqCdiyZNNOGZjJDckcsYxf+ihsXirj1N9z5N6btLLT1QOBwzrncm+PHgmtwnG8rt7S8srqWXy9sbG5t79i7ew0dJYqyOo1EpFo+0UxwyerAQbBWrBgJfcGa/vB64jfvmdI8kjUYxcwLSV/ygFMCRuraB/FdB9gDpLXLcWmIOyYLWJ907aJTdqbAi8TNSBFlqHbtr04voknIJFBBtG67TgxeShRwKti40Ek0iwkdkj5rGypJyLSXTu8f42Oj9HAQKfMk4Kn6eyIlodaj0DfJkMBAz3sT8T+vnUBw4aVcxgkwSWeLgkRgiPCkDNzjilEQI0MIVdzciumAKELBVFYwJbjzX14kjdOy65Td27Ni5SqrI48O0REqIRedowq6QVVURxQ9omf0it6sJ+vFerc+ZtGclc3soz+wPn8AN6iVjw==</latexit> <latexit sha1_base64="vpQAmA6WDz/Qp9Ril18cg5Y6Mfw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMiKDLqhuXFfqCdiyZNNOGZjJDckcsYxf+ihsXirj1N9z5N6btLLT1QOBwzrncm+PHgmtwnG8rt7S8srqWXy9sbG5t79i7ew0dJYqyOo1EpFo+0UxwyerAQbBWrBgJfcGa/vB64jfvmdI8kjUYxcwLSV/ygFMCRuraB/FdB9gDpLXLcWmIOyYLWJ907aJTdqbAi8TNSBFlqHbtr04voknIJFBBtG67TgxeShRwKti40Ek0iwkdkj5rGypJyLSXTu8f42Oj9HAQKfMk4Kn6eyIlodaj0DfJkMBAz3sT8T+vnUBw4aVcxgkwSWeLgkRgiPCkDNzjilEQI0MIVdzciumAKELBVFYwJbjzX14kjdOy65Td27Ni5SqrI48O0REqIRedowq6QVVURxQ9omf0it6sJ+vFerc+ZtGclc3soz+wPn8AN6iVjw==</latexit> <latexit sha1_base64="vpQAmA6WDz/Qp9Ril18cg5Y6Mfw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMiKDLqhuXFfqCdiyZNNOGZjJDckcsYxf+ihsXirj1N9z5N6btLLT1QOBwzrncm+PHgmtwnG8rt7S8srqWXy9sbG5t79i7ew0dJYqyOo1EpFo+0UxwyerAQbBWrBgJfcGa/vB64jfvmdI8kjUYxcwLSV/ygFMCRuraB/FdB9gDpLXLcWmIOyYLWJ907aJTdqbAi8TNSBFlqHbtr04voknIJFBBtG67TgxeShRwKti40Ek0iwkdkj5rGypJyLSXTu8f42Oj9HAQKfMk4Kn6eyIlodaj0DfJkMBAz3sT8T+vnUBw4aVcxgkwSWeLgkRgiPCkDNzjilEQI0MIVdzciumAKELBVFYwJbjzX14kjdOy65Td27Ni5SqrI48O0REqIRedowq6QVVURxQ9omf0it6sJ+vFerc+ZtGclc3soz+wPn8AN6iVjw==</latexit> <latexit sha1_base64="vpQAmA6WDz/Qp9Ril18cg5Y6Mfw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMiKDLqhuXFfqCdiyZNNOGZjJDckcsYxf+ihsXirj1N9z5N6btLLT1QOBwzrncm+PHgmtwnG8rt7S8srqWXy9sbG5t79i7ew0dJYqyOo1EpFo+0UxwyerAQbBWrBgJfcGa/vB64jfvmdI8kjUYxcwLSV/ygFMCRuraB/FdB9gDpLXLcWmIOyYLWJ907aJTdqbAi8TNSBFlqHbtr04voknIJFBBtG67TgxeShRwKti40Ek0iwkdkj5rGypJyLSXTu8f42Oj9HAQKfMk4Kn6eyIlodaj0DfJkMBAz3sT8T+vnUBw4aVcxgkwSWeLgkRgiPCkDNzjilEQI0MIVdzciumAKELBVFYwJbjzX14kjdOy65Td27Ni5SqrI48O0REqIRedowq6QVVURxQ9omf0it6sJ+vFerc+ZtGclc3soz+wPn8AN6iVjw==</latexit> ⇡ sem (k|s) <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> ⇡ lat (z|s,k) <latexit sha1_base64="c/Ra8OoU+P8GWDNB5OTzhF6v8xI=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKIoIui25cVrAPaGKZTCft0MkkzNwINWTjr7hxoYhbP8Odf+P0sdDWAwOHc87lzj1BIrgGx/m2CkvLK6trxfXSxubW9o69u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQyvx37rgSnNY3kHo4T5EelLHnJKwEhd+8BL+H0mCOSVR+yZJGB9iocnXbvsVJ0J8CJxZ6SMZqh37S+vF9M0YhKoIFp3XCcBPyMKOBUsL3mpZgmhQ9JnHUMliZj2s8kBOT42Sg+HsTJPAp6ovycyEmk9igKTjAgM9Lw3Fv/zOimEl37GZZICk3S6KEwFhhiP28A9rhgFMTKEUMXNXzEdEEUomM5KpgR3/uRF0jyruk7VvT0v165mdRTRITpCFeSiC1RDN6iOGoiiHD2jV/RmPVkv1rv1MY0WrNnMPvoD6/MHhhWVrw==</latexit> <latexit sha1_base64="c/Ra8OoU+P8GWDNB5OTzhF6v8xI=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKIoIui25cVrAPaGKZTCft0MkkzNwINWTjr7hxoYhbP8Odf+P0sdDWAwOHc87lzj1BIrgGx/m2CkvLK6trxfXSxubW9o69u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQyvx37rgSnNY3kHo4T5EelLHnJKwEhd+8BL+H0mCOSVR+yZJGB9iocnXbvsVJ0J8CJxZ6SMZqh37S+vF9M0YhKoIFp3XCcBPyMKOBUsL3mpZgmhQ9JnHUMliZj2s8kBOT42Sg+HsTJPAp6ovycyEmk9igKTjAgM9Lw3Fv/zOimEl37GZZICk3S6KEwFhhiP28A9rhgFMTKEUMXNXzEdEEUomM5KpgR3/uRF0jyruk7VvT0v165mdRTRITpCFeSiC1RDN6iOGoiiHD2jV/RmPVkv1rv1MY0WrNnMPvoD6/MHhhWVrw==</latexit> <latexit sha1_base64="c/Ra8OoU+P8GWDNB5OTzhF6v8xI=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKIoIui25cVrAPaGKZTCft0MkkzNwINWTjr7hxoYhbP8Odf+P0sdDWAwOHc87lzj1BIrgGx/m2CkvLK6trxfXSxubW9o69u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQyvx37rgSnNY3kHo4T5EelLHnJKwEhd+8BL+H0mCOSVR+yZJGB9iocnXbvsVJ0J8CJxZ6SMZqh37S+vF9M0YhKoIFp3XCcBPyMKOBUsL3mpZgmhQ9JnHUMliZj2s8kBOT42Sg+HsTJPAp6ovycyEmk9igKTjAgM9Lw3Fv/zOimEl37GZZICk3S6KEwFhhiP28A9rhgFMTKEUMXNXzEdEEUomM5KpgR3/uRF0jyruk7VvT0v165mdRTRITpCFeSiC1RDN6iOGoiiHD2jV/RmPVkv1rv1MY0WrNnMPvoD6/MHhhWVrw==</latexit> <latexit sha1_base64="c/Ra8OoU+P8GWDNB5OTzhF6v8xI=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKIoIui25cVrAPaGKZTCft0MkkzNwINWTjr7hxoYhbP8Odf+P0sdDWAwOHc87lzj1BIrgGx/m2CkvLK6trxfXSxubW9o69u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQyvx37rgSnNY3kHo4T5EelLHnJKwEhd+8BL+H0mCOSVR+yZJGB9iocnXbvsVJ0J8CJxZ6SMZqh37S+vF9M0YhKoIFp3XCcBPyMKOBUsL3mpZgmhQ9JnHUMliZj2s8kBOT42Sg+HsTJPAp6ovycyEmk9igKTjAgM9Lw3Fv/zOimEl37GZZICk3S6KEwFhhiP28A9rhgFMTKEUMXNXzEdEEUomM5KpgR3/uRF0jyruk7VvT0v165mdRTRITpCFeSiC1RDN6iOGoiiHD2jV/RmPVkv1rv1MY0WrNnMPvoD6/MHhhWVrw==</latexit> ⇡ l (a|s,k,z) <latexit sha1_base64="9oyZxgtvpK0SDTEixVXDmE96DWI=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARKpQyI4Iui25cVrAPaMeSSTNtaCYzJHeEOtSNv+LGhSJu/Qt3/o1pOwttPRA4nHMuN/f4seAaHOfbyi0tr6yu5dcLG5tb2zv27l5DR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8mvjNe6Y0j+QtjGLmhaQvecApASN17YNOzO9SMS4R3DE5wLqMh2X8cNK1i07FmQIvEjcjRZSh1rW/Or2IJiGTQAXRuu06MXgpUcCpYONCJ9EsJnRI+qxtqCQh0146vWCMj43Sw0GkzJOAp+rviZSEWo9C3yRDAgM9703E/7x2AsGFl3IZJ8AknS0KEoEhwpM6cI8rRkGMDCFUcfNXTAdEEQqmtIIpwZ0/eZE0TiuuU3FvzorVy6yOPDpER6iEXHSOquga1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AFx+5WR</latexit> <latexit sha1_base64="9oyZxgtvpK0SDTEixVXDmE96DWI=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARKpQyI4Iui25cVrAPaMeSSTNtaCYzJHeEOtSNv+LGhSJu/Qt3/o1pOwttPRA4nHMuN/f4seAaHOfbyi0tr6yu5dcLG5tb2zv27l5DR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8mvjNe6Y0j+QtjGLmhaQvecApASN17YNOzO9SMS4R3DE5wLqMh2X8cNK1i07FmQIvEjcjRZSh1rW/Or2IJiGTQAXRuu06MXgpUcCpYONCJ9EsJnRI+qxtqCQh0146vWCMj43Sw0GkzJOAp+rviZSEWo9C3yRDAgM9703E/7x2AsGFl3IZJ8AknS0KEoEhwpM6cI8rRkGMDCFUcfNXTAdEEQqmtIIpwZ0/eZE0TiuuU3FvzorVy6yOPDpER6iEXHSOquga1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AFx+5WR</latexit> <latexit sha1_base64="9oyZxgtvpK0SDTEixVXDmE96DWI=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARKpQyI4Iui25cVrAPaMeSSTNtaCYzJHeEOtSNv+LGhSJu/Qt3/o1pOwttPRA4nHMuN/f4seAaHOfbyi0tr6yu5dcLG5tb2zv27l5DR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8mvjNe6Y0j+QtjGLmhaQvecApASN17YNOzO9SMS4R3DE5wLqMh2X8cNK1i07FmQIvEjcjRZSh1rW/Or2IJiGTQAXRuu06MXgpUcCpYONCJ9EsJnRI+qxtqCQh0146vWCMj43Sw0GkzJOAp+rviZSEWo9C3yRDAgM9703E/7x2AsGFl3IZJ8AknS0KEoEhwpM6cI8rRkGMDCFUcfNXTAdEEQqmtIIpwZ0/eZE0TiuuU3FvzorVy6yOPDpER6iEXHSOquga1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AFx+5WR</latexit> <latexit sha1_base64="9oyZxgtvpK0SDTEixVXDmE96DWI=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARKpQyI4Iui25cVrAPaMeSSTNtaCYzJHeEOtSNv+LGhSJu/Qt3/o1pOwttPRA4nHMuN/f4seAaHOfbyi0tr6yu5dcLG5tb2zv27l5DR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8mvjNe6Y0j+QtjGLmhaQvecApASN17YNOzO9SMS4R3DE5wLqMh2X8cNK1i07FmQIvEjcjRZSh1rW/Or2IJiGTQAXRuu06MXgpUcCpYONCJ9EsJnRI+qxtqCQh0146vWCMj43Sw0GkzJOAp+rviZSEWo9C3yRDAgM9703E/7x2AsGFl3IZJ8AknS0KEoEhwpM6cI8rRkGMDCFUcfNXTAdEEQqmtIIpwZ0/eZE0TiuuU3FvzorVy6yOPDpER6iEXHSOquga1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AFx+5WR</latexit> p TA (z|s,k) <latexit sha1_base64="c22fNTx68cHPMust/FGx9JuE3es=">AAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQQcKuCFpGbSwj5AXZNcxOZpMhsw9m7opx2cLGX7GxUMTWj7Dzb5w8Ck08MHA451zu3OPFgiuwrG8jt7S8srqWXy9sbG5t75i7e00VJZKyBo1EJNseUUzwkDWAg2DtWDISeIK1vOHV2G/dMal4FNZhFDM3IP2Q+5wS0FLXLMa3qQPsHtL6RZaVH7Cj04DVMR4edc2SVbEmwIvEnpESmqHWNb+cXkSTgIVABVGqY1sxuCmRwKlgWcFJFIsJHZI+62gakoApN50ckeFDrfSwH0n9QsAT9fdESgKlRoGnkwGBgZr3xuJ/XicB/9xNeRgnwEI6XeQnAkOEx43gHpeMghhpQqjk+q+YDogkFHRvBV2CPX/yImmeVGyrYt+clqqXszryqIgOUBnZ6AxV0TWqoQai6BE9o1f0ZjwZL8a78TGN5ozZzD76A+PzB8bql38=</latexit> <latexit sha1_base64="c22fNTx68cHPMust/FGx9JuE3es=">AAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQQcKuCFpGbSwj5AXZNcxOZpMhsw9m7opx2cLGX7GxUMTWj7Dzb5w8Ck08MHA451zu3OPFgiuwrG8jt7S8srqWXy9sbG5t75i7e00VJZKyBo1EJNseUUzwkDWAg2DtWDISeIK1vOHV2G/dMal4FNZhFDM3IP2Q+5wS0FLXLMa3qQPsHtL6RZaVH7Cj04DVMR4edc2SVbEmwIvEnpESmqHWNb+cXkSTgIVABVGqY1sxuCmRwKlgWcFJFIsJHZI+62gakoApN50ckeFDrfSwH0n9QsAT9fdESgKlRoGnkwGBgZr3xuJ/XicB/9xNeRgnwEI6XeQnAkOEx43gHpeMghhpQqjk+q+YDogkFHRvBV2CPX/yImmeVGyrYt+clqqXszryqIgOUBnZ6AxV0TWqoQai6BE9o1f0ZjwZL8a78TGN5ozZzD76A+PzB8bql38=</latexit> <latexit sha1_base64="c22fNTx68cHPMust/FGx9JuE3es=">AAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQQcKuCFpGbSwj5AXZNcxOZpMhsw9m7opx2cLGX7GxUMTWj7Dzb5w8Ck08MHA451zu3OPFgiuwrG8jt7S8srqWXy9sbG5t75i7e00VJZKyBo1EJNseUUzwkDWAg2DtWDISeIK1vOHV2G/dMal4FNZhFDM3IP2Q+5wS0FLXLMa3qQPsHtL6RZaVH7Cj04DVMR4edc2SVbEmwIvEnpESmqHWNb+cXkSTgIVABVGqY1sxuCmRwKlgWcFJFIsJHZI+62gakoApN50ckeFDrfSwH0n9QsAT9fdESgKlRoGnkwGBgZr3xuJ/XicB/9xNeRgnwEI6XeQnAkOEx43gHpeMghhpQqjk+q+YDogkFHRvBV2CPX/yImmeVGyrYt+clqqXszryqIgOUBnZ6AxV0TWqoQai6BE9o1f0ZjwZL8a78TGN5ozZzD76A+PzB8bql38=</latexit> <latexit sha1_base64="c22fNTx68cHPMust/FGx9JuE3es=">AAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQQcKuCFpGbSwj5AXZNcxOZpMhsw9m7opx2cLGX7GxUMTWj7Dzb5w8Ck08MHA451zu3OPFgiuwrG8jt7S8srqWXy9sbG5t75i7e00VJZKyBo1EJNseUUzwkDWAg2DtWDISeIK1vOHV2G/dMal4FNZhFDM3IP2Q+5wS0FLXLMa3qQPsHtL6RZaVH7Cj04DVMR4edc2SVbEmwIvEnpESmqHWNb+cXkSTgIVABVGqY1sxuCmRwKlgWcFJFIsJHZI+62gakoApN50ckeFDrfSwH0n9QsAT9fdESgKlRoGnkwGBgZr3xuJ/XicB/9xNeRgnwEI6XeQnAkOEx43gHpeMghhpQqjk+q+YDogkFHRvBV2CPX/yImmeVGyrYt+clqqXszryqIgOUBnZ6AxV0TWqoQai6BE9o1f0ZjwZL8a78TGN5ozZzD76A+PzB8bql38=</latexit> Task-Agnostic Regularization Execution Regularization Semantic Policy Low-level Policy Execution Latent Policy s a π(a|s) p demo (k|s) <latexit sha1_base64="y3NCKlQn/uVLDFB4qRqfCdKRMdM=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQYhN2RdAyaGMZwTwgWcPs5CYZMvtg5q4Yltj4KzYWitj6F3b+jZNkC008MHA451zu3OPHUmh0nG8rt7S8srqWXy9sbG5t79i7e3UdJYpDjUcyUk2faZAihBoKlNCMFbDAl9Dwh1cTv3EPSosovMVRDF7A+qHoCc7QSB37IL5rIzxg2oUgGpeGtG3SSPVJxy46ZWcKukjcjBRJhmrH/mp3I54EECKXTOuW68TopUyh4BLGhXaiIWZ8yPrQMjRkAWgvnV4wpsdG6dJepMwLkU7V3xMpC7QeBb5JBgwHet6biP95rQR7F14qwjhBCPlsUS+RFCM6qYN2hQKOcmQI40qYv1I+YIpxNKUVTAnu/MmLpH5adp2ye3NWrFxmdeTJITkiJeKSc1Ih16RKaoSTR/JMXsmb9WS9WO/Wxyyas7KZffIH1ucPNAWWsw==</latexit> <latexit sha1_base64="y3NCKlQn/uVLDFB4qRqfCdKRMdM=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQYhN2RdAyaGMZwTwgWcPs5CYZMvtg5q4Yltj4KzYWitj6F3b+jZNkC008MHA451zu3OPHUmh0nG8rt7S8srqWXy9sbG5t79i7e3UdJYpDjUcyUk2faZAihBoKlNCMFbDAl9Dwh1cTv3EPSosovMVRDF7A+qHoCc7QSB37IL5rIzxg2oUgGpeGtG3SSPVJxy46ZWcKukjcjBRJhmrH/mp3I54EECKXTOuW68TopUyh4BLGhXaiIWZ8yPrQMjRkAWgvnV4wpsdG6dJepMwLkU7V3xMpC7QeBb5JBgwHet6biP95rQR7F14qwjhBCPlsUS+RFCM6qYN2hQKOcmQI40qYv1I+YIpxNKUVTAnu/MmLpH5adp2ye3NWrFxmdeTJITkiJeKSc1Ih16RKaoSTR/JMXsmb9WS9WO/Wxyyas7KZffIH1ucPNAWWsw==</latexit> <latexit sha1_base64="y3NCKlQn/uVLDFB4qRqfCdKRMdM=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQYhN2RdAyaGMZwTwgWcPs5CYZMvtg5q4Yltj4KzYWitj6F3b+jZNkC008MHA451zu3OPHUmh0nG8rt7S8srqWXy9sbG5t79i7e3UdJYpDjUcyUk2faZAihBoKlNCMFbDAl9Dwh1cTv3EPSosovMVRDF7A+qHoCc7QSB37IL5rIzxg2oUgGpeGtG3SSPVJxy46ZWcKukjcjBRJhmrH/mp3I54EECKXTOuW68TopUyh4BLGhXaiIWZ8yPrQMjRkAWgvnV4wpsdG6dJepMwLkU7V3xMpC7QeBb5JBgwHet6biP95rQR7F14qwjhBCPlsUS+RFCM6qYN2hQKOcmQI40qYv1I+YIpxNKUVTAnu/MmLpH5adp2ye3NWrFxmdeTJITkiJeKSc1Ih16RKaoSTR/JMXsmb9WS9WO/Wxyyas7KZffIH1ucPNAWWsw==</latexit> <latexit sha1_base64="y3NCKlQn/uVLDFB4qRqfCdKRMdM=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQYhN2RdAyaGMZwTwgWcPs5CYZMvtg5q4Yltj4KzYWitj6F3b+jZNkC008MHA451zu3OPHUmh0nG8rt7S8srqWXy9sbG5t79i7e3UdJYpDjUcyUk2faZAihBoKlNCMFbDAl9Dwh1cTv3EPSosovMVRDF7A+qHoCc7QSB37IL5rIzxg2oUgGpeGtG3SSPVJxy46ZWcKukjcjBRJhmrH/mp3I54EECKXTOuW68TopUyh4BLGhXaiIWZ8yPrQMjRkAWgvnV4wpsdG6dJepMwLkU7V3xMpC7QeBb5JBgwHet6biP95rQR7F14qwjhBCPlsUS+RFCM6qYN2hQKOcmQI40qYv1I+YIpxNKUVTAnu/MmLpH5adp2ye3NWrFxmdeTJITkiJeKSc1Ih16RKaoSTR/JMXsmb9WS9WO/Wxyyas7KZffIH1ucPNAWWsw==</latexit> ⇡ sem (k|s) <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> ⇡ lat (z|s,k) <latexit sha1_base64="c/Ra8OoU+P8GWDNB5OTzhF6v8xI=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKIoIui25cVrAPaGKZTCft0MkkzNwINWTjr7hxoYhbP8Odf+P0sdDWAwOHc87lzj1BIrgGx/m2CkvLK6trxfXSxubW9o69u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQyvx37rgSnNY3kHo4T5EelLHnJKwEhd+8BL+H0mCOSVR+yZJGB9iocnXbvsVJ0J8CJxZ6SMZqh37S+vF9M0YhKoIFp3XCcBPyMKOBUsL3mpZgmhQ9JnHUMliZj2s8kBOT42Sg+HsTJPAp6ovycyEmk9igKTjAgM9Lw3Fv/zOimEl37GZZICk3S6KEwFhhiP28A9rhgFMTKEUMXNXzEdEEUomM5KpgR3/uRF0jyruk7VvT0v165mdRTRITpCFeSiC1RDN6iOGoiiHD2jV/RmPVkv1rv1MY0WrNnMPvoD6/MHhhWVrw==</latexit> <latexit sha1_base64="c/Ra8OoU+P8GWDNB5OTzhF6v8xI=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKIoIui25cVrAPaGKZTCft0MkkzNwINWTjr7hxoYhbP8Odf+P0sdDWAwOHc87lzj1BIrgGx/m2CkvLK6trxfXSxubW9o69u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQyvx37rgSnNY3kHo4T5EelLHnJKwEhd+8BL+H0mCOSVR+yZJGB9iocnXbvsVJ0J8CJxZ6SMZqh37S+vF9M0YhKoIFp3XCcBPyMKOBUsL3mpZgmhQ9JnHUMliZj2s8kBOT42Sg+HsTJPAp6ovycyEmk9igKTjAgM9Lw3Fv/zOimEl37GZZICk3S6KEwFhhiP28A9rhgFMTKEUMXNXzEdEEUomM5KpgR3/uRF0jyruk7VvT0v165mdRTRITpCFeSiC1RDN6iOGoiiHD2jV/RmPVkv1rv1MY0WrNnMPvoD6/MHhhWVrw==</latexit> <latexit sha1_base64="c/Ra8OoU+P8GWDNB5OTzhF6v8xI=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKIoIui25cVrAPaGKZTCft0MkkzNwINWTjr7hxoYhbP8Odf+P0sdDWAwOHc87lzj1BIrgGx/m2CkvLK6trxfXSxubW9o69u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQyvx37rgSnNY3kHo4T5EelLHnJKwEhd+8BL+H0mCOSVR+yZJGB9iocnXbvsVJ0J8CJxZ6SMZqh37S+vF9M0YhKoIFp3XCcBPyMKOBUsL3mpZgmhQ9JnHUMliZj2s8kBOT42Sg+HsTJPAp6ovycyEmk9igKTjAgM9Lw3Fv/zOimEl37GZZICk3S6KEwFhhiP28A9rhgFMTKEUMXNXzEdEEUomM5KpgR3/uRF0jyruk7VvT0v165mdRTRITpCFeSiC1RDN6iOGoiiHD2jV/RmPVkv1rv1MY0WrNnMPvoD6/MHhhWVrw==</latexit> <latexit sha1_base64="c/Ra8OoU+P8GWDNB5OTzhF6v8xI=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKIoIui25cVrAPaGKZTCft0MkkzNwINWTjr7hxoYhbP8Odf+P0sdDWAwOHc87lzj1BIrgGx/m2CkvLK6trxfXSxubW9o69u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQyvx37rgSnNY3kHo4T5EelLHnJKwEhd+8BL+H0mCOSVR+yZJGB9iocnXbvsVJ0J8CJxZ6SMZqh37S+vF9M0YhKoIFp3XCcBPyMKOBUsL3mpZgmhQ9JnHUMliZj2s8kBOT42Sg+HsTJPAp6ovycyEmk9igKTjAgM9Lw3Fv/zOimEl37GZZICk3S6KEwFhhiP28A9rhgFMTKEUMXNXzEdEEUomM5KpgR3/uRF0jyruk7VvT0v165mdRTRITpCFeSiC1RDN6iOGoiiHD2jV/RmPVkv1rv1MY0WrNnMPvoD6/MHhhWVrw==</latexit> ⇡ l (a|s,k,z) <latexit sha1_base64="9oyZxgtvpK0SDTEixVXDmE96DWI=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARKpQyI4Iui25cVrAPaMeSSTNtaCYzJHeEOtSNv+LGhSJu/Qt3/o1pOwttPRA4nHMuN/f4seAaHOfbyi0tr6yu5dcLG5tb2zv27l5DR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8mvjNe6Y0j+QtjGLmhaQvecApASN17YNOzO9SMS4R3DE5wLqMh2X8cNK1i07FmQIvEjcjRZSh1rW/Or2IJiGTQAXRuu06MXgpUcCpYONCJ9EsJnRI+qxtqCQh0146vWCMj43Sw0GkzJOAp+rviZSEWo9C3yRDAgM9703E/7x2AsGFl3IZJ8AknS0KEoEhwpM6cI8rRkGMDCFUcfNXTAdEEQqmtIIpwZ0/eZE0TiuuU3FvzorVy6yOPDpER6iEXHSOquga1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AFx+5WR</latexit> <latexit sha1_base64="9oyZxgtvpK0SDTEixVXDmE96DWI=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARKpQyI4Iui25cVrAPaMeSSTNtaCYzJHeEOtSNv+LGhSJu/Qt3/o1pOwttPRA4nHMuN/f4seAaHOfbyi0tr6yu5dcLG5tb2zv27l5DR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8mvjNe6Y0j+QtjGLmhaQvecApASN17YNOzO9SMS4R3DE5wLqMh2X8cNK1i07FmQIvEjcjRZSh1rW/Or2IJiGTQAXRuu06MXgpUcCpYONCJ9EsJnRI+qxtqCQh0146vWCMj43Sw0GkzJOAp+rviZSEWo9C3yRDAgM9703E/7x2AsGFl3IZJ8AknS0KEoEhwpM6cI8rRkGMDCFUcfNXTAdEEQqmtIIpwZ0/eZE0TiuuU3FvzorVy6yOPDpER6iEXHSOquga1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AFx+5WR</latexit> <latexit sha1_base64="9oyZxgtvpK0SDTEixVXDmE96DWI=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARKpQyI4Iui25cVrAPaMeSSTNtaCYzJHeEOtSNv+LGhSJu/Qt3/o1pOwttPRA4nHMuN/f4seAaHOfbyi0tr6yu5dcLG5tb2zv27l5DR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8mvjNe6Y0j+QtjGLmhaQvecApASN17YNOzO9SMS4R3DE5wLqMh2X8cNK1i07FmQIvEjcjRZSh1rW/Or2IJiGTQAXRuu06MXgpUcCpYONCJ9EsJnRI+qxtqCQh0146vWCMj43Sw0GkzJOAp+rviZSEWo9C3yRDAgM9703E/7x2AsGFl3IZJ8AknS0KEoEhwpM6cI8rRkGMDCFUcfNXTAdEEQqmtIIpwZ0/eZE0TiuuU3FvzorVy6yOPDpER6iEXHSOquga1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AFx+5WR</latexit> <latexit sha1_base64="9oyZxgtvpK0SDTEixVXDmE96DWI=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARKpQyI4Iui25cVrAPaMeSSTNtaCYzJHeEOtSNv+LGhSJu/Qt3/o1pOwttPRA4nHMuN/f4seAaHOfbyi0tr6yu5dcLG5tb2zv27l5DR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8mvjNe6Y0j+QtjGLmhaQvecApASN17YNOzO9SMS4R3DE5wLqMh2X8cNK1i07FmQIvEjcjRZSh1rW/Or2IJiGTQAXRuu06MXgpUcCpYONCJ9EsJnRI+qxtqCQh0146vWCMj43Sw0GkzJOAp+rviZSEWo9C3yRDAgM9703E/7x2AsGFl3IZJ8AknS0KEoEhwpM6cI8rRkGMDCFUcfNXTAdEEQqmtIIpwZ0/eZE0TiuuU3FvzorVy6yOPDpER6iEXHSOquga1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AFx+5WR</latexit> p TA (z|s,k) <latexit sha1_base64="c22fNTx68cHPMust/FGx9JuE3es=">AAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQQcKuCFpGbSwj5AXZNcxOZpMhsw9m7opx2cLGX7GxUMTWj7Dzb5w8Ck08MHA451zu3OPFgiuwrG8jt7S8srqWXy9sbG5t75i7e00VJZKyBo1EJNseUUzwkDWAg2DtWDISeIK1vOHV2G/dMal4FNZhFDM3IP2Q+5wS0FLXLMa3qQPsHtL6RZaVH7Cj04DVMR4edc2SVbEmwIvEnpESmqHWNb+cXkSTgIVABVGqY1sxuCmRwKlgWcFJFIsJHZI+62gakoApN50ckeFDrfSwH0n9QsAT9fdESgKlRoGnkwGBgZr3xuJ/XicB/9xNeRgnwEI6XeQnAkOEx43gHpeMghhpQqjk+q+YDogkFHRvBV2CPX/yImmeVGyrYt+clqqXszryqIgOUBnZ6AxV0TWqoQai6BE9o1f0ZjwZL8a78TGN5ozZzD76A+PzB8bql38=</latexit> <latexit sha1_base64="c22fNTx68cHPMust/FGx9JuE3es=">AAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQQcKuCFpGbSwj5AXZNcxOZpMhsw9m7opx2cLGX7GxUMTWj7Dzb5w8Ck08MHA451zu3OPFgiuwrG8jt7S8srqWXy9sbG5t75i7e00VJZKyBo1EJNseUUzwkDWAg2DtWDISeIK1vOHV2G/dMal4FNZhFDM3IP2Q+5wS0FLXLMa3qQPsHtL6RZaVH7Cj04DVMR4edc2SVbEmwIvEnpESmqHWNb+cXkSTgIVABVGqY1sxuCmRwKlgWcFJFIsJHZI+62gakoApN50ckeFDrfSwH0n9QsAT9fdESgKlRoGnkwGBgZr3xuJ/XicB/9xNeRgnwEI6XeQnAkOEx43gHpeMghhpQqjk+q+YDogkFHRvBV2CPX/yImmeVGyrYt+clqqXszryqIgOUBnZ6AxV0TWqoQai6BE9o1f0ZjwZL8a78TGN5ozZzD76A+PzB8bql38=</latexit> <latexit sha1_base64="c22fNTx68cHPMust/FGx9JuE3es=">AAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQQcKuCFpGbSwj5AXZNcxOZpMhsw9m7opx2cLGX7GxUMTWj7Dzb5w8Ck08MHA451zu3OPFgiuwrG8jt7S8srqWXy9sbG5t75i7e00VJZKyBo1EJNseUUzwkDWAg2DtWDISeIK1vOHV2G/dMal4FNZhFDM3IP2Q+5wS0FLXLMa3qQPsHtL6RZaVH7Cj04DVMR4edc2SVbEmwIvEnpESmqHWNb+cXkSTgIVABVGqY1sxuCmRwKlgWcFJFIsJHZI+62gakoApN50ckeFDrfSwH0n9QsAT9fdESgKlRoGnkwGBgZr3xuJ/XicB/9xNeRgnwEI6XeQnAkOEx43gHpeMghhpQqjk+q+YDogkFHRvBV2CPX/yImmeVGyrYt+clqqXszryqIgOUBnZ6AxV0TWqoQai6BE9o1f0ZjwZL8a78TGN5ozZzD76A+PzB8bql38=</latexit> <latexit sha1_base64="c22fNTx68cHPMust/FGx9JuE3es=">AAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQQcKuCFpGbSwj5AXZNcxOZpMhsw9m7opx2cLGX7GxUMTWj7Dzb5w8Ck08MHA451zu3OPFgiuwrG8jt7S8srqWXy9sbG5t75i7e00VJZKyBo1EJNseUUzwkDWAg2DtWDISeIK1vOHV2G/dMal4FNZhFDM3IP2Q+5wS0FLXLMa3qQPsHtL6RZaVH7Cj04DVMR4edc2SVbEmwIvEnpESmqHWNb+cXkSTgIVABVGqY1sxuCmRwKlgWcFJFIsJHZI+62gakoApN50ckeFDrfSwH0n9QsAT9fdESgKlRoGnkwGBgZr3xuJ/XicB/9xNeRgnwEI6XeQnAkOEx43gHpeMghhpQqjk+q+YDogkFHRvBV2CPX/yImmeVGyrYt+clqqXszryqIgOUBnZ6AxV0TWqoQai6BE9o1f0ZjwZL8a78TGN5ozZzD76A+PzB8bql38=</latexit> Demonstration Regularization Execution Regularization Semantic Policy Low-level Policy Execution Latent Policy s a π(a|s) p demo (k|s) <latexit sha1_base64="y3NCKlQn/uVLDFB4qRqfCdKRMdM=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQYhN2RdAyaGMZwTwgWcPs5CYZMvtg5q4Yltj4KzYWitj6F3b+jZNkC008MHA451zu3OPHUmh0nG8rt7S8srqWXy9sbG5t79i7e3UdJYpDjUcyUk2faZAihBoKlNCMFbDAl9Dwh1cTv3EPSosovMVRDF7A+qHoCc7QSB37IL5rIzxg2oUgGpeGtG3SSPVJxy46ZWcKukjcjBRJhmrH/mp3I54EECKXTOuW68TopUyh4BLGhXaiIWZ8yPrQMjRkAWgvnV4wpsdG6dJepMwLkU7V3xMpC7QeBb5JBgwHet6biP95rQR7F14qwjhBCPlsUS+RFCM6qYN2hQKOcmQI40qYv1I+YIpxNKUVTAnu/MmLpH5adp2ye3NWrFxmdeTJITkiJeKSc1Ih16RKaoSTR/JMXsmb9WS9WO/Wxyyas7KZffIH1ucPNAWWsw==</latexit> <latexit sha1_base64="y3NCKlQn/uVLDFB4qRqfCdKRMdM=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQYhN2RdAyaGMZwTwgWcPs5CYZMvtg5q4Yltj4KzYWitj6F3b+jZNkC008MHA451zu3OPHUmh0nG8rt7S8srqWXy9sbG5t79i7e3UdJYpDjUcyUk2faZAihBoKlNCMFbDAl9Dwh1cTv3EPSosovMVRDF7A+qHoCc7QSB37IL5rIzxg2oUgGpeGtG3SSPVJxy46ZWcKukjcjBRJhmrH/mp3I54EECKXTOuW68TopUyh4BLGhXaiIWZ8yPrQMjRkAWgvnV4wpsdG6dJepMwLkU7V3xMpC7QeBb5JBgwHet6biP95rQR7F14qwjhBCPlsUS+RFCM6qYN2hQKOcmQI40qYv1I+YIpxNKUVTAnu/MmLpH5adp2ye3NWrFxmdeTJITkiJeKSc1Ih16RKaoSTR/JMXsmb9WS9WO/Wxyyas7KZffIH1ucPNAWWsw==</latexit> <latexit sha1_base64="y3NCKlQn/uVLDFB4qRqfCdKRMdM=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQYhN2RdAyaGMZwTwgWcPs5CYZMvtg5q4Yltj4KzYWitj6F3b+jZNkC008MHA451zu3OPHUmh0nG8rt7S8srqWXy9sbG5t79i7e3UdJYpDjUcyUk2faZAihBoKlNCMFbDAl9Dwh1cTv3EPSosovMVRDF7A+qHoCc7QSB37IL5rIzxg2oUgGpeGtG3SSPVJxy46ZWcKukjcjBRJhmrH/mp3I54EECKXTOuW68TopUyh4BLGhXaiIWZ8yPrQMjRkAWgvnV4wpsdG6dJepMwLkU7V3xMpC7QeBb5JBgwHet6biP95rQR7F14qwjhBCPlsUS+RFCM6qYN2hQKOcmQI40qYv1I+YIpxNKUVTAnu/MmLpH5adp2ye3NWrFxmdeTJITkiJeKSc1Ih16RKaoSTR/JMXsmb9WS9WO/Wxyyas7KZffIH1ucPNAWWsw==</latexit> <latexit sha1_base64="y3NCKlQn/uVLDFB4qRqfCdKRMdM=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQYhN2RdAyaGMZwTwgWcPs5CYZMvtg5q4Yltj4KzYWitj6F3b+jZNkC008MHA451zu3OPHUmh0nG8rt7S8srqWXy9sbG5t79i7e3UdJYpDjUcyUk2faZAihBoKlNCMFbDAl9Dwh1cTv3EPSosovMVRDF7A+qHoCc7QSB37IL5rIzxg2oUgGpeGtG3SSPVJxy46ZWcKukjcjBRJhmrH/mp3I54EECKXTOuW68TopUyh4BLGhXaiIWZ8yPrQMjRkAWgvnV4wpsdG6dJepMwLkU7V3xMpC7QeBb5JBgwHet6biP95rQR7F14qwjhBCPlsUS+RFCM6qYN2hQKOcmQI40qYv1I+YIpxNKUVTAnu/MmLpH5adp2ye3NWrFxmdeTJITkiJeKSc1Ih16RKaoSTR/JMXsmb9WS9WO/Wxyyas7KZffIH1ucPNAWWsw==</latexit> p TA (k|s) <latexit sha1_base64="vpQAmA6WDz/Qp9Ril18cg5Y6Mfw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMiKDLqhuXFfqCdiyZNNOGZjJDckcsYxf+ihsXirj1N9z5N6btLLT1QOBwzrncm+PHgmtwnG8rt7S8srqWXy9sbG5t79i7ew0dJYqyOo1EpFo+0UxwyerAQbBWrBgJfcGa/vB64jfvmdI8kjUYxcwLSV/ygFMCRuraB/FdB9gDpLXLcWmIOyYLWJ907aJTdqbAi8TNSBFlqHbtr04voknIJFBBtG67TgxeShRwKti40Ek0iwkdkj5rGypJyLSXTu8f42Oj9HAQKfMk4Kn6eyIlodaj0DfJkMBAz3sT8T+vnUBw4aVcxgkwSWeLgkRgiPCkDNzjilEQI0MIVdzciumAKELBVFYwJbjzX14kjdOy65Td27Ni5SqrI48O0REqIRedowq6QVVURxQ9omf0it6sJ+vFerc+ZtGclc3soz+wPn8AN6iVjw==</latexit> <latexit sha1_base64="vpQAmA6WDz/Qp9Ril18cg5Y6Mfw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMiKDLqhuXFfqCdiyZNNOGZjJDckcsYxf+ihsXirj1N9z5N6btLLT1QOBwzrncm+PHgmtwnG8rt7S8srqWXy9sbG5t79i7ew0dJYqyOo1EpFo+0UxwyerAQbBWrBgJfcGa/vB64jfvmdI8kjUYxcwLSV/ygFMCRuraB/FdB9gDpLXLcWmIOyYLWJ907aJTdqbAi8TNSBFlqHbtr04voknIJFBBtG67TgxeShRwKti40Ek0iwkdkj5rGypJyLSXTu8f42Oj9HAQKfMk4Kn6eyIlodaj0DfJkMBAz3sT8T+vnUBw4aVcxgkwSWeLgkRgiPCkDNzjilEQI0MIVdzciumAKELBVFYwJbjzX14kjdOy65Td27Ni5SqrI48O0REqIRedowq6QVVURxQ9omf0it6sJ+vFerc+ZtGclc3soz+wPn8AN6iVjw==</latexit> <latexit sha1_base64="vpQAmA6WDz/Qp9Ril18cg5Y6Mfw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMiKDLqhuXFfqCdiyZNNOGZjJDckcsYxf+ihsXirj1N9z5N6btLLT1QOBwzrncm+PHgmtwnG8rt7S8srqWXy9sbG5t79i7ew0dJYqyOo1EpFo+0UxwyerAQbBWrBgJfcGa/vB64jfvmdI8kjUYxcwLSV/ygFMCRuraB/FdB9gDpLXLcWmIOyYLWJ907aJTdqbAi8TNSBFlqHbtr04voknIJFBBtG67TgxeShRwKti40Ek0iwkdkj5rGypJyLSXTu8f42Oj9HAQKfMk4Kn6eyIlodaj0DfJkMBAz3sT8T+vnUBw4aVcxgkwSWeLgkRgiPCkDNzjilEQI0MIVdzciumAKELBVFYwJbjzX14kjdOy65Td27Ni5SqrI48O0REqIRedowq6QVVURxQ9omf0it6sJ+vFerc+ZtGclc3soz+wPn8AN6iVjw==</latexit> <latexit sha1_base64="vpQAmA6WDz/Qp9Ril18cg5Y6Mfw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMiKDLqhuXFfqCdiyZNNOGZjJDckcsYxf+ihsXirj1N9z5N6btLLT1QOBwzrncm+PHgmtwnG8rt7S8srqWXy9sbG5t79i7ew0dJYqyOo1EpFo+0UxwyerAQbBWrBgJfcGa/vB64jfvmdI8kjUYxcwLSV/ygFMCRuraB/FdB9gDpLXLcWmIOyYLWJ907aJTdqbAi8TNSBFlqHbtr04voknIJFBBtG67TgxeShRwKti40Ek0iwkdkj5rGypJyLSXTu8f42Oj9HAQKfMk4Kn6eyIlodaj0DfJkMBAz3sT8T+vnUBw4aVcxgkwSWeLgkRgiPCkDNzjilEQI0MIVdzciumAKELBVFYwJbjzX14kjdOy65Td27Ni5SqrI48O0REqIRedowq6QVVURxQ9omf0it6sJ+vFerc+ZtGclc3soz+wPn8AN6iVjw==</latexit> ⇡ sem (k|s) <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> <latexit sha1_base64="ccNuFyzr3mzzwB0m2tA/vnk2QGE=">AAAB/XicbVDLSgMxFM34rPU1PnZugkWomzIjgi6LblxWsA9ox5JJ77ShSWZIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzLvfmhAln2njet7O0vLK6tl7YKG5ube/sunv7DR2nikKdxjxWrZBo4ExC3TDDoZUoICLk0AyH1xO/+QBKs1jemVECgSB9ySJGibFS1z3sJOw+0yDG5SHu2KTB+rTrlryKNwVeJH5OSihHret+dXoxTQVIQznRuu17iQkyogyjHMbFTqohIXRI+tC2VBIBOsim14/xiVV6OIqVfdLgqfp7IiNC65EIbVIQM9Dz3kT8z2unJroMMiaT1ICks0VRyrGJ8aQK3GMKqOEjSwhVzN6K6YAoQo0trGhL8Oe/vEgaZxXfq/i356XqVV5HAR2hY1RGPrpAVXSDaqiOKHpEz+gVvTlPzovz7nzMoktOPnOA/sD5/AHZQJTP</latexit> Support Discriminator Demonstration Regularization Task-Agnostic Regularization Semantic Policy Low-level Policy ⇡ l (a|s,k) <latexit sha1_base64="RJ5xYKyj8Ju9kMwk9Syq3sY5Xho=">AAAB/nicbVDJSgNBFHzjGuM2Kp68NAYhgoQZEfQY9OIxglkgM4aeTk/SpGeh+40QhoC/4sWDIl79Dm/+jZ3loIkFDUVVPd7rClIpNDrOt7W0vLK6tl7YKG5ube/s2nv7DZ1kivE6S2SiWgHVXIqY11Gg5K1UcRoFkjeDwc3Ybz5ypUUS3+Mw5X5Ee7EIBaNopI596KXiIZejMiWeySHRZ2Rw2rFLTsWZgCwSd0ZKMEOtY3953YRlEY+RSap123VS9HOqUDDJR0Uv0zylbEB7vG1oTCOu/Xxy/oicGKVLwkSZFyOZqL8nchppPYwCk4wo9vW8Nxb/89oZhld+LuI0Qx6z6aIwkwQTMu6CdIXiDOXQEMqUMLcS1qeKMjSNFU0J7vyXF0njvOI6FffuolS9ntVRgCM4hjK4cAlVuIUa1IFBDs/wCm/Wk/VivVsf0+iSNZs5gD+wPn8Av+uUrQ==</latexit> <latexit sha1_base64="RJ5xYKyj8Ju9kMwk9Syq3sY5Xho=">AAAB/nicbVDJSgNBFHzjGuM2Kp68NAYhgoQZEfQY9OIxglkgM4aeTk/SpGeh+40QhoC/4sWDIl79Dm/+jZ3loIkFDUVVPd7rClIpNDrOt7W0vLK6tl7YKG5ube/s2nv7DZ1kivE6S2SiWgHVXIqY11Gg5K1UcRoFkjeDwc3Ybz5ypUUS3+Mw5X5Ee7EIBaNopI596KXiIZejMiWeySHRZ2Rw2rFLTsWZgCwSd0ZKMEOtY3953YRlEY+RSap123VS9HOqUDDJR0Uv0zylbEB7vG1oTCOu/Xxy/oicGKVLwkSZFyOZqL8nchppPYwCk4wo9vW8Nxb/89oZhld+LuI0Qx6z6aIwkwQTMu6CdIXiDOXQEMqUMLcS1qeKMjSNFU0J7vyXF0njvOI6FffuolS9ntVRgCM4hjK4cAlVuIUa1IFBDs/wCm/Wk/VivVsf0+iSNZs5gD+wPn8Av+uUrQ==</latexit> <latexit sha1_base64="RJ5xYKyj8Ju9kMwk9Syq3sY5Xho=">AAAB/nicbVDJSgNBFHzjGuM2Kp68NAYhgoQZEfQY9OIxglkgM4aeTk/SpGeh+40QhoC/4sWDIl79Dm/+jZ3loIkFDUVVPd7rClIpNDrOt7W0vLK6tl7YKG5ube/s2nv7DZ1kivE6S2SiWgHVXIqY11Gg5K1UcRoFkjeDwc3Ybz5ypUUS3+Mw5X5Ee7EIBaNopI596KXiIZejMiWeySHRZ2Rw2rFLTsWZgCwSd0ZKMEOtY3953YRlEY+RSap123VS9HOqUDDJR0Uv0zylbEB7vG1oTCOu/Xxy/oicGKVLwkSZFyOZqL8nchppPYwCk4wo9vW8Nxb/89oZhld+LuI0Qx6z6aIwkwQTMu6CdIXiDOXQEMqUMLcS1qeKMjSNFU0J7vyXF0njvOI6FffuolS9ntVRgCM4hjK4cAlVuIUa1IFBDs/wCm/Wk/VivVsf0+iSNZs5gD+wPn8Av+uUrQ==</latexit> <latexit sha1_base64="RJ5xYKyj8Ju9kMwk9Syq3sY5Xho=">AAAB/nicbVDJSgNBFHzjGuM2Kp68NAYhgoQZEfQY9OIxglkgM4aeTk/SpGeh+40QhoC/4sWDIl79Dm/+jZ3loIkFDUVVPd7rClIpNDrOt7W0vLK6tl7YKG5ube/s2nv7DZ1kivE6S2SiWgHVXIqY11Gg5K1UcRoFkjeDwc3Ybz5ypUUS3+Mw5X5Ee7EIBaNopI596KXiIZejMiWeySHRZ2Rw2rFLTsWZgCwSd0ZKMEOtY3953YRlEY+RSap123VS9HOqUDDJR0Uv0zylbEB7vG1oTCOu/Xxy/oicGKVLwkSZFyOZqL8nchppPYwCk4wo9vW8Nxb/89oZhld+LuI0Qx6z6aIwkwQTMu6CdIXiDOXQEMqUMLcS1qeKMjSNFU0J7vyXF0njvOI6FffuolS9ntVRgCM4hjK4cAlVuIUa1IFBDs/wCm/Wk/VivVsf0+iSNZs5gD+wPn8Av+uUrQ==</latexit> D(s) <latexit sha1_base64="hD01c9KasdDmQVoRS6pc5pM73B4=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBahXkoigh6LevBYwX5AG8pmu2mX7m7C7kYooX/BiwdFvPqHvPlv3KQ5aOuDgcd7M8zMC2LOtHHdb6e0tr6xuVXeruzs7u0fVA+POjpKFKFtEvFI9QKsKWeStg0znPZiRbEIOO0G09vM7z5RpVkkH80spr7AY8lCRrDJpLu6Ph9Wa27DzYFWiVeQGhRoDatfg1FEEkGlIRxr3ffc2PgpVoYRTueVQaJpjMkUj2nfUokF1X6a3zpHZ1YZoTBStqRBufp7IsVC65kIbKfAZqKXvUz8z+snJrz2UybjxFBJFovChCMToexxNGKKEsNnlmCimL0VkQlWmBgbT8WG4C2/vEo6Fw3PbXgPl7XmTRFHGU7gFOrgwRU04R5a0AYCE3iGV3hzhPPivDsfi9aSU8wcwx84nz8w2I2q</latexit> <latexit sha1_base64="hD01c9KasdDmQVoRS6pc5pM73B4=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBahXkoigh6LevBYwX5AG8pmu2mX7m7C7kYooX/BiwdFvPqHvPlv3KQ5aOuDgcd7M8zMC2LOtHHdb6e0tr6xuVXeruzs7u0fVA+POjpKFKFtEvFI9QKsKWeStg0znPZiRbEIOO0G09vM7z5RpVkkH80spr7AY8lCRrDJpLu6Ph9Wa27DzYFWiVeQGhRoDatfg1FEEkGlIRxr3ffc2PgpVoYRTueVQaJpjMkUj2nfUokF1X6a3zpHZ1YZoTBStqRBufp7IsVC65kIbKfAZqKXvUz8z+snJrz2UybjxFBJFovChCMToexxNGKKEsNnlmCimL0VkQlWmBgbT8WG4C2/vEo6Fw3PbXgPl7XmTRFHGU7gFOrgwRU04R5a0AYCE3iGV3hzhPPivDsfi9aSU8wcwx84nz8w2I2q</latexit> <latexit sha1_base64="hD01c9KasdDmQVoRS6pc5pM73B4=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBahXkoigh6LevBYwX5AG8pmu2mX7m7C7kYooX/BiwdFvPqHvPlv3KQ5aOuDgcd7M8zMC2LOtHHdb6e0tr6xuVXeruzs7u0fVA+POjpKFKFtEvFI9QKsKWeStg0znPZiRbEIOO0G09vM7z5RpVkkH80spr7AY8lCRrDJpLu6Ph9Wa27DzYFWiVeQGhRoDatfg1FEEkGlIRxr3ffc2PgpVoYRTueVQaJpjMkUj2nfUokF1X6a3zpHZ1YZoTBStqRBufp7IsVC65kIbKfAZqKXvUz8z+snJrz2UybjxFBJFovChCMToexxNGKKEsNnlmCimL0VkQlWmBgbT8WG4C2/vEo6Fw3PbXgPl7XmTRFHGU7gFOrgwRU04R5a0AYCE3iGV3hzhPPivDsfi9aSU8wcwx84nz8w2I2q</latexit> <latexit sha1_base64="hD01c9KasdDmQVoRS6pc5pM73B4=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBahXkoigh6LevBYwX5AG8pmu2mX7m7C7kYooX/BiwdFvPqHvPlv3KQ5aOuDgcd7M8zMC2LOtHHdb6e0tr6xuVXeruzs7u0fVA+POjpKFKFtEvFI9QKsKWeStg0znPZiRbEIOO0G09vM7z5RpVkkH80spr7AY8lCRrDJpLu6Ph9Wa27DzYFWiVeQGhRoDatfg1FEEkGlIRxr3ffc2PgpVoYRTueVQaJpjMkUj2nfUokF1X6a3zpHZ1YZoTBStqRBufp7IsVC65kIbKfAZqKXvUz8z+snJrz2UybjxFBJFovChCMToexxNGKKEsNnlmCimL0VkQlWmBgbT8WG4C2/vEo6Fw3PbXgPl7XmTRFHGU7gFOrgwRU04R5a0AYCE3iGV3hzhPPivDsfi9aSU8wcwx84nz8w2I2q</latexit> STAR “- DemoReg” STAR “- D-weight” STAR “- z” Figure C.7: Overview of some of the performed ablations. -DemoReg: removes the demonstra- tion regularization for the high-level semantic policy and uses only the task-agnostic prior for regularization, -D-weight: removes the discriminator-based weighting between demonstration and task-agnostic regularization and uses only the former for guiding the policy during downstream training, -z: removes the latent variable z from the low-level policy and instead uses a deterministic low-level policy and no execution latent policy. STAR - DemoReg. Ablates the policy regularization with cross-domain skill distributions. Instead simply regularizes with task-agnostic skill priors derived from the target domain play data (see Figure C.7, left). STAR - D-weight. Ablates the discriminator-based weighting between demonstration and task- agnostic skill distributions. Instead always regularizes the high-level semantic policy towards the demonstration skill distribution (see Figure C.7, middle). STAR - z. Ablates the use of the latent execution variable z in the skill policy. Instead trains a simpler low-level policy without latent variable z and removes the execution latent policy (see Figure C.7, right). 180 F Additional Ablation Experiments We perform an additional ablation experiment to test whether replacing the high-level policy’s weighted KL-regularization scheme from equation 4.3 with a simpler behavioral cloning regulariza- tion objective can lead to comparable performance. Concretely, we replace the policy’s objective from equation 4.3 with: max π h Q(s,a)− αE k∼ π sem (k|s) p demo (k|s) | {z } BC regularization − α l D KL π lat (z|s,k), p TA (z|s,k) | {z } task-agnostic execution prior regularization . (F.1) We also experimented with removing the execution prior regularization term, i.e., settingα l = 0, but found it to be crucial for training since the initial policy rapidly degrades without it. STAR (ours) BC Regularization BC Planner We report quantitative results on the human video demonstration to simulated kitchen manipulation task in the figure on the right. The BC-Reg objective in equa- tion F.1 obtains 75% lower performance than our full ob- jective from equation 4.3. This is because the behavioral cloning regularization is also computed on states outside the demonstrations’ support, leading to incorrect regular- ization. Instead, our approach uses the discriminator to only apply regularization within the support of the demonstrations. We also add comparison to an even simpler baseline that clones the transferred semantic skill embeddings from the demonstrations, equivalent to a semantic-level BC planner. This approach does not perform well due to accumulating errors of the high-level planner (see figure on the right). Without online training, this approach cannot correct the shortcomings of the planner. 181 Algorithm 5 STAR (Semantic Transfer Accelerated RL) – High-level Policy Optimization 1: Inputs: H-step reward function ˜ r(s t ,k t ,z t ), reward weightκ, discountη, target divergences δ q ,δ p ,δ l , learning ratesλ π ,λ Q ,λ α , target update rateτ. 2: Load pre-trained & freeze: low-level policy π l (a t |s t ,k t ,z t ), skill priors p demo (k t |s t ), p TA (k t |s t ), p TA (z t |s t ,k t ), discriminator D(s), progress predictor P(s) 3: Initialize: replay bufferD, high-level policies π sem σ (k t |s t ),π lat θ (z t |s t ,k t ), critic Q φ (s t ,k t ,z t ), target network Q ¯ φ (s t ,k t ,z t ) 4: 5: for each iteration do 6: for every H environment steps do 7: k t ∼ π sem (k t |s t ) ▷ sample semantic skill from policy 8: z t ∼ π lat (z t |s t ,z t ) ▷ sample latent skill from policy 9: s t ′∼ p(s t+H |s t ,π l (a t |s t ,k t ,z t )) ▷ execute skill in environment 10: D← D∪{s t ,k t ,z t ,R Σ ,s t ′} ▷ store transition in replay buffer, R Σ = H-step summed reward 11: end for 12: for each gradient step do 13: ˜ r=(1− κ)· R Σ +κ· logD(s t )− log 1− D(s t ) ▷ compute combined reward 14: ˜ r=1(D(s)< 0.5)· ˜ r+1(D(s)≥ 0.5)· ˜ r· P(s) ▷ optionally shape reward with progress predictor 15: ¯ Q= ˜ r+η Q ¯ φ (s t ′,π lat θ (z t ′|s t ′,π sem σ (k t ′|s t ′))) 16: − α q D KL π sem σ (k t ′|s t ′), p demo (k t ′|s t ′) · D(s t ′) 17: +α p D KL π sem σ (k t ′|s t ′), p TA (k t ′|s t ′) · 1− D(s t ′) 18: +α l D KL π lat θ (z t ′|s t ′,k t ′), p TA (z t ′|s t ′,k t ′) ▷ compute Q-target 19: (σ,θ)← (σ,θ)− λ π ∇ (σ,θ) Q φ (s t ,π lat θ (z t |s t ,π sem σ (k t |s t ))) 20: − α q D KL π sem σ (k t |s t ), p demo (k t |s t ) · D(s t ) 21: +α p D KL π sem σ (k t |s t ), p TA (k t |s t ) · 1− D(s t ) 22: +α l D KL π lat σ (z t |s t ,k t ), p TA (z t |s t ,k t ) ▷ update HL policy 23: φ← φ− λ Q ∇ φ 1 2 Q φ (s t ,k t ,z t )− ¯ Q 2 ▷ update critic weights 24: α q ← α q − λ α ∇ α q α q · (D KL (π sem σ (k t |s t ), p demo (k t |s t ))− δ q ) ▷ update alpha values 25: α p ← α p − λ α ∇ α p α p · (D KL (π sem σ (k t |s t ), p TA (k t |s t ))− δ p ) 26: α l ← α l − λ α ∇ α l α l · (D KL (π lat θ (z t |s t ,k t ), p TA (z t |s t ,k t ))− δ l ) 27: ¯ φ← τφ+(1− τ) ¯ φ ▷ update target network weights 28: end for 29: end for 30: return trained high-level policiesπ sem σ (k t |s t ),π lat θ (z t |s t ,k t ) 182 Appendix D Meta Reinforcement Learning with Learned Skills A Meta-reinforcement Learning Method Ablation In this section, we compare the learning efficiency of different meta-RL algorithms with respect to the length of the training tasks. Specifically, we hypothesize that our approach SiMPL, which extracts temporally extended skills from offline experience, is better suited for learning long-horizon tasks than prior meta-RL algorithms. To cleanly investigate the importance of the temporally extended skills vs. the importance of using prior experience we include two additional comparisons to methods that leverage prior experience for meta-RL but via flat behavioral cloning instead of through temporally extended skills: • BC+PEARL first learns a behavior cloning (BC) policy through supervised learning from the offline dataset. Then, analogous to our approach SiMPL, during the meta-training phase, a task encoder and a meta-learned policy are meta-trained with the BC policy constrained SAC objective. For fair comparison, we use the same residual policy parameterization as described in Section E.1. • BC+MAML follows the same learning procedure described above, but uses MAML (Finn et al., 2017b) for meta-training instead of PEARL. We follow the original learning objective in Finn et al. (2017b) (i.e., using REINFORCE (Williams, 1992) for task adaptation, and using TRPO (Schulman et al., 2017) for meta-policy optimization). 183 We compare these methods as well as the standard meta-RL approach PEARL (Rakelly et al., 2019) on three meta-training tasks distributions of increasing complexity in the maze navigation environment (see Figure D.1): (1) short-range goals with small varianceT TRAIN-EASY , (2) short- range goals with larger varianceT TRAIN-MEDIUM , and (3) long-range goals with large variance T TRAIN-HARD , which we used in our original maze experiments. By increasing variance and length of the tasks in each task distribution, we can investigate the learning capability of the meta-RL algorithms. We present the quantitative results in Figure D.2 and the corresponding qualitative analysis in Figure D.3. On the simplest task distribution we find that all approaches can learn to solve the tasks efficiently, except for BC+MAML. While the latter also learns to solve the task eventually (see performance upon convergence as dashed orange line in Figure D.2a) it uses on-policy meta- RL and thus requires substantially more environment interactions during meta-training. We thus only consider the more sample efficient BC+PEARL off-policy meta-RL method in the remaining comparisons. On the more complex task distributionsT TRAIN-MEDIUM andT TRAIN-HARD , we find that using prior data for meta-learning is generally beneficial: both BC+PEARL and SiMPL learn more efficiently on the task distribution of medium difficulty T TRAIN-MEDIUM , as shown in Figure D.2b, since the policy pre-trained from offline data allows for more efficient exploration during meta- training. Importantly, on the hardest task distributionT TRAIN-HARD , as shown in Figure D.2c, which consists exclusively of long-horizon tasks, we find that only SiMPL is able to effectively learn, highlighting the importance of leveraging the offline data via temporally extended skills instead of flat behavioral cloning. This supports our intuition that the abstraction provided by skills is particularly beneficial for meta-learning on long-horizon tasks. B Learning Efficiency on Target Tasks with Few Episodes of Experience In this section, we examine the data efficiency of the compared methods on the target tasks, specifically when provided with only a few (<20) episodes of online interaction with an unseen target 184 (a)T TRAIN-EASY (b)T TRAIN-MEDIUM (c)T TRAIN-HARD Figure D.1: Task Distributions for Task Length Ablation. We propose three meta-training task distribu- tions of increasing difficulty to compare different meta-RL algorithms: T TRAIN-EASY uses short-horizon tasks with adjacent goal locations, making exploration easier during meta-training,T TRAIN-MEDIUM uses similar task horizon but increases the goal position variance,T TRAIN-HARD contains long-horizon tasks with high variance in goal position and thus is the hardest of the tested task distributions. (a)T TRAIN-EASY (b)T TRAIN-MEDIUM (c)T TRAIN-HARD Figure D.2: Meta-Training Performance for Task Length Ablation. We find that most meta-learning approaches can solve the simplest task distribution, but using prior experience in BC+PEARL and SiMPL helps for the more challenging distributions (b) and (c). We find that only our approach, which uses the prior data by extracting temporally extended skills, is able to learn the challenging long-horizon tasks efficiently. 185 (a) PEARL onT TRAIN-EASY (b) BC+PEARL onT TRAIN-EASY (c) SiMPL onT TRAIN-EASY (d) PEARL onT TRAIN-MEDIUM (e) BC+PEARL on T TRAIN-MEDIUM (f) SiMPL onT TRAIN-MEDIUM (g) PEARL onT TRAIN-HARD (h) BC+PEARL onT TRAIN-HARD (i) SiMPL onT TRAIN-HARD Figure D.3: Qualitative Result of Meta-reinforcement Learning Method Ablation. Top. All the methods can learn to solve short-horizon tasksT TRAIN-EASY . Middle. On medium-horizon tasksT TRAIN-MEDIUM , PEARL struggles at exploring further, while BC+PEARL exhibits more consistent exploration yet still fails to solve some of the tasks. SiMPL can explore well and solve all the tasks. Bottom. On long- horizon tasksT TRAIN-HARD , PEARL falls into a local minimum, focusing only on one single task on the left. BC+PEARL explores slightly better and can solve a few more tasks. SiMPL can effectively learn all the tasks. 186 SiMPL (Ours) SPiRL MTRL PEARL SAC Figure D.4: Performance with few episodes of target task interaction. We find that our skill-based meta-RL approach SiMPL is able to learn complex, long-horizon tasks within few episodes of online interaction with a new task while prior meta-RL approaches and non-meta-learning baselines require many more interactions or fail to learn the task altogether. task. Being able to learn new tasks this quickly is a major strength of meta-RL approaches (Finn et al., 2017b; Rakelly et al., 2019). We hypothesize that our skill-based meta-RL algorithm SiMPL can learn similarly fast, even on long-horizon, sparse-reward tasks. In our original evaluations in Section 5.5, we used 20 episodes of initial exploration to condition our meta-trained policy. In Figure D.4, we instead compare performance of different approaches when only provided with very few episodes of online interactions. We find that SiMPL learns to solve the unseen tasks substantially faster than all alternative approaches. On the kitchen manipulation tasks our approach learns to almost solve two out of four subtasks within a time span equivalent to only a few minutes of real-robot execution time. In contrast, prior meta-RL methods struggle at making progress at all on such long-horizon tasks, showing the benefit of combining meta-RL with prior offline experience. C Investigating Offline Data vs. Target Domain Shift To provide more insights on comparing SiMPL and SPiRL (Pertsch et al., 2020a), we evaluate SiMPL in the maze navigation task setup proposed in Pertsch et al. (2020a). This tests whether our approach can scale to image-based observations: Pertsch et al. (2020a) use 32× 32px observations 187 centered around the agent. Even more importantly, it allows us to investigate the robustness of the approach to the domain shifts between the offline pre-training data and the target task: we use the maze navigation offline dataset from Pertsch et al. (2020a) which was collected on randomly sampled 20× 20 maze layouts and test on tasks in the unseen, randomly sampled 40× 40 test maze layout from Pertsch et al. (2020a). We visualize the meta-training task distribution in Figure D.5a and the target task distribution in Figure D.5b. We compare the performance of our method to the best-performing baseline, SPiRL (Pertsch et al., 2020a), in Figure D.5c. Similar to the result presented in Figure 5.4, SiMPL can learn the target task faster by combining skills learned from the offline dataset with efficient meta-training. This shows that our approach can scale to image-based inputs and is robust to substantial domain shifts between the offline pre-training data and the target tasks. Note that the above results are obtained by comparing our proposed method and SPiRL with the exact same setup used in the SPiRL paper (Pertsch et al., 2020a). Specifically, we used the same initial position of the agent as well as sampled the tasks of comparable complexity to the ones used in the SPiRL paper for our evaluation (please see Figure 13 in the SPiRL paper for tasks used in their evaluation). While the used test tasks do not fully cover the entire maze, they are already considerably long-horizon, requiring on average 710 steps until completion while only providing sparse goal-reaching rewards. To further explore the performance of our proposed method and SPiRL, we have experimented with learning from goals sampled across the entire maze. Yet, SPiRL cannot learn such target tasks and our proposed method consequently does not converge well on the meta-training tasks. This highlights the limitation of skill-based RL methods and can potentially be addressed by learning a more expressive skill prior, e.g., using flow models (Dinh et al., 2017), but this is outside the scope of this work. D Extended Related Work We present an extended discussion of the related work in this section. 188 (a)T TRAIN-IMAGE-BASED (b)T TARGET-IMAGE-BASED (c) Target Task Learning Efficiency Figure D.5: Image-Based Maze Navigation with Distribution Shift. (a-b): Meta-training and target task distributions. The green dots represent the goal locations of meta-training tasks and the red dots represent the goal locations of target tasks. The yellow cross represent the initial location of the agent, which is equivalent to the one used in Pertsch et al. (2020a). (c): Performance on the target task. Our approach SiMPL can leverage skills learned from offline data for efficient meta-RL on the maze navigation task and is robust to the domain shift between offline data environments and the target environment. Pre-training in Meta-learning. Leveraging pre-trained models for improving meta-learning methods has been explored in (Bronskill et al., 2021; Dvornik et al., 2020; Kolesnikov et al., 2020; Triantafillou et al., 2020) with a focus on few-shot image classification. One can also view our proposed framework as a meta-reinforcement learning method with a pre-training phase. Specifically, in the pre-training phase, we propose to first extract reusable skills and a skill prior from offline datasets without reward or task annotations in a self-supervised fashion. Then, our proposed method meta-learns from a set of meta-training tasks, which significantly accelerates learning on unseen target tasks. E Implementation Details on Our Method In this section, we describe the additional implementation details on our proposed method. The details on model architecture is presented in Section E.1, followed by the training detailed described in Section E.2. E.1 Model architecture We describe the details on our model architecture in this section. 189 Skill Prior We followed architecture and learning procedure of Pertsch et al. (2020a) for learning a low-level skill policy and a skill prior. Please refer to Pertsch et al. (2020a) for more details on the architectures for learning skills and skill priors from offline datasets. Task Encoder Following Rakelly et al. (2019), our task encoder is a permutation invariant neural network. Specifically, we adopt Set Transformer (Lee et al., 2019a) that consists of layers [2× ISAB 32 → PMA 1 → 3× MLP] for expressive and efficient set encoding. All the hidden layers are 128-dimensional and all attention layers have 4 attention heads. The encoder takes a set of high-level transitions as input, where each transition is a vector concatenation of high-level transition tuple. The output of the encoder is(µ e ,σ e ) which are the parameters of Gaussian task posterior p(e|c) =N (e;µ e ,σ e ). We varied task vector dimension dim(e) depends on task distribution complexity. dim(e)= 10 for Kitchen Manipulation, dim(e)= 6 for Maze Navigation with 40 meta-training tasks, and dim(e)= 5, otherwise. Policy We parameterize our policy with neural network. We employed 4-layer MLPs with 256 hidden units for Maze Navigation, and 6-layer MLPs with 128 hidden unit for Kitchen Manipulation experiment. Instead of direct parameterization of policy, the network output is added to skill-prior to make learning more stable. Specifically, the policy network takes concatenation of (s,e) as input, and then outputs residual parameters(µ z ,logσ z ) to skill-prior distribution p(z|s)=N (z|µ p ,σ p ). Resulting distribution by this residual parameterization is π(z|s)=N (z|µ p +µ z ,exp(logσ p + logσ z )) Critic The critic network takes concatenation of s, e, and skill z as input and outputs an estimation of task-conditioned Q-value Q(s,z,e). We employ double Q networks (Van Hasselt et al., 2016) to mitigate Q-value overestimation. The architecture of critic follows the policy. 190 E.2 Training details For all the network updates, we used Adam optimizer (Kingma and Ba, 2015) with a learning rate of 3e− 4,β 1 = 0.9, andβ 2 = 0.999. We describe the training details of the skill-based meta-training phase in Section E.2.1 and the target task learning phase Section E.2.2. E.2.1 Skill-based Meta-training Our meta-training procedure is similar to the procedure adopted in (Rakelly et al., 2019). Encoder and critics networks are updated to minimize MSE between Q-value prediction and target Q value. Policy network is updated to optimize Equation 5.3 without updating the encoder network. All network are updated with the average gradients of 20 randomly sampled tasks. Each batch of gradients is computed from 1024 and 256 transitions for Maze Navigation and Kitchen Manipulation experiment, respectively. We train our models for 10000, 18000, and 16000 episodes for the Maze Navigation experiments with 10, 20, 40 meta-training tasks, respectively, and 3450 episodes for Kitchen Manipulation. As stated in Section 5.4.2, we apply different regularization coefficients depending on the size of the conditioning transitions. In Maze Navigation experiment, we set target KL divergence to 0.1 for batch that is conditioned on size 4 transitions and 0.4 for batch conditioned on size 8192 transitions. In Kitchen Manipulation experiment, we set target KL divergence to 0.4 for batch conditioned with a size 1024 transitions while KL coefficient for batch conditioned on size 2 transitions is fixed to 0.3. E.2.2 Target Task Learning We initialize the Q function and the auto-tuning value α with the values that learned in the skill-based meta-training phase. The policy is initialized after observing and encoding 20 episodes obtained from the task unconditioned policy rollouts. For the target task learning phase, the target KLδ is 1 for Maze Navigation, and 2 for Kitchen Manipulation experiments. To compute a gradient 191 step, 256 high-level transitions are sampled from a replay buffer with size 20000. Note that we used same setup for baselines that uses SPiRL fine-tuning (SPiRL and MTRL). F Implementation Details on Baselines In this section, we describe the additional implementation details on producing the results of the baselines. SAC The SAC (Haarnoja et al., 2018a) baseline learns to solve a target task from scratch without leveraging the offline dataset nor the meta-training tasks. We initialize α to 0.1 and set the target entropy toH =− dim(A). To compute a gradient step, 4096 and 1024 environment transitions are sampled from a replay buffer for Maze Navigation and Kitchen Navigation experiments, respectively. PEARL and PEARL-ft PEARL (Rakelly et al., 2019) learns from the meta-training tasks but does not use the offline dataset. Therefore, we directly train PEARL models on the meta-training tasks without the phase of learning from offline datasets. We use gradients averaged from 20 randomly sampled tasks where each task gradient is computed by batch sampled from a per-task buffer. The target entropy is set toH =− dim(A) andα is initialized to 0.1. While the method proposed in Rakelly et al. (2019) does not fine-tune on target/meta-testing tasks, we extend PEARL to be fine-tuned on target tasks for a fair comparison, called PEARL-ft. Since PEARL does not use learned skills or a skill prior, the target task learning of PEARL is simply running SAC with task-encoded initialization. Similar to the target task learning of our method, we initialize the Q function and entropy coefficient α to the value learned during the meta-training phase. Also, we initialize the policy to the task conditioned policy after observing 20 episodes of experience from the task unconditioned policy rollouts. The hyperparameters used for fine-tuning are the same as SAC. 192 SPiRL Similar to our method, we initialize the high-level policy to skill-prior while fixing low-level policy for target task learning for SPiRL. α is initialized to 0.01 and we use the same hyperparameters for the SPiRL models as our method. Multi-task RL (MTRL) Inspired by Distral (Teh et al., 2017), our multi-task RL baseline is designed to first learns a set of individual policies, where each of them is specialized in one task; then, a shared/multi-task policy is learned by distilling the individual polices. Since it is inefficient to learn an individual policy from scratch, we learn each individual policy using SPiRL with learned skills and a skill prior. Then, we distill the individual policies using the following objective : max π 0 E T∼ p T ∑ t E (s t ,z t )∼ ρ π 0 r T (s t ,z t )− αD KL π 0 (z|s t ,e), p(z|s t ) . (F.1) We use the same setup forα as our method, whereα is auto-tuned to satisfy a target KL,δ = 0.1 for Maze Navigation andδ = 0.2 for Kitchen Manipulation. While the target task learning phase for MTRL is similar to ours, except that MTRL is not initialized with a meta-trained Q function and learnedα. G Meta-training Tasks and Target Tasks. In this section, we present the meta-training tasks and target tasks used in the maze navigation domain and the kitchen manipulation domain. G.1 Maze Navigation The meta-training tasks and target tasks are visualized in Figure D.6 and Figure D.7. G.2 Kitchen Manipulation The meta-training tasks are: • microwave→kettle→bottom burner→slide cabinet 193 (a) Meta-training 40 Tasks (b) Meta-training 20 Tasks (c) Meta-training 10 Tasks (d) Target Tasks Figure D.6: Maze Meta-training and Target Task Distributions. The green dots represent the goal locations of meta-training tasks and the red dots represent the goal locations of target tasks. The yellow cross represent the initial location of the agent. (a)T TRAIN-TOP (b)T TARGET-TOP (c)T TARGET-BOTTOM Figure D.7: Maze Meta-training and Target Task Distributions for Meta-training Task Distribution Analysis. The green dots represent the goal locations of meta-training tasks and the red dots represent the goal locations of target tasks. The yellow cross represent the initial location of the agent. 194 • microwave→bottom burner→top burner→slide cabinet • microwave→top burner→light switch→hinge cabinet • kettle→bottom burner→light switch→hinge cabinet • microwave→bottom burner→hinge cabinet→top burner • kettle→top burner→light switch→slide cabinet • microwave→kettle→slide cabinet→bottom burner • kettle→light switch→slide cabinet→bottom burner • microwave→kettle→bottom burner→top burner • microwave→kettle→slide cabinet→hinge cabinet • microwave→bottom burner→slide cabinet→top burner • kettle→bottom burner→light switch→top burner • microwave→kettle→top burner→light switch • microwave→kettle→light switch→hinge cabinet • microwave→bottom burner→light switch→slide cabinet • kettle→bottom burner→top burner→light switch • microwave→light switch→slide cabinet→hinge cabinet • microwave→bottom burner→top burner→hinge cabinet • kettle→bottom burner→slide cabinet→hinge cabinet • bottom burner→top burner→slide cabinet→light switch • microwave→kettle→light switch→slide cabinet • kettle→bottom burner→top burner→hinge cabinet • bottom burner→top burner→light switch→slide cabinet The target tasks are: • microwave→bottom burner→light switch→top burner • microwave→bottom burner→top burner→light switch • kettle→bottom burner→light switch→slide cabinet • microwave→kettle→top burner→hinge cabinet • kettle→bottom burner→slide cabinet→top burner • kettle→light switch→slide cabinet→hinge cabinet • kettle→bottom burner→top burner→slide cabinet • microwave→bottom burner→slide cabinet→hinge cabinet • bottom burner→top burner→slide cabinet→hinge cabinet • microwave→kettle→bottom burner→hinge cabinet 195 Appendix E Assisted Teleoperation for Scalable Robot Data Collection A Environments Real World Kitchen We evaluate our algorithm and the baselines in a real-world kitchen envi- ronment (Figure 6.5) with a Kinova Jaco 2 robot arm. Observations are 224x224x3 cropped RGB images recorded from a Logitech Pro Webcam C920 for front-view and an Intel RealSense D435 for wrist-view (see Figure E.1). In addition to images, the neural net policies have access to robot end-effector position, velocity and gripper state. The action space for all policies is the delta in end-effector position and the gripper opening / closing commands. The actions are communicated at a rate of 10Hz and the joint level controller uses PyBullet (Coumans and Bai, 2016–2021) for inverse kinematics. The environment contains 8 long-horizon kitchen tasks composed by combining sub-tasks{cook meat, serve meat}x{cook veggies, serve veggies}x{2 starting locations}. Each sub-task itself involves several different skills such as pick meat from table and put veggies in oven. We use a Dual Shock 4 gamepad to control the robot’s end-effector position. Simulated IKEA Furniture Assembly We evaluate the scalability of data collection with assisted teleoperation on the simulated IKEA Furniture Assembly environment of Lee et al. (Lee et al., 2021a), which builds on the Mujoco simulator (Todorov et al., 2012). We collect data for a block stacking task in which the middle block needs to be stacked on either the left block or the right block. We use the ground truth pose and orientation measurements of end-effector and the blocks 196 Figure E.1: Visualization of the front view (top) and wrist view (bottom) of a (serve meat, serve veggies) task trajectory in the real-world kitchen environment. provided by the simulator as observations for all policies. They control the robot via end-effector control, like in the real-world setup. We use multiple instances of this environment to simulate data collection on multiple robots (Figure 6.7). We use a Dual Shock 4 to control each robot’s end-effector position, gripper open / close state and to allow switching between the robots. B Implementation Details Here we provide a detailed overview of the model architecture and used hyperparameters (for a detailed listing of all hyperparameters, see Table E.1). B.1 Sub-goal Predictor The conditional-V AE for the sub-goal predictor consists of an encoder and a decoder, both implemented with 5-layer MLPs with 128 hidden units in each layer, batch norm and LeakyReLU activations. The input to the encoder is the state and goalH steps in the future, whereH is the goal horizon. For image observations in the real-world kitchen environment we first run the inputs through a pre-trained visual encoder, R3M (Nair et al., 2022a), to obtain a 2048-dimensional vector representation. The output of the encoder is the mean and log standard deviation of a 128-dimensional multivariate Gaussian latent space. The decoder uses the current state and a sample from the encoder’s output distribution to generate a goal prediction. During inference, we generate goals with the decoder by sampling the latent vector from a standard Gaussian prior.H is set to 35 for the real-world kitchen environment and 7 for the IKEA Furniture Assembly environment. 197 To model the task uncertainty when deciding whether to request human input, we sample N= 1024 goals and measure the variance between the end-effector positions of the robot. In case of high task uncertainty, the goals predicted by the sub-goal predictor diverge, leading to high variance. When the variance is higher than a thresholdτ task , we query the human to disambiguate which task to perform next. B.2 Low-level Sub-goal Reaching Policy The inputs to the sub-goal reaching policy is the current state s t and the goal state s t+H whereH is the goal horizon used in the sub-goal predictor. The policy is implemented as a LSTM (Hochreiter and Schmidhuber, 1997) with 256-dimensional hidden units, which autoregressively predicts the actions for the nextL steps using s t and s t+H . Notice that in prior works(Mandlekar et al., 2020b) skill horizonL and goal horizonH are set to the same value. However, in our experiments we noticed that setting a higher goal horizon helps model the task uncertainty better but a higher skill horizon would reduce the decision frequency of the policy. Hence, in this work, we decouple the two and show that the goal horizonH can be much greater than the skill horizonL without affecting the policy performance. The state and goal are processed by input MLPs separately and concatenated before being processed by the LSTM. The hidden state of the LSTM is processed by an output MLP to generate actions. To model policy uncertainty, we use an ensemble of K=5 sub-goal reaching policies. Similar to (Menda et al., 2019), we measure the variance between the actions predicted by each of the policies. When the state input to the policies is seen in the training data, we expect the action predictions of the ensemble policies to agree, leading to a low variance. When the state input is outside the distribution of seen training data, the policy predictions diverge, leading to a high variance. When variance is higher than a thresholdτ policy , the policy queries human help. 198 Hyperparameters Kitchen IKEA Assembly Sub-goal Predictor train iterations 200 500 batch size 16 16 learning rate 0.001 0.001 optimizer Adam(0.9, 0.999) Adam(0.9, 0.999) encoder-mlp (width x depth) 128x5 128x5 decoder-mlp (width x depth) 128x5 128x5 latent dimension 128 128 normalization batch batch activation LeakyReLU(0.2) LeakyReLU(0.2) goal horizon (H ) 35 7 loss ELBO ELBO Low-level Sub-goal Reacher train iterations 300 4000 batch size 16 16 learning rate 0.001 0.001 optimizer Adam(0.9, 0.999) Adam(0.9, 0.999) input-mlp (width x depth) 256x3 256x3 output-mlp (width x depth) 256x3 256x3 lstm hidden dimension 256 256 normalization batch batch activation LeakyReLU(0.2) LeakyReLU(0.2) no. of ensemble policies 5 5 skill horizon (L ) 15 7 loss - delta EEF positions MSE MSE loss - gripper CE Loss CE Loss Table E.1: PATO hyperparameters. Parameters that differ between environments are marked in red. B.3 Q-Function (ThriftyDAgger) The ThriftyDAgger baseline (Hoque et al., 2021a) uses a risk measure derived from a trained Q-function for choosing when to request human input. The Q-function is modeled using a 3-layer MLP with 128 hidden units and LeakyReLU(0.2) activation. More details on training the Q-funtion and using it for risk calculation can be found in (Hoque et al., 2021a). C User Study Setup Real-World Kitchen We give each participant (N=16) 5 minutes of unassisted training time before starting the experiment and an additional training task for each of the assisted methods (PATO and ThriftyDAgger) so that participants can familiarize themselves with the behavior of the policy. During the experiment, the participants are required to perform 3 randomly sampled long horizon tasks from a list of 8 possible tasks using each method while simultaneously doing a side task. To simulate a real world setting where operator and robot may not be physically co-located, we 199 (a) Side task is hidden when participant operates robot (b) Robot view is hidden when participant performs the side task Figure E.2: Participant’s screen when (a) teleoperating the robot and (b) performing the side task only allow teleoperation through a monitor that displays a stream of the robot’s camera observations. Further, to model a scenario where the operator may need to divert their attention from the robot for working on the side task, we only show either the robot view or the side task at a time and allow the users to toggle between the two via a button press. To query human intervention, the assisted methods signal the operator by highlighting the robot view tab. Figure E.2 shows what the participant sees during the study. To measure the participants’ mental load, we use a side task that requires the operators full attention for limited intervals of time: they need to determine the index of an odd one out character in a given string (see Figure E.2b). Specifically, we generate a string of random alphanumeric characters with varying lengths one of which is different from the rest. The participants are required to input the index of this odd alphanumeric character. For example, in Figure E.2b, the participant needs to find the index of the character ’U’ in the string of ’C’s. As soon as the participant provides 200 an answer, we present them with the next string. We leave it up to the participant how to allocate time between the robot manipulation task and the side task. Table E.2: Post-execution survey (Likert scales with 7-option response format) Trust: Q1. I trusted the robot to do the right thing at the right time. Robot intelligence (α = 0.95): Q2. The robot was intelligent. Q3. The robot perceived accurately what my goals are. Q4. The robot and I worked towards mutually agreed upon goals. Q5. The robot’s actions were reasonable. Q6. The robot did the right thing at the right time. Human satisfaction (α = 0.91): Q7. I was satisfied with the robot and my performance. Q8. The robot and I collaborated well together. Q9. The robot was responsive to me. Simulated IKEA Furniture Assembly We give each participant (N=10) 1 minute of training time with each of the policies in a single-robot version of the environment. During the experiment the partic- ipants are asked to collect as much data as possible while controlling multiple robots simultaneously (1, 2 or 4). Each participant completes data collection with all three robot fleet sizes. They are given 4 min- utes for each data collection session. For 2 robots or more, the participants can switch between which robot they want to control via a button press and the policies can request user feedback via a visual indicator (see Figure 6.7). 201 Appendix F Model-Based Planning via Goal-Conditioned Hierarchical Predictors A Additional results We include additional qualitative and quantitative results here as well as at the supplementary website: sites.google.com/view/video-gcp. B Evidence lower bound (ELBO) derivation We wish to optimize the likelihood of the sequence conditioned on the start and the goal frame p(o 2:T− 1 |o 1,T ). However, due to the use of latent variable models, this likelihood is intractable, and we resort to variational inference to optimize it. Specifically, we introduce an approximate posterior network q(z 2:T− 1 |o 1:T ), where that approximates the true posterior (Kingma and Welling, 2014; Rezende et al., 2014). The ELBO can be derived from the objective that consists of likelihood and a term that enforces that the approximate posterior matches the true posterior: ln p(o 2:T− 1 |o 1,T )≥ ln p(o 2:T− 1 |o 1,T )− KL(q(z 2:T− 1 |o 1:T )||p(z 2:T− 1 |o 1:T )) =E q(z 2:T− 1 |o 1:T ) [ln p(o 2:T− 1 |o 1,T ,z 2:T− 1 )]− KL(q(z 2:T− 1 |o 1:T ))|| p(z 2:T− 1 |o 1,T )), (B.1) where the last equality is simply a rearrangement of terms. 202 Figure F.1: Predictions on Human 3.6M. We see that the GCP models are able to faithfully capture the human trajectory. The optical flow-based method (DVF) captures the background but fails to generate complex motion needed for long-term goal-conditioned prediction. Causal InfoGan also struggles to capture the structure of these long sequences and produce implausible interpolations. Full qualitative results are on the supplementary website: sites.google.com/view/gcp-hier/ home. Further, in order to efficiently parametrize these distributions, we factorize the distributions as follows according to the graphical model in Fig 7.2 (right) and Eq. 7.2: p(o 2:T− 1 |o 1,T ,z 2:T− 1 )= T− 1 ∏ t=2 p(o t |o 1,T ,z t ), (B.2) p(z 2:T− 1 |o 1,T )= T− 1 ∏ t=2 , (B.3) q(z 2:T− 1 |o 1:T )= T− 1 ∏ t=2 q(z t |o t ,pa(t)). (B.4) We therefore require the following distributions to define our model: p(o t |o 1,T ,z t ), p(z t |pa(t)), q(z t |o t ,pa(t)). The parameterization of these distributions is defined in Section 7.3.4. The parent 203 Figure F.2: Prior samples from GCP-tree on the Human 3.6M dataset. Each row is a different prior sample conditioned on the same information. Table F.1: Prediction performance on perceptual metrics. DATASET PICK&PLACE HUMAN 3.6M 9-ROOM MAZE 25-ROOM MAZE METHOD FVD LPIPS FVD LPIPS FVD LPIPS FVD LPIPS GCP-TREE 430.3 0.02 1314.3 0.05 655.50 0.174 413.31 0.168 GCP-SEQUENTIAL 328.9 0.02 1541.8 0.06 860.04 0.214 638.95 0.238 DVF (LIU ET AL., 2017) 2879.9 0.06 1704.6 0.05 1320.34 0.231 1476.44 0.215 CIGAN (KURUTACH ET AL., 2018) 3252.6 0.12 2528.5 0.17 1440.6 0.190 677.40 0.219 operator pa(t) returns the parent nodes of s t according to the graphical model in Fig 7.2 (right). Using these factorized distributions, we can write out the ELBO in more detail as: ln p(o 2:T− 1 |o 1,T )≥ E q(z 2:T− 1 |o 1:T ) T− 1 ∑ t=2 [ln p(o t |o 1,T ,z t )− KL(q(z t |o t ,pa(t))|| p(z t |pa(t)))]. (B.5) C Architecture We use a convolutional encoder and decoder similar to the standard DCGAN discriminator and generator architecture respectively. The latent variables z n as well as e n are 256-dimensional. All 204 S g 2 1 2 g S S g g S 1 2 2 g g S S GCP-Flat Visual Foresight GCP-Hierarchical Figure F.3: Comparison of visual planning & control approaches. Execution traces of Visual Foresight (left), GCP-tree with non-hierarchical planning (middle) and GCP-tree with hierarchical planning (right) on two 25-room navigation tasks. Visualized are start and goal observation for all approaches as well as predicted subgoals for hierarchical planning. Both GCP-based approaches can reach faraway goals reliably, but GCP with hierarchical planning finds shorter trajectories to the goal. hidden layers in the Multi-Layer Perceptron have 256 neurons. We add skip-connections from the encoder activations from the first image to the decoder for all images. For the inference network we found it beneficial to use a 2-layer 1D temporal convolutional network that adds temporal context into the latent vectors e t . For the recursive predictor that predicts e n , we use group normalization (Wu and He, 2018). We found that batch normalization (Ioffe and Szegedy, 2015) does not work as well as group normalization for the recursive predictor and conjecture that this is due to the activation distributions being non-i.i.d. for different levels of the tree. We use batch normalization in the convolutional encoder and decoder, and use local per-image batch statistics at test time. Further, for the simple RNN (without the LSTM architecture) ablation of our tree model, we activate e n 205 with hyperbolic tangent (tanh). We observed that without this, the magnitude of activations can explode in the lower levels of the tree and conjecture that this is due to recursive application of the same network. We found that using TreeLSTM (Tai et al., 2015) as the backbone of the hierarchical predictor significantly improved performance over vanilla recurrent architectures. To increase the visual fidelity of the generated results when predicting images, we use a foreground-background generation procedure similar to (Wang et al., 2018). The decoding distribu- tion p(o t |s t ) is a mixture of discretized logistics (Salimans et al., 2017), which we found to work better than alternative distributions. We use the mean of the decoding distribution as the prediction. For the adaptive binding model, the frame o t corresponding to the node s n is not known before the s n is produced. We therefore conditioned the inference distribution on the entire evidence sequence o 1:T via the attention mechanism over the embeddings (Bahdanau et al., 2015; Luong et al., 2015): q(z t )= Att(enc(o 1:T ),pa(t)). We reuse the same observation embeddings e t for the attention mechanism values. The different paths between the same start and goal may have very different lengths (see e.g. Fig. F.3), so it is necessary for GCP models to predict sequences of different lengths. We do so by training a termination classifier that predicts how long the sequence is. For GCP-Sequential, the termination classifier simply outputs the number of frames in the sequence, and the sequence is produced by recurrently unrolling that many frames. For the GCP-Tree model, to account for varied shapes of the tree, we instead predict a binary termination value at each node. To sample a trajectory, we recursively expand the tree, but stop the expansion where a particular node was classified as terminal (determined by a threshold on the classifier output). This procedure enables us to model even datasets with sequences of variable lengths. Hyperparameters. The convolutional encoder and decoder both have five layers. We use the Rectified Adam optimizer (Liu et al., 2019; Kingma and Ba, 2015) with β 1 = 0.9 andβ 2 = 0.999, batch size of 16 for GCP-sequential and 4 for GCP-tree, and a learning rate of 2× 10 − 4 . On each dataset, we trained each network for the same number of epochs on a single high-end NVIDIA 206 GPU. Training took a day for all datasets except the 25-room dataset, where we train the models for 3 days. D Data processing and generation Figure F.4: Example trajectory distribu- tions between fixed start (red) and goal (green) rooms on the 25-room naviga- tion task. The example goal-reaching behavior is highly suboptimal, with both strong multimodality in the space of pos- sible solutions as well as low-level noise in each individual trajectory. For training GCPs we use a dataset of example agent goal-reaching behavior. Below we describe how we col- lect those examples on the pick&place and navigation tasks and the details of the Human3.6M dataset. The data can be found on the following links: • 9-room: https://www.seas.upenn.edu/ ~ oleh/ datasets/gcp/nav_9rooms.zip • 25-room: https://www.seas.upenn.edu/ ~ oleh/ datasets/gcp/nav_25rooms.zip • pick&place: https://www.seas.upenn.edu/ ~ oleh/datasets/gcp/sawyer.zip • Pre-processed H3.6: https://www.seas.upenn. edu/ ~ oleh/datasets/gcp/h36m.zip pick&place. We generate the pick&place dataset using the RoboSuite framework (Fan et al., 2018) that is based on the Mujoco physics simulator (Todorov et al., 2012). We generate example goal-reaching trajectories by placing two objects at random locations on the table and using a rule-based policy to move them into the box that is located at a fixed position on the right of the workspace. We sample the object type randomly from a set of two possible object types, bread and can, with replacement. 207 Human 3.6M. For the Human 3.6 dataset, we downsample the original videos to 64 by 64 resolution. We obtain videos of length of roughly 800 to 1600 frames, which we randomly crop in time to 500-frame sequences. We split the Human 3.6 into training, validation and test set by correspondingly 95%, 5% and 5% of the data. Navigation. For the navigation task the agent is asked to plan and execute a path between a given 2D start and goal position. The environment is simulated using the Gym-Miniworld framework (Chevalier-Boisvert, 2018). We collect goal-reaching examples by randomly sampling start and goal positions in the 2D maze and plan trajectories using the Probabilistic Roadmap (PRM, Kavraki et al. (1996)) planner. The navigation problem is designed such that multiple possible room sequences can be traversed to reach from start to goal for any start and goal combination. During planning we sample one possible room sequence at random, but constrain the selection to only such sequences that do not visit any room more than once, i.e. that do not have loops. This together with the random sampling of waypoints of the PRM algorithm leads to collected examples of goal reaching behavior with substantial suboptimality. We show an example trajectory distribution from the data in Fig. F.4. While GCPs support training on sequences of variable length we need to set an upper bound on the length of trajectories to bound the required depth of the hierarchical predictive model and allow for efficient batch computation (e.g. at most 200 frames for the 25-room environment). If plans from the PRM planner exceed this threshold we subsample them to the maximum lenght using spline interpolation before executing them in the environment. The training data consists of 10,000 and 23,700 sequences for the 9-room and the 25-room task respectively, which we split at a ration of 99%, 1%, 1% into training, validation and test. E Planning Experimental Setup For planning with GCPs we use the model architectures described in Section C trained on the navigation data described in Section D. The hyperparameters for the hierarchical planning experiments are listed in Table F.2. We keep the hyperparameters constant across both 9-room 208 and 25-room tasks except for the maximum episode length which we increase to 400 steps for the 25-room task. Note that the cost function is only used at training time to train the cost estimator described in Section 7.4, which we use to estimate all costs during planning. To infer the actions necessary to execute a given plan, we train a separate inverse model a t = f inv (o t ,o t+1 ) that infers the action a t which leads from observation o t to o t+1 . We train the inverse model with action labels from the training dataset and, in practice, input predicted feature vectors ˆ e t instead of the decoded observations to not be affected by potential inaccuracies in the decoding process. We use a simple 3-layer MLP with 128 hidden units in each layer to instantiate f inv . At every time step the current observation along with the next observation from the plan is passed to the inverse model and the predicted action is executed. We found it crucial to perform such closed-loop control to avoid accumulating errors that posed a central problem when inferring the actions for the whole plan once and then executing them open-loop. Table F.2: Hyperparameters for hierarchical planning with GCPs on 9-room and 25-room navigation tasks. Hierarchical Planning Parameters Hierarchical planning layers (D) 2 Samples per subgoal (M) 10 Final Segment Optimization Sequence samples per Segment 5 General Parameters Max. episode steps 200 / 400 Cost function ∑ T− 1 t=0 (x t+1 − x t ) 2 We separately tuned the hyperpa- rameters for the visual foresight base- line and found that substantially more samples are required to achieve good performance, even on the shorter 9- room tasks. Specifically, we perform three iterations of CEM with a batch size of 500 samples each. For sam- pling and refitting of action distri- butions we follow the procedure de- scribed in (Nagabandi et al., 2019b). We use a planning horizon of 50 steps and replan after the current plan is ex- ecuted. We cannot use the cost function from Table F.2 for this baseline as it leads to degenerate solutions: in constrast to GCPs, VF searches over the space of all trajectories, not only those that 209 reach the goal. Therefore, the VF planner could minimize the trajectory length cost used for the GCP models by predicting trajectories in which the agent does not move. We instead use a cost function that measures whether the predicted trajectory reached the goal by computing the L2 distance between the final predicted observation of the trajectory and the goal observation. We run all experiments on a single NVIDIA V100 GPU and find that we need approximately 30mins / 1h to evaluate all 100 task instances on the 9-room and 25-room tasks respectively when using the hierarchical GCP planning. The VF evaluation requires many more model rollouts and therefore increases the runtime by a factor of approximately five, even though we increase the model rollout batch size by a factor of 20 for VF to parallelize trajectory sampling as much as possible. F Adaptive Binding with Dynamic Programming F.1 An efficient inference procedure To optimize the model with adaptive binding, we perform variational inference on both w and z: log p(x)≥ E q(z,w) [p(x|w,z)]− D KL (q(z|x)||p(z))− D KL (q(w|x,z)||p(w)). (F.1) To infer q(w|x,z), we want to produce a distribution over possible alignments between the tree and the evidence sequence. Moreover, certain alignments, such as the ones that violate the ordering of the sequence are forbidden. We define such distribution over alignment matrices A via Dynamic Time Warping. We define the energy of an alignment matrix as the cost, and the following distribution over alignment matrices: p(A|x,z)= 1 Z e − A∗ c(x,z) , where the partition function Z=E A [e − A∗ c(x,z) ], and c is the MSE error between the ground truth frame x t and the decoded frame associated with z n . We are interested in computing marginal edge 210 distributions w=E A [A]. Given these, we can compute the reconstruction error efficiently. We next show how to efficiently compute the marginal edge distributions. Given two sequences x 0:T ,z 0:N , denote the partition function of aligning two subsequences x 0:i ,z 0: j as f i, j =∑ A∈A 0:i,0: j e − A∗ c(x 0:i ,z 0: j ) . (Cuturi and Blondel, 2017) shows that these can be computed efficiently as: f i, j = c(x i ,z j )∗ ( f i− 1, j− 1 + f i− 1, j ). We note that we do not include the third term f i, j− 1 ), as we do not want a single predicted frame to match multiple ground truth frames. Furthermore, denote the partition function of aligning x i:T ,z j:N as b i, j =∑ A∈A i:T, j:N e − A∗ c(x i:T ,z j:N ) . Analogously, we can compute it as: b i, j = c(x i ,z j )∗ (b i+1, j+1 + b i+1, j ). Proposition 1 The total unnormalized density of all alignment matrices that include the edge(i, j) can be computed as e i, j = f i, j ∗ b i, j /c(x i ,z j )= c(x i ,z j )∗ ( f i− 1, j− 1 + f i− 1, j )∗ (b i+1, j+1 + b i+1, j ). Moreover, the probability of the edge(i, j) can be computed as w i, j = e i, j /Z. Proposition 1 enables us to compute the expected reconstruction loss in quadratic time: p(x|z)= w∗ c(x,z). F.2 Bottleneck Discovery Experimental Setup In order to use the adaptive binding model to discover bottleneck frames that are easier to predict, we increase the reconstruction loss on those nodes as described in the main text. Specifically, we use Gaussian decoding distribution for this experiment, and set the variance of the decoding distribution for several top layers in the hierarchy to a fraction of the value for lower layers. This encourages the model to bind the frames that are easier to predict higher in the hierarchy as the low variance severely penalizes poor predictions. We found this simple variance re-weighting scheme effective at discovering bottleneck frames on several environments. 211 To generate the visualization of the discovered tree structure in Fig. 7.6 we evenly subsample the original 80-frame sequences and display those nodes that bound closest to the subsampled frames such that the resulting graph structure still forms a valid 2-connected tree. The variations in tree structure arise because the semantic bottlenecks which the nodes specialize on binding to appear at different time steps in the sequences due to variations in speed and initial position of the robot arm as well as initial placement of the objects. G Training from Random Data Table F.3: Average Trajectory Length. Planning with GCP finds shorter paths than the training distribution. ORIGINAL DATA RANDOM DATA TRAINING DATA 31.4 62.6 GCP-TREE (OURS) 2 2 20 0 0. . .7 7 7 4 4 42 2 2. . .6 6 6 In the room navigation experi- ments we train our model with noisy trajectories that reach diverse goals with considerable suboptimality (see Fig. F.4). To test whether our method can work with even more suboptimal training data, we conduct preliminary experiments with completely random exploration data, and observe that our method still successfully solves navigation tasks in the 9-room environment (see Fig. F.5). This suggests that the proposed method is scalable even to situations where no good planners exist that can be used for data collection. Figure F.5: Left: random exploration data. Right: execution of our method trained on random data. In Tab. F.3, we compare the average trajectory length of training data and our method on both, the dataset used for the experiments in section 7.5.1 and the random action data. We find that planning with our method leads to substantially shorter trajecto- ries, further showing the ability of our approach to improve upon low-quality training data. 212 G.1 Runtime Complexity Computational efficiency. While the sequential forward predictor performsO(T) sequential operations to produce a sequence of length T, the hierarchical prediction can be more efficient due to parallelization. As the depth of the tree is⌈logT⌉, it only requiresO(logT) sequential operations to produce a sequence, assuming all operations that can be conducted in parallel are parallelized perfectly. We therefore batch the branches of the tree and process them in parallel at every level to utilize the benefit of efficient computation on modern GPUs. We note that the benefits of the GCP-tree runtime lie in parallelization, and thus diminish with large batch sizes, where the parallel processing capacity of the GPU is already fully utilized. We notice that, when predicting video sequences of 500 frames, GCP-sequential can use up to 4 times bigger batches than GCP-Tree without significant increase in runtime cost. This benefit is applicable both during training and inference. When training tree-structured networks we exploit the provided parallelism in the structure of the model and batch recursions in the tree that are independent when conditioned on their parents. 213 Appendix G Language-Conditioned Offline Reinforcement Learning with Instruction Relabeling 214 Algorithm 6 SPRINT Algorithm Require: DatasetD L w/ language instruction labels, LLM 1: AGGREGATESKILLS(D L , LLM) ▷ Automated LLM skill relabeling (Sec. 8.3.2) 2: while not converged do 3: τ z ← D L : Sample an annotated skill (sub-)trajectory 4: Train offline RL on τ z 5: 6: τ chain z ← CROSSCHAINSKILLS(D L ) ▷ Cross-trajectory skill chaining (Sec. 8.3.3) 7: Train offline RL on τ chain z 8: 9: τ agg 1 ,τ agg 2 ← CROSSAGGREGATESKILLS(D L , LLM) ▷ Cross-traj. aggregation (Sec. 8.3.3) 10: Train offline RL on τ agg 1 ,τ agg 2 11: end while 12: 13: Function: AGGREGATESKILLS(D L , LLM) 14: for composite trajectoryτ ¯ z inD L do 15: Separateτ ¯ z into language annotated sub-trajectories[τ z 1 ,...,τ z N ] 16: for all adjacent sub-trajectories τ z i ...τ z j do 17: Assign name from LLM: LLM(z i ...z j )= ˆ z i: j 18: τ ˆ z i: j ← Concat τ z i ,...,τ z j and relabel with ˆ z i: j 19: T← length ofτ ˆ z i: j 20: R(s T ,a T , ˆ z i: j )= 1 ▷ Label last reward with 1 (Equation 8.1). 21: D L =D L ∪ τ ˆ z i: j 22: end for 23: end for 24: 25: Function: CROSSCHAINSKILLS(D L ) 26: Sample random (sub)trajectoriesτ z 1 ,τ z 2 ∼ D L 27: Sample random endpoint j inτ z 1 28: τ chain z 2 ← [(s 0 ,a 0 ,0,z 2 ),..., s j ,a j ,Q π (s j ,a j ,z 2 ),z 2 ] ▷ Relabel reward w/ Equation 8.3. 29: 30: return τ chain z 2 31: 32: Function: CROSSAGGREGATESKILLS(D L , LLM) 33: Sample random (sub)trajectoriesτ z 1 ,τ z 2 ∼ D L with lengths T 1 ,T 2 34: Assign new name : ˆ z= “{z 1 }.{z 2 }” 35: τ agg 1 ← [(s 0 ,a 0 ,0, ˆ z),...,(s T 1 ,a T 1 ,Q π (s T ,a T |z 2 ), ˆ z)] ▷ Relabel reward w/ Equation 8.4 36: τ agg 2 ← [(s 0 ,a 0 ,0, ˆ z),...,(s T 2 ,a T 2 ,1, ˆ z)] ▷ Relabel reward w/ Equation 8.4. 37: 38: return τ agg 1 ,τ agg 2 215 Instructions: summarize the following ordered steps describing common household tasks. Task Steps: 1: Pick up the smaller knife on the counter to the left of the stove. 2: Slice the tomato with the smaller knife. 3: Put the knife in the sink. 4: Pick up a slice of tomato from the countertop. 5: Heat up the slice of tomato in the microwave, removing it afterwards. Summary: Microwave the tomato slice after slicing it with the smaller knife on the counter. Task Steps: 1: Pick up the vase. 2: Turn on the lamp. Summary: Look at the vase under the light. Task Steps: 1: Grab the pencil off of the desk. 2: Put the pencil in the bowl. 3: Grab the container off of the desk. 4: Put the container down at the back of the desk. Summary: Put a bowl with a pencil in it on the desk. Task Steps: 1: Pick up the bar of soap from the back of the toilet. 2: Put the bar of soap in to the sink, turn on the faucet to rinse off the soap, pick up the soap out of the sink. 3: Put the soap in the cabinet under the sink and on the left. Summary: Put a rinsed bar of soap in the cabinet under the sink. Task Steps: 1: [SKILL 1]. 2: [SKILL 2]. 3: [SKILL 3]. ... N: [SKILL N]. Summary: Figure G.1: The full prompt that we use for summarization. Following the suggestions of Ahn et al. (2022) for prompt design, we explicitly number each step. The LLM completion task begins after “Summary:”. For brevity, we omit the new line characters between all numbered steps. 216 A Large Language Model Prompt We list the full large language model summarization prompt in Figure G.1. The examples in the prompt are fixed for all summarization queries. These examples are selected from the ALFRED validation dataset (which is not otherwise used in our work) at random: We spell out the primitive skill annotations in the “Task Steps:” part of each prompt example. Then, the “Summary” for each of these is the high-level, human-written annotation for that trajectory from ALFRED. We repeatedly sampled these trajectories until each example mentioned a different object to prevent biasing the LLM towards certain types of objects. We note that the “Look at the vase under the light” example is important to make the LLM give reasonable summaries for similar tasks in ALFRED where the agent picks something up and turns on a light. This is because most of the human labels for turning on the lamp do not mention the object in the previous step, making it difficult for the LLM to realize that the task has to do with looking at the held object under a lamp. B Baselines and Implementation We implement IQL (Kostrikov et al., 2022) as the base offline RL algorithm for all goal- conditioned offline RL pretraining baselines and ablations due to its strong offline and finetuning performance on a variety of dense and sparse reward environments. At a high level, IQL trains on in-distribution(s,a,s ′ ,r,a ′ ) tuples from the dataset rather than sampling a policy for a ′ to ensure that the Q and value functions represent accurate estimated returns constrained to actions in the dataset. The value function is trained with an expectile regression loss controlled by a hyperparameterτ, whereτ= 0.5 results in standard mean squared error loss andτ→ 1 approximates the max operator, resulting in a more optimistic value function that can better “stitch” together trajectories to obtain distant reward in sparse reward settings. The IQL policy is trained to maximize the following objective: e β(Q(s,a)− V(s)) logπ(a|s), 217 which performs advantage-weighted regression (Peng et al., 2019) with an inverse temperature term β. In practice, the exponential advantage term is limited to a maximum value to avoid numerical overflow issues. We detail shared training and implementation details below, with method-specific information and hyperparameters in the following subsections. Observation space. The state space of the ALFRED environment consists of 300× 300 RGB images. Following the baseline method in ALFRED (Shridhar et al., 2020), we preprocess these images by sending them through a frozen ResNet-18 encoder (He et al., 2016) pretrained on ImageNet (Deng et al., 2009). This results in a 512× 7× 7 feature map that we use as the observation input to all networks. Furthermore, as ALFRED is a partially observable, egocentric navigation environment, we concatenate the last 5 frames as the full observation, resulting in an observation that is of the shape(512∗ 5)× 7× 7. Action space. The agent chooses from 12 discrete low-level actions. There are 5 navigation actions: MoveAhead,RotateRight,RotateLeft,LookUp, andLookDown and 7 interaction actions: Pickup, Put, Open, Close, ToggleOn, ToggleOff, and Slice. For interaction actions the agent additionally selects one of 82 object types to interact with, as defined by Pashevich et al. (2021). In total, the action space consists of 5+ 7∗ 82= 579 discrete action choices. Note that this action space definition is different from the action space in Shridhar et al. (2020), which used a pixel-wise mask output to determine the object to interact with. In contrast to Shridhar et al. (2020) we aim to train agents with reinforcement learning instead of imitation learning and found the discrete action parametrization more amenable to RL training than dense mask outputs. For all methods, due to the large discrete action space, we perform some basic action masking to prevent agents from taking actions that are not possible. For example, we do not allow the agent toClose objects that aren’t closeable nor can theyToggleOn objects that can’t be turned on. Language embedding. We embed language instructions with the all-mpnet-base-v2 pre- trained sentence embedding language model from the SentenceTransformers python package (Reimers and Gurevych, 2019). This produces a 768-dimensional language embedding which is used as input for language-conditioned policy and critic functions, as detailed below. 218 Policy and critic networks. We train a discrete policies with two output heads of size 12 and 82 for the action and interaction object outputs respectively. Critic networks are conditioned on both the observation and the discrete action output of the policy. In all policy and critic networks, we process the ResNet feature observation inputs with 4 convolutional layers. In networks with language input, we flatten the output of the convolutional layers and concatenate the observation features with the 768-dim language embedding, before passing the concatenated image-language features through a series of fully connected layers. Additionally, we use FiLM (Perez et al., 2018) to condition convolutional layers on the language embeddings. Param Value Batch Size 1024 # Training Batches 140k Learning Rate 2e-3 Optimizer AdamW Dropout Rate 0.2 Weight Decay 0.05 Discountγ 0.98 Q Update Polyak Averaging Coefficient 0.005 Q-Network Action Embedding Size 48 Q-Network Object Embedding Size 24 Policy and Q Update Period 1/train iter Batch Norm True Nonlinearity ReLU IQL Advantage Clipping [0, 100] IQL Advantage Inverse Temperatureβ 10 IQL Quantileτ 0.9 Pre-training hyperparameters. A hyperparameter search was performed first on the language-conditioned BC-baseline to optimize for training accuracy. These hyperparameters were carried over to the IQL implemen- tation, and another search for IQL-specific hyperpameters were performed on a baseline IQL policy conditioned on semantic instructions. With these parameters fixed, we performed one more hyperparameter search specific to Actionable Models but for the final implementation of SPRINT we re-used the same hyperparameters and only selected SPRINT-specific parameters heuristically. Hy- perparameters for each method are detailed in a separate table. Shared hyperparameters for all methods (where applicable) are listed below: Finetuning details and hyperparameters. For all models, we finetune only on newly collected task data by running online IQL (without any of the chaining or aggregation steps). Each method is finetuned on every task in the EVAL SCENE task set individually; that is, we pre-train once and then finetune policies for each task in the task set. We then average returns over all tasks, then report 219 metrics averaged over all random seeds. For each task, we define a maximum rollout time horizon of 2 timesteps per environment action required by the expert planner. When not specified, finetuning parameters are identical to pre-training parameters. Finetuning hyperparameters are specified below: Param Value Dropout Rate 0 # Initial Rollouts 50 Training to Env Step Ratio 0.5 ε inε-greedy action sampling 0.25: annealed to 0.05 # Parallel Rollout Samplers 4 B.1 Language-conditioned Behavior Cloning Our language-conditioned behavior cloning (L-BC) comparison method is inspired by and replicates BC-Zero (Jang et al., 2021) and LangLfP (Lynch and Sermanet, 2021). BC-Zero performs FiLM-conditioned semantic imitation learning (Perez et al., 2018) and both BC-Zero and LangLfP have an additional image/video-language alignment objective. In BC-Zero, their video alignment objective aligns language embeddings with videos of humans performing tasks related to those the BC-Zero robot agent trains on. LangLfP’s image-language alignment objective allows their policy to accept both image and natural language goals as input due to only having a subset of their data labeled with hindsight language labels. As we don’t have human videos of these tasks and our entire dataset is labeled with language labels, we do not add a video or image alignment objective. Therefore, we implement L-BC by using the same architecture as described above with just a single policy network that trains to maximize the log-likelihood of actions in the dataset. As our entire dataset consists of expert trajectories, this baseline ideally learns optimal actions for the instructions. Hyperparameters for the L-BC baseline are identical to the shared parameters above, where applicable. 220 B.2 Episodic Transformers Episodic Transformers (ET) (Pashevich et al., 2021) trains a transformer architecture on full sequences of ALFRED instructions with a behavior cloning objective. This is currently state of the art in the “Seen Path-Length Weighted Success Rate” evaluation metric on the ALFRED leader- board. We used the ET implementation from the official code repository and used the pre-tuned hyperparameters. For fair comparison, we make a few modifications to make it as close as possible to SPRINT and the baselines: 1) we train it on the same dataset as all baselines, so we do not generate new synthetic training data like the original implementation Pashevich et al. (2021) since it assumes access to an expert planner , 2) we encode visual frames with a Resnet-18 instead of Resnet-50 backbone, the same we use for all other models, 3) we use a context window of 5 frames just like for other policies, 4) we remove the high-level goal specification from the input text tokens as we do not assume access to those, and 5) we train the model for longer to match the number of training steps for all methods. B.3 Actionable Models (AM) Actionable Models (Chebotar et al., 2021a) pretrains a goal-conditioned Q function conditioned on randomly sampled image goals and also performs a goal-chaining procedure very similar to our semantic skill chaining procedure. We implement AM by modifying the base IQL policy and critic networks to take in image goals instead of natural language embeddings as goals. These goals are provided in the same way as the observations, i.e., as a concatenated stack of 5 frames (the last 5 frames in the trajectory) processed by a frozen ResNet-18. Therefore, goals are the same shape as observations: (512∗ 5)× 7× 7. To allow for fair comparison between our approach and AM, we implement AM with the same powerful offline RL algorithm, IQL (Kostrikov et al., 2022), used in our method. IQL ensures that the policy does not choose out-of-distribution actions by using advantage-weighted regression on 221 in-distribution actions for policy extraction. With this, we found the conservative auxiliary loss AM adds to push down Q-values for out-of-distribution actions to be unnecessary and even hurtful to its overall performance, so we omit this additional loss term. We also pre-train AM on the same long-horizon trajectories as those generated by SPRINT during LLM-based skill aggregation. This ensures a fair comparison in terms of the types and lengths of tasks seen during pre-training. Finally, after consulting the authors of AM, we tried varying maximum trajectory lengths when sampling random goals. We found that allowing random goals to be sampled from anywhere within a trajectory resulted in the best zero-shot evaluation performance for AM, so our numbers are reported with this implementation detail. B.4 SPRINT The implementation details of SPRINT follow from the discussion about implementing IQL at the top of this section. The key differences are in (1) language model skill aggregation and (2) cross-trajectory skill chaining, detailed below. LLM Skill Aggregation. We perform LLM skill aggregation fully offline by iterating through every ALFRED trajectory and aggregating sequences of adjacent primitive skill sub-trajectories. Assuming a trajectory with N primitive skills, we select all N 2 pairs of start and end skills and aggregate all instructions from start to end with the LLM. With 73k original language-annotated trajectories, this procedure allows us to generate an additional 110k aggregated trajectories. We then add these trajectories to the original dataset and train on the entire set. Cross-trajectory skill chaining. We perform cross-trajectory skill chaining in-batch. Instead of sampling a second trajectory to perform chaining on, we simply permute the batch indicies to generate a set of randomly sampled second trajectories. Then, we perform a second loss function update, in addition to the original update on the sampled trajectories, with equal loss weighting, 222 to apply the skill-chaining update. We apply both chaining procedures from Eq. 8.3 and Eq. 8.4 in-batch with equal weight. Empirically, we found that cross-trajectory skill chaining works slightly better with the on-policy Value function obtained through IQL, therefore we use state values at the chaining targets instead of state-action Q-values. 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Number of Skills 0 250 500 750 1000 1250 1500 1750 2000 Number of Trajectories (a) Skills per trajectory in the origi- nal ALFRED dataset. 2 3 4 5 6 7 8 9 10 Number of Skills 0 250 500 750 1000 1250 1500 1750 2000 Number of Trajectories (b) Skills per trajectory in the merged dataset. 0 10 20 30 40 50 60 70 Number of Actions 0 200 400 600 800 1000 1200 1400 Number of Skills (c) Actions per skill in the merged dataset. Figure G.2: Left: distribution of the number of skills in each trajectory in the original ALFRED dataset. Middle: distribution of skills per trajectory in the “merged” dataset with merged navigation skills. Right: distribution of number of actions per skill in the “merged” dataset. SPRINT-specific hyperparameters follow: Param Value LLM OPT-13B LLM Token Filtering Top-p 0.9 LLM Token Sampling Temperature 0.8 B.4.1 Cross-trajectory chaining preserves the MDP. When performing cross-trajectory chaining, particularly using Eq. 8.4, special care must be taken to preserve the dynamics of the original Markov Decision Process (MDP). When chaining together two trajectoriesτ A andτ B , we concatenate the two sentences of each trajectory together and relabel their rewards with Eq. 8.4. The new language annotation used to chain together these trajectories is the concatenation of the two sentences, implying that the agent finishes skill (A) 223 Table G.1: Evaluation Task Specifics. Note that the “number of env actions per task” corresponds to the number of environment actions the ALFRED expert planner required to complete that task. EVAL INSTRUCT EVAL LENGTH EVAL SCENE Number of Tasks 100 20 10 Task Lengths (# primitive skills) [1, 2, 3, 4, 5, 6, 7] [7, 8] [1, 2, 3, 4, 5] Min Number of Env Actions per Task 1 34 2 Avg Number of Env Actions per Task 39.1 60.9 46.6 Max Number of Env Actions per Task 113 104 124 and then skill (B). However, we cannot concatenate the two trajectories together into one longer trajectory, as doing so would imply that the agent can instantaneously jump from the last state of skill (A) to the first state of skill (B), which may not be possible. Therefore, we instead treat the relabeled trajectories as separate trajectories with the same language annotation (lines 36 and 37 of Algorithm 6). However, this introduces two possible complications: 1) Language annotations differing in structure from those in the original dataset, and 2) Possible instruction ambiguity. We detail how these complications are resolved in SPRINT below: 1. Language annotations differing in structure. Language annotations produced by the chaining procedure will result in annotations that implicitly skip certain steps. For example, when chaining skill (A), “make the bed,” and skill (B), “make a cup of coffee,” the resulting chained annotation will be “Make the bed. Make a cup of coffee.” However to perform skill (B) the agent needs to first move to the kitchen from the bedroom to make the cup of coffee, which is skipped in this annotation. LLM-based skill aggregation (Section 8.3.2) helps bridge this gap by summarizing long-horizon sequences while skipping certain implied steps. For example, one real LLM summary (listed in Figure G.8) summarized the sequence: “1: Pick 224 up the plaid pillow that is on the left end of the couch. 2: Place the pillow on the ottoman” into the instruction “Place a plaid pillow on the ottoman,” which skipped the step of picking up the pillow as it is implied that you must do so before placing the pillow down. Using the LLM augments our original dataset such that, in ALFRED, we have 2.5X more data after performing offline skill aggregation. Therefore after performing LLM aggregation, there are many examples of similar instructions to those used for chained trajectories that imply certain steps without mentioning them explicitly. 2. Instruction ambiguity. When chaining trajectories, there will be some ambiguity introduced as we do not have intermediate instructions for going from the last state of A to the initial state of B (obtaining these instructions requires additional human effort). This ambiguity is only present in the states of trajectory A, as when training on trajectory B, the agent can easily infer that the instructions for trajectory A are finished and the just follow the instructions relevant for trajectory B. We believe that the effects of the ambiguity on pre-training performance depends greatly on the given dataset. In complex and diverse environments like ALFRED, hindsight- labeled annotations will contain details specific to certain scenes, resolving this ambiguity. In ALFRED, the annotations usually contain information about the specific objects that the agent must interact with or locations that the agent must go to. For example, annotations for rinsing mugs typically are of the form “clean the MUG in the sink,” or annotations for picking up a candle will often say something like “pick up the YELLOW CANDLE on the COUNTER,” highlighting specific details regarding what the agent is supposed to do to complete the trajectory. C Dataset and Environment Details C.1 Dataset Details For training and evaluation we leverage the ALFRED benchmark and dataset (Shridhar et al., 2020). The ALFRED training dataset contains∼ 6.6k trajectories collected by an optimal planner 225 following a set of 7 high-level tasks with randomly sampled objects (e.g., pick up an object and heat it). Each trajectory has at least three crowd-sourced sets of language instruction annotations. Each trajectory consists of a sequence of 3-19 individually annotated skills (see Figure G.2, left). This results in a total of 141k language-annotated skill trajectories. However, nearly half of the language instructions in the ALFRED dataset are navigation skill instructions like “turn left, then look up and walk to the counter on the right”. To get a more balanced skill annotation dataset, we merge all navigation skills with the skill that immediately follows them, using only the annotation of the next skill. After this processing step, the resulting dataset contains 73k language-annotated primitive skill trajectories. After we merge the navigation skills, the average number of skills in each trajectory is 3.5 skills per trajectory (Figure G.2, middle), and the average number of actions in each skill is 14.3 (Figure G.2, right). C.2 Evaluation Tasks Overview. We evaluate agents through zero-shot policy evaluation and finetuning on three sets of evaluation tasks in the ALFRED environment: (1) EVAL INSTRUCT to measure the ability of pre-trained agents to execute semantically meaningful instructions at varied levels of abstraction, (2) EVAL LENGTH to measure the ability of agents to chain behaviors across multiple trajectories to solve long tasks, and (3) EVAL SCENE to evaluate generalization performance when finetuning to unseen household floor plans. We did not use the official ALFRED benchmark test sets to construct EVAL SCENE since we require a task demonstration to compute how many subtasks the agent solved, which we don’t have for the test set tasks. However, the tasks we evaluate on generally are designed to be representative of the tasks in the ALFRED test set: they test the agent on unseen instruction-scene combinations and consist of varied-length, compositional tasks. Like the ALFRED test set, our evaluation consists of long-horizon tasks that require sequential execution of multiple subtasks. 226 Figure G.3: Data collection jupyter notebook page. Note that there is a “Skip” button so that human annotators can skip an instruction sequence if they do not feel it is semantically meaningful or easy to summarize. 227 Collecting evaluation task data. The ALFRED dataset provides high-level language annotations for each of the trajectories in the dataset. We could use these annotations as unseen task-instructions to evaluate our agents. However, we found that the different skills are not equally distributed across trajectories of different skill lengths, e.g., most 2-skill trajectories perform pick-and-place tasks while tasks involving heating skills only appear in length 7+ trajectories. To allow evaluation with a less biased skill distribution, we create the EVAL INSTRUCT task set by randomly choosing a trajectory from the ALFRED dataset and then randomly sampling a subsequence of skills of a certain length from this trajectory. To obtain a high-level language instruction that summarizes this new subsequence, we crowd-source labels from human annotators. For labeling, each annotator is presented with a remotely hosted Jupyter notebook interface (see Figure G.3). Whenever we by chance sample a full ALFRED trajectory for annotation, we directly used the existing high- level annotation from the ALFRED dataset. We annotate 80 trajectories with human annotators and combine them with 20 randomly sampled single-skill trajectories, resulting in a total of 100 evaluation tasks (see Figure G.4 for example instructions). This results in 20 tasks of length 1 skills, 20 tasks of length 2 skills, 20 tasks of length 3 skills, 20 tasks of length 4 skills, and 20 tasks of lengths 5+ (5-7) skills. For EVAL LENGTH , we randomly sampled 20 full trajectories from the ALFRED dataset that had sequences of 7 or 8 skills (10 of length 7, 10 of length 8) and removed these trajectories from the post-LLM aggregated training dataset. We did not remove any of the LLM-aggregated trajectories made up of subsequences of skills within that trajectory. This allows AM and SPRINT to perform skill chaining to solve these tasks by ensuring that there were valid sequences of skills to chain together to be able to solve these removed tasks. For example, assume a (shortened for clarity) sampled skill sequence is “pick up apple,” then “put apple in microwave”, then “slice the apple.” Then, either Actionable Models or SPRINT can chain together the sub-trajectory associated with “pick up apple” then “put apple in microwave” with the “slice the apple” sub-trajectory to solve this task. These trajectories all had annotations from ALFRED annotators, so we used those annotations directly (see Figure G.6 for example instructions). 228 Skills to Summarize: 1: Grab the knife on the counter. 2: Place the knife in the sink then turn the faucet on so water fills the sink. Turn the faucet off and pick up the knife again. 3: Place the knife on the table to the left of the wooden bowl. Annotator Summary: Wash the knife from the counter, put in on the table. Skills to Summarize: 1: Pick up the blue book closest to your and the phone from the bed. 2: Turn on the lamp to take a look at the book in the light. Annotator Summary: Examine the book by the light of a lamp. Skills to Summarize: 1: Pick up yellow candle on counter. 2: Open cabinet, put candle in cabinet, close cabinet 3: Pick up yellow candle from toilet. Annotator Summary: Move the candle from the sink to the cabinet under the sink, close it and and then pick the candle from the top of the toilet in front of you. Skills to Summarize: 1: Pick the pot on the left side up from the stove. 2: Set the bowl and knife on the table next to the tomato. Annotator Summary: Put the bowl with the knife in it next to the tomato. Skills to Summarize: 1: Pick up the pen that’s in front of you that’s under the mug. 2: Put the pencil in the mug that was above it. 3: Pick up the mug with the pencil in it. Annotator Summary: Put the pen into the mug and pick up the mug. Figure G.4: Randomly sampled, human language instruction annotations from the EVAL INSTRUCT task set. Finally, for EVAL SCENE , we collected a set of 10 full-length trajectories from the ALFRED “valid-unseen” dataset consisting of validation tasks in unseen floor plans. We collected 2 of each length from 1 through 5 for a total of 10 tasks by sampling random full-length trajectories from this dataset, with the exception of length 1 tasks (we just sample random skills to create length 1 tasks). As these are full trajectories, they already have human annotations from ALFRED, which we directly use as the task description (see Figure G.5 for example instructions). We list additional details about the tasks in each evaluation set in Table G.1. Finally, we display 5 randomly sampled tasks, along with their human annotations, from each of our task sets in Figures G.4, G.5, and G.6. 229 Skills to Summarize: 1: Pick up the lettuce on the counter. 2: Chill the lettuce in the fridge. 3: Put the chilled lettuce on the counter, in front of the bread. Annotator Summary: Put chilled lettuce on the counter. Skills to Summarize: 1: Pick up an egg from off of the kitchen counter. 2: Open the fridge, put the egg in to chill for a few seconds and then take it back out. 3: Place the cold egg in the sink. Annotator Summary: Chill an egg and put it in the sink. Skills to Summarize: 1: Pick up the butter knife off of the right side of the kitchen island. 2: Put the knife handle down in the frying pan that is on the front left burner of the stove. 3: Pick up the frying pan with the knife in it off of the stove. 4: Put the frying pan with the knife in it into the sink basin to the right of the potato. Annotator Summary: Put a frying pan with a knife in it into the sink. Skills to Summarize: 1: Take the pencil from the desk. 2: Put the pencil on the desk. Annotator Summary: Take the pencil from the desk, put it on the other side of the desk. Skills to Summarize: 1: Pick up the left pillow on the chair. 2: Put the pillow on the sofa right of the newspaper. 3: Pick up the pillow on the chair. 4: Put the pillow on the sofa left of the newspaper. Annotator Summary: Place two pillows on a sofa. Figure G.5: Randomly sampled, human language instruction annotations from the EVAL SCENE task set. 230 Skills to Summarize: 1: Pick up the knife in front of the lettuce. 2: Slice the apple in the sink with the knife. 3: Place the knife into the sink. 4: Pick up the sliced apple from the sink. 5: Place the apple slice into the pot on the stove. 6: Pick up the pot from the stove. 7: Pick up the pot from the stove. Annotator Summary: Slice an apple for the pot on the stove and put the pot on the counter to the right of the door. Skills to Summarize: 1: Take the apple from the counter in front of you. 2: Place the apple in the sink in front of you. 3: Take the knife by the sink in front of you. 4: Cut the apple in the sink in front of you. 5: Place the knife in the sink in front of you. 6: Take an apple slice from the sink in front of you. 7: Heat the apple in the microwave, take it out and close the microwave. 8: Place the apple slice in the sink in front of you. Annotator Summary: Place a warm apple slice in the sink. Skills to Summarize: 1: Pick up the loaf of bread. 2: Put the bread on the counter above the spatula. 3: Pick up the knife that’s above and to the right of the loaf of bread. 4: Cut the top half of the loaf of bread into slices. 5: Put the knife on the edge of the counter in front of you horizontally. 6: Pick up a slice of bread from the middle of the loaf. 7: Cook the bread in the microwave then take it out and close the microwave door. 8: Throw the cooked slice of bread away. Annotator Summary: Put a microwaved slice of bread in the oven. Skills to Summarize: 1: Pick the knife up from off of the table. 2: Open the microwave, slice the potato, and close the microwave. 3: Open the microwave, place the knife inside of it, and close the microwave. 4: Open the microwave, pick up the potato slice inside, close the microwave. 5: Place the potato slice in the pan on the stove. 6: Pick up the pan from the stove. 7: Open the refrigerator, place the pan inside, and close the refrigerator. Annotator Summary: Move the pan from the stove top to inside the black refrigerator. Skills to Summarize: 1: Pick up the red tomato on the counter to the right of the stove. 2: Put the tomato onto the island below the butter knife. 3: Pick up the butter knife off of the kitchen island. 4: Slice up the tomato on the kitchen island. 5: Place the butter knife onto the island to the right of the sliced tomato. 6: Pick up a tomato slice off of the kitchen island. 7: Open the fridge and put the tomato slice on the bottom shelf, then close the door, after a couple seconds open the fridge and remove the tomato slice then close the door. 8: Open the microwave door and place the tomato slice inside the microwave in front of the egg. Annotator Summary: Put a chilled tomato slice into the microwave. Figure G.6: Randomly sampled, human language instruction annotations from the EVAL LENGTH task set. 231 Online finetuning environment setup. During online-finetuning we initialize the agent in the same house floor plan as the trajectory the task was extracted from to ensure executability. During finetuning, we give each episode a time horizon of 2x the number of environment actions needed by the expert planner to solve the task. We give sparse rewards for each skill solved by the agent during the episode. Therefore for length 1 tasks, the agent can only be rewarded once before the episode ends, while for length 5 tasks, the episode terminates on the fifth reward signal. We give a reward of 1 num total skills for each skill the agent successfully executes so that the return sums to 1. We found that this helped to finetune all comparison methods more stably, possibly due to the fact that giving larger rewards (e.g., 1 for each skill) results in out-of-distribution critic values (when compared to pre-training) that de-stabilize online reinforcement learning. 232 Skills to summarize: 1. Pick the clock up from the desk. 2. Turn the lamp on. OPT-350m Summary: Look at the clock. OPT-1.3b Summary: Turn on the lamp. OPT-13b Summary: Look at the clock. Skills to summarize: 1. Take the plate from the cabinet to the left of the stove. 2. Warm the plate in the microwave, above the stove. 3. Put the plate in the cabinet. OPT-350m Summary: Place the plate in the microwave. OPT-1.3b Summary: Put a warmed plate in the cabinet. OPT-13b Summary: Put a warmed plate in the cabinet. Skills to summarize: 1. Open the microwave, take out the potato and shut the door. 2. Put the potato in the sink, turn the water on and off again. Take the potato out of the sink. 3. Place the potato on the counter, to the right of the sink. OPT-350m Summary: Put the potato on the counter, to the left of the sink. OPT-1.3b Summary: Place the potato on the counter, to the right of the sink. OPT-13b Summary: Put a rinsed potato on the counter to the right of the sink. Figure G.7: Comparison of OPT-350m, 1.3b, and 13b summaries on 3 randomly sampled sequences from our dataset. In general, as the model size increases, the summary becomes better. However, there are some sequences all three on which all 3 models do not generate good summaries for, such as the first sequence in these examples. 233 D Extended Experiments, Results, and Analysis Table G.2: EVAL INSTRUCT and EVAL LENGTH eval dataset per-length and overall skill completion rates. See Section 8.4 for experiment setup. AM L-BC SPRINT ET EVAL INSTRUCT Number of Completed Subtasks Overall 0.46± 0.05 0.41± 0.02 1.27± 0.13 0.84± 0.01 Length 1 Progress 0.23± 0.01 0.77± 0.04 0.53± 0.07 0.78± 0.09 Length 2 Progress 0.22± 0.03 0.24± 0.03 0.49± 0.03 0.53± 0.06 Length 3 Progress 0.17± 0.03 0.16± 0.04 0.48± 0.03 0.33± 0.12 Length 4 Progress 0.10± 0.01 0.06± 0.01 0.35± 0.06 0.24± 0.07 Length 5 Progress 0.16± 0.08 0.02± 0.01 0.46± 0.06 0.09± 0.01 Length 6 Progress 0.16± 0.01 0.02± 0.03 0.36± 0.12 0.08± 0.04 Length 7 Progress 0.00± 0.00 0.00± 0.01 0.01± 0.02 0.00± 0.00 EVAL LENGTH Number of Completed Subtasks Overall 0.67± 0.09 0.04± 0.05 2.59± 0.66 1.04± 0.35 Length 7 Progress 0.04± 0.01 0.00± 0.00 0.25± 0.10 0.09± 0.00 Length 8 Progress 0.13± 0.02 0.01± 0.01 0.43± 0.08 0.03± 0.01 Here, we present additional results complementary to the experiments in the main paper in Section 8.4. We present and analyze LLM annotation examples in Section D.1, zero-shot evaluations with step-by-step task instructions in Section D.3, and an extended ablation analysis in Section D.4. D.1 LLM Summary Examples We randomly sample 12 OPT-13B task summaries produced while performing skill aggregation (explained in Section 8.3.2) using the prompt in Figure G.1 and display them in Figure G.8. Because we sample the tokens sequentially during text generation, some of the summaries mostly describe one or a few of the skills but in a non-sensical manner (e.g., LLM(“Pick up the plastic bowl on the left. Turn on the lamp. ”) = “Look at the plastic bowl on the left”). The summaries also likely can be improved by including more examples in the prompt. However, after analyzing many more examples, we see that OPT-13B generally provides fitting high-level summaries for most sequences by skipping over implied sub-tasks (although it sometimes also skips over important sub-tasks, likely due to the prompt). For example, it summarizes the task “Pick up the plaid pillow that is on the left end of the couch. Place the pillow on the ottoman” as “Place a plaid pillow on the ottoman, ” in which picking up the plaid pillow is implied. It also often demonstrates object relation 234 understanding, e.g., by stating “Place a plaid pillow” when only the first sub-task in the sequence mentions that the pillow is plaid. D.2 Comparing Summaries from Different LLMs In this section, we justify the use of OPT-13b as our summarization model by comparing its summaries of three different skill sequences to OPT-350m (350 million parameters) and OPT-1.3b (1.3 billion parameters). The summaries and sequences are displayed in Figure G.7. Although these are 3 randomly sampled examples, they highlight differences between the models that hold true across many examples that we examined. As the language model size increases, it understands more of the relationships between sub-tasks in each sequence, therefore generating better summaries. An example of this is the second example in Figure G.7, which highlights a sequence in which both OPT-1.3b and OPT-13b understand the high-level task of putting a warmed plate in the cabinet, as opposed to OPT-350m which essentially repeats the second sub-task in the sequence. Yet sometimes, only OPT-13b produces a reasonable summary. In the last example, OPT-13b produces a summary that implies that an agent must wash the potato (“Put a rinsed potato on the counter...”); the other two models’ summaries only mention the last step of putting the potato down. D.3 Step by Step Zero-shot evaluations Humans may sometimes prefer to give more detailed instructions, e.g., if they do not trust a household robot to successfully execute a high-level instruction. In that case, they are likely to give detailed, step-by-step instructions instead. In this section, we demonstrate the effectiveness of our pre-training strategy in enabling execution of longer-horizon tasks when specified by such step-by-step instructions. To generate these step-by-step instructions, we combine all subtask instructions into one paragraph to condition agents on instead of the high-level human annotation. For example, a task with 2 subtasks, “Pick up the knife” and “Slice the potato,” is given the simple task annotation “Pick up the knife. Slice the potato.” 235 Skills to Summarize: 1: Pick up the plaid pillow that is on the left end of the couch. 2: Place the pillow on the ottoman. LLM-generated Summary: Place a plaid pillow on the ottoman. Skills to Summarize: 1: Pick up the piece of potato furthest to the right. 2: Open the microwave, place the potato inside, close the door, microwave for two seconds, then remove the potato. 3: Place the cooked potato piece to the right of the salt shaker. LLM-generated Summary: Place a cooked potato piece on the table. Skills to Summarize: 1: Put the spoon in the pot on the left. 2: Pick up the pot with the spoon. LLM-generated Summary: Put the pot with the spoon on the left. Skills to Summarize: 1: Open the microwave above the stove and slice the apple inside with the knife. 2: Place the knife standing up on the blue dish. 3: Remove a slice of apple from the microwave. 4: Place the apple slice on the blue plate. LLM-generated Summary: Put an apple slice on the blue plate. Skills to Summarize: 1: Pick up the butter knife that is next to the fork. 2: Open the refrigerator. 3: Open the refrigerator and place the knife on the bottom shelf in the refrigerator then close the door. LLM-generated Summary: Open the refrigerator and place the butter knife on the bottom shelf in the refrigerator then close the door. Skills to Summarize: 1: Open the fridge and put the knife inside. 2: Take a piece of the apple out of the microwave. LLM-generated Summary: Take a slice of apple out of the microwave. Skills to Summarize: 1: Pick up the remote that is on the shelf. 2: Put the remote on the couch. LLM-generated Summary: Put the remote on the couch. Skills to Summarize: 1: Pick up the dirty green towel to the right of the bar of soap on the vanity. 2: Put the towel into the sink, rinse it for a few seconds and pick it back up. 3: Open the far left bottom drawer to the left of the sink, put the rinsed green towel inside and close the drawer. LLM-generated Summary: Put the towel in the drawer. Skills to Summarize: 1: Pick up the plastic bowl on the left. 2: Turn on the lamp. LLM-generated Summary: Look at the plastic bowl on the left. Skills to Summarize: 1: Slice up the fruit that is in the sink. 2: Place the knife in the second row of the fridge. 3: Grab a slice of the fruit. 4: Place the fruit slice on the back right pan. 5: Grab the pan that has the fruit in it. LLM-generated Summary: Put a slice of fruit on a pan in the fridge Skills to Summarize: 1: Pick up the pot on the front right burner. 2: Chill the pot in the fridge then get it back out and close the door. LLM-generated Summary: Put a pot on the stove, put it in the fridge, get it out, close the door. Skills to Summarize: 1: Pick up the knife on the counter. 2: Cut the potato in the sink into slices. 3: Put the knife in the sink with the potato. 4. Pick up a slice of potato from in the sink. 5. Warm the potato slice in the microwave and then remove it. LLM-generated Summary: Warm the potato slice in the microwave. Figure G.8: LLM Summary examples. We randomly sample 12 language combinations generated by the OPT-13B text generation model (Zhang et al., 2022) and display them without cherry-picking specific examples. The LLM often creates a high level description by skipping implied steps (like humans do), e.g., “Put the remote on the couch” implies the need to pick up the remote. 236 We compare SPRINT and IL on both EVAL INSTRUCT and EVAL LENGTH tasks with the combined, step-by-step task annotations in Table G.3. L-BC and SPRINT both perform similarly on the step- by-step EVAL INSTRUCT tasks compared to with the original annotations, while both perform better on EVAL LENGTH tasks thatn with the original annotations. This is likely due to how EVAL LENGTH tasks are on average much longer (7 or 8 tasks). Therefore, instructions spelling out exactly what to do allow agents to more easily understand what and in which order subtasks should be performed. Table G.3: Step-by-step evaluation number of completed sub-tasks (number of completed sub-tasks). L-BC SPRINT EVAL INSTRUCT Completed Subtasks 0.46± 0.01 1.24± 0.19 EVAL LENGTH Completed Subtasks 0.22± 0.03 3.11± 0.08 D.4 Extended Ablation Study In this section, we examine a series of additional questions regarding specific design choices of our method, in a manner complementary to Section 8.4.5. We also visualize all ablations’ zero-shot policy evaluation performance in on both EVAL INSTRUCT and EVAL LENGTH task sets in Table G.4. How much does the LLM contribute to skill aggregation? To answer this question, we compare using SPRINT with LLM aggregation on adjacent sub-trajectory sequences but no chaining (SPRINT w/o chain) to SPRINT with skill aggregation, but where the skills are relabeled by na¨ ıvely concatenating the sentences together (SPRINT w/o chain, w/ concat-agg). Across both task sets, SPRINT w/o chain, w/ concat-agg is outperformed by SPRINT w/o chain, especially in EVAL LENGTH . This signifies that using the LLM helps with understanding very-long horizon, high-level semantic instructions as the LLM generates relevant task summaries for consecutive skills in a trajectory. What if we relabel with the LLM during chaining? We examine also using the LLM to label skills during cross-trajectory skill aggregation (SPRINT w/ LLM-chain), rather than concatenating the skill annotations together (SPRINT). Overall, SPRINT w/ LLM-chain performs slightly worse in average return and success rates. When analyzing the summaries generated by the LLM, we 237 Table G.4: Zero-shot Ablation Subtasks Completion Table EVAL INSTRUCT Subtasks Completed EVAL LENGTH Subtasks Completed SPRINT (ours) 1.27± 0.13 2.59± 0.66 SPRINT w/ LLM-chain 1.15± 0.01 2.46± 0.21 SPRINT w/o chain 0.91± 0.03 2.04± 0.04 SPRINT w/o chain, w/ concat-agg 0.77± 0.06 0.67± 0.20 L-BC w/ LLM-agg 0.93± 0.04 1.86± 0.12 SPRINT w/o LLM-agg 0.38± 0.05 0.10± 0.04 found that randomly paired instructions can rarely be summarized meaningfully, thereby resulting in noisy and sometimes meaningless instructions. Therefore, we implemented SPRINT w/ LLM- chain by only utilizing the top 1% (top 10 candidates with our batch size of 1024) of in-batch chaining candidates (ranked by the LLM’s prediction of what the next skill should be). Even so, the performance is not better than just simply concatenating the instructions, as even these top 1% candidates still have a high chance of not being sensible sentences to summarize. Does LLM skill aggregation help L-BC? We perform LLM skill aggregation with the L-BC baseline in order to examine whether LLM skill aggregation alone can help with pure imitation learning methods (L-BC w/ LLM-agg). This change along doubles the average number of subtasks completed against regular L-BC in EVAL INSTRUCT and causes it to jump from 0.04 subtasks com- pleted in EVAL LENGTH to 1.86 (Table G.1 for original L-BC results). These results demonstrate the significant impact LLM aggregation alone can make. D.5 Qualitative Comparison Results Zero-shot evaluation. We compare SPRINT, AM, and L-BC zero-shot evaluation results on long EVAL LENGTH tasks in Figure G.9. In general, SPRINT is able to make substantially more progress on EVAL LENGTH tasks as it leverages the large language model to generate longer-horizon, semantically meaningful pre-training tasks and performs cross-trajectory chaining to learn to chain its existing dataset tasks. In the visualized examples, SPRINT is able to understand and successfully execute many of the sub-tasks implied but not directly stated by the natural language task instruction. 238 L-BC makes very little progress on these tasks, not even understanding what the first sub-task to complete should be as the task annotation is out of distribution from what it saw while training. Finally, AM is able to make some progress on some of these tasks due to its long-horizon goal pre-training objective. However, this is less effective than our language-conditioned pre-training in such zero-shot evaluations. Finetuning. We finetune SPRINT, AM, and L-BC on EVAL SCENE tasks, in household floorplans that were never seen while training, and visualize qualitative policy rollout examples after finetuning in Figure G.10. In general, SPRINT is able to finetune to longer-horizon tasks while AM and L-BC both struggle with making progress on longer-horizon tasks despite receiving rewards for every completed sub-task. SPRINT’s ability to complete more sub-tasks on many of the longer-horizon tasks is demonstrated in Figure G.10a, while a case in which both SPRINT and AM make partial progress throughout finetuning is demonstrated in Figure G.10b. We believe that AM has more trouble finetuning on these tasks than SPRINT because the task specification for AM (goal images) is out of distribution; pre-training on semantic tasks with SPRINT allows agents to more easily learn longer-horizon behaviors as the task specifications may still be in-distribution of the pre-training tasks that LLM skill-aggregation and skill chaining produce. 239 SPRINT AM L-BC Task: Throw away a microwaved slice of potato. Completed Subtasks 8/8 1/8 0/8 (a) SPRINT successfully solves this task, while AM fails to slice the potato and repetitively iterates between putting the potato in the fridge and microwave. L-BC fails to even pick up the potato, as the task annotation does not directly describe picking up a potato. SPRINT AM L-BC Task: Place a cooked potato slice inside the refrigerator. Completed Subtasks 7/8 0/8 0/8 (b) SPRINT nearly solves this task, while AM picks up an egg instead of a potato. L-BC picks up random objects not related to the annotation. SPRINT AM L-BC Task: Put a chilled tomato slice into the microwave. Completed Subtasks 8/8 1/8 0/8 (c) SPRINT completes the entire task. AM picks up the tomato but fails to put it down onto the counter and slice it. L-BC aimlessly wanders and picks up random objects. Figure G.9: Visualizations of zero-shot policy rollouts on three tasks in the EVAL LENGTH task set. 240 SPRINT AM L-BC Task: Place two pillows on a sofa. Completed Subtasks 3/4 1/4 1/4 (a) SPRINT picks up and places one of the pillows on the sofa, and picks up the second but does not manage to place the second on the sofa, thus completing 3/4 subtasks. AM and L-BC both learn to pick up a pillow but never learned to place it in the correct spot. SPRINT AM L-BC Task: Take the pencil from the desk, put it on the other side of the desk. Completed Subtasks 1/2 1/2 0/2 (b) SPRINT and AM both learn to pick up a pencil from the desk, although neither manage to put the pencil down in the correct place “on the other side of the desk.” Meanwhile, L-BC never picks up the pencil. Figure G.10: Visualizations of policy rollouts on two tasks in the EVAL SCENE task set, after finetuning each method. These floor plans were originally unseen to all agents until finetuning. 241
Abstract (if available)
Abstract
Humans can efficiently learn a wide range of complex tasks, like cooking new dishes or using unfamiliar tools, from only a few trials. In order to build autonomous agents that can assist us in our everyday lives, we need to develop approaches that can learn a comparably wide range of long-horizon tasks efficiently. Recently, deep reinforcement learning works have demonstrated impressive manipulation capabilities in unstructured environments. Yet, a major bottleneck of existing learning algorithms is their sample efficiency: since agents are often trained from scratch, they require lots of online interactions to explore complex environments and learn long-horizon tasks. But why train from scratch? Humans use a large repertoire of skills for learning new tasks – can we enable the same in artificial agents? In this thesis we present a skill-based robot learning framework that can effectively reuse prior experience to accelerate the learning of new tasks.
We first present a scalable approach for extracting a large skill repertoire from offline agent experience. We then introduce algorithms that leverage the skill repertoire to accelerate the learning of new tasks in reinforcement learning (RL), imitation and meta-RL settings. We further demonstrate how our skill-based framework can incorporate diverse human video data to teach robotic agents. Finally, we introduce assisted teleoperation approaches for collecting the large robotic experience datasets we use for skill learning and discuss methods for skill learning from diverse-quality data via model-based and language-conditioned offline reinforcement learning.
The work presented here is a step towards data driven learning of diverse, long-horizon robotic tasks. We hope that it will contribute to building agents that can eventually assist humans in a wide range of tasks in their everyday lives.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scaling robot learning with skills
PDF
Algorithms and systems for continual robot learning
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Program-guided framework for your interpreting and acquiring complex skills with learning robots
PDF
Accelerating robot manipulation using demonstrations
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Sample-efficient and robust neurosymbolic learning from demonstrations
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Advancing robot autonomy for long-horizon tasks
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Learning from planners to enable new robot capabilities
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Decision making in complex action spaces
PDF
Closing the reality gap via simulation-based inference and control
PDF
Active sensing in robotic deployments
PDF
Learning objective functions for autonomous motion generation
PDF
Machine learning of motor skills for robotics
Asset Metadata
Creator
Pertsch, Karl
(author)
Core Title
Leveraging prior experience for scalable transfer in robot learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-05
Publication Date
04/11/2023
Defense Date
03/09/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,OAI-PMH Harvest,reinforcement learning,robot learning,robotics,transfer learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lim, Joseph (
committee chair
), Sukhatme, Gaurav (
committee chair
), Bansal, Somil (
committee member
), Nikolaidis, Stefanos (
committee member
), Thomason, Jesse (
committee member
)
Creator Email
karl.pertsch@gmail.com,pertsch@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113010444
Unique identifier
UC113010444
Identifier
etd-PertschKar-11602.pdf (filename)
Legacy Identifier
etd-PertschKar-11602
Document Type
Dissertation
Format
theses (aat)
Rights
Pertsch, Karl
Internet Media Type
application/pdf
Type
texts
Source
20230412-usctheses-batch-1020
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep learning
reinforcement learning
robot learning
robotics
transfer learning